math_
cumsum(column_or_name, partition_by=None, order_by_column=None, is_normalized=False, is_descending=False, alias='cumsum')
Calculate the cumulative sum of a column, optionally partitioned by other columns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column_or_name
|
Column
|
The column for which to calculate the cumulative sum. |
required |
partition_by
|
list[Column]
|
A list of columns to partition by. Defaults to an empty list. |
None
|
order_by_column
|
Column | None
|
The Column for order by, null for using the same column. |
None
|
is_normalized
|
bool
|
Whether to normalize the cumulative sum. Defaults to False. |
False
|
is_descending
|
bool
|
Whether to order the cumulative sum in descending order. Defaults to False. |
False
|
alias
|
str
|
Alias for the resulting column. Defaults to "cumsum". |
'cumsum'
|
Returns:
Name | Type | Description |
---|---|---|
Column |
Column
|
A column representing the cumulative sum. |
Examples:
>>> df = spark.createDataFrame([(1, "A", 10), (2, "A", 20), (3, "B", 30)], ["id", "category", "value"])
>>> result_df = df.select("id", "category", "value", cumsum(F.col("value"), partition_by=[F.col("category")], is_descending=True))
>>> result_df.display()
Source code in pysparky/functions/math_.py
haversine_distance(lat1, long1, lat2, long2)
Calculates the Haversine distance between two sets of latitude and longitude coordinates.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
lat1
|
ColumnOrName
|
The column containing the latitude of the first coordinate. |
required |
long1
|
ColumnOrName
|
The column containing the longitude of the first coordinate. |
required |
lat2
|
ColumnOrName
|
The column containing the latitude of the second coordinate. |
required |
long2
|
ColumnOrName
|
The column containing the longitude of the second coordinate. |
required |
Returns:
Name | Type | Description |
---|---|---|
Column |
Column
|
The column containing the calculated Haversine distance. |
Examples: