math_
cumsum(column_or_name, partition_by=None, order_by_column=None, is_normalized=False, is_descending=False, alias='cumsum')
Calculate the cumulative sum of a column, optionally partitioned by other columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column_or_name
|
Column
|
The column for which to calculate the cumulative sum. |
required |
partition_by
|
list[Column]
|
A list of columns to partition by. Defaults to an empty list. |
None
|
order_by_column
|
Column | None
|
The Column for order by, null for using the same column. |
None
|
is_normalized
|
bool
|
Whether to normalize the cumulative sum. Defaults to False. |
False
|
is_descending
|
bool
|
Whether to order the cumulative sum in descending order. Defaults to False. |
False
|
alias
|
str
|
Alias for the resulting column. Defaults to "cumsum". |
'cumsum'
|
Returns:
| Name | Type | Description |
|---|---|---|
Column |
Column
|
A column representing the cumulative sum. |
Example
>>> df = spark.createDataFrame([(1, "A", 10), (2, "A", 20), (3, "B", 30)], ["id", "category", "value"])
>>> result_df = df.select("id", "category", "value", cumsum(F.col("value"), partition_by=[F.col("category")], is_descending=True))
>>> result_df.show()
+---+--------+-----+------+
| id|category|value|cumsum|
+---+--------+-----+------+
| 1| A| 10| 30|
| 2| A| 20| 20|
| 3| B| 30| 30|
+---+--------+-----+------+
Source code in pysparky/functions/math_.py
haversine_distance(lat1, long1, lat2, long2)
Calculates the Haversine distance between two sets of latitude and longitude coordinates.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
lat1
|
ColumnOrName
|
The column containing the latitude of the first coordinate. |
required |
long1
|
ColumnOrName
|
The column containing the longitude of the first coordinate. |
required |
lat2
|
ColumnOrName
|
The column containing the latitude of the second coordinate. |
required |
long2
|
ColumnOrName
|
The column containing the longitude of the second coordinate. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Column |
Column
|
The column containing the calculated Haversine distance. |
Example
Source code in pysparky/functions/math_.py
sumif(condition, value=1, otherwise_value=0)
Return a conditional sum using Spark expressions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
condition
|
Column
|
Boolean Spark expression to filter rows. |
required |
value
|
Union[Column, int, float]
|
Column or scalar to sum when condition is True. |
1
|
otherwise_value
|
Union[Column, int, float]
|
Column or scalar to use when condition is False. |
0
|
Returns:
| Name | Type | Description |
|---|---|---|
Column |
Column
|
Spark aggregation expression that sums |
Column
|
otherwise |
Example
>>> df = spark.createDataFrame([("A", 10), ("B", 20), ("A", 30)], ["category", "value"])
>>> df.select(sumif(F.col("category") == "A").alias("count_a")).show()
+-------+
|count_a|
+-------+
| 2|
+-------+
>>> df.select(sumif(F.col("category") == "A", F.col("value")).alias("sum_a")).show()
+-----+
|sum_a|
+-----+
| 40|
+-----+