Dedup
get_latest_record_from_column(sdf, window_partition_column_name, window_order_by_column_names, window_function=F.row_number)
Fetches the most recent record from a DataFrame based on a specified column, allowing for custom sorting order.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sdf
|
DataFrame
|
The DataFrame to process. |
required |
window_partition_column_name
|
str
|
The column used to partition the DataFrame. |
required |
window_order_by_column_names
|
str | list
|
The column(s) used to sort the DataFrame. |
required |
window_function
|
Column
|
The window function for ranking records. Defaults to F.row_number. |
row_number
|
Returns:
Name | Type | Description |
---|---|---|
DataFrame |
DataFrame
|
A DataFrame with the most recent record for each partition. |
Source code in pysparky/transformations/dedup.py
get_only_duplicate_record(sdf, column_name)
Retrieves only the duplicate records from the input DataFrame based on the specified column.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sdf
|
DataFrame
|
The input Spark DataFrame. |
required |
column_name
|
str
|
The column name to check for duplicates. |
required |
Returns:
Name | Type | Description |
---|---|---|
DataFrame |
DataFrame
|
A DataFrame containing only the duplicate records. |
Examples:
>>> data = [(1, "A"), (2, "B"), (3, "C"), (1, "A"), (4, "D"), (2, "B")]
>>> sdf = spark.createDataFrame(data, ["id", "value"])
>>> duplicate_records = get_only_duplicate_record(sdf, "id")
>>> duplicate_records.show()
+---+-----+
| id|value|
+---+-----+
| 1| A|
| 1| A|
| 2| B|
| 2| B|
+---+-----+
Source code in pysparky/transformations/dedup.py
get_only_unique_record(sdf, column_name)
Retrieves only the unique records from the input DataFrame based on the specified column.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sdf
|
DataFrame
|
The input Spark DataFrame. |
required |
column_name
|
str
|
The column name to check for duplicates. |
required |
Returns:
Name | Type | Description |
---|---|---|
DataFrame |
DataFrame
|
A DataFrame containing only the unique records. |
Examples:
>>> data = [(1, "A"), (2, "B"), (3, "C"), (1, "A"), (4, "D"), (2, "B")]
>>> sdf = spark.createDataFrame(data, ["id", "value"])
>>> unique_records = get_only_unique_record(sdf, "id")
>>> unique_records.show()
+---+-----+
| id|value|
+---+-----+
| 3| C|
| 4| D|
+---+-----+
Source code in pysparky/transformations/dedup.py
quarantine_duplicate_record(sdf, column_name)
Splits the input DataFrame into two DataFrames: one containing unique records based on the specified column, and the other containing duplicate records.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sdf
|
DataFrame
|
The input Spark DataFrame. |
required |
column_name
|
str
|
The column name to check for duplicates. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
tuple[DataFrame, DataFrame]: A tuple containing two DataFrames. The first DataFrame |
DataFrame
|
contains unique records, and the second DataFrame contains duplicate records. |
Examples:
>>> data = [(1, "A"), (2, "B"), (3, "C"), (1, "A"), (4, "D"), (2, "B")]
>>> sdf = spark.createDataFrame(data, ["id", "value"])
>>> unique_records, duplicate_records = quarantine_duplicate_record(sdf, "id")
>>> unique_records.show()
+---+-----+
| id|value|
+---+-----+
| 3| C|
| 4| D|
+---+-----+
>>> duplicate_records.show()
+---+-----+
| id|value|
+---+-----+
| 1| A|
| 1| A|
| 2| B|
| 2| B|
+---+-----+