Debug

`get_distinct_value_from_df_columns(df, columns_names, display=True)`

Get distinct values from specified DataFrame columns and optionally display their counts.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input Spark DataFrame.	required
`columns_names`	`list[str]`	List of column names to process.	required
`display`	`bool`	Whether to display the counts of distinct values. Default is True.	`True`

Returns:

Type	Description
`dict[str, list]`	dict[str, list]: A dictionary where keys are column names and values are lists of distinct values.

Source code in pysparky/debug.py

def get_distinct_value_from_df_columns(
    df: DataFrame, columns_names: list[str], display: bool = True
) -> dict[str, list]:
    """
    Get distinct values from specified DataFrame columns and optionally display their counts.

    Args:
        df (DataFrame): The input Spark DataFrame.
        columns_names (list[str]): List of column names to process.
        display (bool): Whether to display the counts of distinct values. Default is True.

    Returns:
        dict[str, list]: A dictionary where keys are column names and values are lists of distinct values.
    """
    myDict = {}
    for col in columns_names:
        data = df.select(col).distinct()
        myDict[col] = [row[col] for row in data.collect()]

        if display:
            if df.groupBy(col).count().count() < 20:
                df.groupBy(col).count().show()
    return myDict