Spark ext
column_function(spark, column_obj)
Evaluates a Column expression in the context of a single-row DataFrame.
This function creates a DataFrame with a single row and applies the given Column expression to it. This is particularly useful for testing Column expressions, evaluating complex transformations, or creating sample data based on Column operations.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
spark
|
SparkSession
|
The SparkSession object. |
required |
column_obj
|
Column
|
The Column object or expression to evaluate. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pyspark.sql.DataFrame: A single-row DataFrame containing the result of the Column expression. |
Examples:
Simple column expression
result = spark.column_function(F.lit("Hello, World!"))
result.show()
+-------------+
| col0|
+-------------+
|Hello, World!|
+-------------+
Complex column expression
import datetime
complex_col = F.when(F.current_date() > F.lit(datetime.date(2023, 1, 1)), "Future")
... .otherwise("Past")
result = spark.column_function(complex_col)
result.show()
+------+
| col0|
+------+
|Future|
+------+
Using with user-defined functions (UDFs)
from pyspark.sql.types import IntegerType
square_udf = F.udf(lambda x: x * x, IntegerType())
result = spark.column_function(square_udf(F.lit(5)))
result.show()
+----+
|col0|
+----+
| 25|
+----+
Notes
- This function is particularly useful for debugging or testing Column expressions without the need to create a full DataFrame.
- The resulting DataFrame will always have a single column named 'col0' unless the input Column object has a specific alias.
- Be cautious when using this with resource-intensive operations, as it still creates a distributed DataFrame operation.
Source code in pysparky/spark_ext.py
convert_1d_list_to_dataframe(spark, list_, column_names, axis='column')
Converts a 1-dimensional list into a PySpark DataFrame.
This function takes a 1-dimensional list and converts it into a PySpark DataFrame with the specified column names. The list can be converted into a DataFrame with either a single column or a single row, based on the specified axis.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
spark
|
SparkSession
|
The Spark session to use for creating the DataFrame. |
required |
list_
|
list
|
The 1-dimensional list to convert. |
required |
column_names
|
str or list of str
|
The name(s) of the column(s) for the DataFrame. |
required |
axis
|
str
|
Specifies whether to convert the list into a single column or a single row. Acceptable values are "column" (default) and "row". |
'column'
|
Returns:
Name | Type | Description |
---|---|---|
DataFrame |
DataFrame
|
A PySpark DataFrame created from the 1-dimensional list. |
Raises:
Type | Description |
---|---|
AttributeError
|
If the axis parameter is not "column" or "row". |
Examples:
>>> spark = SparkSession.builder.appName("example").getOrCreate()
>>> list_ = [1, 2, 3, 4]
>>> column_names = ["numbers"]
>>> df = convert_1d_list_to_dataframe(spark, list_, column_names, axis="column")
>>> df.show()
+-------+
|numbers|
+-------+
| 1|
| 2|
| 3|
| 4|
+-------+
>>> column_names = ["ID1", "ID2", "ID3", "ID4"]
>>> df = convert_1d_list_to_dataframe(spark, list_, column_names, axis="row")
>>> df.show()
+---+---+---+---+
|ID1|ID2|ID3|ID4|
+---+---+---+---+
| 1| 2| 3| 4|
+---+---+---+---+
Source code in pysparky/spark_ext.py
convert_dict_to_dataframe(spark, dict_, column_names, explode=False)
Transforms a dictionary with list values into a Spark DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dict_
|
dict
|
The dictionary to transform. Keys will become the first column, and values will become the second column. |
required |
column_names
|
list[str]
|
A list containing the names of the columns. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pyspark.sql.DataFrame: A DataFrame with the dictionary keys and their corresponding exploded list values. |
Examples:
datadict_ = {
"key1": 1,
"key2": 2
}
column_names = ["keys", "values"]
df = convert_dict_to_dataframe(datadict_, column_names)
display(df)
# key1,1
# key2,2
Source code in pysparky/spark_ext.py
createDataFrame_from_dict(spark, dict_)
Creates a Spark DataFrame from a dictionary in a pandas-like style.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
spark
|
SparkSession
|
The SparkSession object. |
required |
dict_
|
dict
|
The dictionary to convert, where keys are column names and values are lists of column data. |
required |
Returns:
Name | Type | Description |
---|---|---|
DataFrame |
DataFrame
|
The resulting Spark DataFrame. |