LLM
build_text_generation_udf(model_bc, tokenizer_bc, system_prompt)
Creates a Spark UDF for text generation using a Hugging Face model and tokenizer.
This function sets up a user-defined function (UDF) that can be used in Spark DataFrames to perform text generation. It uses a broadcasted Hugging Face model and tokenizer to ensure efficient distribution across Spark workers.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_bc
|
Broadcast
|
Broadcasted Hugging Face model. |
required |
tokenizer_bc
|
Broadcast
|
Broadcasted Hugging Face tokenizer. |
required |
system_prompt
|
str
|
Prompt to prepend to each input string before generation. |
required |
Returns:
Name | Type | Description |
---|---|---|
function |
UserDefinedFunction
|
A Spark UDF that takes a string input and returns the generated text. |
Raises:
Type | Description |
---|---|
TypeError
|
If |
Example
... model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small") tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small") t5_udf = build_text_generation_udf( sc.broadcast(model), sc.broadcast(tokenizer), "sentiment of the text" ) results_df = input_df.withColumn("output_column", t5_udf("sentence"))