Skip to content

Conversation

@dboyker
Copy link

@dboyker dboyker commented Jan 26, 2026

Thanks for the nice tutorial on gemma + peft !!

After following it, the script throws this warning:

You passed a dataset that is already processed (contains an input_ids field) together with a formatting function. Therefore formatting_func will be ignored. Either remove the formatting_func or pass a dataset that is not already processed.

As formatting_func is ignored, the model is not correctly fine-tuned. The script works but there is no guarantee that the model ouput the format Quote: [...] Author: [...].

The line 115 is the one which prevents this:
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True). Executing it adds the input_ids column to the dataset which, together with the formatting_function arg, then trigger the warning as seen here: https://github.com/huggingface/trl/blob/main/trl/trainer/sft_trainer.py#L938-L944

Removing the line 115 is safe regarding the tokenization. Indeed, the tokenizer is infered in the SFTTrainer __init__: https://github.com/huggingface/trl/blob/main/trl/trainer/sft_trainer.py#L639-L650

In addition, formatting_func is modified in this PR to avoid raising the following error: AttributeError: 'list' object has no attribute 'endswith'. It now returns a string instead of a list (+ it does not slice the quote and author).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant