Skip to content

Conversation

@ferreirafabio
Copy link
Contributor

Overview

as discussed with @geoalgo today, this PR adds CLI arguments for controlling input truncation and model generation limits:

  • --max_len (default 8192): Maximum character length for truncating input text (instructions, completions) before sending to models. Prevents exceeding context limits (was previously hard-set to 200), effectively leading to judges that would notice cut-off completions and therefore base their decisions on that.

  • --max_tokens (default 32768): Maximum number of tokens all models (A, B, and judge) can generate in their responses (was previously hard-coded to 32k)

  • fixed minor bug:--results_folder parsed but never passed to the CliArgs dataclass

The first two parameters were previously hard-coded with inconsistent values (200 and 4096) and conflated together. This PR separates them as distinct concepts.

Changes

  • generate_and_evaluate.py: Added --max_len and --max_tokens CLI arguments, fixed --result_folder not being passed
  • generate.py: Separated max_len (truncation) from max_tokens (generation) in generate_instructions() and generate_base()
  • evaluate.py: Updated default max_len to 8192

Usage

python -m openjury.generate_and_evaluate
--dataset alpaca-eval
--model_A ...
--model_B ...
--judge_model ...
--max_len 16384
--max_tokens 8192

  • Add --max_len (default 8192) for truncating input text (instructions, completions)
  • Add --max_tokens (default 32768) for limiting model generation output
  • Separate these concepts which were previously conflated
  • Update defaults consistently across generate.py and evaluate.py
  • Fix bug: --result_folder CLI arg was parsed but not passed to CliArgs

- Add --max_len (default 8192) for truncating input text (instructions, completions)
- Add --max_tokens (default 32768) for limiting model generation output
- Separate these concepts which were previously conflated
- Update defaults consistently across generate.py and evaluate.py
- Fix bug: --result_folder CLI arg was parsed but not passed to CliArgs
@ferreirafabio ferreirafabio changed the title Add --max_len and --max_tokens CLI arguments Add distinct max_len and max_tokens parameters Jan 7, 2026
Copy link
Collaborator

@geoalgo geoalgo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for catching and fixing this. I have only one comment about the naming.

" `[result_folder]/[evaluation_name]`.",
)
parser.add_argument(
"--max_len",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_len and max_tokens do not convey what the parameters do, could you replace by a better name?

Perhaps max_token_completion and max_token_judge would be better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback @geoalgo. I've been thinking about this more and realized that max_token_completion and max_token_judge don't fully capture what's happening, since truncation occurs at the character level, not based on tokens.

Here's what I would suggest:
--max_out_tokens_models: max tokens models A/B can generate
--max_out_tokens_judge: max tokens the judge can generate
--truncate_all_input_chars: max chars to truncate all input text (instructions before A/B, completions before judge)

I considered splitting the last one into separate params (--max_in_chars_models for instructions and --max_in_chars_judge for completions), but I couldn't think of a practical use case where you'd want different truncation limits for each. The common scenarios are "both short" (save costs) or "both long" (thorough eval) I would say.

Let me know if this naming works for you, or if you'd prefer something different.

- Rename  →
  (truncates instructions before A/B, completions before judge)
- Split  into:
  -  (output limit for models A/B)
  -  (output limit for judge)
- Fix bug in generate_base where max_len was used instead of max_tokens
- Update function signatures in generate.py and evaluate.py
@geoalgo geoalgo merged commit 593f0f2 into OpenEuroLLM:main Jan 8, 2026
1 check failed
@ferreirafabio ferreirafabio deleted the feature/configurable-max-len branch January 12, 2026 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants