feat: Add GEPA prompt optimization (vf-gepa)
#571
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR adds GEPA (Genetic-Pareto) integration to Verifiers: an automatic prompt optimization system that improves environment prompts through reflection-based evolution.
GEPA works by:
New CLI Command:
vf-gepaOptimize system prompt with medium budget (~12 candidates)
Optimize both system prompt and tool descriptions
Custom configuration
Results are saved to
./gepa_results/<env_id>/<run_id>/including optimized components, original components for comparison, and optimization metrics.Type of Change
Testing
uv run pytestlocally.wordle,gsm8k, andtool_test(w/--components tool_descriptions) environments(e.g. gemini-2.5-flash on a 5-min GEPA run):
second attempt:
Checklist
Additional Notes
How GEPA Works
The integration consists of several key components:
1.
GEPAAdapter(verifiers/gepa/adapter.py)Bridges Verifiers environments with GEPA's optimization protocol:
2. Budget Modes
Three preset budgets control optimization intensity:
light(~6 candidates): Quick iteration, ~30-60 minmedium(~12 candidates): Balanced exploration, ~1-2 hoursheavy(~18 candidates): Thorough optimization, ~2-4 hours3. Component Selection
GEPA can optimize multiple components:
--components system_prompt # Default--components tool_descriptions # For tool-using environments--components system_prompt tool_descriptions # BothWhen optimizing
tool_descriptions, each tool's description becomes a separate optimizable component (tool_0_description,tool_1_description, etc.).Rubric Changes: Feedback Support
A key non-breaking change enables reward functions to return both a score and textual feedback:
Before (still works):
New: Return feedback for better GEPA optimization
The feedback is collected via
rubric.get_feedback(state)and used by GEPA's reflection model to understand why rollouts succeeded or failed. This enables more targeted prompt improvements.Changes to
Rubricclass:RewardResultTypedDict intypes.pyfor type-safe{"score": float, "feedback": str}returns_parse_reward_result()to handle both float and dict returnsget_feedback(state)method to aggregate feedback from all reward functionsstate["feedbacks"]duringscore_rollout()andscore_group()Experiment Tracking
Built-in support for wandb and MLflow:
Track with wandb:
Track with MLflow: