feat: Add GEPA prompt optimization (`vf-gepa`) #571

rsalimans47 · 2025-11-25T14:19:42Z

Description

Note: This feature was originally developed by Zapier for internal use with Verifiers environments. We're excited to contribute it back to the open-source project.

This PR adds GEPA (Genetic-Pareto) integration to Verifiers: an automatic prompt optimization system that improves environment prompts through reflection-based evolution.

GEPA works by:

Testing current prompts on training examples
Collecting rich feedback from rubric evaluations
Using an LLM to reflect on failures and propose improved prompts
Iteratively refining

New CLI Command: `vf-gepa`

Optimize system prompt with medium budget (~12 candidates)

vf-gepa wordle --budget medium

Optimize both system prompt and tool descriptions

vf-gepa wiki-search --budget heavy --components system_prompt tool_descriptions

Custom configuration

vf-gepa my-env --max-metric-calls 1000 -n 100 --num-val 30 -m gpt-5-mini

Results are saved to ./gepa_results/<env_id>/<run_id>/ including optimized components, original components for comparison, and optimization metrics.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes
Tested on wordle, gsm8k, and tool_test (w/ --components tool_descriptions) environments

(e.g. gemini-2.5-flash on a 5-min GEPA run):

================================================================================
GEPA OPTIMIZATION COMPLETE
================================================================================
Best validation score: 0.900
Initial validation score: 0.600
Improvement: 0.300
Total candidates fully explored: 4

second attempt:

================================================================================
GEPA OPTIMIZATION COMPLETE
================================================================================
Best validation score: 1.000
Initial validation score: 0.400
Improvement: 0.600
Total candidates fully explored: 2

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Additional Notes

How GEPA Works

The integration consists of several key components:

1. GEPAAdapter (verifiers/gepa/adapter.py)
Bridges Verifiers environments with GEPA's optimization protocol:

Component extraction/injection: Extracts optimizable text (system prompts, tool descriptions) and injects optimized versions back into environments
Evaluation: Runs rollouts and collects scores using the environment's rubric
Reflective dataset generation: Converts rubric feedback into structured reflection data for GEPA
Tool-aware proposal: Uses specialized templates for tool description optimization that include tool names and parameter schemas

2. Budget Modes
Three preset budgets control optimization intensity:

light (~6 candidates): Quick iteration, ~30-60 min
medium (~12 candidates): Balanced exploration, ~1-2 hours
heavy (~18 candidates): Thorough optimization, ~2-4 hours

3. Component Selection
GEPA can optimize multiple components:
--components system_prompt # Default
--components tool_descriptions # For tool-using environments
--components system_prompt tool_descriptions # Both

When optimizing tool_descriptions, each tool's description becomes a separate optimizable component (tool_0_description, tool_1_description, etc.).

Rubric Changes: Feedback Support

A key non-breaking change enables reward functions to return both a score and textual feedback:

Before (still works):

def accuracy(completion, answer, **kwargs) -> float:
    return 1.0 if completion == answer else 0.0

New: Return feedback for better GEPA optimization

def accuracy_with_feedback(completion, answer, **kwargs):
    correct = completion == answer
    return {
        "score": 1.0 if correct else 0.0,
        "feedback": f"Expected: {answer}, Got: {completion}. {'✓ Correct!' if correct else '✗ Incorrect.'}"
    }

The feedback is collected via rubric.get_feedback(state) and used by GEPA's reflection model to understand why rollouts succeeded or failed. This enables more targeted prompt improvements.

Changes to Rubric class:

Added RewardResult TypedDict in types.py for type-safe {"score": float, "feedback": str} returns
Updated _parse_reward_result() to handle both float and dict returns
Added get_feedback(state) method to aggregate feedback from all reward functions
Feedback is stored in state["feedbacks"] during score_rollout() and score_group()

Experiment Tracking

Built-in support for wandb and MLflow:

Track with wandb:

vf-gepa my-env --budget medium --use-wandb --wandb-project my-project

Track with MLflow:

vf-gepa my-env --budget medium --use-mlflow --mlflow-tracking-uri http://localhost:5000

CLAassistant · 2025-11-25T14:19:52Z

All committers have signed the CLA.

rsalimans47 added 16 commits November 22, 2025 16:37

added gepa integrations folder and readme

0e45d29

first pass on gepa integration and command

03e189a

fixed typo

c050a81

unify vf-gepa cli args with vf-eval

1d62237

renamed '--auto' to '--budget', adjusted default minibatch size

57d5439

a few bugfixes in gepa adapter

92af75e

unified log path with vf-eval

7df2a43

changed location of gepa adapter

dbbfff7

fail fast and loud + fixed unit test

ffebc61

added wandb and mlflow flags

54e0e66

added vf-gepa cli tests

23fd447

simplified build_program in GEPAAdapter

0f1cd1f

Fix GEPA tool call handling and tool_test dict access

17c2a99

improved tool description optimization, new folder structure

1c21915

updated docs

47d705b

improved comments in code

18c71a3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add GEPA prompt optimization (`vf-gepa`) #571

feat: Add GEPA prompt optimization (`vf-gepa`) #571

Uh oh!

rsalimans47 commented Nov 25, 2025

Uh oh!

CLAassistant commented Nov 25, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Add GEPA prompt optimization (vf-gepa) #571

Are you sure you want to change the base?

feat: Add GEPA prompt optimization (vf-gepa) #571

Uh oh!

Conversation

rsalimans47 commented Nov 25, 2025

Description

New CLI Command: vf-gepa

Optimize system prompt with medium budget (~12 candidates)

Optimize both system prompt and tool descriptions

Custom configuration

Type of Change

Testing

Checklist

Additional Notes

How GEPA Works

Rubric Changes: Feedback Support

Experiment Tracking

Uh oh!

CLAassistant commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Add GEPA prompt optimization (`vf-gepa`) #571

feat: Add GEPA prompt optimization (`vf-gepa`) #571

New CLI Command: `vf-gepa`

CLAassistant commented Nov 25, 2025 •

edited

Loading