Skip to content

Conversation

@jli-together
Copy link

@jli-together jli-together commented Dec 22, 2025

Summary

This PR demonstrates how to leverage the GEPA method with our evaluation APIs for iterative prompt optimization.


Note

Adds end-to-end GEPA optimization workflows as runnable notebooks.

  • New Evals/GEPA_Optimization.ipynb: optimizes a summarization prompt on CNN/DailyMail using dspy, batch summary generation, and Together Eval compare with a judge model; tracks win rates, saves prompts/results
  • New Evals/Prompt_Optimization.ipynb: optimizes a judge/evaluator prompt via a TogetherEvalAdapter (upload, poll, download, per-subset metrics), minibatch reflection with an optimizer LLM, validation/test evaluation, and results export

Written by Cursor Bugbot for commit e704809. This will update automatically on new commits. Configure here.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the final PR Bugbot will review for you during this billing cycle

Your free Bugbot reviews will reset on January 7

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

" summarizer_lm: dspy.LM,\n",
" optimizer_lm: SimpleOptimizerLM,\n",
" max_iterations: int = 5\n",
"):\n",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused train_data parameter breaks GEPA methodology

The run_manual_gepa function accepts train_data as a parameter but never uses it. The GEPA methodology relies on sampling failure examples from training data to guide prompt improvement, as correctly implemented in Prompt_Optimization.ipynb. Instead, reflect_and_improve_prompt only receives a win rate percentage without any actual failure examples to analyze. This makes the optimizer LLM blind to specific failure patterns, significantly reducing the effectiveness of the iterative improvement process.

Additional Locations (1)

Fix in Cursor Fix in Web

"\n",
" # Remove language tags if present\n",
" if new_prompt.startswith('markdown\\n') or new_prompt.startswith('text\\n'):\n",
" new_prompt = '\\n'.join(new_prompt.split('\\n')[1:])\n",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete language tag removal corrupts extracted prompts

The reflect_and_propose_prompt function only removes markdown and text language tags from the extracted prompt, while the equivalent function in GEPA_Optimization.ipynb also handles python and plaintext. If the optimizer LLM wraps its response in a code block with an unhandled language tag (e.g., ```plaintext), the tag text would remain at the start of the new judge prompt, potentially corrupting it and degrading evaluation quality.

Fix in Cursor Fix in Web

@jli-together jli-together changed the title Add GEPA Optimization for Summarization [MOSH-976] Add GEPA Optimization for Summarization Dec 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants