-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Labels
enhancementNew feature or requestNew feature or requesthelp wantedExtra attention is neededExtra attention is needed
Description
Overview
Implement an LLM-based judge that can compare agent answers against baselines or evaluate answer quality.
Use Cases
- Baseline Comparison: Compare new answer to human-graded baseline
- Quality Scoring: Evaluate answer on criteria (accuracy, completeness, relevance)
- Regression Detection: Identify if new agent version performs worse
Implementation
- Create
src/evaluation/llm-judge.ts - Use Claude API for judging (configurable model)
- Structured output for scores and reasoning
- Prompt templates for different evaluation types
Tasks
- Design judge interface
- Implement Claude API integration
- Create comparison prompt templates
- Add structured output parsing
- Implement cost tracking for API calls
- Add caching to reduce API costs
Acceptance Criteria
- LLM judge can compare two answers
- Returns structured score and reasoning
- Configurable model and prompts
- Cost tracking for API calls
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requesthelp wantedExtra attention is neededExtra attention is needed