Skip to content

Implement LLM-judge evaluator for answer comparison #29

@jharris1679

Description

@jharris1679

Overview

Implement an LLM-based judge that can compare agent answers against baselines or evaluate answer quality.

Use Cases

  1. Baseline Comparison: Compare new answer to human-graded baseline
  2. Quality Scoring: Evaluate answer on criteria (accuracy, completeness, relevance)
  3. Regression Detection: Identify if new agent version performs worse

Implementation

  • Create src/evaluation/llm-judge.ts
  • Use Claude API for judging (configurable model)
  • Structured output for scores and reasoning
  • Prompt templates for different evaluation types

Tasks

  • Design judge interface
  • Implement Claude API integration
  • Create comparison prompt templates
  • Add structured output parsing
  • Implement cost tracking for API calls
  • Add caching to reduce API costs

Acceptance Criteria

  • LLM judge can compare two answers
  • Returns structured score and reasoning
  • Configurable model and prompts
  • Cost tracking for API calls

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions