🤖 AI Agent Evaluation System

Multi-Agent Evaluation Framework

Welcome to the Agentuity AI Agent Evaluation System! This project provides a comprehensive framework for evaluating AI models using a multi-agent architecture that orchestrates dataset loading, prompt templating, model execution, and result analysis.

🎯 Overview

This evaluation system uses multiple specialized agents to create a robust, scalable framework for testing AI models against ground truth datasets. Each agent has a discrete task and can pass information to others through the Agentuity key-value store.

The system includes a modern React-based web interface that provides:

Results Dashboard: View evaluation summaries, detailed results, and real-time progress
Dataset Management: Upload and manage evaluation datasets
Evaluation Configuration: Set up new evaluations with custom parameters
Settings Management: Configure API endpoints and authentication

System Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│  Dataset Loader │───▶│ Template Manager│───▶│ Evaluation      │
│     Agent       │    │     Agent       │    │ Runner Agent    │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                                                        │
                                                        ▼
                       ┌─────────────────┐    ┌─────────────────┐
                       │  LLM as Judge   │◀───│   Claude API    │
                       │     Agent       │    │  (integrated)   │
                       └─────────────────┘    └─────────────────┘
                                                        │
                                                        ▼
                       ┌─────────────────┐    ┌─────────────────┐
                       │  Results API    │───▶│  React Frontend │
                       │     Agent       │    │   (Web UI)      │
                       └─────────────────┘    └─────────────────┘

🏗️ Agent Structure

Core Agents

dataset_loader: Loads and validates ground truth evaluation datasets
template_manager: Handles prompt templates with variable substitution
evaluation_runner: Executes evaluation cases, coordinates model testing, and communicates directly with AI models (Claude, GPT, etc.)
llm_as_judge: Uses Claude to intelligently compare model outputs against expected results with 0-100 similarity scoring
results_api: Serves evaluation data from KV store to the React frontend

📋 Prerequisites

Before you begin, ensure you have the following installed:

Python: Version 3.10 or higher
UV: Version 0.5.25 or higher (Documentation)
Node.js: Version 18 or higher
npm: Latest version
Agentuity CLI: Latest version

🚀 Getting Started

Authentication

agentuity login

Quick Setup

Start Development Environment:

# Start the Agentuity development server
agentuity dev

Configure Backend:

# Run the setup script (uses dev environment by default)
python setup_env.py

# Or manually set environment variables (optional)
export AGENTUITY_BASE_URL=" https://dev-i9bbrvz6i.agentuity.run"

Install Backend Dependencies:
```
uv sync
```
Install Frontend Dependencies:
```
cd react-frontend
npm install
```

Test Connection:

# Test your dev environment
python setup_env.py --test

Run the Frontend

Start the React web interface:

cd react-frontend
npm run dev

The frontend will be available at http://localhost:5173

Development Mode (Backend)

Run the evaluation system in development mode:

agentuity dev

Or start locally without the console:

uv run server.py

🛠️ Configuration

Development Environment

The system is configured to use the Agentuity development environment:

Base URL: https://dev-i9bbrvz6i.agentuity.run
Dataset Loader Agent ID: agent_abcf9ad4245d2d89aed9eb38aef21fd6
Results API Agent ID: agent_b46de37831f94d01b06b2ccfd183efa0
Authentication: Optional (not required in dev mode)

Environment Variables

# Required
AGENTUITY_BASE_URL=https://dev-uzeflmvib.agentuity.run

# Optional
DATASET_LOADER_AGENT_ID=agent_abcf9ad4245d2d89aed9eb38aef21fd6
RESULTS_API_AGENT_ID=agent_b46de37831f94d01b06b2ccfd183efa0
AGENTUITY_API_TOKEN=  # Not required for dev mode

Development Setup

Start the development server:
```
agentuity dev
```

Verify agents are loaded:

[INFO ] Loaded 6 agents
[INFO ] Loaded agent: dataset_loader [agent_abcf9ad4245d2d89aed9eb38aef21fd6]
[INFO ] Loaded agent: results_api [agent_b46de37831f94d01b06b2ccfd183efa0]
[INFO ] Starting server on port 63229

Use the public URL: https://dev-uzeflmvib.agentuity.run You must set this URL in your react-frontend/src/services/api file.

API Testing

Test your development environment:

# Using curl (no auth required)
curl https://dev-uzeflmvib.agentuity.run/agent_abcf9ad4245d2d89aed9eb38aef21fd6

# Using the setup script
python setup_env.py --test

📊 Usage

Option 1: Using the Web Interface (Recommended)

Start the Backend: Run agentuity dev
Start the Frontend:
```
cd react-frontend
npm run dev
```
Configure Settings: Go to Settings → General and configure your API endpoints
Upload Dataset: Use the Dataset Manager to upload or create evaluation data
Configure Evaluation: Go to "New Evaluation" and set up your test parameters
Monitor Progress: Watch real-time progress in the Results Dashboard
Analyze Results: View detailed analytics, charts, and export data

Option 2: Using the API Directly

1. Prepare Your Dataset

Create a ground truth dataset in JSON format:

[
  {
    "query": "What is the capital of New York?",
    "response": "Albany"
  },
  {
    "query": "What is the capital of California?", 
    "response": "Sacramento"
  }
]

2. Create an Evaluation Request

Send a request to the dataset_loader agent:

{
  "evaluation_id": "state_capitals_eval_001",
  "dataset_path": "datasets/state_capitals.json",
  "format": "query_response_pairs",
  "prompt_template": {
    "template": "You are an expert on state capitals. Answer the following question: {{query}}",
    "variables": ["query"]
  },
  "model_config": {
    "model_name": "claude-3-5-sonnet-latest",
    "max_tokens": 100,
    "temperature": 0.1
  },
  "evaluation_settings": {
    "similarity_threshold": 80,
    "judge_model": "claude-3-5-haiku-latest"
  }
}

3. Review Results

The system will generate a comprehensive evaluation report:

{
  "evaluation_id": "state_capitals_eval_001",
  "summary": {
    "total_cases": 50,
    "average_similarity_score": 87.2,
    "high_similarity_count": 42,
    "medium_similarity_count": 6,
    "low_similarity_count": 2,
    "high_similarity_rate": 0.84
  },
  "detailed_results": [...],
  "model_config": {...}
}

🔄 Data Flow

The evaluation system uses the Agentuity key-value store for agent communication:

Initial Request → dataset_loader agent
Dataset Loading → Key: eval_run_{id}_dataset
Template Processing → Key: eval_run_{id}_processed
Evaluation Execution → Key: eval_run_{id}_results
LLM Judging → Key: eval_run_{id}_comparison (Final structured results)
Results API → Serves data to React frontend via operation-based requests

🛠️ Configuration

Environment Variables

Set up API keys for different models:

agentuity env set --secret ANTHROPIC_API_KEY=your_key_here
agentuity env set --secret OPENAI_API_KEY=your_key_here

Evaluation Settings

Configure default evaluation parameters:

agentuity env set DEFAULT_MODEL=claude-3-5-sonnet-latest
agentuity env set DEFAULT_JUDGE_MODEL=claude-3-5-haiku-latest
agentuity env set DEFAULT_SIMILARITY_THRESHOLD=80

📁 Project Structure

├── agents/
│   ├── dataset_loader/         # Loads evaluation datasets
│   ├── template_manager/       # Manages prompt templates
│   ├── evaluation_runner/      # Executes evaluation cases and communicates with AI models
│   ├── llm_as_judge/          # Uses Claude to judge response similarity (0-100 scoring)
│   └── results_api/           # Serves evaluation data to frontend
├── react-frontend/             # React web interface
│   ├── src/                   # React source code
│   ├── public/                # Static assets
│   ├── package.json           # Frontend dependencies
│   └── vite.config.ts         # Vite configuration
├── datasets/                   # Ground truth datasets
├── .venv/                      # Virtual environment
├── pyproject.toml             # Backend dependencies
├── server.py                  # Server entry point
└── agentuity.yaml            # Project configuration

🎯 LLM-as-Judge Evaluation

The system uses Claude as an intelligent judge to evaluate response similarity:

Semantic Understanding: Understands context, synonyms, and meaning beyond simple string matching
Flexible Scoring: Provides 0-100 similarity scores with detailed reasoning
Context Aware: Considers the original question, expected response, and actual response
Consistent: Uses low temperature settings for reliable, repeatable judgments
Explainable: Provides reasoning for each similarity score

🌐 Deployment

Deploy your evaluation system to Agentuity Cloud:

agentuity deploy

🆘 Troubleshooting

If you encounter issues:

Check agent logs in the Agentuity Console
Verify your dataset format matches the expected schema
Ensure all required environment variables are set
Check the React frontend logs in your terminal
Verify API connectivity in the Settings page
Join our Discord community for support

📚 Documentation

📝 License

This project is licensed under the terms specified in the LICENSE file.

Architecture

React Frontend: Modern UI built with Vite, React, TypeScript, and Tailwind CSS
Agentuity Backend: Multi-agent system for running evaluations and storing results
Results API Agent: Serves evaluation data from KV store to frontend

Setup Instructions

1. Deploy the Results API Agent

The Results API agent is already included in your project. Deploy it with:

# Deploy all agents including the results API
agentuity deploy

2. Configure the Frontend

Install dependencies:
```
cd react-frontend
npm install
```
Update API configuration in the Settings page:
- Go to Settings → General
- Set your Agentuity Base URL (e.g., https://dev-uzeflmvib.agentuity.run)
- Set the Results API Agent ID (e.g., agent_b46de37831f94d01b06b2ccfd183efa0)
- Test the connection
Start the development server:
```
npm run dev
```

3. Test the API

You can test the Results API agent directly:

# List evaluations
curl -X POST https://dev-uzeflmvib.agentuity.run/agent_b46de37831f94d01b06b2ccfd183efa0 \
  -H "Content-Type: application/json" \
  -d '{"operation": "list_evaluations"}'

# Get evaluation details
curl -X POST https://dev-uzeflmvib.agentuity.run/agent_b46de37831f94d01b06b2ccfd183efa0 \
  -H "Content-Type: application/json" \
  -d '{"operation": "get_evaluation_details", "evaluation_id": "your_eval_id"}'

API Operations

The Results API agent provides these operations (sent in request body):

{"operation": "list_evaluations"} - List all evaluations with metadata
{"operation": "get_evaluation_details", "evaluation_id": "eval_001"} - Get detailed results for a specific evaluation
{"operation": "get_evaluation_cases", "evaluation_id": "eval_001"} - Get individual case results for an evaluation

Data Flow

Evaluation Execution: Your existing agents (dataset_loader, model_executor, llm_as_judge) run evaluations and store results in KV store
Results API: The new Results API agent reads from KV store and serves data via operation-based requests
Frontend: React app sends POST requests with operation parameters to the Results API agent

Features

Results Dashboard

Summary View: High-level metrics and charts
Detailed View: Individual query-response pairs with similarity scores
Case Details: Full comparison view with judge reasoning
Real-time Status: Shows running, completed, and failed evaluations

Settings

API Configuration: Configure Agentuity endpoints and authentication
Connection Testing: Verify connectivity to your Agentuity instance
Default Parameters: Set default model and evaluation settings

KV Store Data Structure

The system expects evaluation data in this format:

Metadata (`eval_metadata` namespace)

{
  "evaluation_id": "eval_001",
  "status": "comparison_completed",
  "started_at": "2024-01-15T10:30:00Z",
  "completed_at": "2024-01-15T10:35:00Z",
  "dataset_name": "superhero_powers.json",
  "model_name": "claude-3-5-sonnet-latest",
  "judge_model": "claude-3-5-haiku-latest",
  "similarity_threshold": 80,
  "total_cases": 12,
  "comparison_summary": {
    "average_similarity": 87.2,
    "high_similarity_count": 10,
    "medium_similarity_count": 2,
    "low_similarity_count": 0,
    "high_similarity_rate": 0.83
  }
}

Comparison Results (`eval_comparison` namespace)

{
  "evaluation_id": "eval_001",
  "total_cases": 12,
  "comparison_results": [
    {
      "case_id": "case_001",
      "original_query": "What are Superman's main superpowers?",
      "expected_response": "Superman's main superpowers include...",
      "model_response": "Superman possesses incredible strength...",
      "similarity_score": 92,
      "similarity_category": "high",
      "judge_reasoning": "Both responses accurately list...",
      "success": true
    }
  ]
}

Development

Frontend Development

cd react-frontend
npm run dev    # Start development server
npm run build  # Build for production

Backend Development

agentuity deploy  # Deploy agent changes
agentuity logs    # View agent logs

Troubleshooting

Connection Failed: Check that your Agentuity instance is running and the agent ID is correct
No Evaluations Found: Ensure your evaluation agents are storing data in the expected KV store format
CORS Issues: Make sure your Agentuity instance allows requests from your frontend domain

Next Steps

Add authentication/authorization
Implement real-time updates for running evaluations
Add export functionality for results
Create evaluation templates and presets

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.cursor/rules		.cursor/rules
agents		agents
datasets		datasets
react-frontend		react-frontend
.editorconfig		.editorconfig
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
agentuity.yaml		agentuity.yaml
main.py		main.py
pyproject.toml		pyproject.toml
server.py		server.py
setup_env.py		setup_env.py
test_evaluation_creation.py		test_evaluation_creation.py
test_results_api.py		test_results_api.py
uv.lock		uv.lock

jsw324/evals

Folders and files

Latest commit

History

Repository files navigation

🤖 AI Agent Evaluation System

🎯 Overview

System Architecture

🏗️ Agent Structure

Core Agents

📋 Prerequisites

🚀 Getting Started

Authentication

Quick Setup

Run the Frontend

Development Mode (Backend)

🛠️ Configuration

Development Environment

Environment Variables

Development Setup

API Testing

📊 Usage

Option 1: Using the Web Interface (Recommended)

Option 2: Using the API Directly

1. Prepare Your Dataset

2. Create an Evaluation Request

3. Review Results

🔄 Data Flow

🛠️ Configuration

Environment Variables

Evaluation Settings

📁 Project Structure

🎯 LLM-as-Judge Evaluation

🌐 Deployment

🆘 Troubleshooting

📚 Documentation

📝 License

Architecture

Setup Instructions

1. Deploy the Results API Agent

2. Configure the Frontend

3. Test the API

API Operations

Data Flow

Features

Results Dashboard

Settings

KV Store Data Structure

Metadata (eval_metadata namespace)

Comparison Results (eval_comparison namespace)

Development

Frontend Development

Backend Development

Troubleshooting

Next Steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Metadata (`eval_metadata` namespace)

Comparison Results (`eval_comparison` namespace)

Packages