Welcome to the Agentuity AI Agent Evaluation System! This project provides a comprehensive framework for evaluating AI models using a multi-agent architecture that orchestrates dataset loading, prompt templating, model execution, and result analysis.
This evaluation system uses multiple specialized agents to create a robust, scalable framework for testing AI models against ground truth datasets. Each agent has a discrete task and can pass information to others through the Agentuity key-value store.
The system includes a modern React-based web interface that provides:
- Results Dashboard: View evaluation summaries, detailed results, and real-time progress
- Dataset Management: Upload and manage evaluation datasets
- Evaluation Configuration: Set up new evaluations with custom parameters
- Settings Management: Configure API endpoints and authentication
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Dataset Loader βββββΆβ Template ManagerβββββΆβ Evaluation β
β Agent β β Agent β β Runner Agent β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ βββββββββββββββββββ
β LLM as Judge ββββββ Claude API β
β Agent β β (integrated) β
βββββββββββββββββββ βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ βββββββββββββββββββ
β Results API βββββΆβ React Frontend β
β Agent β β (Web UI) β
βββββββββββββββββββ βββββββββββββββββββ
dataset_loader: Loads and validates ground truth evaluation datasetstemplate_manager: Handles prompt templates with variable substitutionevaluation_runner: Executes evaluation cases, coordinates model testing, and communicates directly with AI models (Claude, GPT, etc.)llm_as_judge: Uses Claude to intelligently compare model outputs against expected results with 0-100 similarity scoringresults_api: Serves evaluation data from KV store to the React frontend
Before you begin, ensure you have the following installed:
- Python: Version 3.10 or higher
- UV: Version 0.5.25 or higher (Documentation)
- Node.js: Version 18 or higher
- npm: Latest version
- Agentuity CLI: Latest version
agentuity login-
Start Development Environment:
# Start the Agentuity development server agentuity dev -
Configure Backend:
# Run the setup script (uses dev environment by default) python setup_env.py # Or manually set environment variables (optional) export AGENTUITY_BASE_URL=" https://dev-i9bbrvz6i.agentuity.run"
-
Install Backend Dependencies:
uv sync
-
Install Frontend Dependencies:
cd react-frontend npm install -
Test Connection:
# Test your dev environment python setup_env.py --test
Start the React web interface:
cd react-frontend
npm run devThe frontend will be available at http://localhost:5173
Run the evaluation system in development mode:
agentuity devOr start locally without the console:
uv run server.pyThe system is configured to use the Agentuity development environment:
- Base URL:
https://dev-i9bbrvz6i.agentuity.run - Dataset Loader Agent ID:
agent_abcf9ad4245d2d89aed9eb38aef21fd6 - Results API Agent ID:
agent_b46de37831f94d01b06b2ccfd183efa0 - Authentication: Optional (not required in dev mode)
# Required
AGENTUITY_BASE_URL=https://dev-uzeflmvib.agentuity.run
# Optional
DATASET_LOADER_AGENT_ID=agent_abcf9ad4245d2d89aed9eb38aef21fd6
RESULTS_API_AGENT_ID=agent_b46de37831f94d01b06b2ccfd183efa0
AGENTUITY_API_TOKEN= # Not required for dev mode-
Start the development server:
agentuity dev
-
Verify agents are loaded:
[INFO ] Loaded 6 agents [INFO ] Loaded agent: dataset_loader [agent_abcf9ad4245d2d89aed9eb38aef21fd6] [INFO ] Loaded agent: results_api [agent_b46de37831f94d01b06b2ccfd183efa0] [INFO ] Starting server on port 63229 -
Use the public URL:
https://dev-uzeflmvib.agentuity.runYou must set this URL in yourreact-frontend/src/services/apifile.
Test your development environment:
# Using curl (no auth required)
curl https://dev-uzeflmvib.agentuity.run/agent_abcf9ad4245d2d89aed9eb38aef21fd6
# Using the setup script
python setup_env.py --test- Start the Backend: Run
agentuity dev - Start the Frontend:
cd react-frontend npm run dev - Configure Settings: Go to Settings β General and configure your API endpoints
- Upload Dataset: Use the Dataset Manager to upload or create evaluation data
- Configure Evaluation: Go to "New Evaluation" and set up your test parameters
- Monitor Progress: Watch real-time progress in the Results Dashboard
- Analyze Results: View detailed analytics, charts, and export data
Create a ground truth dataset in JSON format:
[
{
"query": "What is the capital of New York?",
"response": "Albany"
},
{
"query": "What is the capital of California?",
"response": "Sacramento"
}
]Send a request to the dataset_loader agent:
{
"evaluation_id": "state_capitals_eval_001",
"dataset_path": "datasets/state_capitals.json",
"format": "query_response_pairs",
"prompt_template": {
"template": "You are an expert on state capitals. Answer the following question: {{query}}",
"variables": ["query"]
},
"model_config": {
"model_name": "claude-3-5-sonnet-latest",
"max_tokens": 100,
"temperature": 0.1
},
"evaluation_settings": {
"similarity_threshold": 80,
"judge_model": "claude-3-5-haiku-latest"
}
}The system will generate a comprehensive evaluation report:
{
"evaluation_id": "state_capitals_eval_001",
"summary": {
"total_cases": 50,
"average_similarity_score": 87.2,
"high_similarity_count": 42,
"medium_similarity_count": 6,
"low_similarity_count": 2,
"high_similarity_rate": 0.84
},
"detailed_results": [...],
"model_config": {...}
}The evaluation system uses the Agentuity key-value store for agent communication:
- Initial Request β
dataset_loaderagent - Dataset Loading β Key:
eval_run_{id}_dataset - Template Processing β Key:
eval_run_{id}_processed - Evaluation Execution β Key:
eval_run_{id}_results - LLM Judging β Key:
eval_run_{id}_comparison(Final structured results) - Results API β Serves data to React frontend via operation-based requests
Set up API keys for different models:
agentuity env set --secret ANTHROPIC_API_KEY=your_key_here
agentuity env set --secret OPENAI_API_KEY=your_key_hereConfigure default evaluation parameters:
agentuity env set DEFAULT_MODEL=claude-3-5-sonnet-latest
agentuity env set DEFAULT_JUDGE_MODEL=claude-3-5-haiku-latest
agentuity env set DEFAULT_SIMILARITY_THRESHOLD=80βββ agents/
β βββ dataset_loader/ # Loads evaluation datasets
β βββ template_manager/ # Manages prompt templates
β βββ evaluation_runner/ # Executes evaluation cases and communicates with AI models
β βββ llm_as_judge/ # Uses Claude to judge response similarity (0-100 scoring)
β βββ results_api/ # Serves evaluation data to frontend
βββ react-frontend/ # React web interface
β βββ src/ # React source code
β βββ public/ # Static assets
β βββ package.json # Frontend dependencies
β βββ vite.config.ts # Vite configuration
βββ datasets/ # Ground truth datasets
βββ .venv/ # Virtual environment
βββ pyproject.toml # Backend dependencies
βββ server.py # Server entry point
βββ agentuity.yaml # Project configuration
The system uses Claude as an intelligent judge to evaluate response similarity:
- Semantic Understanding: Understands context, synonyms, and meaning beyond simple string matching
- Flexible Scoring: Provides 0-100 similarity scores with detailed reasoning
- Context Aware: Considers the original question, expected response, and actual response
- Consistent: Uses low temperature settings for reliable, repeatable judgments
- Explainable: Provides reasoning for each similarity score
Deploy your evaluation system to Agentuity Cloud:
agentuity deployIf you encounter issues:
- Check agent logs in the Agentuity Console
- Verify your dataset format matches the expected schema
- Ensure all required environment variables are set
- Check the React frontend logs in your terminal
- Verify API connectivity in the Settings page
- Join our Discord community for support
This project is licensed under the terms specified in the LICENSE file.
- React Frontend: Modern UI built with Vite, React, TypeScript, and Tailwind CSS
- Agentuity Backend: Multi-agent system for running evaluations and storing results
- Results API Agent: Serves evaluation data from KV store to frontend
The Results API agent is already included in your project. Deploy it with:
# Deploy all agents including the results API
agentuity deploy-
Install dependencies:
cd react-frontend npm install -
Update API configuration in the Settings page:
- Go to Settings β General
- Set your Agentuity Base URL (e.g.,
https://dev-uzeflmvib.agentuity.run) - Set the Results API Agent ID (e.g.,
agent_b46de37831f94d01b06b2ccfd183efa0) - Test the connection
-
Start the development server:
npm run dev
You can test the Results API agent directly:
# List evaluations
curl -X POST https://dev-uzeflmvib.agentuity.run/agent_b46de37831f94d01b06b2ccfd183efa0 \
-H "Content-Type: application/json" \
-d '{"operation": "list_evaluations"}'
# Get evaluation details
curl -X POST https://dev-uzeflmvib.agentuity.run/agent_b46de37831f94d01b06b2ccfd183efa0 \
-H "Content-Type: application/json" \
-d '{"operation": "get_evaluation_details", "evaluation_id": "your_eval_id"}'The Results API agent provides these operations (sent in request body):
{"operation": "list_evaluations"}- List all evaluations with metadata{"operation": "get_evaluation_details", "evaluation_id": "eval_001"}- Get detailed results for a specific evaluation{"operation": "get_evaluation_cases", "evaluation_id": "eval_001"}- Get individual case results for an evaluation
- Evaluation Execution: Your existing agents (dataset_loader, model_executor, llm_as_judge) run evaluations and store results in KV store
- Results API: The new Results API agent reads from KV store and serves data via operation-based requests
- Frontend: React app sends POST requests with operation parameters to the Results API agent
- Summary View: High-level metrics and charts
- Detailed View: Individual query-response pairs with similarity scores
- Case Details: Full comparison view with judge reasoning
- Real-time Status: Shows running, completed, and failed evaluations
- API Configuration: Configure Agentuity endpoints and authentication
- Connection Testing: Verify connectivity to your Agentuity instance
- Default Parameters: Set default model and evaluation settings
The system expects evaluation data in this format:
{
"evaluation_id": "eval_001",
"status": "comparison_completed",
"started_at": "2024-01-15T10:30:00Z",
"completed_at": "2024-01-15T10:35:00Z",
"dataset_name": "superhero_powers.json",
"model_name": "claude-3-5-sonnet-latest",
"judge_model": "claude-3-5-haiku-latest",
"similarity_threshold": 80,
"total_cases": 12,
"comparison_summary": {
"average_similarity": 87.2,
"high_similarity_count": 10,
"medium_similarity_count": 2,
"low_similarity_count": 0,
"high_similarity_rate": 0.83
}
}{
"evaluation_id": "eval_001",
"total_cases": 12,
"comparison_results": [
{
"case_id": "case_001",
"original_query": "What are Superman's main superpowers?",
"expected_response": "Superman's main superpowers include...",
"model_response": "Superman possesses incredible strength...",
"similarity_score": 92,
"similarity_category": "high",
"judge_reasoning": "Both responses accurately list...",
"success": true
}
]
}cd react-frontend
npm run dev # Start development server
npm run build # Build for productionagentuity deploy # Deploy agent changes
agentuity logs # View agent logs- Connection Failed: Check that your Agentuity instance is running and the agent ID is correct
- No Evaluations Found: Ensure your evaluation agents are storing data in the expected KV store format
- CORS Issues: Make sure your Agentuity instance allows requests from your frontend domain
- Add authentication/authorization
- Implement real-time updates for running evaluations
- Add export functionality for results
- Create evaluation templates and presets
