Run agent benchmarks in an environment accessed via MCP (Model Context Protocol) tools. The agent perceives the world through MCP tools, reasons about a task, and selects actions until the task succeeds or a step limit is reached.
- Python 3.11+
- MCP server exposing the benchmark tools (see MCP server)
- OpenAI API key (for the LLM used by the agent)
git clone <repository-url>
cd tool-task-agentpython -m venv .venvpip install -e .Or without editable install:
pip install -r requirements.txtCreate a .env file in the project root (never commit it):
OPENAI_API_KEY=sk-your-openai-api-keyOptional overrides:
MCP_URL=http://localhost:8000/mcp
BENCHMARK_MODEL=gpt-4o-mini
LOG_LEVEL=INFO
USE_IMAGE=true
GOAL_SUFFIX=Start your MCP server before running the benchmark (e.g. on http://localhost:8000/mcp).
Runs several episodes, saves per-episode JSON and LLM history.
python run_evaluation.py --num-episodes 3 --output-dir eval_results --max-steps 100 --no-imageOptions:
| Option | Default | Description |
|---|---|---|
--num-episodes |
1 | Number of episodes |
--output-dir |
eval_results | Directory for episode_N.json files |
--max-steps |
100 | Max steps per episode |
--no-image |
— | Disable get_image in perception |
Example:
python run_evaluation.py --num-episodes 5 --output-dir eval_results --max-steps 100 --no-imageOutput:
- eval_results/episode_N.json — goal, task_success, steps, llm_history
- logs/llm_history_epN_YYYYMMDD_HHMMSS.txt — human-readable LLM log per episode
Quick check that the MCP server is reachable and returns a task:
python scripts/test_mcp.py