MCP Benchmark

Run agent benchmarks in an environment accessed via MCP (Model Context Protocol) tools. The agent perceives the world through MCP tools, reasons about a task, and selects actions until the task succeeds or a step limit is reached.

Prerequisites

Python 3.11+
MCP server exposing the benchmark tools (see MCP server)
OpenAI API key (for the LLM used by the agent)

Environment setup

1. Clone and enter the repo

git clone <repository-url>
cd tool-task-agent

2. Create a virtual environment

python -m venv .venv

3. Install dependencies

pip install -e .

Or without editable install:

pip install -r requirements.txt

4. Configure environment variables

Create a .env file in the project root (never commit it):

OPENAI_API_KEY=sk-your-openai-api-key

Optional overrides:

MCP_URL=http://localhost:8000/mcp
BENCHMARK_MODEL=gpt-4o-mini
LOG_LEVEL=INFO
USE_IMAGE=true
GOAL_SUFFIX=

MCP server

Start your MCP server before running the benchmark (e.g. on http://localhost:8000/mcp).

RUN (evaluation)

Runs several episodes, saves per-episode JSON and LLM history.

python run_evaluation.py --num-episodes 3 --output-dir eval_results --max-steps 100 --no-image

Options:

Option	Default	Description
`--num-episodes`	1	Number of episodes
`--output-dir`	eval_results	Directory for `episode_N.json` files
`--max-steps`	100	Max steps per episode
`--no-image`	—	Disable `get_image` in perception

Example:

python run_evaluation.py --num-episodes 5 --output-dir eval_results --max-steps 100 --no-image

Output:

eval_results/episode_N.json — goal, task_success, steps, llm_history
logs/llm_history_epN_YYYYMMDD_HHMMSS.txt — human-readable LLM log per episode

Test MCP connection

Quick check that the MCP server is reachable and returns a task:

python scripts/test_mcp.py

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
mcp_benchmark		mcp_benchmark
scripts		scripts
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_evaluation.py		run_evaluation.py
test_vision.png		test_vision.png
teste.ipynb		teste.ipynb
teste.py		teste.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MCP Benchmark

Prerequisites

Environment setup

1. Clone and enter the repo

2. Create a virtual environment

3. Install dependencies

4. Configure environment variables

MCP server

RUN (evaluation)

Test MCP connection

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

AKCIT-RL/tool-task-agent

Folders and files

Latest commit

History

Repository files navigation

MCP Benchmark

Prerequisites

Environment setup

1. Clone and enter the repo

2. Create a virtual environment

3. Install dependencies

4. Configure environment variables

MCP server

RUN (evaluation)

Test MCP connection

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages