ArcBench: Automated Reasoning Challenge for LLMs

A benchmark for evaluating advanced reasoning in language models and multi-agent systems.

Setup

Install requirements
```
pip install -r requirements.txt
```

Create a .env file
(copy .env.example and fill in your values)

# For generating model answers (baseline/workflow)
API_KEY=xxx
LLM=xxx
BASE_URL=xxx

# For answer evaluation (grading)
EVAL_API_KEY=xxx
EVAL_LLM=xxx
EVAL_BASE_URL=xxx

Running Experiments

Run all experiments with:

python -B arcbench/main.py

Model answers are generated using the model defined by LLM, API_KEY, etc.
Automatic grading is performed using the model specified by EVAL_LLM, EVAL_API_KEY, etc.
Results are saved in results/ and include the model name in the filename.

To try a different LLM for generation, edit the LLM in your .env and run again.
To change the grading rubric, change the EVAL_LLM variables.

Notes

Make sure your input data exists (data/ArcBench.jsonl).
Logs and errors are saved in the results and logs folders.
By default, both baseline and workflow experiments will be run.
If you only want to run one, edit the bottom of arcbench/main.py.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
arcbench		arcbench
data		data
results		results
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ArcBench: Automated Reasoning Challenge for LLMs

Setup

Running Experiments

Notes

About

Uh oh!

Contributors 2

Uh oh!

Languages

License

PrimisAI/arcbench

Folders and files

Latest commit

History

Repository files navigation

ArcBench: Automated Reasoning Challenge for LLMs

Setup

Running Experiments

Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!

Languages