Problem Statement
I would like to be able to easily run popular public benchmarks, like SWEBench, from the evals sdk
Proposed Solution
No response
Use Case
from evals import swebench_evaluator
agent = Agent()
result = swebench_evaluator(agent)
print(result)
Alternatives Solutions
No response
Additional Context
No response