[FEATURE] Make running popular public benchmarks

### Problem Statement

I would like to be able to easily run popular public benchmarks, like SWEBench, from the evals sdk

### Proposed Solution

_No response_

### Use Case

```
from evals import swebench_evaluator

agent = Agent()

result = swebench_evaluator(agent)

print(result)

```

### Alternatives Solutions

_No response_

### Additional Context

_No response_