SWE-Bench Adapter Implementation

## Overview

Implement the SWE-Bench adapter as the first external benchmark integration. This enables users to run their agent configuration against real GitHub issue resolution tasks and measure resolution rates.

**Parent Issue:** #9
**Linear Issue:** ANS-466

## Decisions Made

| Decision | Choice | Rationale |
|----------|--------|-----------|
| **Execution** | Variant containers | True A/B testing isolation, aligns with sniffbench's core value prop |
| **Evaluation** | Full pipeline (local harness) | No point generating predictions without evaluation |
| **Patch extraction** | File diff | Most natural for coding agents, guaranteed valid format |

## Implementation Plan

### Phase 1: Foundation
1. Create `src/benchmark/` module structure
2. Implement dataset loading from HuggingFace (`princeton-nlp/SWE-bench_Lite`)
3. Build instance runner that:
   - Clones repo at base_commit
   - Mounts into variant container
   - Provides problem_statement to agent
   - Extracts patch via `git diff`
4. Add `sniff bench swe-bench` CLI command

### Phase 2: Evaluation
1. Integrate SWE-Bench harness (`python -m swebench.harness.run_evaluation`)
2. Parse results and report resolution rates
3. Store benchmark results in runs tracking

### Phase 3: Polish
1. Incremental runs (resume from where we left off)
2. Parallel instance execution
3. Caching (repos, Docker images)

## CLI Commands

```bash
sniff bench swe-bench --variant lite
sniff bench swe-bench --variant lite --limit 10
sniff bench swe-bench --use-variant control
```

## Requirements

- Docker with 16GB RAM, 8 cores
- ~100GB disk for env-level cache
- Python 3.9+ with swebench package

## Links

- [SWE-Bench Official](https://www.swebench.com/)
- [SWE-Bench GitHub](https://github.com/SWE-bench/SWE-bench)
- [HuggingFace Dataset](https://huggingface.co/datasets/princeton-nlp/SWE-bench)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SWE-Bench Adapter Implementation #41

Overview

Decisions Made

Implementation Plan

Phase 1: Foundation

Phase 2: Evaluation

Phase 3: Polish

CLI Commands

Requirements

Links

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Decision	Choice	Rationale
Execution	Variant containers	True A/B testing isolation, aligns with sniffbench's core value prop
Evaluation	Full pipeline (local harness)	No point generating predictions without evaluation
Patch extraction	File diff	Most natural for coding agents, guaranteed valid format

SWE-Bench Adapter Implementation #41

Description

Overview

Decisions Made

Implementation Plan

Phase 1: Foundation

Phase 2: Evaluation

Phase 3: Polish

CLI Commands

Requirements

Links

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions