-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
Overview
Implement the SWE-Bench adapter as the first external benchmark integration. This enables users to run their agent configuration against real GitHub issue resolution tasks and measure resolution rates.
Parent Issue: #9
Linear Issue: ANS-466
Decisions Made
| Decision | Choice | Rationale |
|---|---|---|
| Execution | Variant containers | True A/B testing isolation, aligns with sniffbench's core value prop |
| Evaluation | Full pipeline (local harness) | No point generating predictions without evaluation |
| Patch extraction | File diff | Most natural for coding agents, guaranteed valid format |
Implementation Plan
Phase 1: Foundation
- Create
src/benchmark/module structure - Implement dataset loading from HuggingFace (
princeton-nlp/SWE-bench_Lite) - Build instance runner that:
- Clones repo at base_commit
- Mounts into variant container
- Provides problem_statement to agent
- Extracts patch via
git diff
- Add
sniff bench swe-benchCLI command
Phase 2: Evaluation
- Integrate SWE-Bench harness (
python -m swebench.harness.run_evaluation) - Parse results and report resolution rates
- Store benchmark results in runs tracking
Phase 3: Polish
- Incremental runs (resume from where we left off)
- Parallel instance execution
- Caching (repos, Docker images)
CLI Commands
sniff bench swe-bench --variant lite
sniff bench swe-bench --variant lite --limit 10
sniff bench swe-bench --use-variant controlRequirements
- Docker with 16GB RAM, 8 cores
- ~100GB disk for env-level cache
- Python 3.9+ with swebench package
Links
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels