Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
acc9689
Add 5 new bootstrap test cases
jharris1679 Feb 13, 2026
3dc4452
Add opencode agent wrapper using @opencode-ai/sdk
jharris1679 Feb 13, 2026
03df9a8
fix: add opencode config inline, fix score display, add agent invocation
jharris1679 Feb 13, 2026
2e481c4
fix: spawn opencode server with correct cwd to fix empty responses
jharris1679 Feb 14, 2026
1f555b2
fix: default to failure for unverified evaluators, add HANDOFF.md
jharris1679 Feb 14, 2026
16d48e2
fix: update pnpm lockfile and fix test runner discovery
jharris1679 Feb 14, 2026
581a80c
fix: correct score normalization when optional-only criteria are excl…
jharris1679 Feb 14, 2026
57749cd
feat: add --model CLI flag and save agent response in results
jharris1679 Feb 14, 2026
96658db
docs: update HANDOFF.md with full session notes
jharris1679 Feb 14, 2026
0a736ef
feat: add event streaming and sandbox file snapshots
jharris1679 Feb 14, 2026
c63e5c4
docs: clean up HANDOFF.md for next agent
jharris1679 Feb 14, 2026
54a4b5c
fix: resolve SSE stream from correct property on subscribe result
jharris1679 Feb 14, 2026
e98ae2c
refactor: centralize default agent config
jharris1679 Feb 14, 2026
89f5f15
feat: add --agent and --model flags to closed-issues run
jharris1679 Feb 14, 2026
441ef67
fix: add node *.test.js to rubric test chain, update HANDOFF.md
jharris1679 Feb 15, 2026
a358598
fix: merge built-in and project-specific case directories
jharris1679 Feb 15, 2026
f18fe3f
fix: address PR review feedback from CodeRabbit and human review
jharris1679 Feb 15, 2026
a6943c2
fix: regenerate lockfile after removing randombytes
jharris1679 Feb 15, 2026
ccfce82
fix: remove unused runAgentLocally function (lint error)
jharris1679 Feb 15, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 72 additions & 0 deletions HANDOFF.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Handoff Notes

## Current state

The `add-glm-agent` branch (PR #48) adds an opencode agent integration to sniffbench. It works end-to-end: spawns an opencode server, sends prompts via the SDK, streams events via SSE, captures tool calls and file snapshots, evaluates with the rubric, and saves results.

**To test:**
```bash
pnpm run build
npx sniff run --agent opencode --cases bootstrap-005
npx sniff closed-issues run --agent opencode --local
```

## Test results (GLM-4.7-4bit via opencode)

### Bootstrap cases

| Case | Score | Verdict | Notes |
|------|-------|---------|-------|
| bootstrap-003 (Python Unit Tests) | 100% | **False pass** | Starter code already passes all tests — no-op |
| bootstrap-004 (Palindrome Checker) | 0% | Fail | Agent edited file but tests failed. JS test runner (`node *.test.js`) was missing from rubric command chain — now fixed |
| bootstrap-005 (Binary Search) | 100% | **Legit pass** | Agent implemented full binary search from `pass` stub |
| bootstrap-006 (Refactor Bad Code) | 0% | Fail | Agent refactored and renamed function `c`, breaking `require('./shipping_calculator').c` in test file. Also hit missing JS test runner issue |
| bootstrap-007 (CSV Parser) | 100% | **Legit pass** | Agent implemented CSV parser from empty `return []` stub |
| simple-001 (Simple math) | 0% | Fail | Code already passes — should be free win. Failed because JS test runner was missing from rubric |
| fail-001 (Intentionally failing) | 0% | Expected fail | Case exists to verify failure reporting works |

**Key fix:** Added `node *.test.js` to the test command chain in `defaults.ts`. All JS cases were failing because the rubric only tried `npm test` (no package.json) then fell through to Python/Go runners.

### Closed-issues cases (real GitHub issues)

| Case | Score | Notes |
|------|-------|-------|
| #12 (Add --compare flag) | 10/100 | 0 files changed. Agent explored but produced no edits. Hard task: 363 additions across 2 files |
| #38 (Split variant/variants) | 10/100 | Created new `variants.ts` instead of modifying existing files. Understood intent but wrong approach |

These are significantly harder than bootstrap cases — require understanding a real codebase and making coordinated multi-file changes. Likely beyond a 4-bit quantized local model's capability.

## What's next

### 1. Bootstrap test cases need work
- **bootstrap-003 (python-unit-test):** No-op — starter code already passes all tests. Make it actually buggy (e.g., `text.split(' ')` instead of `text.split()`).
- **bootstrap-006 (refactor):** Test file imports `c` by name. If the agent renames it (the whole point of refactoring), tests break. Either update tests to import by new name, or make the test more flexible.

### 2. Re-run JS cases
After the `node *.test.js` fix, bootstrap-004 and simple-001 should be re-run to get accurate scores.

### 3. CodeRabbit review items
- Remove unused `randombytes` dependency from package.json
- Remove redundant `allowSyntheticDefaultImports` from tsconfig.json
- Remove redundant `"dist/**/*"` from tsconfig exclude

### 4. Hardcoded model config
The local-glm provider config (baseURL, model path, API key) is hardcoded in `src/agents/opencode.ts` constructor default. Should be externalized — read from opencode config file or a sniffbench config file.

### 5. Comprehension cases
The 12 comp-* cases reference a `comprehension` rubric that doesn't exist yet. These are Q&A tasks that need LLM-judge or human evaluation, not test suites.

## Key technical details

- **ESM wrapper:** The `@opencode-ai/sdk` is ESM-only but the project is CommonJS. The `.mjs` wrapper in `src/agents/opencode-sdk.mjs` bridges this. tsc doesn't copy `.mjs` files, so the build script includes a manual `cp` step.
- **Port management:** `nextPort` counter in opencode.ts increments per run to avoid collisions. Resets on process restart.
- **Server lifecycle:** Each agent run spawns a real opencode server process on a unique port, with the case's temp dir as cwd. The server is killed in the `finally` block.
- **Event streaming:** Uses `client.event.subscribe()` (SSE) + `session.promptAsync()`. The stream object is at `sseResult.stream` (not `.data`). Events arrive as `message.part.updated` (tool calls, text, reasoning), `message.updated` (final tokens/cost), `session.status` (idle = done).
- **File snapshots:** After agent runs, `snapshotFiles()` in runner.ts walks the host tempDir and captures all files with a `changed` flag vs originals. Skips node_modules, .git, __pycache__, files >100KB.
- **Agent config:** `DEFAULT_AGENT` constant in `src/agents/registry.ts` is used across all CLI commands. No more hardcoded `'claude-code'` strings.
- **Closed-issues runner:** Now routes through agent wrappers (supports `--agent` and `--model` flags) instead of shelling out directly to `claude` CLI.

## Git conventions
- Do NOT include `Co-Authored-By` or Anthropic email in commits
- Include model version (e.g., `claude-opus-4-6`) in commit body if desired
- Never amend commits — always create new ones
91 changes: 91 additions & 0 deletions cases/bootstrap/binary-search.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
id: bootstrap-005
title: "Binary Search Implementation"
prompt: |
Complete the binary_search function implementation. The function
should find the index of a target value in a sorted array, or
return -1 if not found.

Binary search must:
- Run in O(log n) time complexity
- Handle empty arrays
- Handle values not present in the array
- Work with any comparable values

Run: python binary_search.test.py
Make all tests pass.

source: bootstrap
category: codefix
language: python
difficulty: medium

tags:
- python
- algorithms
- binary-search

files:
- path: binary_search.py
content: |
def binary_search(arr, target):
"""
Perform binary search on a sorted array.

Args:
arr: Sorted list of comparable elements
target: Value to search for

Returns:
Index of target if found, -1 otherwise

Time complexity: O(log n)
"""
# TODO: Implement binary search
pass

- path: binary_search.test.py
content: |
import unittest
from binary_search import binary_search

class TestBinarySearch(unittest.TestCase):

def test_found_elements(self):
arr = [1, 3, 5, 7, 9, 11, 13, 15]
self.assertEqual(binary_search(arr, 7), 3)
self.assertEqual(binary_search(arr, 1), 0)
self.assertEqual(binary_search(arr, 15), 7)
self.assertEqual(binary_search(arr, 9), 4)

def test_not_found(self):
arr = [1, 3, 5, 7, 9]
self.assertEqual(binary_search(arr, 2), -1)
self.assertEqual(binary_search(arr, 6), -1)
self.assertEqual(binary_search(arr, 10), -1)

def test_empty_array(self):
self.assertEqual(binary_search([], 5), -1)

def test_single_element(self):
arr = [42]
self.assertEqual(binary_search(arr, 42), 0)
self.assertEqual(binary_search(arr, 0), -1)

def test_two_elements(self):
arr = [1, 2]
self.assertEqual(binary_search(arr, 1), 0)
self.assertEqual(binary_search(arr, 2), 1)

def test_strings(self):
arr = ['apple', 'banana', 'cherry', 'date']
self.assertEqual(binary_search(arr, 'cherry'), 2)
self.assertEqual(binary_search(arr, 'grape'), -1)

def test_large_array(self):
arr = list(range(1000))
self.assertEqual(binary_search(arr, 42), 42)
self.assertEqual(binary_search(arr, 999), 999)
self.assertEqual(binary_search(arr, 1000), -1)

if __name__ == '__main__':
unittest.main()
125 changes: 125 additions & 0 deletions cases/bootstrap/csv-parser.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
id: bootstrap-007
title: "File Processing - CSV Parser"
prompt: |
Implement a CSV parser that can read and parse a CSV file.
The implementation should handle:
- Basic comma-separated values
- Quoted fields containing commas
- Header row extraction
- Converting to array of objects

Run: python csv_parser.test.py
Make all tests pass.

source: bootstrap
category: codefix
language: python
difficulty: medium

tags:
- python
- file-processing
- csv

files:
- path: csv_parser.py
content: |
import csv

def parse_csv(filepath, has_header=True):
"""
Parse a CSV file and return data as list of dicts (or lists).

Args:
filepath: Path to the CSV file
has_header: Whether the first row is a header row

Returns:
List of dictionaries (if has_header=True) or list of lists
"""
# TODO: Implement this function
return []

- path: csv_parser.test.py
content: |
import unittest
import os
import tempfile
import textwrap
from csv_parser import parse_csv

class TestCSVParser(unittest.TestCase):

def test_simple_csv_with_header(self):
data = textwrap.dedent('''\
name,age,city
Alice,30,New York
Bob,25,Los Angeles
Charlie,35,Chicago''')

with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as f:
f.write(data)
f.flush()

result = parse_csv(f.name, has_header=True)

self.assertEqual(len(result), 3)
self.assertEqual(result[0]['name'], 'Alice')
self.assertEqual(result[0]['age'], '30')
self.assertEqual(result[1]['city'], 'Los Angeles')

Comment on lines 53 to 70
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

CSV test data will contain unintended leading whitespace, causing assertions to fail.

The triple-quoted strings in the test methods embed continuation lines with leading spaces from the Python indentation. After YAML block-scalar processing, the data string will look like:

name,age,city
        Alice,30,New York
        Bob,25,Los Angeles
        Charlie,35,Chicago

When parsed, result[0]['name'] will be ' Alice', not 'Alice', so Line 65's assertion will fail. The same issue affects all four test methods.

Fix by using textwrap.dedent or by left-aligning the CSV content within the triple-quoted string (breaking Python indentation aesthetics but keeping data correct).

Proposed fix using textwrap.dedent
+      import textwrap
+
       ...
 
           def test_simple_csv_with_header(self):
-              data = '''name,age,city
-              Alice,30,New York
-              Bob,25,Los Angeles
-              Charlie,35,Chicago'''
+              data = textwrap.dedent('''\
+                  name,age,city
+                  Alice,30,New York
+                  Bob,25,Los Angeles
+                  Charlie,35,Chicago''')

Apply the same pattern to all test methods.

🤖 Prompt for AI Agents
In `@cases/bootstrap/csv-parser.yaml` around lines 52 - 68, The test data string
in test_simple_csv_with_header (and the other CSV test methods) contains
unintended leading indentation, causing parsed field values to include spaces;
wrap the multiline CSV literal in textwrap.dedent (import textwrap at top) or
left-align the triple-quoted CSV so the variable data has no leading spaces
before writing it in the tempfile.NamedTemporaryFile block, ensuring
parse_csv(...) returns 'Alice' not '        Alice' and the assertions pass.

os.unlink(f.name)

def test_csv_without_header(self):
data = textwrap.dedent('''\
Alice,30,New York
Bob,25,Los Angeles''')

with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as f:
f.write(data)
f.flush()

result = parse_csv(f.name, has_header=False)

self.assertEqual(len(result), 2)
self.assertEqual(result[0][0], 'Alice')
self.assertEqual(result[1][2], 'Los Angeles')

os.unlink(f.name)

def test_quoted_fields(self):
data = textwrap.dedent('''\
product,price,description
Widget,10.00,"A widget, really."
Gadget,15.00,"A device, good."''')

with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as f:
f.write(data)
f.flush()

result = parse_csv(f.name, has_header=True)

self.assertEqual(len(result), 2)
self.assertEqual(result[0]['description'], 'A widget, really.')
self.assertEqual(result[1]['description'], 'A device, good.')

os.unlink(f.name)

def test_single_row(self):
data = textwrap.dedent('''\
name,value
test,123''')

with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as f:
f.write(data)
f.flush()

result = parse_csv(f.name, has_header=True)

self.assertEqual(len(result), 1)
self.assertEqual(result[0]['name'], 'test')

os.unlink(f.name)

if __name__ == '__main__':
unittest.main()
97 changes: 97 additions & 0 deletions cases/bootstrap/palindrome-checker.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
id: bootstrap-004
title: "Palindrome Checker"
prompt: |
Implement a palindrome checker that works correctly across
different edge cases. The tests are already written - you need
to make them all pass.

A palindrome reads the same forwards and backwards.
You should:
- Ignore case
- Ignore non-alphanumeric characters
- Handle empty strings as valid palindromes

Run: node palindrome.test.js
Fix the implementation until all tests pass.

source: bootstrap
category: codefix
language: javascript
difficulty: easy

tags:
- javascript
- algorithms
- string-manipulation

files:
- path: palindrome.js
content: |
function isPalindrome(str) {
// TODO: Implement properly
return str === str.split('').reverse().join('');
}

module.exports = { isPalindrome };

- path: palindrome.test.js
content: |
const { isPalindrome } = require('./palindrome');

function test(name, fn) {
try {
fn();
console.log(`✓ ${name}`);
} catch (err) {
console.log(`✗ ${name}: ${err.message}`);
process.exit(1);
}
}

function assertEqual(actual, expected, message) {
if (actual !== expected) {
throw new Error(message || `Expected ${expected}, got ${actual}`);
}
}

// Basic palindromes
test('racecar is palindrome', () => {
assertEqual(isPalindrome('racecar'), true);
});

test('hello is not palindrome', () => {
assertEqual(isPalindrome('hello'), false);
});

// Case insensitive
test('RaceCar is palindrome', () => {
assertEqual(isPalindrome('RaceCar'), true);
});

test('A man a plan a canal Panama', () => {
assertEqual(isPalindrome('A man a plan a canal Panama'), true);
});

// With spaces and punctuation
test('Was it a car or a cat I saw', () => {
assertEqual(isPalindrome('Was it a car or a cat I saw'), true);
});

// Edge cases
test('empty string', () => {
assertEqual(isPalindrome(''), true);
});

test('single character', () => {
assertEqual(isPalindrome('a'), true);
});

test('numeric', () => {
assertEqual(isPalindrome('12321'), true);
});

test('numeric with letters', () => {
assertEqual(isPalindrome('1a2 3 2a1'), true);
});

console.log('All tests passed!');
Loading