AnswerLayer · jharris1679 · Feb 15, 2026 · Feb 13, 2026 · Feb 13, 2026 · Feb 13, 2026
diff --git a/HANDOFF.md b/HANDOFF.md
@@ -0,0 +1,72 @@
+# Handoff Notes
+
+## Current state
+
+The `add-glm-agent` branch (PR #48) adds an opencode agent integration to sniffbench. It works end-to-end: spawns an opencode server, sends prompts via the SDK, streams events via SSE, captures tool calls and file snapshots, evaluates with the rubric, and saves results.
+
+**To test:**
+```bash
+pnpm run build
+npx sniff run --agent opencode --cases bootstrap-005
+npx sniff closed-issues run --agent opencode --local
+```
+
+## Test results (GLM-4.7-4bit via opencode)
+
+### Bootstrap cases
+
+| Case | Score | Verdict | Notes |
+|------|-------|---------|-------|
+| bootstrap-003 (Python Unit Tests) | 100% | **False pass** | Starter code already passes all tests — no-op |
+| bootstrap-004 (Palindrome Checker) | 0% | Fail | Agent edited file but tests failed. JS test runner (`node *.test.js`) was missing from rubric command chain — now fixed |
+| bootstrap-005 (Binary Search) | 100% | **Legit pass** | Agent implemented full binary search from `pass` stub |
+| bootstrap-006 (Refactor Bad Code) | 0% | Fail | Agent refactored and renamed function `c`, breaking `require('./shipping_calculator').c` in test file. Also hit missing JS test runner issue |
+| bootstrap-007 (CSV Parser) | 100% | **Legit pass** | Agent implemented CSV parser from empty `return []` stub |
+| simple-001 (Simple math) | 0% | Fail | Code already passes — should be free win. Failed because JS test runner was missing from rubric |
+| fail-001 (Intentionally failing) | 0% | Expected fail | Case exists to verify failure reporting works |
+
+**Key fix:** Added `node *.test.js` to the test command chain in `defaults.ts`. All JS cases were failing because the rubric only tried `npm test` (no package.json) then fell through to Python/Go runners.
+
+### Closed-issues cases (real GitHub issues)
+
+| Case | Score | Notes |
+|------|-------|-------|
+| #12 (Add --compare flag) | 10/100 | 0 files changed. Agent explored but produced no edits. Hard task: 363 additions across 2 files |
+| #38 (Split variant/variants) | 10/100 | Created new `variants.ts` instead of modifying existing files. Understood intent but wrong approach |
+
+These are significantly harder than bootstrap cases — require understanding a real codebase and making coordinated multi-file changes. Likely beyond a 4-bit quantized local model's capability.
+
+## What's next
+
+### 1. Bootstrap test cases need work
+- **bootstrap-003 (python-unit-test):** No-op — starter code already passes all tests. Make it actually buggy (e.g., `text.split(' ')` instead of `text.split()`).
+- **bootstrap-006 (refactor):** Test file imports `c` by name. If the agent renames it (the whole point of refactoring), tests break. Either update tests to import by new name, or make the test more flexible.
+
+### 2. Re-run JS cases
+After the `node *.test.js` fix, bootstrap-004 and simple-001 should be re-run to get accurate scores.
+
+### 3. CodeRabbit review items
+- Remove unused `randombytes` dependency from package.json
+- Remove redundant `allowSyntheticDefaultImports` from tsconfig.json
+- Remove redundant `"dist/**/*"` from tsconfig exclude
+
+### 4. Hardcoded model config
+The local-glm provider config (baseURL, model path, API key) is hardcoded in `src/agents/opencode.ts` constructor default. Should be externalized — read from opencode config file or a sniffbench config file.
+
+### 5. Comprehension cases
+The 12 comp-* cases reference a `comprehension` rubric that doesn't exist yet. These are Q&A tasks that need LLM-judge or human evaluation, not test suites.
+
+## Key technical details
+
+- **ESM wrapper:** The `@opencode-ai/sdk` is ESM-only but the project is CommonJS. The `.mjs` wrapper in `src/agents/opencode-sdk.mjs` bridges this. tsc doesn't copy `.mjs` files, so the build script includes a manual `cp` step.
+- **Port management:** `nextPort` counter in opencode.ts increments per run to avoid collisions. Resets on process restart.
+- **Server lifecycle:** Each agent run spawns a real opencode server process on a unique port, with the case's temp dir as cwd. The server is killed in the `finally` block.
+- **Event streaming:** Uses `client.event.subscribe()` (SSE) + `session.promptAsync()`. The stream object is at `sseResult.stream` (not `.data`). Events arrive as `message.part.updated` (tool calls, text, reasoning), `message.updated` (final tokens/cost), `session.status` (idle = done).
+- **File snapshots:** After agent runs, `snapshotFiles()` in runner.ts walks the host tempDir and captures all files with a `changed` flag vs originals. Skips node_modules, .git, __pycache__, files >100KB.
+- **Agent config:** `DEFAULT_AGENT` constant in `src/agents/registry.ts` is used across all CLI commands. No more hardcoded `'claude-code'` strings.
+- **Closed-issues runner:** Now routes through agent wrappers (supports `--agent` and `--model` flags) instead of shelling out directly to `claude` CLI.
+
+## Git conventions
+- Do NOT include `Co-Authored-By` or Anthropic email in commits
+- Include model version (e.g., `claude-opus-4-6`) in commit body if desired
+- Never amend commits — always create new ones
diff --git a/cases/bootstrap/binary-search.yaml b/cases/bootstrap/binary-search.yaml
@@ -0,0 +1,91 @@
+id: bootstrap-005
+title: "Binary Search Implementation"
+prompt: |
+  Complete the binary_search function implementation. The function
+  should find the index of a target value in a sorted array, or
+  return -1 if not found.
+
+  Binary search must:
+  - Run in O(log n) time complexity
+  - Handle empty arrays
+  - Handle values not present in the array
+  - Work with any comparable values
+
+  Run: python binary_search.test.py
+  Make all tests pass.
+
+source: bootstrap
+category: codefix
+language: python
+difficulty: medium
+
+tags:
+  - python
+  - algorithms
+  - binary-search
+
+files:
+  - path: binary_search.py
+    content: |
+      def binary_search(arr, target):
+          """
+          Perform binary search on a sorted array.
+
+          Args:
+              arr: Sorted list of comparable elements
+              target: Value to search for
+
+          Returns:
+              Index of target if found, -1 otherwise
+
+          Time complexity: O(log n)
+          """
+          # TODO: Implement binary search
+          pass
+
+  - path: binary_search.test.py
+    content: |
+      import unittest
+      from binary_search import binary_search
+
+      class TestBinarySearch(unittest.TestCase):
+
+          def test_found_elements(self):
+              arr = [1, 3, 5, 7, 9, 11, 13, 15]
+              self.assertEqual(binary_search(arr, 7), 3)
+              self.assertEqual(binary_search(arr, 1), 0)
+              self.assertEqual(binary_search(arr, 15), 7)
+              self.assertEqual(binary_search(arr, 9), 4)
+
+          def test_not_found(self):
+              arr = [1, 3, 5, 7, 9]
+              self.assertEqual(binary_search(arr, 2), -1)
+              self.assertEqual(binary_search(arr, 6), -1)
+              self.assertEqual(binary_search(arr, 10), -1)
+
+          def test_empty_array(self):
+              self.assertEqual(binary_search([], 5), -1)
+
+          def test_single_element(self):
+              arr = [42]
+              self.assertEqual(binary_search(arr, 42), 0)
+              self.assertEqual(binary_search(arr, 0), -1)
+
+          def test_two_elements(self):
+              arr = [1, 2]
+              self.assertEqual(binary_search(arr, 1), 0)
+              self.assertEqual(binary_search(arr, 2), 1)
+
+          def test_strings(self):
+              arr = ['apple', 'banana', 'cherry', 'date']
+              self.assertEqual(binary_search(arr, 'cherry'), 2)
+              self.assertEqual(binary_search(arr, 'grape'), -1)
+
+          def test_large_array(self):
+              arr = list(range(1000))
+              self.assertEqual(binary_search(arr, 42), 42)
+              self.assertEqual(binary_search(arr, 999), 999)
+              self.assertEqual(binary_search(arr, 1000), -1)
+
+      if __name__ == '__main__':
+          unittest.main()
diff --git a/cases/bootstrap/csv-parser.yaml b/cases/bootstrap/csv-parser.yaml
@@ -0,0 +1,125 @@
+id: bootstrap-007
+title: "File Processing - CSV Parser"
+prompt: |
+  Implement a CSV parser that can read and parse a CSV file.
+  The implementation should handle:
+  - Basic comma-separated values
+  - Quoted fields containing commas
+  - Header row extraction
+  - Converting to array of objects
+
+  Run: python csv_parser.test.py
+  Make all tests pass.
+
+source: bootstrap
+category: codefix
+language: python
+difficulty: medium
+
+tags:
+  - python
+  - file-processing
+  - csv
+
+files:
+  - path: csv_parser.py
+    content: |
+      import csv
+
+      def parse_csv(filepath, has_header=True):
+          """
+          Parse a CSV file and return data as list of dicts (or lists).
+
+          Args:
+              filepath: Path to the CSV file
+              has_header: Whether the first row is a header row
+
+          Returns:
+              List of dictionaries (if has_header=True) or list of lists
+          """
+          # TODO: Implement this function
+          return []
+
+  - path: csv_parser.test.py
+    content: |
+      import unittest
+      import os
+      import tempfile
+      import textwrap
+      from csv_parser import parse_csv
+
+      class TestCSVParser(unittest.TestCase):
+
+          def test_simple_csv_with_header(self):
+              data = textwrap.dedent('''\
+                  name,age,city
+                  Alice,30,New York
+                  Bob,25,Los Angeles
+                  Charlie,35,Chicago''')
+
+              with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as f:
+                  f.write(data)
+                  f.flush()
+
+                  result = parse_csv(f.name, has_header=True)
+
+                  self.assertEqual(len(result), 3)
+                  self.assertEqual(result[0]['name'], 'Alice')
+                  self.assertEqual(result[0]['age'], '30')
+                  self.assertEqual(result[1]['city'], 'Los Angeles')
+
+              os.unlink(f.name)
+
+          def test_csv_without_header(self):
+              data = textwrap.dedent('''\
+                  Alice,30,New York
+                  Bob,25,Los Angeles''')
+
+              with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as f:
+                  f.write(data)
+                  f.flush()
+
+                  result = parse_csv(f.name, has_header=False)
+
+                  self.assertEqual(len(result), 2)
+                  self.assertEqual(result[0][0], 'Alice')
+                  self.assertEqual(result[1][2], 'Los Angeles')
+
+              os.unlink(f.name)
+
+          def test_quoted_fields(self):
+              data = textwrap.dedent('''\
+                  product,price,description
+                  Widget,10.00,"A widget, really."
+                  Gadget,15.00,"A device, good."''')
+
+              with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as f:
+                  f.write(data)
+                  f.flush()
+
+                  result = parse_csv(f.name, has_header=True)
+
+                  self.assertEqual(len(result), 2)
+                  self.assertEqual(result[0]['description'], 'A widget, really.')
+                  self.assertEqual(result[1]['description'], 'A device, good.')
+
+              os.unlink(f.name)
+
+          def test_single_row(self):
+              data = textwrap.dedent('''\
+                  name,value
+                  test,123''')
+
+              with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as f:
+                  f.write(data)
+                  f.flush()
+
+                  result = parse_csv(f.name, has_header=True)
+
+                  self.assertEqual(len(result), 1)
+                  self.assertEqual(result[0]['name'], 'test')
+
+              os.unlink(f.name)
+
+      if __name__ == '__main__':
+          unittest.main()
diff --git a/cases/bootstrap/palindrome-checker.yaml b/cases/bootstrap/palindrome-checker.yaml
@@ -0,0 +1,97 @@
+id: bootstrap-004
+title: "Palindrome Checker"
+prompt: |
+  Implement a palindrome checker that works correctly across
+  different edge cases. The tests are already written - you need
+  to make them all pass.
+
+  A palindrome reads the same forwards and backwards.
+  You should:
+  - Ignore case
+  - Ignore non-alphanumeric characters
+  - Handle empty strings as valid palindromes
+
+  Run: node palindrome.test.js
+  Fix the implementation until all tests pass.
+
+source: bootstrap
+category: codefix
+language: javascript
+difficulty: easy
+
+tags:
+  - javascript
+  - algorithms
+  - string-manipulation
+
+files:
+  - path: palindrome.js
+    content: |
+      function isPalindrome(str) {
+        // TODO: Implement properly
+        return str === str.split('').reverse().join('');
+      }
+
+      module.exports = { isPalindrome };
+
+  - path: palindrome.test.js
+    content: |
+      const { isPalindrome } = require('./palindrome');
+
+      function test(name, fn) {
+        try {
+          fn();
+          console.log(`✓ ${name}`);
+        } catch (err) {
+          console.log(`✗ ${name}: ${err.message}`);
+          process.exit(1);
+        }
+      }
+
+      function assertEqual(actual, expected, message) {
+        if (actual !== expected) {
+          throw new Error(message || `Expected ${expected}, got ${actual}`);
+        }
+      }
+
+      // Basic palindromes
+      test('racecar is palindrome', () => {
+        assertEqual(isPalindrome('racecar'), true);
+      });
+
+      test('hello is not palindrome', () => {
+        assertEqual(isPalindrome('hello'), false);
+      });
+
+      // Case insensitive
+      test('RaceCar is palindrome', () => {
+        assertEqual(isPalindrome('RaceCar'), true);
+      });
+
+      test('A man a plan a canal Panama', () => {
+        assertEqual(isPalindrome('A man a plan a canal Panama'), true);
+      });
+
+      // With spaces and punctuation
+      test('Was it a car or a cat I saw', () => {
+        assertEqual(isPalindrome('Was it a car or a cat I saw'), true);
+      });
+
+      // Edge cases
+      test('empty string', () => {
+        assertEqual(isPalindrome(''), true);
+      });
+
+      test('single character', () => {
+        assertEqual(isPalindrome('a'), true);
+      });
+
+      test('numeric', () => {
+        assertEqual(isPalindrome('12321'), true);
+      });
+
+      test('numeric with letters', () => {
+        assertEqual(isPalindrome('1a2 3 2a1'), true);
+      });
+
+      console.log('All tests passed!');