-
Notifications
You must be signed in to change notification settings - Fork 2
Add opencode agent using @opencode-ai/sdk #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
acc9689
Add 5 new bootstrap test cases
jharris1679 3dc4452
Add opencode agent wrapper using @opencode-ai/sdk
jharris1679 03df9a8
fix: add opencode config inline, fix score display, add agent invocation
jharris1679 2e481c4
fix: spawn opencode server with correct cwd to fix empty responses
jharris1679 1f555b2
fix: default to failure for unverified evaluators, add HANDOFF.md
jharris1679 16d48e2
fix: update pnpm lockfile and fix test runner discovery
jharris1679 581a80c
fix: correct score normalization when optional-only criteria are excl…
jharris1679 57749cd
feat: add --model CLI flag and save agent response in results
jharris1679 96658db
docs: update HANDOFF.md with full session notes
jharris1679 0a736ef
feat: add event streaming and sandbox file snapshots
jharris1679 c63e5c4
docs: clean up HANDOFF.md for next agent
jharris1679 54a4b5c
fix: resolve SSE stream from correct property on subscribe result
jharris1679 e98ae2c
refactor: centralize default agent config
jharris1679 89f5f15
feat: add --agent and --model flags to closed-issues run
jharris1679 441ef67
fix: add node *.test.js to rubric test chain, update HANDOFF.md
jharris1679 a358598
fix: merge built-in and project-specific case directories
jharris1679 f18fe3f
fix: address PR review feedback from CodeRabbit and human review
jharris1679 a6943c2
fix: regenerate lockfile after removing randombytes
jharris1679 ccfce82
fix: remove unused runAgentLocally function (lint error)
jharris1679 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,72 @@ | ||
| # Handoff Notes | ||
|
|
||
| ## Current state | ||
|
|
||
| The `add-glm-agent` branch (PR #48) adds an opencode agent integration to sniffbench. It works end-to-end: spawns an opencode server, sends prompts via the SDK, streams events via SSE, captures tool calls and file snapshots, evaluates with the rubric, and saves results. | ||
|
|
||
| **To test:** | ||
| ```bash | ||
| pnpm run build | ||
| npx sniff run --agent opencode --cases bootstrap-005 | ||
| npx sniff closed-issues run --agent opencode --local | ||
| ``` | ||
|
|
||
| ## Test results (GLM-4.7-4bit via opencode) | ||
|
|
||
| ### Bootstrap cases | ||
|
|
||
| | Case | Score | Verdict | Notes | | ||
| |------|-------|---------|-------| | ||
| | bootstrap-003 (Python Unit Tests) | 100% | **False pass** | Starter code already passes all tests — no-op | | ||
| | bootstrap-004 (Palindrome Checker) | 0% | Fail | Agent edited file but tests failed. JS test runner (`node *.test.js`) was missing from rubric command chain — now fixed | | ||
| | bootstrap-005 (Binary Search) | 100% | **Legit pass** | Agent implemented full binary search from `pass` stub | | ||
| | bootstrap-006 (Refactor Bad Code) | 0% | Fail | Agent refactored and renamed function `c`, breaking `require('./shipping_calculator').c` in test file. Also hit missing JS test runner issue | | ||
| | bootstrap-007 (CSV Parser) | 100% | **Legit pass** | Agent implemented CSV parser from empty `return []` stub | | ||
| | simple-001 (Simple math) | 0% | Fail | Code already passes — should be free win. Failed because JS test runner was missing from rubric | | ||
| | fail-001 (Intentionally failing) | 0% | Expected fail | Case exists to verify failure reporting works | | ||
|
|
||
| **Key fix:** Added `node *.test.js` to the test command chain in `defaults.ts`. All JS cases were failing because the rubric only tried `npm test` (no package.json) then fell through to Python/Go runners. | ||
|
|
||
| ### Closed-issues cases (real GitHub issues) | ||
|
|
||
| | Case | Score | Notes | | ||
| |------|-------|-------| | ||
| | #12 (Add --compare flag) | 10/100 | 0 files changed. Agent explored but produced no edits. Hard task: 363 additions across 2 files | | ||
| | #38 (Split variant/variants) | 10/100 | Created new `variants.ts` instead of modifying existing files. Understood intent but wrong approach | | ||
|
|
||
| These are significantly harder than bootstrap cases — require understanding a real codebase and making coordinated multi-file changes. Likely beyond a 4-bit quantized local model's capability. | ||
|
|
||
| ## What's next | ||
|
|
||
| ### 1. Bootstrap test cases need work | ||
| - **bootstrap-003 (python-unit-test):** No-op — starter code already passes all tests. Make it actually buggy (e.g., `text.split(' ')` instead of `text.split()`). | ||
| - **bootstrap-006 (refactor):** Test file imports `c` by name. If the agent renames it (the whole point of refactoring), tests break. Either update tests to import by new name, or make the test more flexible. | ||
|
|
||
| ### 2. Re-run JS cases | ||
| After the `node *.test.js` fix, bootstrap-004 and simple-001 should be re-run to get accurate scores. | ||
|
|
||
| ### 3. CodeRabbit review items | ||
| - Remove unused `randombytes` dependency from package.json | ||
| - Remove redundant `allowSyntheticDefaultImports` from tsconfig.json | ||
| - Remove redundant `"dist/**/*"` from tsconfig exclude | ||
|
|
||
| ### 4. Hardcoded model config | ||
| The local-glm provider config (baseURL, model path, API key) is hardcoded in `src/agents/opencode.ts` constructor default. Should be externalized — read from opencode config file or a sniffbench config file. | ||
|
|
||
| ### 5. Comprehension cases | ||
| The 12 comp-* cases reference a `comprehension` rubric that doesn't exist yet. These are Q&A tasks that need LLM-judge or human evaluation, not test suites. | ||
|
|
||
| ## Key technical details | ||
|
|
||
| - **ESM wrapper:** The `@opencode-ai/sdk` is ESM-only but the project is CommonJS. The `.mjs` wrapper in `src/agents/opencode-sdk.mjs` bridges this. tsc doesn't copy `.mjs` files, so the build script includes a manual `cp` step. | ||
| - **Port management:** `nextPort` counter in opencode.ts increments per run to avoid collisions. Resets on process restart. | ||
| - **Server lifecycle:** Each agent run spawns a real opencode server process on a unique port, with the case's temp dir as cwd. The server is killed in the `finally` block. | ||
| - **Event streaming:** Uses `client.event.subscribe()` (SSE) + `session.promptAsync()`. The stream object is at `sseResult.stream` (not `.data`). Events arrive as `message.part.updated` (tool calls, text, reasoning), `message.updated` (final tokens/cost), `session.status` (idle = done). | ||
| - **File snapshots:** After agent runs, `snapshotFiles()` in runner.ts walks the host tempDir and captures all files with a `changed` flag vs originals. Skips node_modules, .git, __pycache__, files >100KB. | ||
| - **Agent config:** `DEFAULT_AGENT` constant in `src/agents/registry.ts` is used across all CLI commands. No more hardcoded `'claude-code'` strings. | ||
| - **Closed-issues runner:** Now routes through agent wrappers (supports `--agent` and `--model` flags) instead of shelling out directly to `claude` CLI. | ||
|
|
||
| ## Git conventions | ||
| - Do NOT include `Co-Authored-By` or Anthropic email in commits | ||
| - Include model version (e.g., `claude-opus-4-6`) in commit body if desired | ||
| - Never amend commits — always create new ones |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,91 @@ | ||
| id: bootstrap-005 | ||
| title: "Binary Search Implementation" | ||
| prompt: | | ||
| Complete the binary_search function implementation. The function | ||
| should find the index of a target value in a sorted array, or | ||
| return -1 if not found. | ||
|
|
||
| Binary search must: | ||
| - Run in O(log n) time complexity | ||
| - Handle empty arrays | ||
| - Handle values not present in the array | ||
| - Work with any comparable values | ||
|
|
||
| Run: python binary_search.test.py | ||
| Make all tests pass. | ||
|
|
||
| source: bootstrap | ||
| category: codefix | ||
| language: python | ||
| difficulty: medium | ||
|
|
||
| tags: | ||
| - python | ||
| - algorithms | ||
| - binary-search | ||
|
|
||
| files: | ||
| - path: binary_search.py | ||
| content: | | ||
| def binary_search(arr, target): | ||
| """ | ||
| Perform binary search on a sorted array. | ||
|
|
||
| Args: | ||
| arr: Sorted list of comparable elements | ||
| target: Value to search for | ||
|
|
||
| Returns: | ||
| Index of target if found, -1 otherwise | ||
|
|
||
| Time complexity: O(log n) | ||
| """ | ||
| # TODO: Implement binary search | ||
| pass | ||
|
|
||
| - path: binary_search.test.py | ||
| content: | | ||
| import unittest | ||
| from binary_search import binary_search | ||
|
|
||
| class TestBinarySearch(unittest.TestCase): | ||
|
|
||
| def test_found_elements(self): | ||
| arr = [1, 3, 5, 7, 9, 11, 13, 15] | ||
| self.assertEqual(binary_search(arr, 7), 3) | ||
| self.assertEqual(binary_search(arr, 1), 0) | ||
| self.assertEqual(binary_search(arr, 15), 7) | ||
| self.assertEqual(binary_search(arr, 9), 4) | ||
|
|
||
| def test_not_found(self): | ||
| arr = [1, 3, 5, 7, 9] | ||
| self.assertEqual(binary_search(arr, 2), -1) | ||
| self.assertEqual(binary_search(arr, 6), -1) | ||
| self.assertEqual(binary_search(arr, 10), -1) | ||
|
|
||
| def test_empty_array(self): | ||
| self.assertEqual(binary_search([], 5), -1) | ||
|
|
||
| def test_single_element(self): | ||
| arr = [42] | ||
| self.assertEqual(binary_search(arr, 42), 0) | ||
| self.assertEqual(binary_search(arr, 0), -1) | ||
|
|
||
| def test_two_elements(self): | ||
| arr = [1, 2] | ||
| self.assertEqual(binary_search(arr, 1), 0) | ||
| self.assertEqual(binary_search(arr, 2), 1) | ||
|
|
||
| def test_strings(self): | ||
| arr = ['apple', 'banana', 'cherry', 'date'] | ||
| self.assertEqual(binary_search(arr, 'cherry'), 2) | ||
| self.assertEqual(binary_search(arr, 'grape'), -1) | ||
|
|
||
| def test_large_array(self): | ||
| arr = list(range(1000)) | ||
| self.assertEqual(binary_search(arr, 42), 42) | ||
| self.assertEqual(binary_search(arr, 999), 999) | ||
| self.assertEqual(binary_search(arr, 1000), -1) | ||
|
|
||
| if __name__ == '__main__': | ||
| unittest.main() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,125 @@ | ||
| id: bootstrap-007 | ||
| title: "File Processing - CSV Parser" | ||
| prompt: | | ||
| Implement a CSV parser that can read and parse a CSV file. | ||
| The implementation should handle: | ||
| - Basic comma-separated values | ||
| - Quoted fields containing commas | ||
| - Header row extraction | ||
| - Converting to array of objects | ||
|
|
||
| Run: python csv_parser.test.py | ||
| Make all tests pass. | ||
|
|
||
| source: bootstrap | ||
| category: codefix | ||
| language: python | ||
| difficulty: medium | ||
|
|
||
| tags: | ||
| - python | ||
| - file-processing | ||
| - csv | ||
|
|
||
| files: | ||
| - path: csv_parser.py | ||
| content: | | ||
| import csv | ||
|
|
||
| def parse_csv(filepath, has_header=True): | ||
| """ | ||
| Parse a CSV file and return data as list of dicts (or lists). | ||
|
|
||
| Args: | ||
| filepath: Path to the CSV file | ||
| has_header: Whether the first row is a header row | ||
|
|
||
| Returns: | ||
| List of dictionaries (if has_header=True) or list of lists | ||
| """ | ||
| # TODO: Implement this function | ||
| return [] | ||
|
|
||
| - path: csv_parser.test.py | ||
| content: | | ||
| import unittest | ||
| import os | ||
| import tempfile | ||
| import textwrap | ||
| from csv_parser import parse_csv | ||
|
|
||
| class TestCSVParser(unittest.TestCase): | ||
|
|
||
| def test_simple_csv_with_header(self): | ||
| data = textwrap.dedent('''\ | ||
| name,age,city | ||
| Alice,30,New York | ||
| Bob,25,Los Angeles | ||
| Charlie,35,Chicago''') | ||
|
|
||
| with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as f: | ||
| f.write(data) | ||
| f.flush() | ||
|
|
||
| result = parse_csv(f.name, has_header=True) | ||
|
|
||
| self.assertEqual(len(result), 3) | ||
| self.assertEqual(result[0]['name'], 'Alice') | ||
| self.assertEqual(result[0]['age'], '30') | ||
| self.assertEqual(result[1]['city'], 'Los Angeles') | ||
|
|
||
| os.unlink(f.name) | ||
|
|
||
| def test_csv_without_header(self): | ||
| data = textwrap.dedent('''\ | ||
| Alice,30,New York | ||
| Bob,25,Los Angeles''') | ||
|
|
||
| with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as f: | ||
| f.write(data) | ||
| f.flush() | ||
|
|
||
| result = parse_csv(f.name, has_header=False) | ||
|
|
||
| self.assertEqual(len(result), 2) | ||
| self.assertEqual(result[0][0], 'Alice') | ||
| self.assertEqual(result[1][2], 'Los Angeles') | ||
|
|
||
| os.unlink(f.name) | ||
|
|
||
| def test_quoted_fields(self): | ||
| data = textwrap.dedent('''\ | ||
| product,price,description | ||
| Widget,10.00,"A widget, really." | ||
| Gadget,15.00,"A device, good."''') | ||
|
|
||
| with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as f: | ||
| f.write(data) | ||
| f.flush() | ||
|
|
||
| result = parse_csv(f.name, has_header=True) | ||
|
|
||
| self.assertEqual(len(result), 2) | ||
| self.assertEqual(result[0]['description'], 'A widget, really.') | ||
| self.assertEqual(result[1]['description'], 'A device, good.') | ||
|
|
||
| os.unlink(f.name) | ||
|
|
||
| def test_single_row(self): | ||
| data = textwrap.dedent('''\ | ||
| name,value | ||
| test,123''') | ||
|
|
||
| with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as f: | ||
| f.write(data) | ||
| f.flush() | ||
|
|
||
| result = parse_csv(f.name, has_header=True) | ||
|
|
||
| self.assertEqual(len(result), 1) | ||
| self.assertEqual(result[0]['name'], 'test') | ||
|
|
||
| os.unlink(f.name) | ||
|
|
||
| if __name__ == '__main__': | ||
| unittest.main() | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,97 @@ | ||
| id: bootstrap-004 | ||
| title: "Palindrome Checker" | ||
| prompt: | | ||
| Implement a palindrome checker that works correctly across | ||
| different edge cases. The tests are already written - you need | ||
| to make them all pass. | ||
|
|
||
| A palindrome reads the same forwards and backwards. | ||
| You should: | ||
| - Ignore case | ||
| - Ignore non-alphanumeric characters | ||
| - Handle empty strings as valid palindromes | ||
|
|
||
| Run: node palindrome.test.js | ||
| Fix the implementation until all tests pass. | ||
|
|
||
| source: bootstrap | ||
| category: codefix | ||
| language: javascript | ||
| difficulty: easy | ||
|
|
||
| tags: | ||
| - javascript | ||
| - algorithms | ||
| - string-manipulation | ||
|
|
||
| files: | ||
| - path: palindrome.js | ||
| content: | | ||
| function isPalindrome(str) { | ||
| // TODO: Implement properly | ||
| return str === str.split('').reverse().join(''); | ||
| } | ||
|
|
||
| module.exports = { isPalindrome }; | ||
|
|
||
| - path: palindrome.test.js | ||
| content: | | ||
| const { isPalindrome } = require('./palindrome'); | ||
|
|
||
| function test(name, fn) { | ||
| try { | ||
| fn(); | ||
| console.log(`✓ ${name}`); | ||
| } catch (err) { | ||
| console.log(`✗ ${name}: ${err.message}`); | ||
| process.exit(1); | ||
| } | ||
| } | ||
|
|
||
| function assertEqual(actual, expected, message) { | ||
| if (actual !== expected) { | ||
| throw new Error(message || `Expected ${expected}, got ${actual}`); | ||
| } | ||
| } | ||
|
|
||
| // Basic palindromes | ||
| test('racecar is palindrome', () => { | ||
| assertEqual(isPalindrome('racecar'), true); | ||
| }); | ||
|
|
||
| test('hello is not palindrome', () => { | ||
| assertEqual(isPalindrome('hello'), false); | ||
| }); | ||
|
|
||
| // Case insensitive | ||
| test('RaceCar is palindrome', () => { | ||
| assertEqual(isPalindrome('RaceCar'), true); | ||
| }); | ||
|
|
||
| test('A man a plan a canal Panama', () => { | ||
| assertEqual(isPalindrome('A man a plan a canal Panama'), true); | ||
| }); | ||
|
|
||
| // With spaces and punctuation | ||
| test('Was it a car or a cat I saw', () => { | ||
| assertEqual(isPalindrome('Was it a car or a cat I saw'), true); | ||
| }); | ||
|
|
||
| // Edge cases | ||
| test('empty string', () => { | ||
| assertEqual(isPalindrome(''), true); | ||
| }); | ||
|
|
||
| test('single character', () => { | ||
| assertEqual(isPalindrome('a'), true); | ||
| }); | ||
|
|
||
| test('numeric', () => { | ||
| assertEqual(isPalindrome('12321'), true); | ||
| }); | ||
|
|
||
| test('numeric with letters', () => { | ||
| assertEqual(isPalindrome('1a2 3 2a1'), true); | ||
| }); | ||
|
|
||
| console.log('All tests passed!'); |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CSV test data will contain unintended leading whitespace, causing assertions to fail.
The triple-quoted strings in the test methods embed continuation lines with leading spaces from the Python indentation. After YAML block-scalar processing, the data string will look like:
When parsed,
result[0]['name']will be' Alice', not'Alice', so Line 65's assertion will fail. The same issue affects all four test methods.Fix by using
textwrap.dedentor by left-aligning the CSV content within the triple-quoted string (breaking Python indentation aesthetics but keeping data correct).Proposed fix using textwrap.dedent
Apply the same pattern to all test methods.
🤖 Prompt for AI Agents