Add opencode agent using @opencode-ai/sdk#48
Conversation
- Python Unit Tests (text processing) - Palindrome Checker (algorithms) - Binary Search Implementation - Refactor Bad Code (shipping calculator) - CSV Parser (file processing) Expands bootstrap suite from 2 to 7 cases
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds five bootstrap exercises (implementations/tests or stubs), a new Opencode agent wrapper and SDK shim, registers and runs agents during per-case evaluation, updates runner scoring and evaluator behavior, adjusts CLI rounding, package/TS config, and adds HANDOFF documentation. Changes
Sequence DiagramsequenceDiagram
participant User
participant Runner
participant Agent as OpencodeAgent
participant CLI
participant SDK as Opencode SDK
participant Tools
User->>Runner: request run(case)
Runner->>Agent: createOpencodeAgent(cliPath?)
Runner->>Agent: run(prompt, {cwd, timeout, permissionMode})
Agent->>CLI: spawn opencode server (with cwd)
CLI-->>Agent: server stdout -> server URL
Agent->>SDK: dynamic import & create client session (connect to server)
Agent->>SDK: stream prompt
SDK->>Tools: invoke tools (if any)
Tools-->>SDK: return tool results
SDK-->>Agent: stream parts (text_delta / tool parts / complete)
Agent->>Runner: emit 'text_delta', 'tool_start', 'tool_end'
SDK-->>Agent: complete (tokens, cost, model)
Agent->>Runner: emit 'complete' and return AgentResult
Runner->>User: continue evaluation with AgentResult
alt Error
SDK-->>Agent: error
Agent->>Runner: emit 'error' and return error AgentResult
end
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly Related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Fix all issues with AI agents
In `@cases/bootstrap/csv-parser.yaml`:
- Around line 52-68: The test data string in test_simple_csv_with_header (and
the other CSV test methods) contains unintended leading indentation, causing
parsed field values to include spaces; wrap the multiline CSV literal in
textwrap.dedent (import textwrap at top) or left-align the triple-quoted CSV so
the variable data has no leading spaces before writing it in the
tempfile.NamedTemporaryFile block, ensuring parse_csv(...) returns 'Alice' not '
Alice' and the assertions pass.
In `@cases/bootstrap/python-unit-test.yaml`:
- Around line 29-33: The TODO claiming count_words is buggy is incorrect — the
current count_words(text) implementation using text.split() already handles
whitespace/newlines and counts tokens; remove the misleading TODO comment or
update the docstring to reflect correctness, or if you intend to change behavior
(e.g., ignore punctuation), update tests and implement punctuation stripping
inside count_words (e.g., normalize by removing punctuation before splitting)
and adjust the docstring accordingly.
In `@src/agents/opencode.ts`:
- Around line 37-38: The constructor currently hardcodes cliPath to
'/opt/homebrew/bin/opencode', which is not portable; change the default in the
constructor for the class in src/agents/opencode.ts so cliPath defaults to
'opencode' (a bare executable name) and rely on PATH resolution when spawning
processes; update any places referencing the constructor signature
(constructor(cliPath: string = '/opt/homebrew/bin/opencode')) to use the new
default and keep existing behavior when an explicit path is passed.
- Around line 162-165: The local const startTime inside the tool completion
branch shadows the outer startTime used for the run duration; rename the inner
variable (e.g., toolStartTime or toolStartedAt) wherever it's declared and used
in the block that references toolPart, toolStartTimes, and callID so durationMs
continues to compute Date.now() - toolStartTime without shadowing the run-level
startTime; update any subsequent references in that scope to the new name to
avoid accidental use of the run startTime.
- Around line 183-192: The code reads response.info unguarded which can throw if
info is undefined; change to use a safe default (e.g., const info =
response.info ?? {}) and compute tokens using optional chaining/fallbacks
(input/output defaulting to 0) so tokens.inputTokens, tokens.outputTokens,
tokens.totalTokens never access undefined, and only set model when
info.providerID and info.modelID are present (keep the existing conditional but
reference the guarded info). Ensure cacheReadTokens/cacheWriteTokens remain
initialized to 0.
- Around line 142-149: The tool part branch assumes toolPart.state is always
defined before reading state.status; update the conditional in the part.type ===
'tool' branch to null-check toolPart.state (e.g., require toolPart.state &&
toolPart.state.status === 'pending') before creating the ToolCall, and apply the
same null-guard pattern to the later check around the second access (the check
at the location corresponding to the later state.status usage) so accessing
.status cannot throw when state is undefined.
🧹 Nitpick comments (7)
tsconfig.json (2)
10-10:allowSyntheticDefaultImportsis redundant whenesModuleInteropis already enabled.With
esModuleInterop: trueon line 9, TypeScript already impliesallowSyntheticDefaultImports: true. This line can be removed without any behavioral change.
20-20: Redundant exclude pattern:"dist/**/*"is already covered by"dist".The
"dist"entry already excludes the entire directory tree. Adding"dist/**/*"is a no-op.Proposed fix
- "exclude": ["node_modules", "dist", "**/*.test.ts", "dist/**/*"] + "exclude": ["node_modules", "dist", "**/*.test.ts"]src/agents/opencode.ts (3)
17-25: Lazy-load pattern is not concurrency-safe, though benign here.If
loadSDK()is called concurrently (e.g., two agent runs started simultaneously), both calls may triggerimport()since neither has setcreateOpencodeyet. This is benign (dynamicimport()returns a cached module), but the intent of the guard is defeated. A simple fix is to cache the promise instead.Promise-caching approach
-let createOpencode: any; -const loadSDK = async () => { - if (!createOpencode) { - const sdkWrapper = await import('./opencode-sdk.mjs'); - createOpencode = sdkWrapper.createOpencode; - } - return createOpencode; -}; +let sdkPromise: Promise<any> | null = null; +const loadSDK = () => { + if (!sdkPromise) { + sdkPromise = import('./opencode-sdk.mjs').then((m) => m.createOpencode); + } + return sdkPromise; +};
194-200: Parts are iterated twice — once for events (lines 133-181) and once for answer text (lines 194-200).Consider accumulating the answer text during the first iteration to avoid the redundant pass.
Combine into single loop
+ let answer = ''; for (const part of parts) { if (part.type === 'text') { const textPart = part as { text?: string }; if (textPart.text) { + answer += textPart.text; options.onEvent?.({ type: 'text_delta', text: textPart.text, }); } } else if (part.type === 'tool') { // ... tool handling ... } } // Remove second loop (lines 194-200) - let answer = ''; - for (const part of parts) { - if (part.type === 'text') { - const textPart = part as { text?: string }; - answer += textPart.text || ''; - } - }
93-97: Thesignalparameter is supported, but consider using the dedicatedtimeoutoption instead.The Opencode SDK's
createOpencode()function does accept asignal?: AbortSignaloption for cancellation. UsingAbortSignal.timeout(timeoutMs)as shown will work correctly. However, the SDK also provides a dedicatedtimeoutoption specifically for server start timeout (default 5000ms), which may be a clearer and more idiomatic way to express the intent.package.json (1)
52-52: Remove unusedrandombytesdependency.This package is not imported or used anywhere in the codebase and can be safely removed from the dependencies in package.json.
src/agents/opencode-sdk.mjs.d.ts (1)
5-7: Import the actual type from@opencode-ai/sdkinstead of usingany.The
@opencode-ai/sdkpackage exportscreateOpencodewith proper TypeScript types. Usinganydiscards type safety and IDE autocompletion. Replace the declaration with:import type { createOpencode } from '@opencode-ai/sdk'; export { createOpencode };
| def test_simple_csv_with_header(self): | ||
| data = '''name,age,city | ||
| Alice,30,New York | ||
| Bob,25,Los Angeles | ||
| Charlie,35,Chicago''' | ||
|
|
||
| with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as f: | ||
| f.write(data) | ||
| f.flush() | ||
|
|
||
| result = parse_csv(f.name, has_header=True) | ||
|
|
||
| self.assertEqual(len(result), 3) | ||
| self.assertEqual(result[0]['name'], 'Alice') | ||
| self.assertEqual(result[0]['age'], '30') | ||
| self.assertEqual(result[1]['city'], 'Los Angeles') | ||
|
|
There was a problem hiding this comment.
CSV test data will contain unintended leading whitespace, causing assertions to fail.
The triple-quoted strings in the test methods embed continuation lines with leading spaces from the Python indentation. After YAML block-scalar processing, the data string will look like:
name,age,city
Alice,30,New York
Bob,25,Los Angeles
Charlie,35,Chicago
When parsed, result[0]['name'] will be ' Alice', not 'Alice', so Line 65's assertion will fail. The same issue affects all four test methods.
Fix by using textwrap.dedent or by left-aligning the CSV content within the triple-quoted string (breaking Python indentation aesthetics but keeping data correct).
Proposed fix using textwrap.dedent
+ import textwrap
+
...
def test_simple_csv_with_header(self):
- data = '''name,age,city
- Alice,30,New York
- Bob,25,Los Angeles
- Charlie,35,Chicago'''
+ data = textwrap.dedent('''\
+ name,age,city
+ Alice,30,New York
+ Bob,25,Los Angeles
+ Charlie,35,Chicago''')Apply the same pattern to all test methods.
🤖 Prompt for AI Agents
In `@cases/bootstrap/csv-parser.yaml` around lines 52 - 68, The test data string
in test_simple_csv_with_header (and the other CSV test methods) contains
unintended leading indentation, causing parsed field values to include spaces;
wrap the multiline CSV literal in textwrap.dedent (import textwrap at top) or
left-align the triple-quoted CSV so the variable data has no leading spaces
before writing it in the tempfile.NamedTemporaryFile block, ensuring
parse_csv(...) returns 'Alice' not ' Alice' and the assertions pass.
| def count_words(text): | ||
| """Count the number of words in a string.""" | ||
| # TODO: This implementation is buggy. Fix it! | ||
| words = text.split() | ||
| return len(words) |
There was a problem hiding this comment.
The "buggy" implementation already passes all provided tests.
text.split() (no arguments) splits on any whitespace, handles multiple spaces, empty strings, and newlines correctly. The punctuation test ("hello, world!" → 2) also passes since split() counts "hello," and "world!" as two tokens.
The prompt claims the implementation is buggy with a TODO to fix it, but there's nothing to fix — the agent can pass all tests without changing a line. Consider either introducing an actual bug (e.g., text.split(' ') which fails on multiple spaces) or adding tests that genuinely fail (e.g., stripping punctuation before counting).
🤖 Prompt for AI Agents
In `@cases/bootstrap/python-unit-test.yaml` around lines 29 - 33, The TODO
claiming count_words is buggy is incorrect — the current count_words(text)
implementation using text.split() already handles whitespace/newlines and counts
tokens; remove the misleading TODO comment or update the docstring to reflect
correctness, or if you intend to change behavior (e.g., ignore punctuation),
update tests and implement punctuation stripping inside count_words (e.g.,
normalize by removing punctuation before splitting) and adjust the docstring
accordingly.
- Add agent config directly in opencode.ts with SDK initialization parameters - Add response structure validation and server cleanup in finally block - Fix score display in CLI and runner (scores already percentages) - Add agent call between dependency install and evaluation in runner
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@src/agents/opencode.ts`:
- Around line 159-165: The AbortSignal.timeout call can receive zero/negative
values when timeoutMs ≤ 5000, causing an invalid or immediate abort; update the
client.session.prompt invocation (the promptResult call) to compute a safe
timeout before passing a signal — e.g., calculate a timeoutDuration = timeoutMs
- 5000 and only pass signal: AbortSignal.timeout(timeoutDuration) when
timeoutDuration > 0, or else omit the signal or use a small positive fallback
(e.g., Math.max(1, timeoutDuration)); adjust the code around
client.session.prompt, sessionId, and timeoutMs accordingly so the prompt isn't
given an invalid timeout.
- Around line 95-127: The config object is hardcoded with machine-specific
values (model: 'local-glm/glm-4.7-local-4bit', provider.baseURL,
provider.apiKey, and the absolute model id '/Users/studio/models/GLM-4.7-4bit');
replace this by loading settings from the Opencode config
(~/.config/opencode/opencode.json) or environment/agent options before
constructing the config variable so it’s portable. Implement reading/parsing of
the opencode config (or use process.env fallbacks) and populate the
config.model, provider['local-glm'].options.baseURL, apiKey and
provider['local-glm'].models['glm-4.7-local-4bit'].id from those values instead
of hardcoding them; ensure defaults are validated and fallback errors are
handled where config is used (e.g., the config variable and any functions that
consume it).
🧹 Nitpick comments (3)
src/agents/opencode.ts (2)
131-137: Hardcoded hostname and port for the Opencode server.
hostname: '127.0.0.1'andport: 4097are baked in. If another service occupies that port, or the agent runs in a different network context, this silently fails. Consider making these configurable (env vars or constructor options).
270-270: Inconsistent indentation oncatchblock.The closing brace +
catchon line 270 is missing indentation relative to thetryon line 131. This is likely just a formatting issue but hurts readability.src/evaluation/runner.ts (1)
227-236: PopulateagentTracewith agent execution metrics.
agentResultcontains execution metrics (tokens, cost, duration, tool calls, model) but is discarded after the success check.CaseResulthas anagentTracefield designed for this telemetry—extractnumTurns,totalTokens,toolsUsedfromagentResultand populate it to retain agent execution visibility in results.
src/agents/opencode.ts
Outdated
| const config = { | ||
| model: 'local-glm/glm-4.7-local-4bit', | ||
| provider: { | ||
| 'local-glm': { | ||
| api: 'openai', | ||
| options: { | ||
| baseURL: 'http://127.0.0.1:8081/v1', | ||
| apiKey: 'local-glm-key' | ||
| }, | ||
| models: { | ||
| 'glm-4.7-local-4bit': { | ||
| name: 'GLM-4.7 Local (4-bit)', | ||
| id: '/Users/studio/models/GLM-4.7-4bit', | ||
| reasoning: false, | ||
| tool_call: true, | ||
| temperature: true, | ||
| limit: { | ||
| context: 32768, | ||
| output: 4096 | ||
| }, | ||
| cost: { | ||
| input: 0, | ||
| output: 0 | ||
| }, | ||
| modalities: { | ||
| input: ['text'], | ||
| output: ['text'] | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } | ||
| }; |
There was a problem hiding this comment.
Hardcoded local-only configuration is not portable and leaks machine-specific paths.
The entire config block — model name, baseURL, apiKey, and especially the absolute path /Users/studio/models/GLM-4.7-4bit — is specific to a single developer's machine. This will break for every other contributor and in CI. The config should be loaded from the Opencode config file (mentioned in the PR description as ~/.config/opencode/opencode.json) or sourced from environment variables / agent options.
Suggested approach
- const config = {
- model: 'local-glm/glm-4.7-local-4bit',
- provider: {
- 'local-glm': {
- api: 'openai',
- options: {
- baseURL: 'http://127.0.0.1:8081/v1',
- apiKey: 'local-glm-key'
- },
- models: {
- 'glm-4.7-local-4bit': {
- ...
- }
- }
- }
- }
- };
+ // Load config from ~/.config/opencode/opencode.json or accept via options
+ const configPath = process.env.OPENCODE_CONFIG
+ || path.join(os.homedir(), '.config', 'opencode', 'opencode.json');
+ const config = JSON.parse(fs.readFileSync(configPath, 'utf-8'));🤖 Prompt for AI Agents
In `@src/agents/opencode.ts` around lines 95 - 127, The config object is hardcoded
with machine-specific values (model: 'local-glm/glm-4.7-local-4bit',
provider.baseURL, provider.apiKey, and the absolute model id
'/Users/studio/models/GLM-4.7-4bit'); replace this by loading settings from the
Opencode config (~/.config/opencode/opencode.json) or environment/agent options
before constructing the config variable so it’s portable. Implement
reading/parsing of the opencode config (or use process.env fallbacks) and
populate the config.model, provider['local-glm'].options.baseURL, apiKey and
provider['local-glm'].models['glm-4.7-local-4bit'].id from those values instead
of hardcoding them; ensure defaults are validated and fallback errors are
handled where config is used (e.g., the config variable and any functions that
consume it).
src/agents/opencode.ts
Outdated
| const promptResult = await client.session.prompt({ | ||
| path: { id: sessionId }, | ||
| body: { | ||
| parts: [{ type: 'text', text: prompt }], | ||
| }, | ||
| signal: AbortSignal.timeout(timeoutMs - 5000), | ||
| }); |
There was a problem hiding this comment.
AbortSignal.timeout(timeoutMs - 5000) can produce a negative or zero timeout.
If timeoutMs is ≤ 5000 (e.g., caller passes a small timeout), this creates an invalid or immediately-firing abort signal. Guard against this.
Proposed fix
- signal: AbortSignal.timeout(timeoutMs - 5000),
+ signal: AbortSignal.timeout(Math.max(timeoutMs - 5000, 1000)),📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const promptResult = await client.session.prompt({ | |
| path: { id: sessionId }, | |
| body: { | |
| parts: [{ type: 'text', text: prompt }], | |
| }, | |
| signal: AbortSignal.timeout(timeoutMs - 5000), | |
| }); | |
| const promptResult = await client.session.prompt({ | |
| path: { id: sessionId }, | |
| body: { | |
| parts: [{ type: 'text', text: prompt }], | |
| }, | |
| signal: AbortSignal.timeout(Math.max(timeoutMs - 5000, 1000)), | |
| }); |
🤖 Prompt for AI Agents
In `@src/agents/opencode.ts` around lines 159 - 165, The AbortSignal.timeout call
can receive zero/negative values when timeoutMs ≤ 5000, causing an invalid or
immediate abort; update the client.session.prompt invocation (the promptResult
call) to compute a safe timeout before passing a signal — e.g., calculate a
timeoutDuration = timeoutMs - 5000 and only pass signal:
AbortSignal.timeout(timeoutDuration) when timeoutDuration > 0, or else omit the
signal or use a small positive fallback (e.g., Math.max(1, timeoutDuration));
adjust the code around client.session.prompt, sessionId, and timeoutMs
accordingly so the prompt isn't given an invalid timeout.
The SDK's createOpencodeServer spawns `opencode serve` without passing
a cwd, so the server inherits the parent process's working directory.
When the runner passes a temp dir via session.create({ directory }),
the SDK silently returns an empty response for non-project directories.
Fix: spawn the server ourselves with cwd set to the case's working
directory, then connect with createOpencodeClient. Also fixes variable
shadowing, adds null guards on response.info and toolPart.state, uses
PATH-resolved 'opencode' instead of hardcoded /opt/homebrew path, and
adds post-build copy step for the .mjs ESM wrapper.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Review feedback — a few things to address before this is ready to merge1. bootstrap-003 (python-unit-test) is a no-op test caseCodeRabbit caught this too — the "buggy" Fix: Either make the starter implementation actually buggy (e.g.,
2. We're evaluating agent quality, not just test executionThe rubric right now mostly checks exit codes and basic signals ("No test runner found" still passes, "Pattern check not fully implemented" still passes). For these bootstrap cases to be meaningful benchmarks, we need to evaluate whether the agent's output is good — not just whether the sandbox ran without errors. Think about:
3. Other CodeRabbit feedback to addressFrom the review:
4. What I fixed in
|
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
package.json (1)
1-75:⚠️ Potential issue | 🟡 MinorCI failure: lockfile is out of sync with package.json.
The pipeline reports that lockfile specifiers don't match
package.json. Runnpm install(or the project's package manager) to regenerate the lockfile and commit it.
🤖 Fix all issues with AI agents
In `@package.json`:
- Line 52: The package.json currently lists "randombytes" as a direct dependency
but the codebase doesn't use it; remove the "randombytes" entry from
dependencies in package.json (or move it to devDependencies only if used in
scripts/tests), then run your package manager install to let npm/yarn resolve it
transitively (or verify it's no longer installed) and run a quick search for
"randombytes" and imports/requires across the repo (and check `@opencode-ai/sdk`
usage) to ensure it's truly unused before committing.
In `@src/agents/opencode.ts`:
- Around line 286-288: The factory createOpencodeAgent currently accepts only
cliPath causing OpencodeAgent instances to always use the hardcoded default
config; update createOpencodeAgent to accept a config parameter and forward it
into the OpencodeAgent constructor (createOpencodeAgent(cliPath?: string,
config?: OpencodeConfig) -> return new OpencodeAgent(cliPath, config)), or
alternatively remove/avoid the hardcoded default in the OpencodeAgent
constructor and require callers to pass config explicitly; reference
createOpencodeAgent and the OpencodeAgent constructor to implement the chosen
change so instances receive the intended config.
🧹 Nitpick comments (6)
src/agents/opencode-sdk.mjs.d.ts (1)
5-7: Consider adding a minimal function signature instead ofany.Typing
createOpencodeClientasanydiscards all type safety at the call site. Even a loose signature like(options: { baseUrl: string }) => anywould catch basic misuse.package.json (1)
12-12: Build copy step is fragile and not cross-platform.The raw
cpcommand won't work on Windows. Consider using a cross-platform alternative likecpx,cpy-cli, or a small Node script. Also, this will fail silently ifdist/agents/doesn't exist yet (thoughtscshould create it).src/agents/opencode.ts (4)
28-30: Global port counter has no upper-bound check and no collision detection.
nextPortincrements indefinitely without wrapping or verifying availability. If many runs occur (or the process is long-lived), this could exceed valid port ranges. Consider bounding it (e.g.,4097 + (nextPort++ % 16000)) or picking a random available port.
49-80: After the URL is resolved, theexitlistener can still fire and callrejecton a settled promise.While this is a no-op in Node.js (settled promises ignore subsequent resolve/reject), it's a code smell. Consider adding a
resolvedflag or removing the listeners after the URL is found to make the intent clearer.
238-245: Use?? 0instead of|| 0for numeric fallbacks.
|| 0treats any falsy value (including legitimate0) as missing. While0 || 0happens to produce0, using?? 0(nullish coalescing) is more semantically correct and future-proof for cases where you might check against the fallback behavior.
280-282: Consider forceful kill if the server doesn't terminate promptly.
serverProc?.kill()sendsSIGTERM, but a hung server could linger. A common pattern is to set a short timer and follow up withSIGKILLif the process hasn't exited.
package.json
Outdated
| "commander": "^12.0.0", | ||
| "dockerode": "^4.0.2", | ||
| "ora": "^8.0.0", | ||
| "randombytes": "^2.1.0", |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check if randombytes is directly imported/required in source code
rg -n "randombytes" --type=ts --type=js -g '!node_modules'Repository: AnswerLayer/sniffbench
Length of output: 48
🏁 Script executed:
# Also check the package.json to see if randombytes is listed as a dependency of `@opencode-ai/sdk`
cat package.json | grep -A 10 -B 10 "randombytes"Repository: AnswerLayer/sniffbench
Length of output: 664
🏁 Script executed:
# Search for any reference to 'randombytes' across the entire codebase (excluding node_modules)
rg "randombytes" -g '!node_modules'Repository: AnswerLayer/sniffbench
Length of output: 107
Remove randombytes if unused
This dependency is not directly used anywhere in the codebase. If it's a transitive dependency from @opencode-ai/sdk, remove it from direct dependencies and let npm resolve it automatically. If it's not needed at all, remove it entirely.
🤖 Prompt for AI Agents
In `@package.json` at line 52, The package.json currently lists "randombytes" as a
direct dependency but the codebase doesn't use it; remove the "randombytes"
entry from dependencies in package.json (or move it to devDependencies only if
used in scripts/tests), then run your package manager install to let npm/yarn
resolve it transitively (or verify it's no longer installed) and run a quick
search for "randombytes" and imports/requires across the repo (and check
`@opencode-ai/sdk` usage) to ensure it's truly unused before committing.
| export function createOpencodeAgent(cliPath?: string): OpencodeAgent { | ||
| return new OpencodeAgent(cliPath); | ||
| } |
There was a problem hiding this comment.
Factory function doesn't forward config, so all instances get the hardcoded default.
createOpencodeAgent accepts only cliPath but not config. The registry will always create agents with the non-portable default config. Either pass config through the factory or remove the default config from the constructor.
Proposed fix
-export function createOpencodeAgent(cliPath?: string): OpencodeAgent {
- return new OpencodeAgent(cliPath);
+export function createOpencodeAgent(cliPath?: string, config?: Record<string, any>): OpencodeAgent {
+ return new OpencodeAgent(cliPath, config);
}🤖 Prompt for AI Agents
In `@src/agents/opencode.ts` around lines 286 - 288, The factory
createOpencodeAgent currently accepts only cliPath causing OpencodeAgent
instances to always use the hardcoded default config; update createOpencodeAgent
to accept a config parameter and forward it into the OpencodeAgent constructor
(createOpencodeAgent(cliPath?: string, config?: OpencodeConfig) -> return new
OpencodeAgent(cliPath, config)), or alternatively remove/avoid the hardcoded
default in the OpencodeAgent constructor and require callers to pass config
explicitly; reference createOpencodeAgent and the OpencodeAgent constructor to
implement the chosen change so instances receive the intended config.
The rubric was giving false passes in multiple places: - Test command chain fell back to `echo` (exit 0) when no runner found - Pattern evaluator stub returned passed:true - Unknown evaluator types returned passed:true - Criteria with only optional evaluators scored 1.0 (perfect) Now: unimplemented evaluators fail by default, test fallback exits 1, and criteria with only optional evaluators are excluded from the overall score calculation rather than getting a free perfect score. Also adds HANDOFF.md with debugging notes, SDK response reference, and remaining TODO items for the other agent. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/rubrics/defaults.ts (1)
89-93:⚠️ Potential issue | 🟡 MinorPre-existing bug:
find -namedoes not support brace expansion — this pattern never matches any files.
find . -name "*.{js,ts,py}"passes the literal string*.{js,ts,py}tofind, which won't match anything. Brace expansion is a shell feature, not supported byfind -name. This evaluator silently succeeds (no files found →awkgets no input → exit 0), so the "Reasonable file sizes" check is effectively a no-op. Since it'soptional: true, it doesn't affect scoring, but it's still dead logic.Fix would be to use multiple
-nameclauses with-o:Suggested fix
- run: 'find . -name "*.{js,ts,py}" -exec wc -l {} + 2>/dev/null | awk \'$1 > 1000 {exit 1}\' || true', + run: 'find . \\( -name "*.js" -o -name "*.ts" -o -name "*.py" \\) -exec wc -l {} + 2>/dev/null | awk \'$1 > 1000 {exit 1}\' || true',
🤖 Fix all issues with AI agents
In `@HANDOFF.md`:
- Around line 67-82: The fenced code block showing the promptResult.data
structure lacks a language specifier; update the opening fence to include a
language hint (e.g., use "json" or "jsonc") so the block becomes ```jsonc and
enables proper syntax highlighting and linter validation for the
promptResult.data example; no other content changes are needed—the rest of the
block (the promptResult.data object and its fields like info, parts, tokens,
etc.) should remain unchanged.
In `@src/evaluation/runner.ts`:
- Around line 218-236: agent.run(...) returns an AgentResult with metrics that
are currently discarded; update the code to attach agentResult (or at minimum
selected fields like tokens, costUsd, toolCalls, model, durationMs) to the
CaseResult returned by the runner (extend the CaseResult type/interface and
populate it after the agent.run call); also change the error throw to avoid
"undefined" by including a fallback (e.g., use agentResult.error ?? 'unknown
error') when throwing or logging the failure; reference getAgent, agent.run,
AgentResult and CaseResult when making these edits and ensure options.onProgress
continues to be called with progress updates.
- Around line 334-348: The pattern evaluator block (evaluator.type ===
'pattern') currently returns a hardcoded failure; implement real matching
against evaluator.failIfMatch patterns and the available files (references:
strictRubric contains the pattern rules, evaluator.failIfMatch, files, and
evalResult). Iterate evaluator.failIfMatch (treat entries as regex strings or
plain substrings), scan each entry against each file's content in files, collect
any matches (file name + matched pattern), and set evalResult.passed = false,
score = 0.0 and evidence to a summary of matches if any are found; if no
matches, set evalResult.passed = true, score = 1.0 and evidence = 'No pattern
matches found'. Ensure you handle invalid regexes safely (fallback to substring
match) and include the matched file names in the evidence.
🧹 Nitpick comments (2)
src/rubrics/defaults.ts (1)
139-146: Inconsistency:strictRubriccorrectness command lacks the|| exit 1fallback.The
defaultRubricandminimalRubricboth use|| exit 1to explicitly fail when no runner succeeds, but the strict rubric relies on the last command's natural exit code. Whilego test ./...will return non-zero if Go isn't installed, the behavior is subtly different (noisy stderr, and if the last runner happens to be installed but returns 0 for "no test files," it could false-pass). Consider aligning with the other rubrics for consistency.Suggested fix
- run: 'npm test || pytest || go test ./...', + run: 'npm test 2>/dev/null || pytest 2>/dev/null || go test ./... 2>/dev/null || exit 1',src/evaluation/runner.ts (1)
366-390: Scoring normalization logic looks correct — good improvement.Excluding criteria with only optional evaluators from weight normalization prevents phantom 100% scores. The
_totalWeight > 0 ? ... : 0guard at line 390 handles the edge case where no criteria participate.One nit: the
_totalWeightvariable name (line 289) uses an underscore prefix, which conventionally signals an intentionally unused variable. Since it's actively used on lines 385 and 390, this is misleading. Consider renaming toparticipatingWeightor justtotalWeight.Suggested rename
- let _totalWeight = 0; + let participatingWeight = 0;And update lines 385 and 390 accordingly.
HANDOFF.md
Outdated
| ``` | ||
| promptResult.data = { | ||
| info: { | ||
| id, sessionID, role, time: { created, completed }, | ||
| modelID, providerID, cost, | ||
| tokens: { total, input, output, reasoning, cache: { read, write } }, | ||
| finish: "stop" | ||
| }, | ||
| parts: [ | ||
| { type: "step-start", snapshot }, | ||
| { type: "reasoning", text, time: { start, end } }, | ||
| { type: "text", text, time: { start, end } }, | ||
| { type: "tool", tool, callID, state: { status: "pending"|"completed" } }, | ||
| { type: "step-finish", reason, snapshot, cost, tokens } | ||
| ] | ||
| } |
There was a problem hiding this comment.
Fenced code block missing language specifier.
The code block starting at line 67 should have a language identifier. Since this represents a JSON-like data structure, consider adding a language hint for syntax highlighting and to satisfy linting.
Suggested fix
-```
+```jsonc
promptResult.data = {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ``` | |
| promptResult.data = { | |
| info: { | |
| id, sessionID, role, time: { created, completed }, | |
| modelID, providerID, cost, | |
| tokens: { total, input, output, reasoning, cache: { read, write } }, | |
| finish: "stop" | |
| }, | |
| parts: [ | |
| { type: "step-start", snapshot }, | |
| { type: "reasoning", text, time: { start, end } }, | |
| { type: "text", text, time: { start, end } }, | |
| { type: "tool", tool, callID, state: { status: "pending"|"completed" } }, | |
| { type: "step-finish", reason, snapshot, cost, tokens } | |
| ] | |
| } |
🧰 Tools
🪛 markdownlint-cli2 (0.20.0)
[warning] 67-67: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In `@HANDOFF.md` around lines 67 - 82, The fenced code block showing the
promptResult.data structure lacks a language specifier; update the opening fence
to include a language hint (e.g., use "json" or "jsonc") so the block becomes
```jsonc and enables proper syntax highlighting and linter validation for the
promptResult.data example; no other content changes are needed—the rest of the
block (the promptResult.data object and its fields like info, parts, tokens,
etc.) should remain unchanged.
| // Run the agent to attempt to solve the case | ||
| options.onProgress?.({ | ||
| type: 'running', | ||
| caseId: caseData.id, | ||
| caseIndex, | ||
| totalCases, | ||
| message: 'Running agent...', | ||
| }); | ||
|
|
||
| const agent = getAgent(options.agent); | ||
| const agentResult: AgentResult = await agent.run(caseData.prompt, { | ||
| cwd: tempDir, | ||
| timeoutMs: (options.timeoutSeconds || 300) * 1000, | ||
| permissionMode: 'acceptEdits', | ||
| }); | ||
|
|
||
| if (!agentResult.success) { | ||
| throw new Error(`Agent execution failed: ${agentResult.error}`); | ||
| } |
There was a problem hiding this comment.
Agent result data (tokens, cost, tool calls, model) is discarded — only success/failure is used.
agentResult contains valuable benchmarking data (tokens, costUsd, toolCalls, model, durationMs, etc.) but is never propagated to CaseResult. For a benchmarking tool, this data loss seems significant. Consider extending CaseResult to carry agentResult or at least the key metrics.
Additionally, agentResult.error is typed as string | undefined (see AgentResult in src/agents/types.ts), so line 235 could produce "Agent execution failed: undefined".
Suggested fix for the error message
if (!agentResult.success) {
- throw new Error(`Agent execution failed: ${agentResult.error}`);
+ throw new Error(`Agent execution failed: ${agentResult.error || 'unknown error'}`);
}🤖 Prompt for AI Agents
In `@src/evaluation/runner.ts` around lines 218 - 236, agent.run(...) returns an
AgentResult with metrics that are currently discarded; update the code to attach
agentResult (or at minimum selected fields like tokens, costUsd, toolCalls,
model, durationMs) to the CaseResult returned by the runner (extend the
CaseResult type/interface and populate it after the agent.run call); also change
the error throw to avoid "undefined" by including a fallback (e.g., use
agentResult.error ?? 'unknown error') when throwing or logging the failure;
reference getAgent, agent.run, AgentResult and CaseResult when making these
edits and ensure options.onProgress continues to be called with progress
updates.
| } else if (evaluator.type === 'pattern') { | ||
| // Run pattern evaluator (check for matches in files) | ||
| // For now, just pass - full implementation will use grep/find | ||
| // Default to fail until fully implemented | ||
| evalResult = { | ||
| passed: true, | ||
| score: 1.0, | ||
| evidence: 'Pattern check not fully implemented', | ||
| passed: false, | ||
| score: 0.0, | ||
| evidence: 'Pattern check not yet implemented', | ||
| }; | ||
| } else { | ||
| // Other evaluator types (llm_judge, benchmark, etc.) - placeholder | ||
| // Other evaluator types (llm_judge, benchmark, etc.) - not implemented | ||
| evalResult = { | ||
| passed: true, | ||
| score: 1.0, | ||
| evidence: 'Evaluator type not yet implemented', | ||
| passed: false, | ||
| score: 0.0, | ||
| evidence: `Evaluator type '${evaluator.type}' not yet implemented`, | ||
| }; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check for pattern evaluators in defaults.ts and their optional status
rg -n "type: 'pattern'" -B2 -A8 src/rubrics/defaults.ts | head -100Repository: AnswerLayer/sniffbench
Length of output: 997
🏁 Script executed:
# Also search more broadly for strictRubric definition
rg -n "strictRubric" -A50 src/rubrics/defaults.ts | head -150Repository: AnswerLayer/sniffbench
Length of output: 1792
🏁 Script executed:
# Check the current implementation of pattern evaluator in runner.ts
rg -n "type === 'pattern'" -B5 -A10 src/evaluation/runner.tsRepository: AnswerLayer/sniffbench
Length of output: 670
🏁 Script executed:
# Check safety criterion weight in strictRubric
rg -n "safety:" -A10 src/rubrics/defaults.ts | grep -A10 "strictRubric" | head -20Repository: AnswerLayer/sniffbench
Length of output: 48
🏁 Script executed:
# Verify if pattern evaluator implementation actually exists elsewhere in runner.ts
rg -n "failIfMatch\|patternMatch\|pattern.*match" src/evaluation/runner.tsRepository: AnswerLayer/sniffbench
Length of output: 48
🏁 Script executed:
# Get the full safety criterion section from strictRubric
sed -n '178,204p' src/rubrics/defaults.tsRepository: AnswerLayer/sniffbench
Length of output: 874
🏁 Script executed:
# Check if pattern evaluator is implemented at all by looking for execution logic
rg -n "failIfMatch" src/evaluation/runner.tsRepository: AnswerLayer/sniffbench
Length of output: 48
Pattern evaluators in strictRubric are non-optional but unimplemented, permanently failing the safety criterion (20% weight).
The strictRubric defines two non-optional pattern evaluators in the safety criterion (lines 190–201):
- "No hardcoded secrets" (line 190)
- "No dangerous functions" (line 197)
Neither is marked optional: true. However, the pattern evaluator in runner.ts (line 334–341) only returns passed: false, score: 0.0 with a stub message. This means these evaluators always fail regardless of actual code quality, reducing the safety criterion to a guaranteed 0% score (20% of the overall strict rubric).
Implement the pattern evaluator logic (it requires matching failIfMatch patterns against files listed in files) or mark these evaluators as optional: true in the strict rubric until implementation is ready.
🤖 Prompt for AI Agents
In `@src/evaluation/runner.ts` around lines 334 - 348, The pattern evaluator block
(evaluator.type === 'pattern') currently returns a hardcoded failure; implement
real matching against evaluator.failIfMatch patterns and the available files
(references: strictRubric contains the pattern rules, evaluator.failIfMatch,
files, and evalResult). Iterate evaluator.failIfMatch (treat entries as regex
strings or plain substrings), scan each entry against each file's content in
files, collect any matches (file name + matched pattern), and set
evalResult.passed = false, score = 0.0 and evidence to a summary of matches if
any are found; if no matches, set evalResult.passed = true, score = 1.0 and
evidence = 'No pattern matches found'. Ensure you handle invalid regexes safely
(fallback to substring match) and include the matched file names in the
evidence.
- Regenerate pnpm-lock.yaml to include @opencode-ai/sdk and randombytes (fixes CI frozen-lockfile failure) - Add python *.test.py and python *_test.py to test runner chain so Python test files with non-pytest naming are discovered - Fix syntax errors in minimalRubric (missing braces/indentation) Generated with claude-opus-4-6
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/rubrics/defaults.ts`:
- Line 33: The run command string in defaults.ts uses shell globs "python
*.test.py" and "python *_test.py" which only execute the first matched file;
update the run command(s) (the string assigned to run in
src/rubrics/defaults.ts) to either loop over matching files (e.g., iterate
matches and run each with python) or, preferably, remove the manual python glob
entries and rely on pytest (which already discovers tests) — also apply the same
change to the other occurrences referenced in the comment (the similar run
strings on lines around 117 and 141) so multiple Python test files are executed
correctly.
🧹 Nitpick comments (1)
src/rubrics/defaults.ts (1)
141-141:strictRubrictest command can false-pass on a wrong language's tests.The
||-chain means ifnpm testfails (JS tests broken), the shell falls through topython *.test.py/pytest/go test. A project that has both JS and Python files could report success from the Python runner while JS tests are actually failing. This applies to the other rubrics too, but is especially surprising under a "strict" label.This is inherent to the auto-discovery approach and may be acceptable for now, but worth noting as a design limitation given the reviewer feedback about false positives in the PR discussion.
src/rubrics/defaults.ts
Outdated
| type: 'command', | ||
| name: 'Tests pass', | ||
| run: 'npm test 2>/dev/null || pytest 2>/dev/null || go test ./... 2>/dev/null || echo "No test runner found"', | ||
| run: 'npm test 2>/dev/null || python *.test.py 2>/dev/null || python *_test.py 2>/dev/null || pytest 2>/dev/null || go test ./... 2>/dev/null || exit 1', |
There was a problem hiding this comment.
Shell glob python *.test.py only executes the first matching file.
When multiple test files match *.test.py, the shell expands the glob into positional arguments: python a.test.py b.test.py .... Python treats only the first as the script; the rest become sys.argv entries and are never executed. The same issue applies to *_test.py. This affects lines 117 and 141 as well.
Consider iterating over matches, or relying on pytest (which already discovers both patterns) and dropping the manual python glob steps:
Option A: loop over matched files
- run: 'npm test 2>/dev/null || python *.test.py 2>/dev/null || python *_test.py 2>/dev/null || pytest 2>/dev/null || go test ./... 2>/dev/null || exit 1',
+ run: 'npm test 2>/dev/null || for f in *.test.py *_test.py; do [ -f "$f" ] && python "$f"; done 2>/dev/null || pytest 2>/dev/null || go test ./... 2>/dev/null || exit 1',Option B: rely on pytest for Python discovery (simpler)
- run: 'npm test 2>/dev/null || python *.test.py 2>/dev/null || python *_test.py 2>/dev/null || pytest 2>/dev/null || go test ./... 2>/dev/null || exit 1',
+ run: 'npm test 2>/dev/null || pytest 2>/dev/null || go test ./... 2>/dev/null || exit 1',📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| run: 'npm test 2>/dev/null || python *.test.py 2>/dev/null || python *_test.py 2>/dev/null || pytest 2>/dev/null || go test ./... 2>/dev/null || exit 1', | |
| run: 'npm test 2>/dev/null || for f in *.test.py *_test.py; do [ -f "$f" ] && python "$f"; done 2>/dev/null || pytest 2>/dev/null || go test ./... 2>/dev/null || exit 1', |
| run: 'npm test 2>/dev/null || python *.test.py 2>/dev/null || python *_test.py 2>/dev/null || pytest 2>/dev/null || go test ./... 2>/dev/null || exit 1', | |
| run: 'npm test 2>/dev/null || pytest 2>/dev/null || go test ./... 2>/dev/null || exit 1', |
🤖 Prompt for AI Agents
In `@src/rubrics/defaults.ts` at line 33, The run command string in defaults.ts
uses shell globs "python *.test.py" and "python *_test.py" which only execute
the first matched file; update the run command(s) (the string assigned to run in
src/rubrics/defaults.ts) to either loop over matching files (e.g., iterate
matches and run each with python) or, preferably, remove the manual python glob
entries and rely on pytest (which already discovers tests) — also apply the same
change to the other occurrences referenced in the comment (the similar run
strings on lines around 117 and 141) so multiple Python test files are executed
correctly.
…uded When criteria with only optional evaluators are excluded from scoring, the overall score was calculated as (weightedSum / totalWeight) * 100 but weightedSum already incorporates a /100 factor. This caused scores to appear as 1% when they should be 100%. Fix: normalize by the participating fraction of total weight so excluded criteria don't dilute the score. Generated with claude-opus-4-6
- Add --model option to sniff run, passed through to agent - Save agentResponse, agentToolCalls, agentModel, and agentTokens in the CaseResult so the actual agent output is persisted in the results JSON alongside the evaluation scores Generated with claude-opus-4-6
Comprehensive handoff covering SDK fix, rubric fixes, what's next (event streaming, file snapshots), and technical reference. Generated with claude-opus-4-6
Switch from session.prompt() to promptAsync() + event.subscribe() (SSE) to capture intermediate tool calls, text deltas, and reasoning in real-time. agentToolCalls will now be populated with actual tool usage. Add file snapshots: after the agent runs, walk the workspace directory and capture all files with content and changed flag (compared against original case files). Results JSON now includes agentFiles array. claude-opus-4-6
Remove historical narrative and deprecated SDK references. Keep only actionable items, technical gotchas, and conventions. claude-opus-4-6
The SDK's event.subscribe() returns ServerSentEventsResult with the stream on .stream, not .data. Try multiple paths for resilience. claude-opus-4-6
Add DEFAULT_AGENT constant in agents/registry.ts and replace all hardcoded 'claude-code' defaults across CLI commands. claude-opus-4-6
Route closed-issues runner through the agent wrapper system instead of shelling out directly to claude CLI. Supports opencode, claude-code, and any other registered agent. claude-opus-4-6
All JavaScript bootstrap cases were failing because the test command chain only tried npm test then Python/Go runners. Add node *.test.js as a fallback. Update HANDOFF.md with full test results and analysis. claude-opus-4-6
The case loader was picking either .sniffbench/cases/ OR the built-in cases/ directory, not both. When closed-issues cases existed in .sniffbench/cases/, all bootstrap cases were invisible. Now loadCases() accepts string | string[], getDefaultCasesDirs() returns both directories, and cases are deduplicated by ID (project-specific takes precedence). claude-opus-4-6
- Remove unused randombytes dependency from package.json
- Remove redundant allowSyntheticDefaultImports from tsconfig (implied
by esModuleInterop)
- Remove redundant "dist/**/*" from tsconfig exclude (covered by "dist")
- Fix broken brace expansion in find command (defaults.ts maintainability
check) — find -name doesn't support {js,ts,py} syntax
- Align strictRubric test command with default/minimal rubrics: add
node *.test.js, 2>/dev/null suppression, and || exit 1 fallback
- Make bootstrap-003 starter code actually buggy (split(' ') instead of
split()) so it's not a no-op pass
- Fix CSV parser test data leading whitespace from YAML indentation
using textwrap.dedent()
claude-opus-4-6
claude-opus-4-6
Dead code after switching closed-issues runner to agent wrappers. claude-opus-4-6
Summary
Adds a new opencode agent to sniffbench that uses the official Opencode SDK for programmatic interaction. The implementation follows the same pattern as the claude-code agent but uses opencode's SDK instead.
Changes
src/agents/opencode.ts- Agent wrapper using@opencode-ai/sdk(v1.1.65)src/agents/opencode-sdk.mjs- ESM wrapper for SDK imports (handled ESM-only module compatibility)src/agents/opencode-sdk.mjs.d.ts- TypeScript type declarations for wrappersrc/agents/registry.ts- Registered opencode agentImplementation Details
The agent:
/Users/studio/.config/opencode/opencode.jsonsession.create()andsession.prompt()methodsAssistantMessageresponseTesting
sniff run --agent opencode --cases simple-001- PASSEDSummary by CodeRabbit
New Features
Bug Fixes
Chores
Documentation