ralph: #29 — Implement LLM-judge evaluator for answer comparison by jharris1679 · Pull Request #56 · AnswerLayer/sniffbench

jharris1679 · 2026-02-16T17:04:30Z

Issue

Closes #29

Status: ✓ verified

Build, tests, and lint all pass locally.

Summary

Automated implementation by Ralph (rlmkit + MiniMax M2.5).
Review the changes carefully — this was generated by a local model.

Summary by CodeRabbit

Release Notes

New Features
- Added OpenCode agent for server-managed code execution with event streaming
- Introduced LLM-based judge for AI-powered evaluation and answer comparison
- Implemented rubric-driven evaluation system with weighted criteria and flexible evaluators
- Enhanced evaluation runner with comprehensive case metrics and sandbox isolation
Improvements
- Improved type safety across codebase with stricter type annotations
- Standardized result structures and field naming for consistency

- Fix null type assignments in llm-judge.ts return statements - Add null checks before returning LLM judge results - Remove unused runLLMJudgeComparisonEvaluator import from runner.ts - Fix missing variable declarations (evalDurationMs, evalStartTime) in runner.ts - All build, test, and lint checks now pass

coderabbitai · 2026-02-16T17:04:46Z

Walkthrough

This PR enhances type safety by replacing any with unknown in the Opencode agent, introduces an LLM-based judge for evaluating answers via Claude API with caching and cost tracking, redesigns the evaluation result schema around a rubric-centric model with weighted criteria, and updates the evaluation runner and CLI to use the new field names and structures.

Changes

Cohort / File(s)	Summary
Type Safety in Opencode `src/agents/opencode-sdk.mjs.d.ts`, `src/agents/opencode.ts`	Changed `createOpencodeClient` and internal type annotations from `any` to `unknown`; updated `OpencodeAgent` constructor and config from `Record<string, any>` to `Record<string, unknown>` with temporary variable naming (_url, _serverProc) for clearer resource management.
LLM Judge Evaluation `src/evaluation/llm-judge.ts`	Introduces new module with Claude API integration for single and comparative answer evaluation, structured score/reasoning output, caching support, and cost tracking; exports LLMJudge class, LLMJudgeScore, ComparisonResult interfaces, and evaluator utility functions.
Evaluation Schema Redesign `src/cases/types.ts`	Replaces ad-hoc rubric pieces with RubricCriterion and RubricReference interfaces; consolidates evaluator base with mandatory name field; expands result model (EvaluatorResult, CriterionResult, CaseResult, RunResult, RunSummary) to support richer grading, criterion-level evidence, and rubric-driven metadata.
Evaluation Runner Updates `src/evaluation/runner.ts`	Refactors result shapes—RunResult uses id/timestamp/cases instead of runId/startedAt/caseResults; CaseResult replaces caseId with id and adds evidence/criteria/evaluators; wires evaluators with duration measurements; adds placeholder handling for llm_judge/llm_judge_comparison.
CLI Command Adjustments `src/cli/commands/run.ts`	Updates per-case result messages to use result.id instead of result.caseId; treats timedOut display with null coalescing checks and non-null verification.
Reference Implementations `src/agents/opencode.ts.bak`, `src/agents/opencode.ts.bak2`, `src/evaluation/llm-judge.ts.bak`, `src/evaluation/runner.ts.bak`, `src/evaluation/runner.ts.orig`	Backup and reference files documenting Opencode agent orchestration, LLM judge implementation, and evaluation runner workflow for future reference.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Evaluation Client
    participant Runner as Evaluation Runner
    participant Sandbox as Sandbox/Agent
    participant Judge as LLM Judge
    participant Claude as Claude API
    participant Cache as Cache Store

    Client->>Runner: runCases(cases, options)
    
    loop For each case
        Runner->>Sandbox: Execute agent with prompt
        Sandbox-->>Runner: agentResult, agentFiles
        
        alt Rubric with LLM Judge Evaluator
            Runner->>Judge: evaluate(criteria, answer, context)
            
            Judge->>Cache: Check cache key
            alt Cache hit
                Cache-->>Judge: Cached LLMJudgeScore
            else Cache miss
                Judge->>Claude: POST prompt with criteria & answer
                Claude-->>Judge: JSON response (score, reasoning, etc.)
                Judge->>Judge: parseResponse() → LJMJudgeScore
                Judge->>Cache: Store result
            end
            
            Judge-->>Runner: LJMJudgeScore {score, passed, reasoning}
        else Standard Evaluators (Command/Pattern)
            Runner->>Runner: Execute evaluator logic
        end
        
        Runner->>Runner: Aggregate criterion results & score
        Runner-->>Client: onEvent(case_complete) with CaseResult
    end
    
    Runner->>Runner: Compute RunSummary
    Runner-->>Client: RunResult {id, timestamp, cases, summary}

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Add opencode agent using @opencode-ai/sdk #48: Added Opencode agent implementation and SDK wrapper (createOpencodeClient export)—this PR tightens its type safety and internal structure.
feat(runs): implement run tracking and variant registration (ANS-455) #36: Modified evaluation runner result shapes and case result wiring—this PR builds on those structural changes with the new rubric schema.
Feature/15 claude agent sdk wrapper #31: Introduced Claude SDK integration for LLM-based agents—this PR extends that pattern to evaluator-level judgments.

Poem

🐰 A judge fair and wise, powered by Claude's light,
Evaluates answers with reasoning precise and bright.
Rubrics weighted, criteria clear,
From unknown lands, type safety draws near!
With caches humming and scores flowing free,
Our evaluation pipeline blooms with glee! 🌱✨

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Out of Scope Changes check	⚠️ Warning	The PR includes changes to types.ts, runner.ts, opencode agents, and CLI that go beyond the LLM judge implementation outlined in issue `#29`.	Clarify whether type schema restructuring and agent/runner changes are required for LLM judge integration or should be separate PRs for modularity and review clarity.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title clearly identifies the implementation of an LLM-judge evaluator and references issue `#29`, accurately summarizing the main change.
Linked Issues check	✅ Passed	The PR implements all core requirements from issue `#29`: LLM judge class with evaluate/compare/evaluateAgainstBaseline methods, Claude API integration, structured outputs, cost tracking, and caching support.
Docstring Coverage	✅ Passed	Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.
Merge Conflict Detection	✅ Passed	✅ No merge conflicts detected when merging into `main`

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch ralph/issue-29

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

coderabbitai

Actionable comments posted: 16

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

src/agents/opencode.ts (1)

97-120: ⚠️ Potential issue | 🟠 Major

Hardcoded machine-specific model path in default config.

Line 109 references /Users/studio/models/GLM-4.7-4bit — a local macOS filesystem path. This will fail on any other machine. Consider making this configurable or documenting it as a required override.
src/cli/commands/run.ts (1)
119-119: ⚠️ Potential issue | 🔴 Critical

result.runId is undefined — the runner returns id, not runId.

The runCases function (in runner.ts.orig line 163) sets id: runId but never sets a runId property. Since RunResult.runId is optional, result.runId will be undefined, producing "Run ID: undefined" in the output and a filename of undefined.json.
-      `Run ID: ${chalk.cyan(result.runId)}`,
+      `Run ID: ${chalk.cyan(result.id)}`,
-    const outputFile = path.join(outputDir, `${result.runId}.json`);
+    const outputFile = path.join(outputDir, `${result.id}.json`);
Also applies to: 135-135
src/evaluation/runner.ts (2)
394-405: ⚠️ Potential issue | 🟠 Major

allPassed includes optional evaluators, contradicting the scoring logic.

Lines 394-397 correctly exclude optional evaluators from the score, but Line 405 counts all evaluator results (including optional ones) when determining passed. An unimplemented optional evaluator (passed: false) will mark the entire criterion as failed even though it doesn't affect the score.
🐛 Proposed fix — filter optional evaluators from the pass check
-    const allPassed = evaluatorResults.filter((e) => !e.passed).length === 0;
+    const requiredResults = evaluatorResults.filter(
+      (_, idx) => !criterion.evaluators[idx].optional
+    );
+    const allPassed = requiredResults.length === 0 || requiredResults.every((e) => e.passed);
191-199: ⚠️ Potential issue | 🟡 Minor

Path traversal risk — file.path is not validated before writing.

file.path from case data is joined directly onto tempDir without sanitization. A malicious or misconfigured case file with a path like ../../etc/cron.d/exploit would write outside the temp directory.
🛡️ Proposed fix — validate resolved path stays within tempDir
       for (const file of caseData.files) {
         const filePath = path.join(tempDir, file.path);
+        // Guard against path traversal
+        if (!filePath.startsWith(tempDir + path.sep) && filePath !== tempDir) {
+          throw new Error(`Case file path escapes temp directory: ${file.path}`);
+        }
         const fileDir = path.dirname(filePath);

🤖 Fix all issues with AI agents

In `@src/agents/opencode.ts`:
- Around line 340-347: The message.updated handler unconditionally appends text
from props.parts into the answer, which can duplicate content already appended
by the streaming delta handler; add a guard so props.parts is only merged when
no streaming deltas were received (or when answer is still empty). Concretely,
introduce or reuse a boolean flag set to true in the streaming-delta processing
code (the code that appends to answer from deltas) such as hasStreamedDeltas or
receivedDeltas, then in the message.updated block that reads props.parts only
append when that flag is false (or answer is empty), and ensure the flag is
initialized/cleared appropriately so fallback behavior still works when there
were no streaming deltas.

In `@src/agents/opencode.ts.bak`:
- Around line 1-437: Delete the backup files named opencode.ts.bak,
opencode.ts.bak2, llm-judge.ts.bak, runner.ts.bak, and runner.ts.orig from the
repository (remove them from the commit and the index) and add their patterns to
.gitignore so future backups aren’t committed; ensure no code imports or
references these artifacts (check files containing spawnServer and class
OpencodeAgent for accidental imports) before committing the deletion and push a
follow-up commit that removes these artifacts.

In `@src/cases/types.ts`:
- Around line 134-135: EvaluatorType now lists 'llm_judge_comparison' but the
Evaluator union (and specifically the LLMJudgeEvaluator interface) lacks a
corresponding interface, causing the runner to cast types; add a new interface
LLMJudgeComparisonEvaluator and include it in the Evaluator union (mirror the
shape/fields expected by runner.ts where it checks evaluator.type ===
'llm_judge_comparison') or alternatively remove 'llm_judge_comparison' from
EvaluatorType if not supported; update the union to reference the new
LLMJudgeComparisonEvaluator (or delete the enum member) so the type system no
longer requires any (evaluator.type as any) workarounds.
- Around line 415-476: Multiple interfaces (CaseResult, RunResult, RunSummary,
CriterionResult, EvaluatorBase, LLMJudgeEvaluator) are declared repeatedly via
declaration merging; consolidate each into a single complete interface by
locating every duplicate declaration of each symbol and merging their properties
into one unified declaration (preserve all unique fields like
CaseResult.agentResponse, agentToolCalls, agentTokens, agentFiles, timedOut,
timestamp; RunResult/RunSummary fields from all variants; CriterionResult fields
from all declarations; and all members of EvaluatorBase and LLMJudgeEvaluator)
then remove the extra/duplicated interface blocks so each symbol appears exactly
once.

In `@src/evaluation/llm-judge.ts`:
- Around line 269-303: The costTracker is only incrementing callCount in
callClaude but never updates inputTokens, outputTokens, or costUsd; modify
callClaude to extract usage metadata (e.g., usage.input_tokens and
usage.output_tokens) from the SDK response (the streaming response or final
'result' message) after receiving the message and before parseResponse, update
this.costTracker.inputTokens and outputTokens by adding those values, compute
costUsd using a model-pricing lookup (add or reuse a small mapping like
pricePerInputToken/model and pricePerOutputToken/model or a
getModelPricePerToken(model) helper) and increment this.costTracker.costUsd
accordingly, and ensure the same extraction and updates are implemented in the
other Claude integration path referenced around lines 139-156 (the alternate
Claude call function) so getCostTracker() returns accurate cumulative costs.
- Around line 407-418: The LLMJudge is being instantiated inside
runLLMJudgeEvaluator and runLLMJudgeComparisonEvaluator which discards
instance-level cache and costTracker each call; change the code to reuse a
persistent LLMJudge instance by either (A) adding an optional parameter to
runLLMJudgeEvaluator and runLLMJudgeComparisonEvaluator (e.g., judge?: LLMJudge)
and use the provided instance when present, or (B) implement a module-level
factory/getter (e.g., getSharedLLMJudge(options) or a cached variable
sharedJudge) that returns a singleton LLMJudge created once with the provided
LLMJudgeOptions so this.shared cache and costTracker persist across invocations;
update call sites to pass an existing LLMJudge where appropriate or rely on the
shared factory to ensure caching and cost tracking are preserved.
- Around line 370-376: The generateCacheKey function currently truncates the
concatenated args with str.substring(0, 200) which causes collisions; change it
to compute a stable hash of the full concatenated string (e.g., use
crypto.createHash('sha256').update(str).digest('hex')) and build the key as type
+ ':' + this.model + ':' + <hex-hash> (remove the substring truncation), and add
the appropriate import of Node's crypto module so the function uses the SHA-256
hex digest of the full args instead of a truncated prefix.
- Around line 463-477: The returned result from runLLMJudgeEvaluator is
incorrectly labeled as 'llm_judge_comparison'; update both return objects in
runLLMJudgeEvaluator (the successful return that builds { name: evaluator.name
|| 'llm_judge', type: 'llm_judge_comparison', ... } and the error-path return)
to use type: 'llm_judge' so the EvaluatorResult.type correctly reflects
single-answer evaluation; ensure you change both occurrences and keep the rest
of the result shape (score, passed, evidence, details, durationMs) intact.

In `@src/evaluation/llm-judge.ts.bak`:
- Around line 269-303: In callClaude, the costTracker only increments callCount
but never updates inputTokens, outputTokens, or costUsd; update callClaude to
extract token/usage info from the Claude SDK response (inspect the streaming
messages for a usage/metrics object or final response metadata after the
'result' message), parse input and output token counts and add them to
this.costTracker.inputTokens / outputTokens, compute costUsd (use the correct
per-token rate for this.model or a configurable RATE constant) and add to
this.costTracker.costUsd, and ensure this happens before returning (in the same
loop or immediately after parsing the final message in callClaude) so
getCostTracker() reflects actual usage.
- Around line 465-477: The returned result objects in runLLMJudgeEvaluator
incorrectly use type 'llm_judge_comparison'; update both the normal return
object in runLLMJudgeEvaluator and the error-fallback return (the catch/fallback
block that builds a failed result) to use type 'llm_judge' instead of
'llm_judge_comparison', leaving other fields (name, score, passed, evidence,
details, durationMs) unchanged; locate the return that references evaluator.name
and the fallback return that sets passed: false and replace only the type
string.
- Around line 370-376: The cache key truncates the concatenated inputs to 200
chars in generateCacheKey which can cause collisions for long answers; replace
the truncation with a full-input hash (e.g. sha256) of the string composed from
type, this.model and the joined args so the key uses type + ':' + this.model +
':' + <hex-hash>; use Node's synchronous crypto.createHash to avoid changing
method signature and remove the substring(0,200) logic so generateCacheKey
reliably returns a unique key for long inputs.

In `@src/evaluation/runner.ts`:
- Line 318: The shared evalStartTime variable is causing cumulative timings;
change the logic to record a fresh start timestamp for each evaluator and for
each criterion before their work begins and compute durationMs as Date.now()
minus that per-item start time. Specifically, inside the loop that processes
each evaluator (referencing evalStartTime and evaluator.durationMs) set a local
evaluatorStart = Date.now() just before running the evaluator and use Date.now()
- evaluatorStart to populate evaluator.durationMs; similarly, before running
each criterion (referencing criterion.durationMs) set a local criterionStart =
Date.now() and compute criterion.durationMs = Date.now() - criterionStart after
it finishes. Ensure any existing uses of the top-level evalStartTime are
replaced by these per-item start variables so durations reflect individual
execution time.

In `@src/evaluation/runner.ts.bak`:
- Around line 1-555: This PR includes a development artifact file named
runner.ts.bak which duplicates the real runner implementation (functions like
runCases, runSingleCase, evaluateWithRubric, snapshotFiles,
installDependencies); remove runner.ts.bak from the commit and repository (git
rm --cached or git rm and commit the deletion), ensure only the canonical
src/evaluation/runner.ts remains, update the PR so no .bak file is present, and
re-run tests/CI to confirm there are no duplicate-file import or build issues.

In `@src/evaluation/runner.ts.orig`:
- Around line 370-395: runLLMJudgeEvaluator is being fed
JSON.stringify(agentFiles) which can produce very large prompts and high API
costs; replace that direct serialization by preprocessing agentFiles with a
function (e.g., summarizeOrTruncateAgentFiles or buildAgentFilesContext) that
for each file produces either a truncated snippet (limit bytes/characters per
file) or a short summary (first N lines + metadata or a condensed summary) and
also enforces a total context size cap, then pass JSON.stringify of that reduced
context to runLLMJudgeEvaluator (update the caller in runner.ts where agentFiles
is used and add/implement the new summarization/truncation helper).
- Around line 1-566: Remove this development artifact file
(src/evaluation/runner.ts.orig) from the repository and ensure the canonical
implementation (e.g., src/evaluation/runner.ts) is the one tracked; delete the
.orig file in the PR, remove any imports or references to it, and verify key
symbols such as runCases, runSingleCase, evaluateWithRubric, snapshotFiles, and
createSandboxManager are only present in the intended runner.ts to avoid
duplicate definitions.
- Around line 419-427: The code references evalDurationMs and evalStartTime out
of their block scope; move timing so criterion-level and function-level timers
are declared outside their respective loops: before the evaluator for-loop,
declare a criterion-level start timestamp (e.g., criterionStartTime) and after
the evaluator loop compute criterionDurationMs and use that in the
criteriaResults push (replace evalDurationMs), and at the top of the
run/evaluation function declare a function-level start timestamp (e.g.,
runStartTime) and compute totalDurationMs for the final return (replace
evalStartTime usage); update references to evalDurationMs and evalStartTime to
these new symbols and ensure they are computed after the loops complete.

🧹 Nitpick comments (6)

src/agents/opencode.ts (2)
298-303: Redundant guard: if (!text) continue makes the following if (text) always true.
           } else if (partAny.type === 'reasoning') {
             const text = (props as { delta?: string }).delta || partAny.text || '';
             if (!text) continue;
-            if (text) {
-              options.onEvent?.({ type: 'thinking', text });
-            }
+            options.onEvent?.({ type: 'thinking', text });
237-238: Dead code: toolName defaults to 'unknown', so if (!toolName) continue never triggers.

The fallback || 'unknown' on line 237 ensures toolName is always truthy, making the guard on line 238 unreachable.
src/evaluation/llm-judge.ts.bak (1)

463-492: Unused durationMs variable — computed but never referenced.

const durationMs = Date.now() - startTime; at lines 463 and 479 is assigned but the return statements use Date.now() - startTime again. Same pattern at lines 528 and 545.

src/evaluation/runner.ts (1)

28-28: Dead commented-out import — wire the LLM judge or remove.

The runLLMJudgeEvaluator import is double-commented (// //) and the llm_judge branch on Line 370 is a hard-coded placeholder returning score: 0. The LLM judge module (llm-judge.ts) is implemented in this PR — consider wiring it in now, or at minimum leave a single clean // TODO with an issue reference rather than dead commented code.

src/evaluation/llm-judge.ts (2)

463-463: Unused _durationMs variables on Lines 463, 479, 528, 545.

Each function computes _durationMs but never uses it — durationMs in the returned object is recomputed inline with a second Date.now() - startTime call. Remove the dead assignments.

Also applies to: 479-479, 528-528, 545-545

142-157: Default model claude-3-5-sonnet-20241022 may be outdated — make it easy to update.

Consider extracting the default model string to a module-level constant for easier updates.

coderabbitai · 2026-02-16T17:10:51Z

src/agents/opencode.ts

          // Extract final answer text from message parts if we haven't captured it via deltas
-          if (props?.parts && !answer) {
-            for (const p of props.parts) {
-              if (p.type === 'text' && p.text) {
-                answer += p.text;
+          if (props && (props as { parts?: unknown[] } & Record<string, unknown>).parts) {
+            for (const p of (props as { parts?: unknown[] | null | undefined }).parts ?? []) {
+              if ((p as { type?: string; text?: string }).type === 'text' && (p as { type?: string; text?: string }).text) {
+                answer += (p as { type?: string; text?: string }).text;
              }
            }
          }


⚠️ Potential issue | 🟠 Major

Potential duplicate answer content from message.updated events.

When streaming deltas are received (lines 229-233), answer accumulates text. Then message.updated (lines 341-346) unconditionally appends text from the full message parts. This can double the captured answer. The fallback at lines 373-392 already handles the "no streaming data" case. Guard this section:

// Extract final answer text from message parts if we haven't captured it via deltas - if (props && (props as { parts?: unknown[] } & Record<string, unknown>).parts) { + if (!answer && props && (props as { parts?: unknown[] } & Record<string, unknown>).parts) {

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Extract final answer text from message parts if we haven't captured it via deltas

if (props?.parts && !answer) {

for (const p of props.parts) {

if (p.type === 'text' && p.text) {

answer += p.text;

if (props && (props as { parts?: unknown[] } & Record<string, unknown>).parts) {

for (const p of (props as { parts?: unknown[] | null | undefined }).parts ?? []) {

if ((p as { type?: string; text?: string }).type === 'text' && (p as { type?: string; text?: string }).text) {

answer += (p as { type?: string; text?: string }).text;

}

}

}

// Extract final answer text from message parts if we haven't captured it via deltas

if (!answer && props && (props as { parts?: unknown[] } & Record<string, unknown>).parts) {

for (const p of (props as { parts?: unknown[] | null | undefined }).parts ?? []) {

if ((p as { type?: string; text?: string }).type === 'text' && (p as { type?: string; text?: string }).text) {

answer += (p as { type?: string; text?: string }).text;

}

}

}

🤖 Prompt for AI Agents

In `@src/agents/opencode.ts` around lines 340 - 347, The message.updated handler unconditionally appends text from props.parts into the answer, which can duplicate content already appended by the streaming delta handler; add a guard so props.parts is only merged when no streaming deltas were received (or when answer is still empty). Concretely, introduce or reuse a boolean flag set to true in the streaming-delta processing code (the code that appends to answer from deltas) such as hasStreamedDeltas or receivedDeltas, then in the message.updated block that reads props.parts only append when that flag is false (or answer is empty), and ensure the flag is initialized/cleared appropriately so fallback behavior still works when there were no streaming deltas.

coderabbitai · 2026-02-16T17:10:51Z

src/agents/opencode.ts.bak

+/**
+ * Opencode agent wrapper using SDK
+ *
+ * Uses @opencode-ai/sdk for programmatic interaction with opencode.
+ * Spawns the opencode server with the correct working directory so
+ * the agent operates on the test case files.
+ */
+
+import { spawn, ChildProcess } from 'child_process';
+import {
+  AgentWrapper,
+  AgentResult,
+  AgentRunOptions,
+  ToolCall,
+  emptyAgentResult,
+} from './types.js';
+
+// Import SDK client dynamically since it's ESM-only
+let _createOpencodeClient: (() => unknown) | undefined; // SDK type not fully defined
+const loadSDK = async () => {
+  if (!_createOpencodeClient) {
+    const sdkWrapper = await import('./opencode-sdk.mjs');
+    _createOpencodeClient = sdkWrapper.createOpencodeClient;
+  }
+  return _createOpencodeClient;
+};
+
+// Port counter to avoid collisions between concurrent runs
+let nextPort = 4097;
+
+/**
+ * Spawn an opencode server process with the given working directory.
+ * Returns the server URL and a close function.
+ */
+async function spawnServer(
+  cwd: string,
+  config: Record<string, unknown>,
+  timeoutMs: number,
+): Promise<{ url: string; proc: ChildProcess }> {
+  const port = nextPort++;
+  const proc = spawn('opencode', ['serve', `--hostname=127.0.0.1`, `--port=${port}`], {
+    cwd,
+    env: {
+      ...process.env,
+      OPENCODE_CONFIG_CONTENT: JSON.stringify(config),
+    },
+  });
+
+  const _url = await new Promise<string>((resolve, reject) => { // eslint-disable-line @typescript-eslint/no-unused-vars
+    const id = setTimeout(() => {
+      proc.kill();
+      reject(new Error(`Timeout waiting for opencode server after ${timeoutMs}ms`));
+    }, timeoutMs);
+
+    let output = '';
+    proc.stdout?.on('data', (chunk: Buffer) => {
+      output += chunk.toString();
+      for (const line of output.split('\n')) {
+        if (line.startsWith('opencode server listening')) {
+          const match = line.match(/on\s+(https?:\/\/[^\s]+)/);
+          if (match) {
+            clearTimeout(id);
+            resolve(match[1]);
+            return;
+          }
+        }
+      }
+    });
+    proc.stderr?.on('data', (chunk: Buffer) => {
+      output += chunk.toString();
+    });
+    proc.on('exit', (code) => {
+      clearTimeout(id);
+      reject(new Error(`Server exited with code ${code}: ${output}`));
+    });
+    proc.on('error', (err) => {
+      clearTimeout(id);
+      reject(err);
+    });
+  });
+
+  return { url: _url, proc };
+}
+
+/**
+ * Opencode agent wrapper using SDK
+ */
+export class OpencodeAgent implements AgentWrapper {
+  name = 'opencode';
+  displayName = 'Opencode';
+
+  private cliPath: string;
+  private config: Record<string, unknown>;
+
+  constructor(cliPath: string = 'opencode', config?: Record<string, unknown>) {
+    this.cliPath = cliPath;
+    this.config = config || {
+      model: 'local-glm/glm-4.7-local-4bit',
+      provider: {
+        'local-glm': {
+          api: 'openai',
+          options: {
+            baseURL: 'http://127.0.0.1:8081/v1',
+            apiKey: 'local-glm-key',
+          },
+          models: {
+            'glm-4.7-local-4bit': {
+              name: 'GLM-4.7 Local (4-bit)',
+              id: '/Users/studio/models/GLM-4.7-4bit',
+              reasoning: false,
+              tool_call: true,
+              temperature: true,
+              limit: { context: 32768, output: 4096 },
+              cost: { input: 0, output: 0 },
+              modalities: { input: ['text'], output: ['text'] },
+            },
+          },
+        },
+      },
+    };
+  }
+
+  async isAvailable(): Promise<boolean> {
+    try {
+      const version = await this.getVersion();
+      return version !== null;
+    } catch {
+      return false;
+    }
+  }
+
+  async getVersion(): Promise<string | null> {
+    return new Promise((resolve) => {
+      const proc = spawn(this.cliPath, ['--version'], { timeout: 5000 });
+      let stdout = '';
+      proc.stdout?.on('data', (data: Buffer) => {
+        stdout += data.toString();
+      });
+      proc.on('close', (code: number | null) => {
+        resolve(code === 0 && stdout.trim() ? stdout.trim() : null);
+      });
+      proc.on('error', () => resolve(null));
+    });
+  }
+
+  async run(prompt: string, options: AgentRunOptions): Promise<AgentResult> {
+    const runStartTime = Date.now();
+    const timeoutMs = options.timeoutMs || 300000;
+    const toolCalls: ToolCall[] = [];
+    let model = 'unknown';
+    let sessionId = '';
+    let _serverProc: ChildProcess | null = null;
+
+    try {
+      // Spawn server in the case's working directory
+      const cwd = options.cwd || process.cwd();
+      const config = options.model
+        ? { ...this.config, model: options.model }
+        : this.config;
+      const { url, proc } = await spawnServer(cwd, config, 15000);
+      _serverProc = proc;
+
+      const createClient = await loadSDK();
+      if (!createClient) throw new Error("Failed to load SDK");
+      const client = createClient();
+
+      const createResult = await client.session.create({});
+      if (createResult.error) {
+        throw new Error(`Failed to create session: ${JSON.stringify(createResult.error)}`);
+      }
+
+      const session = createResult.data;
+      sessionId = session.id;
+      model = options.model || session.version || 'unknown';
+
+      options.onEvent?.({ type: 'start', timestamp: runStartTime, model });
+
+      // Subscribe to SSE events BEFORE sending the prompt so we capture everything
+      // event.subscribe() returns ServerSentEventsResult directly (not { data, error })
+      const sseResult = await client.event.subscribe({}) as unknown;
+      const stream: AsyncIterable<unknown> | undefined =
+        (sseResult as { stream?: AsyncIterable<unknown>; data?: { stream?: AsyncIterable<unknown> } })?.stream ||
+        (sseResult as { data?: { stream?: AsyncIterable<unknown> } })?.data?.stream ||
+        (sseResult as { data?: AsyncIterable<unknown> })?.data;
+
+      if (!stream) {
+        throw new Error(
+          `Event stream not available — subscribe() returned: ${JSON.stringify(Object.keys(sseResult || {}))}`,
+        );
+      }
+
+      // Send prompt asynchronously (returns immediately, events stream the progress)
+      const asyncResult = await client.session.promptAsync({
+        path: { id: sessionId },
+        body: {
+          parts: [{ type: 'text', text: prompt }],
+        },
+      });
+
+      if (asyncResult.error) {
+        throw new Error(`Prompt failed: ${JSON.stringify(asyncResult.error)}`);
+      }
+
+      // Process SSE events until the session goes idle or we time out
+      let answer = '';
+      let numTurns = 0;
+      let totalTokens = { input: 0, output: 0, cacheRead: 0, cacheWrite: 0, total: 0 };
+      let totalCost: number = 0;
+      const deadline = Date.now() + timeoutMs - 5000;
+
+      for await (const event of stream) {
+        if (Date.now() > deadline) {
+          options.onEvent?.({ type: 'status', message: 'Timed out waiting for agent' });
+          break;
+        }
+
+        const eventType = (event as { type?: string; event?: string })?.type ?? (event as { type?: string; event?: string })?.event ?? '';
+
+        if (eventType === 'message.part.updated') {
+          const eventAny = event as { properties?: unknown; data?: unknown };
+          const props = eventAny.properties || eventAny.data || {};
+          if (!props) continue;
+          const part = (props as { part?: unknown }).part || ({} as any);
+          if (!part) continue;
+
+          const partAny = part as { type?: string; text?: string; state?: { status?: string; input?: unknown; time?: { start?: number; end?: number }; output?: unknown }; callID?: string; callId?: string; tool?: string; tokens?: { input?: number; output?: number; cache?: { read?: number; write?: number }; total?: number }; cost?: number };
+          if (partAny.type === 'text') {
+            // Streaming text delta
+            const delta = (props as { delta?: string }).delta || '';
+            if (delta) {
+              answer += delta;
+              options.onEvent?.({ type: 'text_delta', text: delta });
+            }
+          } else if (partAny.type === 'tool') {
+            const status = partAny.state?.status || '';
+            const callID = partAny.callID || partAny.callId || '';
+            const toolName: string = (partAny.tool as string) || 'unknown';
+            if (!toolName) continue;
+
+            if (status === 'running' || status === 'pending') {
+              // Only add if not already tracked
+              if (!toolCalls.find((t) => t.id === callID)) {
+                const toolCall: ToolCall = {
+                  id: callID,
+                  name: toolName,
+                  input: (partAny.state?.input || {}) as Record<string, unknown>,
+                  timestamp: Date.now(),
+                };
+                toolCalls.push(toolCall);
+                options.onEvent?.({ type: 'tool_start', tool: toolCall });
+                options.onEvent?.({ type: 'status', message: `Tool: ${toolName}` });
+              }
+            } else if (status === 'completed') {
+              const existing = toolCalls.find((t) => t.id === callID);
+              if (existing) {
+                existing.durationMs = partAny.state?.time?.end && partAny.state.time?.start
+                  ? (partAny.state.time.end - partAny.state.time.start) * 1000
+                  : Date.now() - existing.timestamp;
+                existing.success = true;
+                existing.result = partAny.state?.output
+                  ? String(partAny.state.output).substring(0, 500)
+                  : undefined;
+              } else {
+                // Tool completed without a prior start event (can happen if subscription started late)
+                toolCalls.push({
+                  id: callID,
+                  name: toolName,
+                  input: (partAny.state?.input || {}) as Record<string, unknown>,
+                  timestamp: Date.now(),
+                  durationMs: partAny.state?.time?.end && partAny.state.time?.start
+                    ? (partAny.state.time.end - partAny.state.time.start) * 1000
+                    : 0,
+                  success: true,
+                  result: partAny.state?.output
+                    ? String(partAny.state.output).substring(0, 500)
+                    : undefined,
+                });
+              }
+              options.onEvent?.({
+                type: 'tool_end',
+                toolId: callID,
+                success: true,
+                durationMs: toolCalls.find((t) => t.id === callID)?.durationMs || 0,
+              });
+            } else if (status === 'error') {
+              const existing = toolCalls.find((t) => t.id === callID);
+              if (existing) {
+                existing.success = false;
+                existing.durationMs = Date.now() - existing.timestamp;
+              }
+              options.onEvent?.({
+                type: 'tool_end',
+                toolId: callID,
+                success: false,
+                durationMs: existing?.durationMs || 0,
+              });
+            }
+          } else if (partAny.type === 'reasoning') {
+            const text = (props as { delta?: string }).delta || partAny.text || '';
+            if (!text) continue;
+            if (text) {
+              options.onEvent?.({ type: 'thinking', text });
+            }
+          } else if (partAny.type === 'step-finish') {
+            numTurns++;
+            // Accumulate per-step tokens/cost
+            const partTyped = partAny as { tokens?: { input?: number; output?: number; cache?: { read?: number; write?: number }; total?: number }; cost?: number };
+            if (partTyped.tokens) {
+              totalTokens.input += partTyped.tokens.input || 0;
+              totalTokens.output += partTyped.tokens.output || 0;
+              totalTokens.cacheRead += partTyped.tokens.cache?.read || 0;
+              totalTokens.cacheWrite += partTyped.tokens.cache?.write || 0;
+              totalTokens.total += partTyped.tokens.total || 0;
+            }
+            if (partTyped.cost) {
+              totalCost += partTyped.cost;
+            }
+          }
+        } else if (eventType === 'message.updated') {
+          // A full message update — extract final info from here
+          const eventAny = event as { properties?: unknown; data?: unknown };
+          const props = (eventAny.properties || eventAny.data) as { parts?: unknown[] } & Record<string, unknown>;
+          const info = props as { providerID?: string; modelID?: string; tokens?: { input?: number; output?: number; cache?: { read?: number; write?: number }; total?: number }; cost?: number } | undefined;
+          if (info?.providerID && info?.modelID) {
+            model = `${info.providerID}/${info.modelID}`;
+          }
+          // Use message-level tokens as authoritative total if available
+          if (info?.tokens?.total) {
+            totalTokens = {
+              input: info.tokens.input || totalTokens.input,
+              output: info.tokens.output || totalTokens.output,
+              cacheRead: info.tokens.cache?.read || totalTokens.cacheRead,
+              cacheWrite: info.tokens.cache?.write || totalTokens.cacheWrite,
+              total: info.tokens.total,
+            };
+          }
+          if (info?.cost !== undefined) {
+            totalCost = info.cost;
+          }
+          // Extract final answer text from message parts if we haven't captured it via deltas
+          if (props && (props as { parts?: unknown[] } & Record<string, unknown>).parts) {
+            for (const p of (props as { parts?: unknown[] | null | undefined }).parts ?? []) {
+              if ((p as { type?: string; text?: string }).type === 'text' && (p as { type?: string; text?: string }).text) {
+                answer += (p as { type?: string; text?: string }).text;
+              }
+            }
+          }
+        } else if (eventType === 'session.status') {
+          const eventAny = event as { properties?: unknown; data?: unknown };
+          const props = (eventAny.properties || eventAny.data) as { parts?: unknown[] } & Record<string, unknown>;
+          const status = props as { type?: string; attempt?: number; message?: string } | undefined;
+          if (status?.type === 'idle') {
+            // Agent finished processing
+            options.onEvent?.({ type: 'status', message: 'Session idle — agent finished' });
+            break;
+          } else if (status?.type === 'busy') {
+            options.onEvent?.({ type: 'status', message: 'Agent working...' });
+          } else if (status?.type === 'retry') {
+            options.onEvent?.({
+              type: 'status',
+              message: `Retrying (attempt ${status.attempt}): ${status.message}`,
+            });
+          }
+        } else if (eventType === 'session.error') {
+          const eventAny = event as { properties?: unknown; data?: unknown };
+          const props = (eventAny.properties || eventAny.data) as { parts?: unknown[] } & Record<string, unknown>;
+          const errMsg = (props as { error?: { message?: string } | undefined })?.error?.message || JSON.stringify(props) || 'Unknown error';
+          options.onEvent?.({ type: 'error', message: errMsg, code: 'SESSION_ERROR' });
+        }
+      }
+
+      // If answer is still empty, fetch the final messages from the session
+      if (!answer) {
+        const messagesResult = await client.session.messages({
+          path: { id: sessionId },
+        });
+        if (messagesResult.data) {
+          const messages = messagesResult.data as { role?: string; parts?: unknown[] }[];
+          // Find the last assistant message
+          for (let i = messages.length - 1; i >= 0; i--) {
+            const msg = messages[i] as { role?: string; parts?: unknown[] };
+            if ((msg as { role?: string }).role === 'assistant' && msg.parts) {
+              for (const p of msg.parts) {
+                if ((p as { type?: string; text?: string }).type === 'text' && (p as { type?: string; text?: string }).text) {
+                  answer += (p as { type?: string; text?: string }).text;
+                }
+              }
+              break;
+            }
+          }
+        }
+      }
+
+      const result: AgentResult = {
+        answer,
+        success: true,
+        timedOut: Date.now() > deadline,
+        durationMs: Date.now() - runStartTime,
+        tokens: {
+          inputTokens: totalTokens.input,
+          outputTokens: totalTokens.output,
+          cacheReadTokens: totalTokens.cacheRead,
+          cacheWriteTokens: totalTokens.cacheWrite,
+          totalTokens: totalTokens.total,
+        },
+        costUsd: totalCost,
+        numTurns: numTurns || 1,
+        toolCalls,
+        toolsUsed: [...new Set(toolCalls.map((t) => t.name))],
+        model,
+        raw: { sessionId },
+      };
+
+      options.onEvent?.({ type: 'complete', result });
+      return result;
+    } catch (error) {
+      const errorMessage = error instanceof Error ? error.message : String(error);
+
+      options.onEvent?.({ type: 'error', message: errorMessage, code: 'ERROR' });
+
+      const errorResult = emptyAgentResult(errorMessage);
+      errorResult.durationMs = Date.now() - runStartTime;
+      errorResult.toolCalls = toolCalls;
+      errorResult.toolsUsed = [...new Set(toolCalls.map((t) => t.name))];
+      errorResult.model = model;
+
+      options.onEvent?.({ type: 'complete', result: errorResult });
+      return errorResult;
+    } finally {
+      _serverProc?.kill();
+    }
+  }
+}
+
+export function createOpencodeAgent(cliPath?: string): OpencodeAgent {
+  return new OpencodeAgent(cliPath);
+}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Find all .bak, .bak2, .orig files in the repository echo "=== Backup files found ===" fd -e bak -e bak2 -e orig echo "" echo "=== Comparing opencode.ts.bak and opencode.ts.bak2 ===" if [ -f src/agents/opencode.ts.bak ] && [ -f src/agents/opencode.ts.bak2 ]; then diff src/agents/opencode.ts.bak src/agents/opencode.ts.bak2 | head -20 echo "Exit code: $?" else echo "One or both files not found" fi echo "" echo "=== Comparing opencode.ts.bak with opencode.ts ===" if [ -f src/agents/opencode.ts.bak ] && [ -f src/agents/opencode.ts ]; then diff src/agents/opencode.ts.bak src/agents/opencode.ts | head -20 echo "Exit code: $?" else echo "One or both files not found" fi echo "" echo "=== File sizes ===" ls -lh src/agents/opencode.ts* 2>/dev/null || echo "Files not found"

Repository: AnswerLayer/sniffbench

Length of output: 1035

Remove .bak and .orig files from the repository.

Five backup files exist and should not be committed to version control: opencode.ts.bak, opencode.ts.bak2, llm-judge.ts.bak, runner.ts.bak, and runner.ts.orig. These are development artifacts. Use git history to track prior states instead of storing backup files.

🤖 Prompt for AI Agents

In `@src/agents/opencode.ts.bak` around lines 1 - 437, Delete the backup files named opencode.ts.bak, opencode.ts.bak2, llm-judge.ts.bak, runner.ts.bak, and runner.ts.orig from the repository (remove them from the commit and the index) and add their patterns to .gitignore so future backups aren’t committed; ensure no code imports or references these artifacts (check files containing spawnServer and class OpencodeAgent for accidental imports) before committing the deletion and push a follow-up commit that removes these artifacts.

coderabbitai · 2026-02-16T17:10:51Z

src/cases/types.ts

+  | 'llm_judge_comparison' // Use LLM to compare two answers
  | 'agent_behavior'; // Evaluate agent behavior metrics


⚠️ Potential issue | 🟡 Minor

'llm_judge_comparison' added to EvaluatorType but no corresponding evaluator interface exists in the Evaluator union.

The Evaluator union type (line 272) only includes LLMJudgeEvaluator (with type: 'llm_judge'). There is no dedicated interface for 'llm_judge_comparison'. The runner (runner.ts.orig line 379) uses (evaluator.type as any) === 'llm_judge_comparison' to work around this. Either add a proper LLMJudgeComparisonEvaluator interface to the union, or remove the type from the union if it's not ready.

🤖 Prompt for AI Agents

In `@src/cases/types.ts` around lines 134 - 135, EvaluatorType now lists 'llm_judge_comparison' but the Evaluator union (and specifically the LLMJudgeEvaluator interface) lacks a corresponding interface, causing the runner to cast types; add a new interface LLMJudgeComparisonEvaluator and include it in the Evaluator union (mirror the shape/fields expected by runner.ts where it checks evaluator.type === 'llm_judge_comparison') or alternatively remove 'llm_judge_comparison' from EvaluatorType if not supported; update the union to reference the new LLMJudgeComparisonEvaluator (or delete the enum member) so the type system no longer requires any (evaluator.type as any) workarounds.

coderabbitai · 2026-02-16T17:10:52Z

src/cases/types.ts

+// Fix missing properties in CaseResult
 export interface CaseResult {
-  /** Case that was evaluated */
-  caseId: string;
+  /** Case ID */
+  id: string;
+
+  /** Case title */
+  title: string;

-  /** Overall score from 0 to 100 */
+  /** Overall score (0-1) */
  score: number;

-  /** Whether the case passed (score >= pass threshold) */
+  /** Whether the case passed */
  passed: boolean;

-  /** Results for each criterion */
-  criteriaResults: CriterionResult[];
+  /** Evidence/reasoning */
+  evidence: string;
+
+  /** Individual criterion results */
+  criteria: CriterionResult[];
+
+  /** Individual evaluator results */
+  evaluators: EvaluatorResult[];
+
+  /** Duration in milliseconds */
+  durationMs: number;

-  /** Agent behavior trace */
-  agentTrace?: AgentTrace;
+  /** Error if any */
+  error?: string;

-  /** The agent's text response */
+  /** Agent response */
  agentResponse?: string;

-  /** Tool calls the agent made */
-  agentToolCalls?: { name: string; durationMs?: number; success?: boolean }[];
+  /** Agent tool calls */
+  agentToolCalls?: Array<{
+    name: string;
+    durationMs: number;
+    success: boolean;
+  }>;

-  /** Model used */
+  /** Agent model */
  agentModel?: string;

-  /** Token usage */
-  agentTokens?: { input: number; output: number; total: number };
+  /** Agent tokens */
+  agentTokens?: {
+    input: number;
+    output: number;
+    total: number;
+  };
+
+  /** Agent files */
+  agentFiles?: Array<{
+    path: string;
+    content: string;
+    changed: boolean;
+  }>;
+
+  /** Whether the case timed out */
+  timedOut?: boolean;
+
+  /** Timestamp */
+  timestamp?: Date;
+}


🛠️ Refactor suggestion | 🟠 Major

Interfaces are declared multiple times via declaration merging — consolidate into single definitions.

CaseResult is declared three times (lines 343, 416, 587), RunResult three times (lines 375, 479, 650), RunSummary three times (lines 398, 512, 683), CriterionResult three times (lines 320, 536, 707), EvaluatorBase three times (lines 173, 563, 734), and LLMJudgeEvaluator three times (lines 248, 575, 746).

While TypeScript merges them, this is clearly a code-generation artifact where the LLM appended "fix" blocks instead of editing the original definitions. This makes the file nearly unreadable and is error-prone. Merge each interface into a single, complete declaration.

🤖 Prompt for AI Agents

In `@src/cases/types.ts` around lines 415 - 476, Multiple interfaces (CaseResult, RunResult, RunSummary, CriterionResult, EvaluatorBase, LLMJudgeEvaluator) are declared repeatedly via declaration merging; consolidate each into a single complete interface by locating every duplicate declaration of each symbol and merging their properties into one unified declaration (preserve all unique fields like CaseResult.agentResponse, agentToolCalls, agentTokens, agentFiles, timedOut, timestamp; RunResult/RunSummary fields from all variants; CriterionResult fields from all declarations; and all members of EvaluatorBase and LLMJudgeEvaluator) then remove the extra/duplicated interface blocks so each symbol appears exactly once.

coderabbitai · 2026-02-16T17:10:52Z

src/evaluation/llm-judge.ts

+  private async callClaude(prompt: string): Promise<LLMJudgeScore | ComparisonResult | null> {
+    if (!this.apiKey) {
+      throw new Error('ANTHROPIC_API_KEY not set');
+    }
+
+    this.costTracker.callCount++;
+
+    // Dynamic import of SDK
+    const sdk = await import('@anthropic-ai/claude-agent-sdk');
+
+    const response = await sdk.query({
+      prompt,
+      options: {
+        model: this.model,
+        // Note: system prompt is not supported in this SDK version
+        settingSources: [],
+      },
+    });
+
+    let result: LLMJudgeScore | ComparisonResult | null = null;
+
+    for await (const message of response) {
+      if (message.type === 'result' && message.subtype === 'success' && (message as { result?: string }).result) {
+        const content = (message as { result?: string }).result as string || '';
+        result = this.parseResponse(content);
+        break;
+      }
+    }
+
+    if (!result) {
+      throw new Error('Failed to parse LLM response');
+    }
+
+    return result;
+  }


⚠️ Potential issue | 🟠 Major

Cost tracking is incomplete — token counts and USD cost are never updated.

callCount is incremented (Line 274), but inputTokens, outputTokens, and costUsd remain 0 forever. The response from callClaude doesn't extract usage metadata. This was an acceptance criterion ("API cost tracking implemented") and will mislead consumers of getCostTracker().

Once the SDK integration is fixed, extract usage.input_tokens and usage.output_tokens from the API response and compute cost based on model pricing.

Also applies to: 139-156

🤖 Prompt for AI Agents

In `@src/evaluation/llm-judge.ts` around lines 269 - 303, The costTracker is only incrementing callCount in callClaude but never updates inputTokens, outputTokens, or costUsd; modify callClaude to extract usage metadata (e.g., usage.input_tokens and usage.output_tokens) from the SDK response (the streaming response or final 'result' message) after receiving the message and before parseResponse, update this.costTracker.inputTokens and outputTokens by adding those values, compute costUsd using a model-pricing lookup (add or reuse a small mapping like pricePerInputToken/model and pricePerOutputToken/model or a getModelPricePerToken(model) helper) and increment this.costTracker.costUsd accordingly, and ensure the same extraction and updates are implemented in the other Claude integration path referenced around lines 139-156 (the alternate Claude call function) so getCostTracker() returns accurate cumulative costs.

coderabbitai · 2026-02-16T17:10:52Z

src/evaluation/runner.ts

  const criteriaResults: CriterionResult[] = [];
  let totalWeightedScore = 0;
  let _totalWeight = 0;
+  const evalStartTime = Date.now();


⚠️ Potential issue | 🟠 Major

Bug: evalStartTime is shared across all evaluators and criteria, producing cumulative durations instead of per-item durations.

evalStartTime is set once (Line 318) and never reset. Every evaluator's durationMs (Line 390) and every criterion's durationMs (Line 414) are computed as Date.now() - evalStartTime, so the values grow monotonically — the second evaluator reports the total elapsed time since the first evaluator started, not its own duration.

🐛 Proposed fix — track per-evaluator and per-criterion start times

- const evalStartTime = Date.now(); // Evaluate each criterion in the rubric for (const [criterionKey, criterion] of Object.entries(rubric.criteria)) { const evaluatorResults: EvaluatorResult[] = []; let criterionScore = 0; let evaluatorCount = 0; + const criterionStartTime = Date.now(); for (const evaluator of criterion.evaluators) { let evalResult: Omit<EvaluatorResult, 'name' | 'type' | 'durationMs'>; + const evaluatorStartTime = Date.now(); if (evaluator.type === 'command') { // ... existing evaluator logic ... @@ evaluatorResults.push evaluatorResults.push({ name: evaluator.name || evaluator.type, type: evaluator.type as EvaluatorType, - durationMs: Date.now() - evalStartTime, + durationMs: Date.now() - evaluatorStartTime, ...evalResult, }); @@ criteriaResults.push criteriaResults.push({ name: criterionKey, weight: criterion.weight, score: rawScore, passed: allPassed, evidence: `Criterion: ${criterionKey}`, evaluatorResults, - durationMs: Date.now() - evalStartTime, + durationMs: Date.now() - criterionStartTime, });

Also applies to: 387-392, 407-415

🤖 Prompt for AI Agents

In `@src/evaluation/runner.ts` at line 318, The shared evalStartTime variable is causing cumulative timings; change the logic to record a fresh start timestamp for each evaluator and for each criterion before their work begins and compute durationMs as Date.now() minus that per-item start time. Specifically, inside the loop that processes each evaluator (referencing evalStartTime and evaluator.durationMs) set a local evaluatorStart = Date.now() just before running the evaluator and use Date.now() - evaluatorStart to populate evaluator.durationMs; similarly, before running each criterion (referencing criterion.durationMs) set a local criterionStart = Date.now() and compute criterion.durationMs = Date.now() - criterionStart after it finishes. Ensure any existing uses of the top-level evalStartTime are replaced by these per-item start variables so durations reflect individual execution time.

coderabbitai · 2026-02-16T17:10:52Z

src/evaluation/runner.ts.bak

+/**
+ * Evaluation runner - executes cases in sandboxes and evaluates results
+ *
+ * This is the core evaluation engine that:
+ * 1. Sets up the sandbox environment
+ * 2. Runs the case (agent attempts to solve the problem)
+ * 3. Applies the rubric to evaluate the result
+ */
+
+import * as fs from 'fs';
+import * as path from 'path';
+import * as os from 'os';
+import {
+  Case,
+  CaseFile,
+  CaseResult,
+  CriterionResult,
+  EvaluatorResult,
+  RunResult,
+  RunSummary,
+  EvaluatorType,
+} from '../cases/types';
+import { createSandboxManager, checkDocker, RECOMMENDED_IMAGES } from '../sandbox';
+import { Sandbox, SandboxConfig } from '../sandbox/types';
+import { getRubricRegistry } from '../rubrics/loader';
+import { getAgent } from '../agents/registry';
+import type { AgentResult } from '../agents/types';
+// // import { runLLMJudgeEvaluator } from './llm-judge';
+
+export interface RunnerOptions {
+  /** Agent being evaluated (for logging) */
+  agent: string;
+
+  /** Model to use (passed to agent) */
+  model?: string;
+
+  /** Timeout per case in seconds */
+  timeoutSeconds?: number;
+
+  /** Enable network in sandbox */
+  networkEnabled?: boolean;
+
+  /** Callback for progress updates */
+  onProgress?: (update: ProgressUpdate) => void;
+
+  /** Callback when a case completes */
+  onCaseComplete?: (result: CaseResult) => void;
+}
+
+export interface ProgressUpdate {
+  type: 'starting' | 'running' | 'validating' | 'complete' | 'error';
+  caseId: string;
+  caseIndex: number;
+  totalCases: number;
+  message?: string;
+}
+
+/**
+ * Get the appropriate Docker image for a language
+ */
+function getImageForLanguage(language: string): string {
+  const langLower = language.toLowerCase();
+
+  if (langLower === 'javascript' || langLower === 'typescript' || langLower === 'node') {
+    return RECOMMENDED_IMAGES.node.latest;
+  }
+  if (langLower === 'python') {
+    return RECOMMENDED_IMAGES.python.latest;
+  }
+  if (langLower === 'go' || langLower === 'golang') {
+    return RECOMMENDED_IMAGES.go.latest;
+  }
+  if (langLower === 'rust') {
+    return RECOMMENDED_IMAGES.rust.latest;
+  }
+  if (langLower === 'java') {
+    return RECOMMENDED_IMAGES.java.latest;
+  }
+
+  // Default to Node.js for unknown languages
+  return RECOMMENDED_IMAGES.node.latest;
+}
+
+/**
+ * Run a set of cases and return results
+ */
+export async function runCases(cases: Case[], options: RunnerOptions): Promise<RunResult> {
+  const runId = `run-${Date.now()}-${Math.random().toString(36).substring(2, 8)}`;
+  const startedAt = new Date();
+  const results: CaseResult[] = [];
+
+  // Check Docker availability first
+  const dockerStatus = await checkDocker();
+  if (!dockerStatus.available) {
+    throw new Error(`Docker is not available: ${dockerStatus.error}\n${dockerStatus.suggestion}`);
+  }
+
+  const manager = createSandboxManager();
+  let rubricId = 'default';
+
+  try {
+    for (let i = 0; i < cases.length; i++) {
+      const caseData = cases[i];
+
+      options.onProgress?.({
+        type: 'starting',
+        caseId: caseData.id,
+        caseIndex: i,
+        totalCases: cases.length,
+        message: `Starting ${caseData.title}`,
+      });
+
+      try {
+        const result = await runSingleCase(caseData, manager, options, i, cases.length);
+        results.push(result);
+        options.onCaseComplete?.(result);
+        // Track the rubric ID from the first case
+        if (i === 0) {
+          const registry = getRubricRegistry();
+          const rubric = registry.resolve(caseData.rubric);
+          rubricId = rubric.id;
+        }
+      } catch (err) {
+        const errorResult: CaseResult = {
+          id: caseData.id,
+          title: caseData.title,
+          score: 0,
+          passed: false,
+          evidence: (err as Error).message,
+          criteria: [],
+          evaluators: [],
+          durationMs: 0,
+          error: (err as Error).message,
+          timestamp: new Date(),
+        };
+        results.push(errorResult);
+        options.onCaseComplete?.(errorResult);
+      }
+    }
+  } finally {
+    // Clean up all sandboxes
+    await manager.destroyAll();
+  }
+
+  const completedAt = new Date();
+  const totalDurationMs = completedAt.getTime() - startedAt.getTime();
+
+  // Calculate summary
+  const scores = results.map((r) => r.score);
+  const averageScore = scores.length > 0 ? scores.reduce((a, b) => a + b, 0) / scores.length : 0;
+
+  const summary: RunSummary = {
+    total: results.length,
+    passed: results.filter((r) => r.passed).length,
+    failed: results.filter((r) => !r.passed && !r.error).length,
+    skipped: 0,
+    timedOut: results.filter((r) => r.timedOut).length,
+    averageScore,
+    totalDurationMs,
+  };
+
+  return {
+    id: runId,
+    timestamp: startedAt,
+    cases: results,
+    summary,
+    durationMs: totalDurationMs,
+    agent: options.agent,
+    rubricId,
+  };
+}
+
+/**
+ * Run a single case in a sandbox
+ */
+async function runSingleCase(
+  caseData: Case,
+  manager: ReturnType<typeof createSandboxManager>,
+  options: RunnerOptions,
+  caseIndex: number,
+  totalCases: number
+): Promise<CaseResult> {
+  const startTime = Date.now();
+
+  // Create a temporary directory for this case
+  const tempDir = fs.mkdtempSync(path.join(os.tmpdir(), `sniff-${caseData.id}-`));
+
+  try {
+    // Write case files to temp directory (if any)
+    if (caseData.files) {
+      for (const file of caseData.files) {
+        const filePath = path.join(tempDir, file.path);
+        const fileDir = path.dirname(filePath);
+
+        // Create directories if needed
+        fs.mkdirSync(fileDir, { recursive: true });
+        if (file.content !== undefined) {
+          fs.writeFileSync(filePath, file.content);
+        }
+      }
+    }
+
+    // Create sandbox
+    const sandboxConfig: SandboxConfig = {
+      workdir: tempDir,
+      image: getImageForLanguage(caseData.language),
+      timeoutSeconds: options.timeoutSeconds || 300,
+      networkEnabled: options.networkEnabled || false,
+    };
+
+    options.onProgress?.({
+      type: 'running',
+      caseId: caseData.id,
+      caseIndex,
+      totalCases,
+      message: 'Creating sandbox...',
+    });
+
+    const sandbox = await manager.create(sandboxConfig);
+
+    try {
+      // Install dependencies if needed
+      await installDependencies(sandbox, caseData.language, options, caseIndex, totalCases, caseData.id);
+
+      // Run the agent to attempt to solve the case
+      options.onProgress?.({
+        type: 'running',
+        caseId: caseData.id,
+        caseIndex,
+        totalCases,
+        message: 'Running agent...',
+      });
+
+      const agent = getAgent(options.agent);
+      const _agentResult: AgentResult = await agent.run(caseData.prompt, {
+        cwd: tempDir,
+        model: options.model,
+        timeoutMs: (options.timeoutSeconds || 300) * 1000,
+        permissionMode: 'acceptEdits',
+      });
+
+      if (!_agentResult.success) {
+        throw new Error(`Agent execution failed: ${_agentResult.error}`);
+      }
+
+      // Snapshot files the agent produced (before rubric evaluation)
+      const _agentFiles = snapshotFiles(tempDir, caseData.files);
+
+      // Evaluate using the rubric
+      options.onProgress?.({
+        type: 'validating',
+        caseId: caseData.id,
+        caseIndex,
+        totalCases,
+        message: 'Evaluating with rubric...',
+      });
+
+      const result = await evaluateWithRubric(caseData, sandbox, options, _agentResult, _agentFiles);
+      const durationMs = Date.now() - startTime;
+
+      options.onProgress?.({
+        type: 'complete',
+        caseId: caseData.id,
+        caseIndex,
+        totalCases,
+        message: result.passed ? `Passed (${Math.round(result.score)}%)` : `Failed (${Math.round(result.score)}%)`,
+      });
+
+      return {
+        ...result,
+        agentResponse: _agentResult.answer,
+        agentToolCalls: _agentResult.toolCalls.map((t) => ({
+          name: t.name,
+          durationMs: t.durationMs || 0,
+          success: t.success || false,
+        })),
+        agentModel: _agentResult.model,
+        agentTokens: _agentResult.tokens
+          ? {
+              input: _agentResult.tokens.inputTokens,
+              output: _agentResult.tokens.outputTokens,
+              total: _agentResult.tokens.totalTokens,
+            }
+          : undefined,
+        agentFiles: _agentFiles,
+        durationMs,
+        timestamp: new Date(),
+      };
+    } finally {
+      await sandbox.destroy();
+    }
+  } finally {
+    // Clean up temp directory
+    try {
+      fs.rmSync(tempDir, { recursive: true, force: true });
+    } catch {
+      // Ignore cleanup errors
+    }
+  }
+}
+
+/**
+ * Evaluate a case using its rubric
+ */
+async function evaluateWithRubric(
+  caseData: Case,
+  sandbox: Sandbox,
+  _options: RunnerOptions,
+  _agentResult: AgentResult,
+  _agentFiles: { path: string; content: string; changed: boolean }[]
+): Promise<CaseResult> {
+  const registry = getRubricRegistry();
+  const rubric = registry.resolve(caseData.rubric);
+
+  const criteriaResults: CriterionResult[] = [];
+  let totalWeightedScore = 0;
+  let _totalWeight = 0;
+  const evalStartTime = Date.now();
+
+  // Evaluate each criterion in the rubric
+  for (const [criterionKey, criterion] of Object.entries(rubric.criteria)) {
+    const evaluatorResults: EvaluatorResult[] = [];
+    let criterionScore = 0;
+    let evaluatorCount = 0;
+
+    for (const evaluator of criterion.evaluators) {
+      let evalResult: Omit<EvaluatorResult, 'name' | 'type' | 'durationMs'>;
+
+      if (evaluator.type === 'command') {
+        // Run command evaluator
+        const result = await sandbox.exec(evaluator.run, {
+          timeoutSeconds: 60,
+        });
+
+        const passed = result.exitCode === 0;
+        let score = passed ? 1.0 : 0.0;
+
+        // Handle partial credit
+        if (evaluator.partialCredit && !passed) {
+          // For test runners, try to parse pass/fail ratio
+          const testMatch = result.stdout.match(/(\d+) passed/);
+          const failMatch = result.stdout.match(/(\d+) failed/);
+          if (testMatch && failMatch) {
+            const passedTests = parseInt(testMatch[1], 10);
+            const failedTests = parseInt(failMatch[1], 10);
+            const total = passedTests + failedTests;
+            if (total > 0) {
+              score = passedTests / total;
+            }
+          }
+        }
+
+        evalResult = {
+          passed,
+          score,
+          evidence: (result.stdout + '\n' + result.stderr).trim(),
+          details: {
+            exitCode: result.exitCode,
+            timedOut: result.timedOut,
+          },
+        };
+      } else if (evaluator.type === 'pattern') {
+        // Run pattern evaluator (check for matches in files)
+        // Default to fail until fully implemented
+        evalResult = {
+          passed: false,
+          score: 0.0,
+          evidence: 'Pattern check not yet implemented',
+        };
+      } else if ((evaluator.type as any) === 'llm_judge' || (evaluator.type as any) === 'llm_judge_comparison') {
+        // Run LLM judge evaluator
+        // TODO: Implement baseline answer storage and comparison
+        // For now, use a placeholder evaluator
+        evalResult = {
+          passed: false,
+          score: 0.0,
+          evidence: 'LLM judge comparison not yet fully implemented',
+        };
+      } else {
+        // Other evaluator types (llm_judge, benchmark, etc.) - not implemented
+        evalResult = {
+          passed: false,
+          score: 0.0,
+          evidence: `Evaluator type '${evaluator.type}' not yet implemented`,
+        };
+      }
+      evaluatorResults.push({
+        name: evaluator.name || evaluator.type,
+        type: evaluator.type as EvaluatorType,
+        durationMs: Date.now() - evalStartTime,
+        ...evalResult,
+      });
+
+      if (!evaluator.optional) {
+        criterionScore += evalResult.score;
+        evaluatorCount++;
+      }
+    }
+
+    // Average score for this criterion
+    // If no non-optional evaluators ran, this criterion doesn't participate in scoring
+    const hasRequiredEvaluators = evaluatorCount > 0;
+    const rawScore = hasRequiredEvaluators ? criterionScore / evaluatorCount : 0.0;
+    const weightedScore = hasRequiredEvaluators ? (rawScore * criterion.weight) / 100 : 0;
+    const allPassed = evaluatorResults.filter((e) => !e.passed).length === 0;
+
+    criteriaResults.push({
+      name: criterionKey,
+      weight: criterion.weight,
+      score: rawScore,
+      passed: allPassed,
+      evidence: `Criterion: ${criterionKey}`,
+      evaluatorResults,
+      durationMs: Date.now() - evalStartTime,
+    });
+
+    totalWeightedScore += weightedScore;
+    // Only count weight for criteria that had non-optional evaluators
+    if (hasRequiredEvaluators) {
+      _totalWeight += criterion.weight;
+    }
+  }
+
+  // Normalize score by participating weight (criteria with only optional evaluators are excluded)
+  // Each criterion's weightedScore = rawScore * weight / 100, so totalWeightedScore
+  // is a fraction of 1.0 when all weights sum to 100. When some criteria are excluded,
+  // rescale so the participating criteria fill the full 0-100% range.
+  const participatingFraction = _totalWeight / 100;
+  const overallScore = participatingFraction > 0 ? (totalWeightedScore / participatingFraction) * 100 : 0;
+
+  // Determine pass/fail (default threshold: 70%)
+  const passThreshold = 70;
+  const passed = overallScore >= passThreshold;
+
+  const result: CaseResult = {
+    id: caseData.id,
+    title: caseData.title,
+    score: overallScore,
+    passed,
+    evidence: `Overall score: ${overallScore.toFixed(2)}%`,
+    criteria: criteriaResults,
+    evaluators: [],
+    durationMs: Date.now() - evalStartTime,
+    timestamp: new Date(),
+  };
+  return result;
+}
+
+/**
+ * Install dependencies based on language
+ */
+async function installDependencies(
+  sandbox: Sandbox,
+  language: string,
+  options: RunnerOptions,
+  caseIndex: number,
+  totalCases: number,
+  caseId: string
+): Promise<void> {
+  const langLower = language.toLowerCase();
+
+  options.onProgress?.({
+    type: 'running',
+    caseId,
+    caseIndex,
+    totalCases,
+    message: 'Installing dependencies...',
+  });
+
+  if (langLower === 'python') {
+    // Check for requirements.txt
+    const result = await sandbox.exec('test -f requirements.txt && pip install -r requirements.txt || true');
+    if (result.exitCode !== 0 && result.stderr) {
+      console.warn('Warning: pip install failed:', result.stderr);
+    }
+    // Also install pytest if running tests
+    await sandbox.exec('pip install pytest --quiet 2>/dev/null || true');
+  } else if (langLower === 'javascript' || langLower === 'typescript' || langLower === 'node') {
+    // Check for package.json
+    const result = await sandbox.exec('test -f package.json && npm install --silent || true');
+    if (result.exitCode !== 0 && result.stderr) {
+      console.warn('Warning: npm install failed:', result.stderr);
+    }
+  } else if (langLower === 'go' || langLower === 'golang') {
+    // Check for go.mod
+    await sandbox.exec('test -f go.mod && go mod download || true');
+  }
+}
+
+/**
+ * Snapshot all files in the workspace after the agent runs.
+ * Compares against the original case files to flag which ones changed.
+ * Reads directly from the host tempDir (bind-mounted into the sandbox).
+ */
+function snapshotFiles(
+  tempDir: string,
+  originalFiles?: CaseFile[]
+): { path: string; content: string; changed: boolean }[] {
+  const results: { path: string; content: string; changed: boolean }[] = [];
+  const origMap = new Map<string, string>();
+
+  // Build map of original file contents for comparison
+  if (originalFiles) {
+    for (const f of originalFiles) {
+      if (f.content !== undefined) {
+        origMap.set(f.path, f.content);
+      }
+    }
+  }
+
+  // Walk the temp directory and collect all files
+  function walk(dir: string, prefix: string) {
+    let entries: fs.Dirent[];
+    try {
+      entries = fs.readdirSync(dir, { withFileTypes: true });
+    } catch {
+      return;
+    }
+    for (const entry of entries) {
+      const relPath = prefix ? `${prefix}/${entry.name}` : entry.name;
+      const fullPath = path.join(dir, entry.name);
+
+      // Skip common non-essential directories
+      if (entry.isDirectory()) {
+        if (['node_modules', '.git', '__pycache__', '.pytest_cache', 'venv', '.venv'].includes(entry.name)) {
+          continue;
+        }
+        walk(fullPath, relPath);
+        continue;
+      }
+
+      if (!entry.isFile()) continue;
+
+      // Skip binary and large files
+      try {
+        const stat = fs.statSync(fullPath);
+        if (stat.size > 100_000) continue; // Skip files over 100KB
+      } catch {
+        continue;
+      }
+
+      try {
+        const content = fs.readFileSync(fullPath, 'utf-8');
+        const original = origMap.get(relPath);
+        const changed = original === undefined || original !== content;
+        results.push({ path: relPath, content, changed });
+      } catch {
+        // Skip files that can't be read as UTF-8
+      }
+    }
+  }
+
+  walk(tempDir, '');
+  return results;
+}


⚠️ Potential issue | 🟠 Major

Remove .bak file from the repository.

runner.ts.bak is a development artifact — it's a near-exact copy of runner.ts and should not be committed. It will cause confusion about which file is canonical and adds maintenance burden. Delete it before merging.

🤖 Prompt for AI Agents

In `@src/evaluation/runner.ts.bak` around lines 1 - 555, This PR includes a development artifact file named runner.ts.bak which duplicates the real runner implementation (functions like runCases, runSingleCase, evaluateWithRubric, snapshotFiles, installDependencies); remove runner.ts.bak from the commit and repository (git rm --cached or git rm and commit the deletion), ensure only the canonical src/evaluation/runner.ts remains, update the PR so no .bak file is present, and re-run tests/CI to confirm there are no duplicate-file import or build issues.

coderabbitai · 2026-02-16T17:10:52Z

src/evaluation/runner.ts.orig

+/**
+ * Evaluation runner - executes cases in sandboxes and evaluates results
+ *
+ * This is the core evaluation engine that:
+ * 1. Sets up the sandbox environment
+ * 2. Runs the case (agent attempts to solve the problem)
+ * 3. Applies the rubric to evaluate the result
+ */
+
+import * as fs from 'fs';
+import * as path from 'path';
+import * as os from 'os';
+import {
+  Case,
+  CaseFile,
+  CaseResult,
+  CriterionResult,
+  EvaluatorResult,
+  RunResult,
+  RunSummary,
+  EvaluatorType,
+} from '../cases/types';
+import { createSandboxManager, checkDocker, RECOMMENDED_IMAGES } from '../sandbox';
+import { Sandbox, SandboxConfig } from '../sandbox/types';
+import { getRubricRegistry } from '../rubrics/loader';
+import { getAgent } from '../agents/registry';
+import { runLLMJudgeEvaluator } from './llm-judge';
+import type { AgentResult } from '../agents/types';
+
+export interface RunnerOptions {
+  /** Agent being evaluated (for logging) */
+  agent: string;
+
+  /** Model to use (passed to agent) */
+  model?: string;
+
+  /** Timeout per case in seconds */
+  timeoutSeconds?: number;
+
+  /** Enable network in sandbox */
+  networkEnabled?: boolean;
+
+  /** Callback for progress updates */
+  onProgress?: (update: ProgressUpdate) => void;
+
+  /** Callback when a case completes */
+  onCaseComplete?: (result: CaseResult) => void;
+}
+
+export interface ProgressUpdate {
+  type: 'starting' | 'running' | 'validating' | 'complete' | 'error';
+  caseId: string;
+  caseIndex: number;
+  totalCases: number;
+  message?: string;
+}
+
+/**
+ * Get the appropriate Docker image for a language
+ */
+function getImageForLanguage(language: string): string {
+  const langLower = language.toLowerCase();
+
+  if (langLower === 'javascript' || langLower === 'typescript' || langLower === 'node') {
+    return RECOMMENDED_IMAGES.node.latest;
+  }
+  if (langLower === 'python') {
+    return RECOMMENDED_IMAGES.python.latest;
+  }
+  if (langLower === 'go' || langLower === 'golang') {
+    return RECOMMENDED_IMAGES.go.latest;
+  }
+  if (langLower === 'rust') {
+    return RECOMMENDED_IMAGES.rust.latest;
+  }
+  if (langLower === 'java') {
+    return RECOMMENDED_IMAGES.java.latest;
+  }
+
+  // Default to Node.js for unknown languages
+  return RECOMMENDED_IMAGES.node.latest;
+}
+
+/**
+ * Run a set of cases and return results
+ */
+export async function runCases(cases: Case[], options: RunnerOptions): Promise<RunResult> {
+  const runId = `run-${Date.now()}-${Math.random().toString(36).substring(2, 8)}`;
+  const startedAt = new Date();
+  const results: CaseResult[] = [];
+
+  // Check Docker availability first
+  const dockerStatus = await checkDocker();
+  if (!dockerStatus.available) {
+    throw new Error(`Docker is not available: ${dockerStatus.error}\n${dockerStatus.suggestion}`);
+  }
+
+  const manager = createSandboxManager();
+  let rubricId = 'default';
+
+  try {
+    for (let i = 0; i < cases.length; i++) {
+      const caseData = cases[i];
+
+      options.onProgress?.({
+        type: 'starting',
+        caseId: caseData.id,
+        caseIndex: i,
+        totalCases: cases.length,
+        message: `Starting ${caseData.title}`,
+      });
+
+      try {
+        const result = await runSingleCase(caseData, manager, options, i, cases.length);
+        results.push(result);
+        options.onCaseComplete?.(result);
+        // Track the rubric ID from the first case
+        if (i === 0) {
+          const registry = getRubricRegistry();
+          const rubric = registry.resolve(caseData.rubric);
+          rubricId = rubric.id;
+        }
+      } catch (err) {
+        const errorResult: CaseResult = {
+          id: caseData.id,
+          title: caseData.title,
+          score: 0,
+          passed: false,
+          evidence: (err as Error).message,
+          criteria: [],
+          evaluators: [],
+          durationMs: 0,
+          error: (err as Error).message,
+          timestamp: new Date(),
+        };
+        results.push(errorResult);
+        options.onCaseComplete?.(errorResult);
+      }
+    }
+  } finally {
+    // Clean up all sandboxes
+    await manager.destroyAll();
+  }
+
+  const completedAt = new Date();
+  const totalDurationMs = completedAt.getTime() - startedAt.getTime();
+
+  // Calculate summary
+  const scores = results.map((r) => r.score);
+  const averageScore = scores.length > 0 ? scores.reduce((a, b) => a + b, 0) / scores.length : 0;
+
+  const summary: RunSummary = {
+    total: results.length,
+    passed: results.filter((r) => r.passed).length,
+    failed: results.filter((r) => !r.passed && !r.error).length,
+    skipped: 0,
+    timedOut: results.filter((r) => r.timedOut).length,
+    averageScore,
+    totalDurationMs,
+  };
+
+  return {
+    id: runId,
+    timestamp: startedAt,
+    cases: results,
+    summary,
+    durationMs: totalDurationMs,
+    agent: options.agent,
+    rubricId,
+  };
+}
+
+/**
+ * Run a single case in a sandbox
+ */
+async function runSingleCase(
+  caseData: Case,
+  manager: ReturnType<typeof createSandboxManager>,
+  options: RunnerOptions,
+  caseIndex: number,
+  totalCases: number
+): Promise<CaseResult> {
+  const startTime = Date.now();
+
+  // Create a temporary directory for this case
+  const tempDir = fs.mkdtempSync(path.join(os.tmpdir(), `sniff-${caseData.id}-`));
+
+  try {
+    // Write case files to temp directory (if any)
+    if (caseData.files) {
+      for (const file of caseData.files) {
+        const filePath = path.join(tempDir, file.path);
+        const fileDir = path.dirname(filePath);
+
+        // Create directories if needed
+        fs.mkdirSync(fileDir, { recursive: true });
+        if (file.content !== undefined) {
+          fs.writeFileSync(filePath, file.content);
+        }
+      }
+    }
+
+    // Create sandbox
+    const sandboxConfig: SandboxConfig = {
+      workdir: tempDir,
+      image: getImageForLanguage(caseData.language),
+      timeoutSeconds: options.timeoutSeconds || 300,
+      networkEnabled: options.networkEnabled || false,
+    };
+
+    options.onProgress?.({
+      type: 'running',
+      caseId: caseData.id,
+      caseIndex,
+      totalCases,
+      message: 'Creating sandbox...',
+    });
+
+    const sandbox = await manager.create(sandboxConfig);
+
+    try {
+      // Install dependencies if needed
+      await installDependencies(sandbox, caseData.language, options, caseIndex, totalCases, caseData.id);
+
+      // Run the agent to attempt to solve the case
+      options.onProgress?.({
+        type: 'running',
+        caseId: caseData.id,
+        caseIndex,
+        totalCases,
+        message: 'Running agent...',
+      });
+
+      const agent = getAgent(options.agent);
+      const agentResult: AgentResult = await agent.run(caseData.prompt, {
+        cwd: tempDir,
+        model: options.model,
+        timeoutMs: (options.timeoutSeconds || 300) * 1000,
+        permissionMode: 'acceptEdits',
+      });
+
+      if (!agentResult.success) {
+        throw new Error(`Agent execution failed: ${agentResult.error}`);
+      }
+
+      // Snapshot files the agent produced (before rubric evaluation)
+      const agentFiles = snapshotFiles(tempDir, caseData.files);
+
+      // Evaluate using the rubric
+      options.onProgress?.({
+        type: 'validating',
+        caseId: caseData.id,
+        caseIndex,
+        totalCases,
+        message: 'Evaluating with rubric...',
+      });
+
+      const result = await evaluateWithRubric(caseData, sandbox, options, agentResult, agentFiles);
+      const durationMs = Date.now() - startTime;
+
+      options.onProgress?.({
+        type: 'complete',
+        caseId: caseData.id,
+        caseIndex,
+        totalCases,
+        message: result.passed ? `Passed (${Math.round(result.score)}%)` : `Failed (${Math.round(result.score)}%)`,
+      });
+
+      return {
+        ...result,
+        agentResponse: agentResult.answer,
+        agentToolCalls: agentResult.toolCalls.map((t) => ({
+          name: t.name,
+          durationMs: t.durationMs || 0,
+          success: t.success || false,
+        })),
+        agentModel: agentResult.model,
+        agentTokens: agentResult.tokens
+          ? {
+              input: agentResult.tokens.inputTokens,
+              output: agentResult.tokens.outputTokens,
+              total: agentResult.tokens.totalTokens,
+            }
+          : undefined,
+        agentFiles,
+        durationMs,
+        timestamp: new Date(),
+      };
+    } finally {
+      await sandbox.destroy();
+    }
+  } finally {
+    // Clean up temp directory
+    try {
+      fs.rmSync(tempDir, { recursive: true, force: true });
+    } catch {
+      // Ignore cleanup errors
+    }
+  }
+}
+
+/**
+ * Evaluate a case using its rubric
+ */
+async function evaluateWithRubric(
+  caseData: Case,
+  sandbox: Sandbox,
+  _options: RunnerOptions,
+  agentResult: AgentResult,
+  agentFiles: { path: string; content: string; changed: boolean }[]
+): Promise<CaseResult> {
+  const registry = getRubricRegistry();
+  const rubric = registry.resolve(caseData.rubric);
+
+  const criteriaResults: CriterionResult[] = [];
+  let totalWeightedScore = 0;
+  let _totalWeight = 0;
+
+  // Evaluate each criterion in the rubric
+  for (const [criterionKey, criterion] of Object.entries(rubric.criteria)) {
+    const evaluatorResults: EvaluatorResult[] = [];
+    let criterionScore = 0;
+    let evaluatorCount = 0;
+
+    for (const evaluator of criterion.evaluators) {
+      const evalStartTime = Date.now();
+      let evalResult: Omit<EvaluatorResult, 'name' | 'type' | 'durationMs'>;
+
+      if (evaluator.type === 'command') {
+        // Run command evaluator
+        const result = await sandbox.exec(evaluator.run, {
+          timeoutSeconds: 60,
+        });
+
+        const passed = result.exitCode === 0;
+        let score = passed ? 1.0 : 0.0;
+
+        // Handle partial credit
+        if (evaluator.partialCredit && !passed) {
+          // For test runners, try to parse pass/fail ratio
+          const testMatch = result.stdout.match(/(\d+) passed/);
+          const failMatch = result.stdout.match(/(\d+) failed/);
+          if (testMatch && failMatch) {
+            const passedTests = parseInt(testMatch[1], 10);
+            const failedTests = parseInt(failMatch[1], 10);
+            const total = passedTests + failedTests;
+            if (total > 0) {
+              score = passedTests / total;
+            }
+          }
+        }
+
+        evalResult = {
+          passed,
+          score,
+          evidence: (result.stdout + '\n' + result.stderr).trim(),
+          details: {
+            exitCode: result.exitCode,
+            timedOut: result.timedOut,
+          },
+        };
+      } else if (evaluator.type === 'pattern') {
+        // Run pattern evaluator (check for matches in files)
+        // Default to fail until fully implemented
+        evalResult = {
+          passed: false,
+          score: 0.0,
+          evidence: 'Pattern check not yet implemented',
+        };
+      } else if (evaluator.type === 'llm_judge') {
+        // Run LLM judge evaluator
+        const result = await runLLMJudgeEvaluator(evaluator, agentResult.answer, JSON.stringify(agentFiles));
+        evalResult = {
+          passed: result.passed,
+          score: result.score,
+          evidence: result.evidence,
+          details: result.details,
+        };
+      } else if ((evaluator.type as any) === 'llm_judge_comparison') {
+        // Run LLM judge comparison evaluator
+        // TODO: Implement baseline answer storage and comparison
+        // For now, use a placeholder evaluator
+        evalResult = {
+          passed: false,
+          score: 0.0,
+          evidence: 'LLM judge comparison not yet fully implemented',
+        };
+      } else {
+        // Other evaluator types (llm_judge, benchmark, etc.) - not implemented
+        evalResult = {
+          passed: false,
+          score: 0.0,
+          evidence: `Evaluator type '${evaluator.type}' not yet implemented`,
+        };
+      }
+
+      const evalDurationMs = Date.now() - evalStartTime;
+
+      evaluatorResults.push({
+        name: evaluator.name || evaluator.type,
+        type: evaluator.type as EvaluatorType,
+        durationMs: evalDurationMs,
+        ...evalResult,
+      });
+
+      if (!evaluator.optional) {
+        criterionScore += evalResult.score;
+        evaluatorCount++;
+      }
+    }
+
+    // Average score for this criterion
+    // If no non-optional evaluators ran, this criterion doesn't participate in scoring
+    const hasRequiredEvaluators = evaluatorCount > 0;
+    const rawScore = hasRequiredEvaluators ? criterionScore / evaluatorCount : 0.0;
+    const weightedScore = hasRequiredEvaluators ? (rawScore * criterion.weight) / 100 : 0;
+    const allPassed = evaluatorResults.filter((e) => !e.passed).length === 0;
+
+    criteriaResults.push({
+      name: criterionKey,
+      weight: criterion.weight,
+      score: rawScore,
+      passed: allPassed,
+      evidence: `Criterion: ${criterionKey}`,
+      evaluatorResults,
+      durationMs: evalDurationMs,
+    });
+
+    totalWeightedScore += weightedScore;
+    // Only count weight for criteria that had non-optional evaluators
+    if (hasRequiredEvaluators) {
+      _totalWeight += criterion.weight;
+    }
+  }
+
+  // Normalize score by participating weight (criteria with only optional evaluators are excluded)
+  // Each criterion's weightedScore = rawScore * weight / 100, so totalWeightedScore
+  // is a fraction of 1.0 when all weights sum to 100. When some criteria are excluded,
+  // rescale so the participating criteria fill the full 0-100% range.
+  const participatingFraction = _totalWeight / 100;
+  const overallScore = participatingFraction > 0 ? (totalWeightedScore / participatingFraction) * 100 : 0;
+
+  // Determine pass/fail (default threshold: 70%)
+  const passThreshold = 70;
+  const passed = overallScore >= passThreshold;
+
+  return {
+    id: caseData.id,
+    title: caseData.title,
+    score: overallScore,
+    passed,
+    evidence: `Overall score: ${overallScore.toFixed(2)}%`,
+    criteria: criteriaResults,
+    evaluators: [],
+    durationMs: Date.now() - evalStartTime,
+    timestamp: new Date(),
+  };
+}
+
+/**
+ * Install dependencies based on language
+ */
+async function installDependencies(
+  sandbox: Sandbox,
+  language: string,
+  options: RunnerOptions,
+  caseIndex: number,
+  totalCases: number,
+  caseId: string
+): Promise<void> {
+  const langLower = language.toLowerCase();
+
+  options.onProgress?.({
+    type: 'running',
+    caseId,
+    caseIndex,
+    totalCases,
+    message: 'Installing dependencies...',
+  });
+
+  if (langLower === 'python') {
+    // Check for requirements.txt
+    const result = await sandbox.exec('test -f requirements.txt && pip install -r requirements.txt || true');
+    if (result.exitCode !== 0 && result.stderr) {
+      console.warn('Warning: pip install failed:', result.stderr);
+    }
+    // Also install pytest if running tests
+    await sandbox.exec('pip install pytest --quiet 2>/dev/null || true');
+  } else if (langLower === 'javascript' || langLower === 'typescript' || langLower === 'node') {
+    // Check for package.json
+    const result = await sandbox.exec('test -f package.json && npm install --silent || true');
+    if (result.exitCode !== 0 && result.stderr) {
+      console.warn('Warning: npm install failed:', result.stderr);
+    }
+  } else if (langLower === 'go' || langLower === 'golang') {
+    // Check for go.mod
+    await sandbox.exec('test -f go.mod && go mod download || true');
+  }
+}
+
+/**
+ * Snapshot all files in the workspace after the agent runs.
+ * Compares against the original case files to flag which ones changed.
+ * Reads directly from the host tempDir (bind-mounted into the sandbox).
+ */
+function snapshotFiles(
+  tempDir: string,
+  originalFiles?: CaseFile[]
+): { path: string; content: string; changed: boolean }[] {
+  const results: { path: string; content: string; changed: boolean }[] = [];
+  const origMap = new Map<string, string>();
+
+  // Build map of original file contents for comparison
+  if (originalFiles) {
+    for (const f of originalFiles) {
+      if (f.content !== undefined) {
+        origMap.set(f.path, f.content);
+      }
+    }
+  }
+
+  // Walk the temp directory and collect all files
+  function walk(dir: string, prefix: string) {
+    let entries: fs.Dirent[];
+    try {
+      entries = fs.readdirSync(dir, { withFileTypes: true });
+    } catch {
+      return;
+    }
+    for (const entry of entries) {
+      const relPath = prefix ? `${prefix}/${entry.name}` : entry.name;
+      const fullPath = path.join(dir, entry.name);
+
+      // Skip common non-essential directories
+      if (entry.isDirectory()) {
+        if (['node_modules', '.git', '__pycache__', '.pytest_cache', 'venv', '.venv'].includes(entry.name)) {
+          continue;
+        }
+        walk(fullPath, relPath);
+        continue;
+      }
+
+      if (!entry.isFile()) continue;
+
+      // Skip binary and large files
+      try {
+        const stat = fs.statSync(fullPath);
+        if (stat.size > 100_000) continue; // Skip files over 100KB
+      } catch {
+        continue;
+      }
+
+      try {
+        const content = fs.readFileSync(fullPath, 'utf-8');
+        const original = origMap.get(relPath);
+        const changed = original === undefined || original !== content;
+        results.push({ path: relPath, content, changed });
+      } catch {
+        // Skip files that can't be read as UTF-8
+      }
+    }
+  }
+
+  walk(tempDir, '');
+  return results;
+}


⚠️ Potential issue | 🟠 Major

Remove .orig file from the repository — this is a development artifact.

Same concern as the .bak files. Use git history to track prior versions.

🤖 Prompt for AI Agents

In `@src/evaluation/runner.ts.orig` around lines 1 - 566, Remove this development artifact file (src/evaluation/runner.ts.orig) from the repository and ensure the canonical implementation (e.g., src/evaluation/runner.ts) is the one tracked; delete the .orig file in the PR, remove any imports or references to it, and verify key symbols such as runCases, runSingleCase, evaluateWithRubric, snapshotFiles, and createSandboxManager are only present in the intended runner.ts to avoid duplicate definitions.

coderabbitai · 2026-02-16T17:10:52Z

src/evaluation/runner.ts.orig

+      } else if (evaluator.type === 'llm_judge') {
+        // Run LLM judge evaluator
+        const result = await runLLMJudgeEvaluator(evaluator, agentResult.answer, JSON.stringify(agentFiles));
+        evalResult = {
+          passed: result.passed,
+          score: result.score,
+          evidence: result.evidence,
+          details: result.details,
+        };
+      } else if ((evaluator.type as any) === 'llm_judge_comparison') {
+        // Run LLM judge comparison evaluator
+        // TODO: Implement baseline answer storage and comparison
+        // For now, use a placeholder evaluator
+        evalResult = {
+          passed: false,
+          score: 0.0,
+          evidence: 'LLM judge comparison not yet fully implemented',
+        };
+      } else {
+        // Other evaluator types (llm_judge, benchmark, etc.) - not implemented
+        evalResult = {
+          passed: false,
+          score: 0.0,
+          evidence: `Evaluator type '${evaluator.type}' not yet implemented`,
+        };
+      }


⚠️ Potential issue | 🟡 Minor

LLM judge evaluator integration: runLLMJudgeEvaluator passes JSON.stringify(agentFiles) as context.

At line 372, the entire agent file snapshot (up to 100KB per file) is serialized and sent as context to the LLM judge. This could produce very large prompts, driving up API costs significantly. Consider truncating or summarizing the file contents before sending.

🤖 Prompt for AI Agents

In `@src/evaluation/runner.ts.orig` around lines 370 - 395, runLLMJudgeEvaluator is being fed JSON.stringify(agentFiles) which can produce very large prompts and high API costs; replace that direct serialization by preprocessing agentFiles with a function (e.g., summarizeOrTruncateAgentFiles or buildAgentFilesContext) that for each file produces either a truncated snippet (limit bytes/characters per file) or a short summary (first N lines + metadata or a condensed summary) and also enforces a total context size cap, then pass JSON.stringify of that reduced context to runLLMJudgeEvaluator (update the caller in runner.ts where agentFiles is used and add/implement the new summarization/truncation helper).

coderabbitai · 2026-02-16T17:10:52Z

src/evaluation/runner.ts.orig

+    criteriaResults.push({
+      name: criterionKey,
+      weight: criterion.weight,
+      score: rawScore,
+      passed: allPassed,
+      evidence: `Criterion: ${criterionKey}`,
+      evaluatorResults,
+      durationMs: evalDurationMs,
+    });


⚠️ Potential issue | 🔴 Critical

evalDurationMs is out of scope — will cause a compilation/runtime error.

evalDurationMs is declared with const at line 397 inside the evaluator for loop (lines 325–410), but referenced at line 426 outside that loop. This is a block-scoping error.

Similarly, evalStartTime (declared at line 326 inside the evaluator loop) is referenced at line 455 outside both loops.

🐛 Proposed fix

Track criterion-level timing separately:

for (const [criterionKey, criterion] of Object.entries(rubric.criteria)) { const evaluatorResults: EvaluatorResult[] = []; let criterionScore = 0; let evaluatorCount = 0; + const criterionStartTime = Date.now(); for (const evaluator of criterion.evaluators) { const evalStartTime = Date.now(); ... } ... criteriaResults.push({ name: criterionKey, weight: criterion.weight, score: rawScore, passed: allPassed, evidence: `Criterion: ${criterionKey}`, evaluatorResults, - durationMs: evalDurationMs, + durationMs: Date.now() - criterionStartTime, });

And for the return statement at line 455, track the function start time:

async function evaluateWithRubric( ... ): Promise<CaseResult> { + const rubricStartTime = Date.now(); const registry = getRubricRegistry(); ... return { ... - durationMs: Date.now() - evalStartTime, + durationMs: Date.now() - rubricStartTime, timestamp: new Date(), };

🤖 Prompt for AI Agents

In `@src/evaluation/runner.ts.orig` around lines 419 - 427, The code references evalDurationMs and evalStartTime out of their block scope; move timing so criterion-level and function-level timers are declared outside their respective loops: before the evaluator for-loop, declare a criterion-level start timestamp (e.g., criterionStartTime) and after the evaluator loop compute criterionDurationMs and use that in the criteriaResults push (replace evalDurationMs), and at the top of the run/evaluation function declare a function-level start timestamp (e.g., runStartTime) and compute totalDurationMs for the final return (replace evalStartTime usage); update references to evalDurationMs and evalStartTime to these new symbols and ensure they are computed after the loops complete.

jharris1679 added 30 commits February 16, 2026 02:15

ralph: work on #29 (iter 1)

396e04c

ralph: work on #29 (iter 2)

5ff7b67

ralph: work on #29 (iter 3)

90afa8e

ralph: work on #29 (iter 4)

8db2cf9

ralph: work on #29 (iter 5)

dca0a62

ralph: work on #29 (iter 6)

5e68155

ralph: work on #29 (iter 7)

591e66a

ralph: work on #29 (iter 8)

1907aee

fix: resolve syntax errors in llm-judge.ts and runner.ts (#29)

3c15eea

ralph: work on #29 (iter 10)

bfd1865

ralph: work on #29 (iter 11)

fe5fdbe

ralph: work on #29 (iter 14)

51cc8a0

fix: resolve TypeScript build errors in llm-judge and runner (#29)

accb89b

ralph: work on #29 (iter 17)

202890f

ralph: work on #29 (iter 24)

ba89155

ralph: work on #29 (iter 25)

731e6ef

ralph: work on #29 (iter 26)

ba82825

ralph: work on #29 (iter 27)

8aa5584

ralph: work on #29 (iter 28)

d3124fe

ralph: work on #29 (iter 29)

a356603

ralph: work on #29 (iter 30)

22968cb

ralph: work on #29 (iter 31)

f8a4c81

fix: resolve TypeScript type errors in opencode agent (#29)

e8dbf84

ralph: work on #29 (iter 33)

fead796

ralph: work on #29 (iter 35)

48d3025

ralph: work on #29 (iter 36)

9ce33a0

ralph: work on #29 (iter 37)

58596fd

ralph: work on #29 (iter 38)

b5825e1

ralph: work on #29 (iter 41)

57d57fb

jharris1679 added 9 commits February 16, 2026 10:06

ralph: work on #29 (iter 42)

49343b6

ralph: work on #29 (iter 43)

0837c90

ralph: work on #29 (iter 46)

5e5c120

ralph: work on #29 (iter 48)

3573088

ralph: work on #29 (iter 49)

0e61dac

ralph: work on #29 (iter 50)

17db07f

ralph: work on #29 (iter 51)

a6df782

ralph: work on #29 (iter 52)

e92fea0

ralph: work on #29 (iter 53)

a039b62

greptile-apps bot reviewed Feb 16, 2026

View reviewed changes

coderabbitai bot reviewed Feb 16, 2026

View reviewed changes

		\| 'llm_judge_comparison' // Use LLM to compare two answers
		\| 'agent_behavior'; // Evaluate agent behavior metrics

Comments

Conversation

jharris1679 commented Feb 16, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue

Status: ✓ verified

Summary

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jharris1679 commented Feb 16, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 16, 2026 •

edited

Loading