Skip to content

Comments

ralph: #22 — Difficulty Ladder System — add easy/medium/hard scaffolding levels to cases#55

Open
jharris1679 wants to merge 3 commits intomainfrom
ralph/issue-22
Open

ralph: #22 — Difficulty Ladder System — add easy/medium/hard scaffolding levels to cases#55
jharris1679 wants to merge 3 commits intomainfrom
ralph/issue-22

Conversation

@jharris1679
Copy link
Contributor

@jharris1679 jharris1679 commented Feb 16, 2026

Issue

Closes #22

Status: ✓ verified

Build, tests, and lint all pass locally.

Summary

Automated implementation by Ralph (rlmkit + MiniMax M2.5).
Review the changes carefully — this was generated by a local model.

Summary by CodeRabbit

New Features

  • Difficulty Ladder System: Cases now support multi-level scaffolding variants. Each case can define easy, medium, and hard difficulty levels with customizable hints, additional files, solution visibility, and notes to guide learners through progressive challenge levels.

@coderabbitai
Copy link

coderabbitai bot commented Feb 16, 2026

Walkthrough

Introduces a Difficulty Ladder system that generates scaffolded variants of a base case at easy, medium, and hard difficulty levels with customizable hints, additional files, and solution visibility. Includes fluent builder API and default scaffolding templates.

Changes

Cohort / File(s) Summary
Difficulty Ladder System
src/cases/difficulty-ladder.ts
Adds ScaffoldingLevel and DifficultyLadder interfaces, core generation functions (generateDifficultyVariants, getScaffoldingLevel, hasDifficultyLadder, createVariant), DifficultyLadderBuilder class with fluent API (withEasy, withMedium, withHard), and default scaffolding strategies (defaultEasyScaffolding, defaultMediumScaffolding, defaultHardScaffolding, createDefaultLadder).
Type Extensions
src/cases/types.ts
Imports ScaffoldingLevel type; extends Case interface with optional difficultyLadder field; documents the DifficultyLadder construct for integration with case schema.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A ladder tall with rungs so fine,
Easy, medium, hard—they all align!
Each case now wears three different hats,
More scaffolds here, fewer details there,
Sweet variants bloom everywhere! 🌟

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning PR implements the schema design and scaffolding generation for difficulty variants [#22], but does not address CLI integration (--difficulty flag) or performance metrics tracking requirements. Future PRs should complete the remaining tasks: add --difficulty flag to sniff run CLI and implement per-difficulty performance metrics tracking.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed Title accurately and concisely describes the main change: implementing a difficulty ladder system with scaffolding levels for cases.
Out of Scope Changes check ✅ Passed All changes are directly related to implementing the difficulty ladder system schema and generation logic as specified in issue #22; no unrelated modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into main

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ralph/issue-22

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Fix all issues with AI agents
In `@src/cases/difficulty-ladder.ts`:
- Around line 96-106: The variant currently spreads ...baseCase which pulls in
baseCase.difficultyLadder (causing nested/bloated variants) and then
unconditionally sets notes to level.notes (overwriting baseCase.notes with
undefined); to fix, exclude difficultyLadder when building the variant (e.g.,
destructure baseCase to remove difficultyLadder before spreading) and set notes
using a nullish fallback like level.notes ?? baseCase.notes so existing notes
are preserved; update the construction of variant (the object built from
baseCase and fields id, title, prompt, files, difficulty, solution, notes)
accordingly.
- Around line 182-188: The build() method is unsafely casting this.levels
(Partial<Record<CaseDifficulty, ScaffoldingLevel>>) to a full Record which can
produce undefined at runtime; update build() in DifficultyLadder to validate
that all CaseDifficulty keys are present in this.levels (e.g., iterate over
Object.values(CaseDifficulty) or the explicit ['easy','medium','hard'] set) and
throw a clear error if any are missing, then return the object with levels typed
as Record<CaseDifficulty, ScaffoldingLevel>; alternatively, if you prefer not to
enforce presence, change the DifficultyLadder.levels type to
Partial<Record<CaseDifficulty, ScaffoldingLevel>> so the public API reflects
optional entries—pick one of these two fixes and apply it to the build()
implementation and DifficultyLadder type accordingly.

In `@src/cases/types.ts`:
- Around line 47-56: You have two identical exported interfaces named
DifficultyLadder (in types.ts and difficulty-ladder.ts) and the current
DifficultyLadder.base is a self-referential Case which causes circular nesting;
remove the duplicate export from difficulty-ladder.ts and keep the single source
of truth in types.ts, then change DifficultyLadder.base from type Case to a
lightweight identifier (e.g., baseId: string) and update all usages that
reference DifficultyLadder.base (including any constructors, serializers,
deserializers, and the Case type definition that currently has
difficultyLadder?: DifficultyLadder) to use baseId or resolve the Case by id
where needed; ensure imports/reference sites (functions/classes named
DifficultyLadder, Case, and any methods in difficulty-ladder.ts that manipulate
ladders) are updated to the consolidated type and that serialization no longer
nests full Case objects.
- Line 1: There’s a circular import and duplicated type definitions: move the
ScaffoldingLevel type and the DifficultyLadder type definition into the shared
types file (where Case, CaseFile, CaseDifficulty live), remove the duplicate
DifficultyLadder from difficulty-ladder.ts, and stop importing ScaffoldingLevel
from difficulty-ladder.ts; then update difficulty-ladder.ts to import
ScaffoldingLevel and DifficultyLadder from the consolidated types module (and
remove any reverse imports of Case/CaseFile/CaseDifficulty), ensuring all usages
reference the single definitions ScaffoldingLevel and DifficultyLadder in the
types file.

In `@src/cases/types.ts.bak`:
- Around line 1-541: Remove the backup file src/cases/types.ts.bak from the repo
and ensure future .bak files are ignored: delete the file from git (e.g., git rm
--cached/ or remove and commit) and add the pattern "*.bak" to the repository
.gitignore; verify no other .bak artifacts remain (search for files ending in
.bak) and commit the .gitignore change so the backup won't be re-added.
🧹 Nitpick comments (5)
src/cases/types.ts (1)

128-129: Missing JSDoc on difficultyLadder field.

Every other field on Case has a JSDoc comment, but this one only has a bare // comment. Use /** ... */ for consistency so tooling surfaces the documentation.

Proposed fix
-  // Difficulty ladder configuration for generating variants
-  difficultyLadder?: DifficultyLadder;
+  /** Difficulty ladder configuration for generating variants */
+  difficultyLadder?: DifficultyLadder;
src/cases/difficulty-ladder.ts (4)

17-32: ScaffoldingLevel.level duplicates CaseDifficulty.

The level field is typed as the literal union 'easy' | 'medium' | 'hard', which is exactly CaseDifficulty. Reuse the existing type alias for consistency and to avoid drift if levels are added later.

Proposed fix
 export interface ScaffoldingLevel {
   /** How much context is provided */
-  level: 'easy' | 'medium' | 'hard';
+  level: CaseDifficulty;

55-71: generateDifficultyVariants could be simplified with a loop.

The three sequential createVariant calls are repetitive. A loop over the difficulty levels would be more concise and automatically adapt if levels change in the future.

Proposed refactor
 export function generateDifficultyVariants(baseCase: Case): Case[] {
-  const variants: Case[] = [];
-
-  // Create easy variant
-  const easyVariant = createVariant(baseCase, 'easy');
-  if (easyVariant) variants.push(easyVariant);
-
-  // Create medium variant
-  const mediumVariant = createVariant(baseCase, 'medium');
-  if (mediumVariant) variants.push(mediumVariant);
-
-  // Create hard variant
-  const hardVariant = createVariant(baseCase, 'hard');
-  if (hardVariant) variants.push(hardVariant);
-
-  return variants;
+  const difficulties: CaseDifficulty[] = ['easy', 'medium', 'hard'];
+  return difficulties
+    .map((d) => createVariant(baseCase, d))
+    .filter((v): v is Case => v !== null);
 }

199-239: Default scaffolding hints don't differentiate context — they're just generic messages.

Per the issue requirements, the difficulty levels should provide varying amounts of context/scaffolding:

  • Easy: "relevant files identified, hints provided"
  • Hard: "agent must discover files and approach"

Currently all three defaults return similar generic encouragement strings and none of them use _baseCase to vary additionalFiles or tailor hints to the actual task. The "easy" default doesn't add any additional files or identify relevant files, and the "hard" default actually provides more hints (3) than "medium" (1), which inverts the intended scaffolding gradient.

Consider aligning the defaults with the issue spec: easy should include additionalFiles (e.g., solution stubs, relevant file pointers), medium should be minimal, and hard should strip hints entirely.


114-116: Prefer ?? over || for nullish coalescing.

|| will also coalesce on falsy values (which is fine here since the value is an object or undefined), but ?? communicates intent more clearly and is the idiomatic choice in modern TypeScript.

Proposed fix
-  return baseCase.difficultyLadder?.levels[difficulty] || null;
+  return baseCase.difficultyLadder?.levels[difficulty] ?? null;

Comment on lines +96 to +106
const variant: Case = {
...baseCase,
id: `${baseCase.id}-${difficulty}`,
title: `${baseCase.title} (${difficulty})`,
prompt,
files: allFiles,
difficulty,
// Don't include solution in variants (unless explicitly requested)
solution: level.showSolution ? baseCase.solution : undefined,
notes: level.notes,
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Variant inherits difficultyLadder from base and silently drops notes.

Two issues with the spread + overrides:

  1. ...baseCase copies the difficultyLadder property into every variant. Variants shouldn't carry the ladder configuration — it creates unnecessary circular nesting and bloats serialized output.
  2. notes: level.notes explicitly sets notes even when level.notes is undefined, which overwrites whatever baseCase.notes contained (the spread would have preserved it). Use a conditional or nullish-coalescing pattern instead.
Proposed fix
   const variant: Case = {
     ...baseCase,
     id: `${baseCase.id}-${difficulty}`,
     title: `${baseCase.title} (${difficulty})`,
     prompt,
     files: allFiles,
     difficulty,
-    // Don't include solution in variants (unless explicitly requested)
     solution: level.showSolution ? baseCase.solution : undefined,
-    notes: level.notes,
+    notes: level.notes ?? baseCase.notes,
+    difficultyLadder: undefined,  // variants shouldn't carry the ladder
   };
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const variant: Case = {
...baseCase,
id: `${baseCase.id}-${difficulty}`,
title: `${baseCase.title} (${difficulty})`,
prompt,
files: allFiles,
difficulty,
// Don't include solution in variants (unless explicitly requested)
solution: level.showSolution ? baseCase.solution : undefined,
notes: level.notes,
};
const variant: Case = {
...baseCase,
id: `${baseCase.id}-${difficulty}`,
title: `${baseCase.title} (${difficulty})`,
prompt,
files: allFiles,
difficulty,
solution: level.showSolution ? baseCase.solution : undefined,
notes: level.notes ?? baseCase.notes,
difficultyLadder: undefined, // variants shouldn't carry the ladder
};
🤖 Prompt for AI Agents
In `@src/cases/difficulty-ladder.ts` around lines 96 - 106, The variant currently
spreads ...baseCase which pulls in baseCase.difficultyLadder (causing
nested/bloated variants) and then unconditionally sets notes to level.notes
(overwriting baseCase.notes with undefined); to fix, exclude difficultyLadder
when building the variant (e.g., destructure baseCase to remove difficultyLadder
before spreading) and set notes using a nullish fallback like level.notes ??
baseCase.notes so existing notes are preserved; update the construction of
variant (the object built from baseCase and fields id, title, prompt, files,
difficulty, solution, notes) accordingly.

Comment on lines +182 to +188
build(): DifficultyLadder {
return {
base: this.base,
levels: this.levels as Record<CaseDifficulty, ScaffoldingLevel>,
autoGenerate: true,
};
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Unsafe cast: Partial<Record<...>>Record<...> can produce undefined at runtime.

this.levels is Partial<Record<CaseDifficulty, ScaffoldingLevel>>, so any subset of keys may be set. The as Record<...> cast on line 185 silences the compiler, but consumers of DifficultyLadder.levels (e.g., levels['medium']) will get undefined at runtime when the key wasn't set — violating the non-optional Record contract.

Either validate that all three levels are present and throw if not, or change DifficultyLadder.levels to Partial<Record<...>> so the type is honest.

Option A: Validate at build time
   build(): DifficultyLadder {
+    const required: CaseDifficulty[] = ['easy', 'medium', 'hard'];
+    for (const d of required) {
+      if (!this.levels[d]) {
+        throw new Error(`Missing scaffolding level: ${d}`);
+      }
+    }
     return {
       base: this.base,
       levels: this.levels as Record<CaseDifficulty, ScaffoldingLevel>,
       autoGenerate: true,
     };
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
build(): DifficultyLadder {
return {
base: this.base,
levels: this.levels as Record<CaseDifficulty, ScaffoldingLevel>,
autoGenerate: true,
};
}
build(): DifficultyLadder {
const required: CaseDifficulty[] = ['easy', 'medium', 'hard'];
for (const d of required) {
if (!this.levels[d]) {
throw new Error(`Missing scaffolding level: ${d}`);
}
}
return {
base: this.base,
levels: this.levels as Record<CaseDifficulty, ScaffoldingLevel>,
autoGenerate: true,
};
}
🤖 Prompt for AI Agents
In `@src/cases/difficulty-ladder.ts` around lines 182 - 188, The build() method is
unsafely casting this.levels (Partial<Record<CaseDifficulty, ScaffoldingLevel>>)
to a full Record which can produce undefined at runtime; update build() in
DifficultyLadder to validate that all CaseDifficulty keys are present in
this.levels (e.g., iterate over Object.values(CaseDifficulty) or the explicit
['easy','medium','hard'] set) and throw a clear error if any are missing, then
return the object with levels typed as Record<CaseDifficulty, ScaffoldingLevel>;
alternatively, if you prefer not to enforce presence, change the
DifficultyLadder.levels type to Partial<Record<CaseDifficulty,
ScaffoldingLevel>> so the public API reflects optional entries—pick one of these
two fixes and apply it to the build() implementation and DifficultyLadder type
accordingly.

@@ -1,3 +1,4 @@
import type { ScaffoldingLevel } from './difficulty-ladder';
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's check the imports structure
echo "=== types.ts imports from difficulty-ladder ==="
rg "from.*difficulty-ladder" src/cases/types.ts

echo ""
echo "=== difficulty-ladder.ts imports from types ==="
rg "from.*types" src/cases/difficulty-ladder.ts

Repository: AnswerLayer/sniffbench

Length of output: 281


🏁 Script executed:

# Check if both files exist and get their contents
echo "=== Checking src/cases/types.ts imports section ==="
head -20 src/cases/types.ts

echo ""
echo "=== Checking src/cases/difficulty-ladder.ts imports section ==="
head -20 src/cases/difficulty-ladder.ts

Repository: AnswerLayer/sniffbench

Length of output: 1460


🏁 Script executed:

# Search for DifficultyLadder definitions in both files
echo "=== DifficultyLadder in types.ts ==="
rg -A 10 "interface DifficultyLadder|type DifficultyLadder" src/cases/types.ts

echo ""
echo "=== DifficultyLadder in difficulty-ladder.ts ==="
rg -A 10 "interface DifficultyLadder|type DifficultyLadder" src/cases/difficulty-ladder.ts

Repository: AnswerLayer/sniffbench

Length of output: 697


🏁 Script executed:

# Check what ScaffoldingLevel is and where it's defined/used
echo "=== ScaffoldingLevel definition ==="
rg -B 2 -A 5 "ScaffoldingLevel" src/cases/difficulty-ladder.ts | head -30

echo ""
echo "=== Uses of ScaffoldingLevel in types.ts ==="
rg "ScaffoldingLevel" src/cases/types.ts

Repository: AnswerLayer/sniffbench

Length of output: 1121


Circular dependency and duplicate type definitions between types.ts and difficulty-ladder.ts.

types.ts imports ScaffoldingLevel from difficulty-ladder.ts, while difficulty-ladder.ts imports Case, CaseFile, and CaseDifficulty from types.ts. Additionally, DifficultyLadder is defined identically in both files, creating unnecessary duplication.

Move ScaffoldingLevel and DifficultyLadder into types.ts—both are pure type definitions that belong alongside the rest of the schema. This eliminates the circular dependency and consolidates the duplicate definitions. difficulty-ladder.ts would then depend only on types.ts with no reverse dependency.

🤖 Prompt for AI Agents
In `@src/cases/types.ts` at line 1, There’s a circular import and duplicated type
definitions: move the ScaffoldingLevel type and the DifficultyLadder type
definition into the shared types file (where Case, CaseFile, CaseDifficulty
live), remove the duplicate DifficultyLadder from difficulty-ladder.ts, and stop
importing ScaffoldingLevel from difficulty-ladder.ts; then update
difficulty-ladder.ts to import ScaffoldingLevel and DifficultyLadder from the
consolidated types module (and remove any reverse imports of
Case/CaseFile/CaseDifficulty), ensuring all usages reference the single
definitions ScaffoldingLevel and DifficultyLadder in the types file.

Comment on lines +47 to +56
export interface DifficultyLadder {
/** Base case (the original task) */
base: Case;

/** Scaffolding levels for each difficulty */
levels: Record<CaseDifficulty, ScaffoldingLevel>;

/** Whether to auto-generate variants on load */
autoGenerate?: boolean;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

DifficultyLadder is defined in both types.ts and difficulty-ladder.ts.

This interface is declared identically in both files (here at lines 47-56 and in difficulty-ladder.ts at lines 37-46). Having two exported definitions of the same interface will cause ambiguity for consumers — which one should they import? Consolidate it in one place (preferably types.ts since it's a type definition).

Additionally, DifficultyLadder.base: Case creates a self-referential structure since Case itself contains difficultyLadder?: DifficultyLadder. This means a Case embeds a ladder that re-embeds the whole Case. Consider using a lighter reference (e.g., baseId: string) instead of embedding the full Case to avoid circular nesting and serialization issues.

🤖 Prompt for AI Agents
In `@src/cases/types.ts` around lines 47 - 56, You have two identical exported
interfaces named DifficultyLadder (in types.ts and difficulty-ladder.ts) and the
current DifficultyLadder.base is a self-referential Case which causes circular
nesting; remove the duplicate export from difficulty-ladder.ts and keep the
single source of truth in types.ts, then change DifficultyLadder.base from type
Case to a lightweight identifier (e.g., baseId: string) and update all usages
that reference DifficultyLadder.base (including any constructors, serializers,
deserializers, and the Case type definition that currently has
difficultyLadder?: DifficultyLadder) to use baseId or resolve the Case by id
where needed; ensure imports/reference sites (functions/classes named
DifficultyLadder, Case, and any methods in difficulty-ladder.ts that manipulate
ladders) are updated to the consolidated type and that serialization no longer
nests full Case objects.

Comment on lines +1 to +541
/**
* Case and Rubric Schema Types
*
* Cases are structured as "interview questions" - a prompt given to an agent
* with optional starting files and metadata.
*
* Rubrics define how to grade the agent's response - weighted criteria
* with evaluators that produce scores.
*/

// =============================================================================
// Case Types (The Interview Question)
// =============================================================================

/**
* A file provided as part of a case (starting code, tests, etc.)
*/
export interface CaseFile {
/** Relative path within the workspace */
path: string;

/** File content (inline) */
content?: string;

/** Reference to external file (alternative to inline content) */
ref?: string;

/** Whether this file is read-only (agent shouldn't modify) */
readonly?: boolean;
}

/**
* Source/origin of a case
*/
export type CaseSource = 'bootstrap' | 'generated' | 'manual' | 'imported' | 'closed_issue';

/**
* Difficulty level
*/
export type CaseDifficulty = 'easy' | 'medium' | 'hard';

/**
* Agent behavior expectations for a case
*/
export interface CaseExpectations {
/** Maximum time in seconds */
maxTimeSeconds?: number;

/** Maximum tokens (input + output) */
maxTokens?: number;

/** Maximum iterations/turns */
maxIterations?: number;

/** Tools the agent is allowed to use */
allowedTools?: string[];

/** Tools the agent should not use */
disallowedTools?: string[];
}

/**
* A test case - the "interview question" given to an agent
*/
export interface Case {
/** Unique identifier (e.g., "bootstrap-001", "error-handling-py-001") */
id: string;

/** Human-readable title */
title: string;

/** The interview question - what we're asking the agent to do */
prompt: string;

/** Optional starting files (empty = greenfield task) */
files?: CaseFile[];

/**
* Rubric to use for evaluation.
* Can be:
* - string: reference to a rubric ID (e.g., "default", "strict-security")
* - object: inline rubric or extension of existing rubric
*/
rubric?: string | RubricReference;

/** Where this case came from */
source: CaseSource;

/** Primary programming language */
language: string;

/** Difficulty level */
difficulty: CaseDifficulty;

/** Category for organization (e.g., "error-handling", "security", "performance") */
category: string;

/** Tags for filtering */
tags?: string[];

/** Expected agent behavior bounds */
expectations?: CaseExpectations;

/** Version of this case (for tracking changes) */
version?: string;

/** Reference solution (not shown to agent, used for validation) */
solution?: CaseFile[];

/** Additional notes or hints (not shown to agent) */
notes?: string;

// Metadata added by loader
/** Source file path (added by loader) */
_sourcePath?: string;

/** When this case was loaded (added by loader) */
_loadedAt?: Date;
}

// =============================================================================
// Rubric Types (How We Grade)
// =============================================================================

/**
* Types of evaluators available
*/
export type EvaluatorType =
| 'command' // Run a shell command, check exit code
| 'pattern' // Regex match on files
| 'benchmark' // Run command, extract numeric metric
| 'diff' // Compare output to expected
| 'llm_judge' // Use LLM to evaluate (subjective criteria)
| 'agent_behavior'; // Evaluate agent behavior metrics

/**
* Base evaluator configuration
*/
export interface EvaluatorBase {
/** Type of evaluator */
type: EvaluatorType;

/** Human-readable name for this check */
name?: string;

/** Whether this evaluator is optional (won't fail if it errors) */
optional?: boolean;

/** Whether to award partial credit (vs pass/fail) */
partialCredit?: boolean;

/** Threshold for passing (0.0-1.0, default 1.0) */
passThreshold?: number;
}

/**
* Command evaluator - runs a shell command
*/
export interface CommandEvaluator extends EvaluatorBase {
type: 'command';

/** Command to run */
run: string;

/** How to parse output (for partial credit) */
parse?: 'exit_code' | 'json' | 'junit' | 'tap';

/** JSONPath expression to extract score (when parse=json) */
scorePath?: string;

/** Fail if this pattern is found in output */
failIfMatch?: string;

/** Fail if this pattern is NOT found in output */
failIfNoMatch?: string;
}

/**
* Pattern evaluator - regex match on files
*/
export interface PatternEvaluator extends EvaluatorBase {
type: 'pattern';

/** Glob pattern for files to check */
files: string;

/** Fail if this pattern matches */
failIfMatch?: string;

/** Fail if this pattern does NOT match */
requireMatch?: string;

/** Case-insensitive matching */
ignoreCase?: boolean;
}

/**
* Benchmark evaluator - extract numeric metrics
*/
export interface BenchmarkEvaluator extends EvaluatorBase {
type: 'benchmark';

/** Command to run */
run: string;

/** Name of the metric being measured */
metric: string;

/** JSONPath to extract value (if output is JSON) */
valuePath?: string;

/** Regex to extract value from output */
valuePattern?: string;

/** Minimum acceptable value */
minValue?: number;

/** Maximum acceptable value */
maxValue?: number;

/** Target value (for partial credit calculation) */
targetValue?: number;
}

/**
* Diff evaluator - compare output to expected
*/
export interface DiffEvaluator extends EvaluatorBase {
type: 'diff';

/** Command that produces actual output */
run: string;

/** Expected output (inline) */
expected?: string;

/** Path to file with expected output */
expectedFile?: string;

/** Ignore whitespace differences */
ignoreWhitespace?: boolean;

/** Ignore case differences */
ignoreCase?: boolean;
}

/**
* LLM Judge evaluator - use AI to evaluate subjective criteria
*/
export interface LLMJudgeEvaluator extends EvaluatorBase {
type: 'llm_judge';

/** What to evaluate */
evaluate: 'code_quality' | 'readability' | 'documentation' | 'custom';

/** Custom prompt for evaluation (when evaluate=custom) */
prompt?: string;

/** Files to include in evaluation context */
files?: string;

/** Model to use (default: configured default) */
model?: string;
}

/**
* Agent behavior evaluator - measure how the agent worked
*/
export interface AgentBehaviorEvaluator extends EvaluatorBase {
type: 'agent_behavior';

/** Which metric to evaluate */
metric: 'time' | 'tokens' | 'iterations' | 'tool_calls' | 'self_corrections';

/** Maximum acceptable value */
maxValue?: number;

/** Minimum acceptable value */
minValue?: number;

/** Target value (for partial credit) */
targetValue?: number;
}

/**
* Union of all evaluator types
*/
export type Evaluator =
| CommandEvaluator
| PatternEvaluator
| BenchmarkEvaluator
| DiffEvaluator
| LLMJudgeEvaluator
| AgentBehaviorEvaluator;

/**
* A criterion in a rubric (e.g., "correctness", "code_quality")
*/
export interface RubricCriterion {
/** Weight of this criterion (should sum to 100 across all criteria) */
weight: number;

/** Human-readable description */
description?: string;

/** Evaluators that contribute to this criterion's score */
evaluators: Evaluator[];
}

/**
* A rubric - defines how to grade an agent's response
*/
export interface Rubric {
/** Unique identifier */
id: string;

/** Human-readable name */
name: string;

/** Description of when to use this rubric */
description?: string;

/** Another rubric to extend (inherit criteria from) */
extends?: string;

/** The grading criteria */
criteria: Record<string, RubricCriterion>;

// Metadata
/** Source file path (added by loader) */
_sourcePath?: string;
}

/**
* Reference to a rubric with optional overrides
*/
export interface RubricReference {
/** ID of rubric to use as base */
extends: string;

/** Override specific criteria */
criteria?: Record<string, Partial<RubricCriterion>>;
}

// =============================================================================
// Result Types (What We Measured)
// =============================================================================

/**
* Result from a single evaluator
*/
export interface EvaluatorResult {
/** Name of the evaluator */
name: string;

/** Type of evaluator */
type: EvaluatorType;

/** Score from 0.0 to 1.0 */
score: number;

/** Whether this evaluator passed (score >= threshold) */
passed: boolean;

/** Evidence (stdout, stderr, or explanation) */
evidence: string;

/** Evaluator-specific details */
details?: Record<string, unknown>;

/** Error message if evaluator failed to run */
error?: string;

/** Duration in milliseconds */
durationMs: number;
}

/**
* Result for a single criterion
*/
export interface CriterionResult {
/** Name of the criterion */
name: string;

/** Weight of this criterion */
weight: number;

/** Weighted score (score * weight / 100) */
weightedScore: number;

/** Raw score from 0.0 to 1.0 */
score: number;

/** Whether this criterion passed */
passed: boolean;

/** Results from individual evaluators */
evaluatorResults: EvaluatorResult[];
}

/**
* Agent behavior trace (captured during execution)
*/
export interface AgentTrace {
/** Total execution time in ms */
totalTimeMs: number;

/** Total tokens used (input + output) */
totalTokens: number;

/** Number of turns/iterations */
iterations: number;

/** Tools that were called */
toolsUsed: string[];

/** Number of self-corrections detected */
selfCorrections: number;

/** Per-turn details */
turns?: AgentTurn[];
}

/**
* A single turn in the agent's execution
*/
export interface AgentTurn {
/** When this turn started */
timestamp: Date;

/** Tokens in (prompt) */
tokensIn: number;

/** Tokens out (response) */
tokensOut: number;

/** Tools called in this turn */
toolCalls: string[];

/** Whether this turn was a self-correction */
selfCorrection: boolean;
}

/**
* Result from evaluating a single case
*/
export interface CaseResult {
/** Case that was evaluated */
caseId: string;

/** Overall score from 0 to 100 */
score: number;

/** Whether the case passed (score >= pass threshold) */
passed: boolean;

/** Results for each criterion */
criteriaResults: CriterionResult[];

/** Agent behavior trace */
agentTrace?: AgentTrace;

/** The agent's text response */
agentResponse?: string;

/** Tool calls the agent made */
agentToolCalls?: { name: string; durationMs?: number; success?: boolean }[];

/** Model used */
agentModel?: string;

/** Token usage */
agentTokens?: { input: number; output: number; total: number };

/** Files produced by the agent (snapshot of workspace after agent runs) */
agentFiles?: { path: string; content: string; changed: boolean }[];

/** Total duration in milliseconds */
durationMs: number;

/** Whether it timed out */
timedOut: boolean;

/** Error if something went wrong */
error?: string;

/** When this result was produced */
timestamp: Date;
}

/**
* Result from a full evaluation run
*/
export interface RunResult {
/** Unique run identifier */
runId: string;

/** When the run started */
startedAt: Date;

/** When the run completed */
completedAt: Date;

/** Agent that was evaluated */
agent: string;

/** Rubric used */
rubricId: string;

/** Results for each case */
caseResults: CaseResult[];

/** Summary statistics */
summary: RunSummary;
}

/**
* Summary statistics for a run
*/
export interface RunSummary {
/** Total cases run */
total: number;

/** Cases that passed */
passed: number;

/** Cases that failed */
failed: number;

/** Cases that were skipped */
skipped: number;

/** Cases that timed out */
timedOut: number;

/** Average score across all cases */
averageScore: number;

/** Total duration in milliseconds */
totalDurationMs: number;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove this .bak file from the repository.

types.ts.bak is a backup artifact and should not be committed. It duplicates types.ts (minus the new difficulty-ladder additions) and will cause confusion. Add *.bak to .gitignore and delete this file.

#!/bin/bash
# Check if .bak files are already in .gitignore
fd ".gitignore" --exec grep -n "bak" {} +
echo "---"
# Check for any other .bak files in the repo
fd -e bak
🤖 Prompt for AI Agents
In `@src/cases/types.ts.bak` around lines 1 - 541, Remove the backup file
src/cases/types.ts.bak from the repo and ensure future .bak files are ignored:
delete the file from git (e.g., git rm --cached/ or remove and commit) and add
the pattern "*.bak" to the repository .gitignore; verify no other .bak artifacts
remain (search for files ending in .bak) and commit the .gitignore change so the
backup won't be re-added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Difficulty Ladder System

1 participant