ralph: #22 — Difficulty Ladder System — add easy/medium/hard scaffolding levels to cases#55
ralph: #22 — Difficulty Ladder System — add easy/medium/hard scaffolding levels to cases#55jharris1679 wants to merge 3 commits intomainfrom
Conversation
WalkthroughIntroduces a Difficulty Ladder system that generates scaffolded variants of a base case at easy, medium, and hard difficulty levels with customizable hints, additional files, and solution visibility. Includes fluent builder API and default scaffolding templates. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In `@src/cases/difficulty-ladder.ts`:
- Around line 96-106: The variant currently spreads ...baseCase which pulls in
baseCase.difficultyLadder (causing nested/bloated variants) and then
unconditionally sets notes to level.notes (overwriting baseCase.notes with
undefined); to fix, exclude difficultyLadder when building the variant (e.g.,
destructure baseCase to remove difficultyLadder before spreading) and set notes
using a nullish fallback like level.notes ?? baseCase.notes so existing notes
are preserved; update the construction of variant (the object built from
baseCase and fields id, title, prompt, files, difficulty, solution, notes)
accordingly.
- Around line 182-188: The build() method is unsafely casting this.levels
(Partial<Record<CaseDifficulty, ScaffoldingLevel>>) to a full Record which can
produce undefined at runtime; update build() in DifficultyLadder to validate
that all CaseDifficulty keys are present in this.levels (e.g., iterate over
Object.values(CaseDifficulty) or the explicit ['easy','medium','hard'] set) and
throw a clear error if any are missing, then return the object with levels typed
as Record<CaseDifficulty, ScaffoldingLevel>; alternatively, if you prefer not to
enforce presence, change the DifficultyLadder.levels type to
Partial<Record<CaseDifficulty, ScaffoldingLevel>> so the public API reflects
optional entries—pick one of these two fixes and apply it to the build()
implementation and DifficultyLadder type accordingly.
In `@src/cases/types.ts`:
- Around line 47-56: You have two identical exported interfaces named
DifficultyLadder (in types.ts and difficulty-ladder.ts) and the current
DifficultyLadder.base is a self-referential Case which causes circular nesting;
remove the duplicate export from difficulty-ladder.ts and keep the single source
of truth in types.ts, then change DifficultyLadder.base from type Case to a
lightweight identifier (e.g., baseId: string) and update all usages that
reference DifficultyLadder.base (including any constructors, serializers,
deserializers, and the Case type definition that currently has
difficultyLadder?: DifficultyLadder) to use baseId or resolve the Case by id
where needed; ensure imports/reference sites (functions/classes named
DifficultyLadder, Case, and any methods in difficulty-ladder.ts that manipulate
ladders) are updated to the consolidated type and that serialization no longer
nests full Case objects.
- Line 1: There’s a circular import and duplicated type definitions: move the
ScaffoldingLevel type and the DifficultyLadder type definition into the shared
types file (where Case, CaseFile, CaseDifficulty live), remove the duplicate
DifficultyLadder from difficulty-ladder.ts, and stop importing ScaffoldingLevel
from difficulty-ladder.ts; then update difficulty-ladder.ts to import
ScaffoldingLevel and DifficultyLadder from the consolidated types module (and
remove any reverse imports of Case/CaseFile/CaseDifficulty), ensuring all usages
reference the single definitions ScaffoldingLevel and DifficultyLadder in the
types file.
In `@src/cases/types.ts.bak`:
- Around line 1-541: Remove the backup file src/cases/types.ts.bak from the repo
and ensure future .bak files are ignored: delete the file from git (e.g., git rm
--cached/ or remove and commit) and add the pattern "*.bak" to the repository
.gitignore; verify no other .bak artifacts remain (search for files ending in
.bak) and commit the .gitignore change so the backup won't be re-added.
🧹 Nitpick comments (5)
src/cases/types.ts (1)
128-129: Missing JSDoc ondifficultyLadderfield.Every other field on
Casehas a JSDoc comment, but this one only has a bare//comment. Use/** ... */for consistency so tooling surfaces the documentation.Proposed fix
- // Difficulty ladder configuration for generating variants - difficultyLadder?: DifficultyLadder; + /** Difficulty ladder configuration for generating variants */ + difficultyLadder?: DifficultyLadder;src/cases/difficulty-ladder.ts (4)
17-32:ScaffoldingLevel.levelduplicatesCaseDifficulty.The
levelfield is typed as the literal union'easy' | 'medium' | 'hard', which is exactlyCaseDifficulty. Reuse the existing type alias for consistency and to avoid drift if levels are added later.Proposed fix
export interface ScaffoldingLevel { /** How much context is provided */ - level: 'easy' | 'medium' | 'hard'; + level: CaseDifficulty;
55-71:generateDifficultyVariantscould be simplified with a loop.The three sequential
createVariantcalls are repetitive. A loop over the difficulty levels would be more concise and automatically adapt if levels change in the future.Proposed refactor
export function generateDifficultyVariants(baseCase: Case): Case[] { - const variants: Case[] = []; - - // Create easy variant - const easyVariant = createVariant(baseCase, 'easy'); - if (easyVariant) variants.push(easyVariant); - - // Create medium variant - const mediumVariant = createVariant(baseCase, 'medium'); - if (mediumVariant) variants.push(mediumVariant); - - // Create hard variant - const hardVariant = createVariant(baseCase, 'hard'); - if (hardVariant) variants.push(hardVariant); - - return variants; + const difficulties: CaseDifficulty[] = ['easy', 'medium', 'hard']; + return difficulties + .map((d) => createVariant(baseCase, d)) + .filter((v): v is Case => v !== null); }
199-239: Default scaffolding hints don't differentiate context — they're just generic messages.Per the issue requirements, the difficulty levels should provide varying amounts of context/scaffolding:
- Easy: "relevant files identified, hints provided"
- Hard: "agent must discover files and approach"
Currently all three defaults return similar generic encouragement strings and none of them use
_baseCaseto varyadditionalFilesor tailor hints to the actual task. The "easy" default doesn't add any additional files or identify relevant files, and the "hard" default actually provides more hints (3) than "medium" (1), which inverts the intended scaffolding gradient.Consider aligning the defaults with the issue spec: easy should include
additionalFiles(e.g., solution stubs, relevant file pointers), medium should be minimal, and hard should strip hints entirely.
114-116: Prefer??over||for nullish coalescing.
||will also coalesce on falsy values (which is fine here since the value is an object orundefined), but??communicates intent more clearly and is the idiomatic choice in modern TypeScript.Proposed fix
- return baseCase.difficultyLadder?.levels[difficulty] || null; + return baseCase.difficultyLadder?.levels[difficulty] ?? null;
| const variant: Case = { | ||
| ...baseCase, | ||
| id: `${baseCase.id}-${difficulty}`, | ||
| title: `${baseCase.title} (${difficulty})`, | ||
| prompt, | ||
| files: allFiles, | ||
| difficulty, | ||
| // Don't include solution in variants (unless explicitly requested) | ||
| solution: level.showSolution ? baseCase.solution : undefined, | ||
| notes: level.notes, | ||
| }; |
There was a problem hiding this comment.
Variant inherits difficultyLadder from base and silently drops notes.
Two issues with the spread + overrides:
...baseCasecopies thedifficultyLadderproperty into every variant. Variants shouldn't carry the ladder configuration — it creates unnecessary circular nesting and bloats serialized output.notes: level.notesexplicitly setsnoteseven whenlevel.notesisundefined, which overwrites whateverbaseCase.notescontained (the spread would have preserved it). Use a conditional or nullish-coalescing pattern instead.
Proposed fix
const variant: Case = {
...baseCase,
id: `${baseCase.id}-${difficulty}`,
title: `${baseCase.title} (${difficulty})`,
prompt,
files: allFiles,
difficulty,
- // Don't include solution in variants (unless explicitly requested)
solution: level.showSolution ? baseCase.solution : undefined,
- notes: level.notes,
+ notes: level.notes ?? baseCase.notes,
+ difficultyLadder: undefined, // variants shouldn't carry the ladder
};📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const variant: Case = { | |
| ...baseCase, | |
| id: `${baseCase.id}-${difficulty}`, | |
| title: `${baseCase.title} (${difficulty})`, | |
| prompt, | |
| files: allFiles, | |
| difficulty, | |
| // Don't include solution in variants (unless explicitly requested) | |
| solution: level.showSolution ? baseCase.solution : undefined, | |
| notes: level.notes, | |
| }; | |
| const variant: Case = { | |
| ...baseCase, | |
| id: `${baseCase.id}-${difficulty}`, | |
| title: `${baseCase.title} (${difficulty})`, | |
| prompt, | |
| files: allFiles, | |
| difficulty, | |
| solution: level.showSolution ? baseCase.solution : undefined, | |
| notes: level.notes ?? baseCase.notes, | |
| difficultyLadder: undefined, // variants shouldn't carry the ladder | |
| }; |
🤖 Prompt for AI Agents
In `@src/cases/difficulty-ladder.ts` around lines 96 - 106, The variant currently
spreads ...baseCase which pulls in baseCase.difficultyLadder (causing
nested/bloated variants) and then unconditionally sets notes to level.notes
(overwriting baseCase.notes with undefined); to fix, exclude difficultyLadder
when building the variant (e.g., destructure baseCase to remove difficultyLadder
before spreading) and set notes using a nullish fallback like level.notes ??
baseCase.notes so existing notes are preserved; update the construction of
variant (the object built from baseCase and fields id, title, prompt, files,
difficulty, solution, notes) accordingly.
| build(): DifficultyLadder { | ||
| return { | ||
| base: this.base, | ||
| levels: this.levels as Record<CaseDifficulty, ScaffoldingLevel>, | ||
| autoGenerate: true, | ||
| }; | ||
| } |
There was a problem hiding this comment.
Unsafe cast: Partial<Record<...>> → Record<...> can produce undefined at runtime.
this.levels is Partial<Record<CaseDifficulty, ScaffoldingLevel>>, so any subset of keys may be set. The as Record<...> cast on line 185 silences the compiler, but consumers of DifficultyLadder.levels (e.g., levels['medium']) will get undefined at runtime when the key wasn't set — violating the non-optional Record contract.
Either validate that all three levels are present and throw if not, or change DifficultyLadder.levels to Partial<Record<...>> so the type is honest.
Option A: Validate at build time
build(): DifficultyLadder {
+ const required: CaseDifficulty[] = ['easy', 'medium', 'hard'];
+ for (const d of required) {
+ if (!this.levels[d]) {
+ throw new Error(`Missing scaffolding level: ${d}`);
+ }
+ }
return {
base: this.base,
levels: this.levels as Record<CaseDifficulty, ScaffoldingLevel>,
autoGenerate: true,
};
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| build(): DifficultyLadder { | |
| return { | |
| base: this.base, | |
| levels: this.levels as Record<CaseDifficulty, ScaffoldingLevel>, | |
| autoGenerate: true, | |
| }; | |
| } | |
| build(): DifficultyLadder { | |
| const required: CaseDifficulty[] = ['easy', 'medium', 'hard']; | |
| for (const d of required) { | |
| if (!this.levels[d]) { | |
| throw new Error(`Missing scaffolding level: ${d}`); | |
| } | |
| } | |
| return { | |
| base: this.base, | |
| levels: this.levels as Record<CaseDifficulty, ScaffoldingLevel>, | |
| autoGenerate: true, | |
| }; | |
| } |
🤖 Prompt for AI Agents
In `@src/cases/difficulty-ladder.ts` around lines 182 - 188, The build() method is
unsafely casting this.levels (Partial<Record<CaseDifficulty, ScaffoldingLevel>>)
to a full Record which can produce undefined at runtime; update build() in
DifficultyLadder to validate that all CaseDifficulty keys are present in
this.levels (e.g., iterate over Object.values(CaseDifficulty) or the explicit
['easy','medium','hard'] set) and throw a clear error if any are missing, then
return the object with levels typed as Record<CaseDifficulty, ScaffoldingLevel>;
alternatively, if you prefer not to enforce presence, change the
DifficultyLadder.levels type to Partial<Record<CaseDifficulty,
ScaffoldingLevel>> so the public API reflects optional entries—pick one of these
two fixes and apply it to the build() implementation and DifficultyLadder type
accordingly.
| @@ -1,3 +1,4 @@ | |||
| import type { ScaffoldingLevel } from './difficulty-ladder'; | |||
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's check the imports structure
echo "=== types.ts imports from difficulty-ladder ==="
rg "from.*difficulty-ladder" src/cases/types.ts
echo ""
echo "=== difficulty-ladder.ts imports from types ==="
rg "from.*types" src/cases/difficulty-ladder.tsRepository: AnswerLayer/sniffbench
Length of output: 281
🏁 Script executed:
# Check if both files exist and get their contents
echo "=== Checking src/cases/types.ts imports section ==="
head -20 src/cases/types.ts
echo ""
echo "=== Checking src/cases/difficulty-ladder.ts imports section ==="
head -20 src/cases/difficulty-ladder.tsRepository: AnswerLayer/sniffbench
Length of output: 1460
🏁 Script executed:
# Search for DifficultyLadder definitions in both files
echo "=== DifficultyLadder in types.ts ==="
rg -A 10 "interface DifficultyLadder|type DifficultyLadder" src/cases/types.ts
echo ""
echo "=== DifficultyLadder in difficulty-ladder.ts ==="
rg -A 10 "interface DifficultyLadder|type DifficultyLadder" src/cases/difficulty-ladder.tsRepository: AnswerLayer/sniffbench
Length of output: 697
🏁 Script executed:
# Check what ScaffoldingLevel is and where it's defined/used
echo "=== ScaffoldingLevel definition ==="
rg -B 2 -A 5 "ScaffoldingLevel" src/cases/difficulty-ladder.ts | head -30
echo ""
echo "=== Uses of ScaffoldingLevel in types.ts ==="
rg "ScaffoldingLevel" src/cases/types.tsRepository: AnswerLayer/sniffbench
Length of output: 1121
Circular dependency and duplicate type definitions between types.ts and difficulty-ladder.ts.
types.ts imports ScaffoldingLevel from difficulty-ladder.ts, while difficulty-ladder.ts imports Case, CaseFile, and CaseDifficulty from types.ts. Additionally, DifficultyLadder is defined identically in both files, creating unnecessary duplication.
Move ScaffoldingLevel and DifficultyLadder into types.ts—both are pure type definitions that belong alongside the rest of the schema. This eliminates the circular dependency and consolidates the duplicate definitions. difficulty-ladder.ts would then depend only on types.ts with no reverse dependency.
🤖 Prompt for AI Agents
In `@src/cases/types.ts` at line 1, There’s a circular import and duplicated type
definitions: move the ScaffoldingLevel type and the DifficultyLadder type
definition into the shared types file (where Case, CaseFile, CaseDifficulty
live), remove the duplicate DifficultyLadder from difficulty-ladder.ts, and stop
importing ScaffoldingLevel from difficulty-ladder.ts; then update
difficulty-ladder.ts to import ScaffoldingLevel and DifficultyLadder from the
consolidated types module (and remove any reverse imports of
Case/CaseFile/CaseDifficulty), ensuring all usages reference the single
definitions ScaffoldingLevel and DifficultyLadder in the types file.
| export interface DifficultyLadder { | ||
| /** Base case (the original task) */ | ||
| base: Case; | ||
|
|
||
| /** Scaffolding levels for each difficulty */ | ||
| levels: Record<CaseDifficulty, ScaffoldingLevel>; | ||
|
|
||
| /** Whether to auto-generate variants on load */ | ||
| autoGenerate?: boolean; | ||
| } |
There was a problem hiding this comment.
DifficultyLadder is defined in both types.ts and difficulty-ladder.ts.
This interface is declared identically in both files (here at lines 47-56 and in difficulty-ladder.ts at lines 37-46). Having two exported definitions of the same interface will cause ambiguity for consumers — which one should they import? Consolidate it in one place (preferably types.ts since it's a type definition).
Additionally, DifficultyLadder.base: Case creates a self-referential structure since Case itself contains difficultyLadder?: DifficultyLadder. This means a Case embeds a ladder that re-embeds the whole Case. Consider using a lighter reference (e.g., baseId: string) instead of embedding the full Case to avoid circular nesting and serialization issues.
🤖 Prompt for AI Agents
In `@src/cases/types.ts` around lines 47 - 56, You have two identical exported
interfaces named DifficultyLadder (in types.ts and difficulty-ladder.ts) and the
current DifficultyLadder.base is a self-referential Case which causes circular
nesting; remove the duplicate export from difficulty-ladder.ts and keep the
single source of truth in types.ts, then change DifficultyLadder.base from type
Case to a lightweight identifier (e.g., baseId: string) and update all usages
that reference DifficultyLadder.base (including any constructors, serializers,
deserializers, and the Case type definition that currently has
difficultyLadder?: DifficultyLadder) to use baseId or resolve the Case by id
where needed; ensure imports/reference sites (functions/classes named
DifficultyLadder, Case, and any methods in difficulty-ladder.ts that manipulate
ladders) are updated to the consolidated type and that serialization no longer
nests full Case objects.
| /** | ||
| * Case and Rubric Schema Types | ||
| * | ||
| * Cases are structured as "interview questions" - a prompt given to an agent | ||
| * with optional starting files and metadata. | ||
| * | ||
| * Rubrics define how to grade the agent's response - weighted criteria | ||
| * with evaluators that produce scores. | ||
| */ | ||
|
|
||
| // ============================================================================= | ||
| // Case Types (The Interview Question) | ||
| // ============================================================================= | ||
|
|
||
| /** | ||
| * A file provided as part of a case (starting code, tests, etc.) | ||
| */ | ||
| export interface CaseFile { | ||
| /** Relative path within the workspace */ | ||
| path: string; | ||
|
|
||
| /** File content (inline) */ | ||
| content?: string; | ||
|
|
||
| /** Reference to external file (alternative to inline content) */ | ||
| ref?: string; | ||
|
|
||
| /** Whether this file is read-only (agent shouldn't modify) */ | ||
| readonly?: boolean; | ||
| } | ||
|
|
||
| /** | ||
| * Source/origin of a case | ||
| */ | ||
| export type CaseSource = 'bootstrap' | 'generated' | 'manual' | 'imported' | 'closed_issue'; | ||
|
|
||
| /** | ||
| * Difficulty level | ||
| */ | ||
| export type CaseDifficulty = 'easy' | 'medium' | 'hard'; | ||
|
|
||
| /** | ||
| * Agent behavior expectations for a case | ||
| */ | ||
| export interface CaseExpectations { | ||
| /** Maximum time in seconds */ | ||
| maxTimeSeconds?: number; | ||
|
|
||
| /** Maximum tokens (input + output) */ | ||
| maxTokens?: number; | ||
|
|
||
| /** Maximum iterations/turns */ | ||
| maxIterations?: number; | ||
|
|
||
| /** Tools the agent is allowed to use */ | ||
| allowedTools?: string[]; | ||
|
|
||
| /** Tools the agent should not use */ | ||
| disallowedTools?: string[]; | ||
| } | ||
|
|
||
| /** | ||
| * A test case - the "interview question" given to an agent | ||
| */ | ||
| export interface Case { | ||
| /** Unique identifier (e.g., "bootstrap-001", "error-handling-py-001") */ | ||
| id: string; | ||
|
|
||
| /** Human-readable title */ | ||
| title: string; | ||
|
|
||
| /** The interview question - what we're asking the agent to do */ | ||
| prompt: string; | ||
|
|
||
| /** Optional starting files (empty = greenfield task) */ | ||
| files?: CaseFile[]; | ||
|
|
||
| /** | ||
| * Rubric to use for evaluation. | ||
| * Can be: | ||
| * - string: reference to a rubric ID (e.g., "default", "strict-security") | ||
| * - object: inline rubric or extension of existing rubric | ||
| */ | ||
| rubric?: string | RubricReference; | ||
|
|
||
| /** Where this case came from */ | ||
| source: CaseSource; | ||
|
|
||
| /** Primary programming language */ | ||
| language: string; | ||
|
|
||
| /** Difficulty level */ | ||
| difficulty: CaseDifficulty; | ||
|
|
||
| /** Category for organization (e.g., "error-handling", "security", "performance") */ | ||
| category: string; | ||
|
|
||
| /** Tags for filtering */ | ||
| tags?: string[]; | ||
|
|
||
| /** Expected agent behavior bounds */ | ||
| expectations?: CaseExpectations; | ||
|
|
||
| /** Version of this case (for tracking changes) */ | ||
| version?: string; | ||
|
|
||
| /** Reference solution (not shown to agent, used for validation) */ | ||
| solution?: CaseFile[]; | ||
|
|
||
| /** Additional notes or hints (not shown to agent) */ | ||
| notes?: string; | ||
|
|
||
| // Metadata added by loader | ||
| /** Source file path (added by loader) */ | ||
| _sourcePath?: string; | ||
|
|
||
| /** When this case was loaded (added by loader) */ | ||
| _loadedAt?: Date; | ||
| } | ||
|
|
||
| // ============================================================================= | ||
| // Rubric Types (How We Grade) | ||
| // ============================================================================= | ||
|
|
||
| /** | ||
| * Types of evaluators available | ||
| */ | ||
| export type EvaluatorType = | ||
| | 'command' // Run a shell command, check exit code | ||
| | 'pattern' // Regex match on files | ||
| | 'benchmark' // Run command, extract numeric metric | ||
| | 'diff' // Compare output to expected | ||
| | 'llm_judge' // Use LLM to evaluate (subjective criteria) | ||
| | 'agent_behavior'; // Evaluate agent behavior metrics | ||
|
|
||
| /** | ||
| * Base evaluator configuration | ||
| */ | ||
| export interface EvaluatorBase { | ||
| /** Type of evaluator */ | ||
| type: EvaluatorType; | ||
|
|
||
| /** Human-readable name for this check */ | ||
| name?: string; | ||
|
|
||
| /** Whether this evaluator is optional (won't fail if it errors) */ | ||
| optional?: boolean; | ||
|
|
||
| /** Whether to award partial credit (vs pass/fail) */ | ||
| partialCredit?: boolean; | ||
|
|
||
| /** Threshold for passing (0.0-1.0, default 1.0) */ | ||
| passThreshold?: number; | ||
| } | ||
|
|
||
| /** | ||
| * Command evaluator - runs a shell command | ||
| */ | ||
| export interface CommandEvaluator extends EvaluatorBase { | ||
| type: 'command'; | ||
|
|
||
| /** Command to run */ | ||
| run: string; | ||
|
|
||
| /** How to parse output (for partial credit) */ | ||
| parse?: 'exit_code' | 'json' | 'junit' | 'tap'; | ||
|
|
||
| /** JSONPath expression to extract score (when parse=json) */ | ||
| scorePath?: string; | ||
|
|
||
| /** Fail if this pattern is found in output */ | ||
| failIfMatch?: string; | ||
|
|
||
| /** Fail if this pattern is NOT found in output */ | ||
| failIfNoMatch?: string; | ||
| } | ||
|
|
||
| /** | ||
| * Pattern evaluator - regex match on files | ||
| */ | ||
| export interface PatternEvaluator extends EvaluatorBase { | ||
| type: 'pattern'; | ||
|
|
||
| /** Glob pattern for files to check */ | ||
| files: string; | ||
|
|
||
| /** Fail if this pattern matches */ | ||
| failIfMatch?: string; | ||
|
|
||
| /** Fail if this pattern does NOT match */ | ||
| requireMatch?: string; | ||
|
|
||
| /** Case-insensitive matching */ | ||
| ignoreCase?: boolean; | ||
| } | ||
|
|
||
| /** | ||
| * Benchmark evaluator - extract numeric metrics | ||
| */ | ||
| export interface BenchmarkEvaluator extends EvaluatorBase { | ||
| type: 'benchmark'; | ||
|
|
||
| /** Command to run */ | ||
| run: string; | ||
|
|
||
| /** Name of the metric being measured */ | ||
| metric: string; | ||
|
|
||
| /** JSONPath to extract value (if output is JSON) */ | ||
| valuePath?: string; | ||
|
|
||
| /** Regex to extract value from output */ | ||
| valuePattern?: string; | ||
|
|
||
| /** Minimum acceptable value */ | ||
| minValue?: number; | ||
|
|
||
| /** Maximum acceptable value */ | ||
| maxValue?: number; | ||
|
|
||
| /** Target value (for partial credit calculation) */ | ||
| targetValue?: number; | ||
| } | ||
|
|
||
| /** | ||
| * Diff evaluator - compare output to expected | ||
| */ | ||
| export interface DiffEvaluator extends EvaluatorBase { | ||
| type: 'diff'; | ||
|
|
||
| /** Command that produces actual output */ | ||
| run: string; | ||
|
|
||
| /** Expected output (inline) */ | ||
| expected?: string; | ||
|
|
||
| /** Path to file with expected output */ | ||
| expectedFile?: string; | ||
|
|
||
| /** Ignore whitespace differences */ | ||
| ignoreWhitespace?: boolean; | ||
|
|
||
| /** Ignore case differences */ | ||
| ignoreCase?: boolean; | ||
| } | ||
|
|
||
| /** | ||
| * LLM Judge evaluator - use AI to evaluate subjective criteria | ||
| */ | ||
| export interface LLMJudgeEvaluator extends EvaluatorBase { | ||
| type: 'llm_judge'; | ||
|
|
||
| /** What to evaluate */ | ||
| evaluate: 'code_quality' | 'readability' | 'documentation' | 'custom'; | ||
|
|
||
| /** Custom prompt for evaluation (when evaluate=custom) */ | ||
| prompt?: string; | ||
|
|
||
| /** Files to include in evaluation context */ | ||
| files?: string; | ||
|
|
||
| /** Model to use (default: configured default) */ | ||
| model?: string; | ||
| } | ||
|
|
||
| /** | ||
| * Agent behavior evaluator - measure how the agent worked | ||
| */ | ||
| export interface AgentBehaviorEvaluator extends EvaluatorBase { | ||
| type: 'agent_behavior'; | ||
|
|
||
| /** Which metric to evaluate */ | ||
| metric: 'time' | 'tokens' | 'iterations' | 'tool_calls' | 'self_corrections'; | ||
|
|
||
| /** Maximum acceptable value */ | ||
| maxValue?: number; | ||
|
|
||
| /** Minimum acceptable value */ | ||
| minValue?: number; | ||
|
|
||
| /** Target value (for partial credit) */ | ||
| targetValue?: number; | ||
| } | ||
|
|
||
| /** | ||
| * Union of all evaluator types | ||
| */ | ||
| export type Evaluator = | ||
| | CommandEvaluator | ||
| | PatternEvaluator | ||
| | BenchmarkEvaluator | ||
| | DiffEvaluator | ||
| | LLMJudgeEvaluator | ||
| | AgentBehaviorEvaluator; | ||
|
|
||
| /** | ||
| * A criterion in a rubric (e.g., "correctness", "code_quality") | ||
| */ | ||
| export interface RubricCriterion { | ||
| /** Weight of this criterion (should sum to 100 across all criteria) */ | ||
| weight: number; | ||
|
|
||
| /** Human-readable description */ | ||
| description?: string; | ||
|
|
||
| /** Evaluators that contribute to this criterion's score */ | ||
| evaluators: Evaluator[]; | ||
| } | ||
|
|
||
| /** | ||
| * A rubric - defines how to grade an agent's response | ||
| */ | ||
| export interface Rubric { | ||
| /** Unique identifier */ | ||
| id: string; | ||
|
|
||
| /** Human-readable name */ | ||
| name: string; | ||
|
|
||
| /** Description of when to use this rubric */ | ||
| description?: string; | ||
|
|
||
| /** Another rubric to extend (inherit criteria from) */ | ||
| extends?: string; | ||
|
|
||
| /** The grading criteria */ | ||
| criteria: Record<string, RubricCriterion>; | ||
|
|
||
| // Metadata | ||
| /** Source file path (added by loader) */ | ||
| _sourcePath?: string; | ||
| } | ||
|
|
||
| /** | ||
| * Reference to a rubric with optional overrides | ||
| */ | ||
| export interface RubricReference { | ||
| /** ID of rubric to use as base */ | ||
| extends: string; | ||
|
|
||
| /** Override specific criteria */ | ||
| criteria?: Record<string, Partial<RubricCriterion>>; | ||
| } | ||
|
|
||
| // ============================================================================= | ||
| // Result Types (What We Measured) | ||
| // ============================================================================= | ||
|
|
||
| /** | ||
| * Result from a single evaluator | ||
| */ | ||
| export interface EvaluatorResult { | ||
| /** Name of the evaluator */ | ||
| name: string; | ||
|
|
||
| /** Type of evaluator */ | ||
| type: EvaluatorType; | ||
|
|
||
| /** Score from 0.0 to 1.0 */ | ||
| score: number; | ||
|
|
||
| /** Whether this evaluator passed (score >= threshold) */ | ||
| passed: boolean; | ||
|
|
||
| /** Evidence (stdout, stderr, or explanation) */ | ||
| evidence: string; | ||
|
|
||
| /** Evaluator-specific details */ | ||
| details?: Record<string, unknown>; | ||
|
|
||
| /** Error message if evaluator failed to run */ | ||
| error?: string; | ||
|
|
||
| /** Duration in milliseconds */ | ||
| durationMs: number; | ||
| } | ||
|
|
||
| /** | ||
| * Result for a single criterion | ||
| */ | ||
| export interface CriterionResult { | ||
| /** Name of the criterion */ | ||
| name: string; | ||
|
|
||
| /** Weight of this criterion */ | ||
| weight: number; | ||
|
|
||
| /** Weighted score (score * weight / 100) */ | ||
| weightedScore: number; | ||
|
|
||
| /** Raw score from 0.0 to 1.0 */ | ||
| score: number; | ||
|
|
||
| /** Whether this criterion passed */ | ||
| passed: boolean; | ||
|
|
||
| /** Results from individual evaluators */ | ||
| evaluatorResults: EvaluatorResult[]; | ||
| } | ||
|
|
||
| /** | ||
| * Agent behavior trace (captured during execution) | ||
| */ | ||
| export interface AgentTrace { | ||
| /** Total execution time in ms */ | ||
| totalTimeMs: number; | ||
|
|
||
| /** Total tokens used (input + output) */ | ||
| totalTokens: number; | ||
|
|
||
| /** Number of turns/iterations */ | ||
| iterations: number; | ||
|
|
||
| /** Tools that were called */ | ||
| toolsUsed: string[]; | ||
|
|
||
| /** Number of self-corrections detected */ | ||
| selfCorrections: number; | ||
|
|
||
| /** Per-turn details */ | ||
| turns?: AgentTurn[]; | ||
| } | ||
|
|
||
| /** | ||
| * A single turn in the agent's execution | ||
| */ | ||
| export interface AgentTurn { | ||
| /** When this turn started */ | ||
| timestamp: Date; | ||
|
|
||
| /** Tokens in (prompt) */ | ||
| tokensIn: number; | ||
|
|
||
| /** Tokens out (response) */ | ||
| tokensOut: number; | ||
|
|
||
| /** Tools called in this turn */ | ||
| toolCalls: string[]; | ||
|
|
||
| /** Whether this turn was a self-correction */ | ||
| selfCorrection: boolean; | ||
| } | ||
|
|
||
| /** | ||
| * Result from evaluating a single case | ||
| */ | ||
| export interface CaseResult { | ||
| /** Case that was evaluated */ | ||
| caseId: string; | ||
|
|
||
| /** Overall score from 0 to 100 */ | ||
| score: number; | ||
|
|
||
| /** Whether the case passed (score >= pass threshold) */ | ||
| passed: boolean; | ||
|
|
||
| /** Results for each criterion */ | ||
| criteriaResults: CriterionResult[]; | ||
|
|
||
| /** Agent behavior trace */ | ||
| agentTrace?: AgentTrace; | ||
|
|
||
| /** The agent's text response */ | ||
| agentResponse?: string; | ||
|
|
||
| /** Tool calls the agent made */ | ||
| agentToolCalls?: { name: string; durationMs?: number; success?: boolean }[]; | ||
|
|
||
| /** Model used */ | ||
| agentModel?: string; | ||
|
|
||
| /** Token usage */ | ||
| agentTokens?: { input: number; output: number; total: number }; | ||
|
|
||
| /** Files produced by the agent (snapshot of workspace after agent runs) */ | ||
| agentFiles?: { path: string; content: string; changed: boolean }[]; | ||
|
|
||
| /** Total duration in milliseconds */ | ||
| durationMs: number; | ||
|
|
||
| /** Whether it timed out */ | ||
| timedOut: boolean; | ||
|
|
||
| /** Error if something went wrong */ | ||
| error?: string; | ||
|
|
||
| /** When this result was produced */ | ||
| timestamp: Date; | ||
| } | ||
|
|
||
| /** | ||
| * Result from a full evaluation run | ||
| */ | ||
| export interface RunResult { | ||
| /** Unique run identifier */ | ||
| runId: string; | ||
|
|
||
| /** When the run started */ | ||
| startedAt: Date; | ||
|
|
||
| /** When the run completed */ | ||
| completedAt: Date; | ||
|
|
||
| /** Agent that was evaluated */ | ||
| agent: string; | ||
|
|
||
| /** Rubric used */ | ||
| rubricId: string; | ||
|
|
||
| /** Results for each case */ | ||
| caseResults: CaseResult[]; | ||
|
|
||
| /** Summary statistics */ | ||
| summary: RunSummary; | ||
| } | ||
|
|
||
| /** | ||
| * Summary statistics for a run | ||
| */ | ||
| export interface RunSummary { | ||
| /** Total cases run */ | ||
| total: number; | ||
|
|
||
| /** Cases that passed */ | ||
| passed: number; | ||
|
|
||
| /** Cases that failed */ | ||
| failed: number; | ||
|
|
||
| /** Cases that were skipped */ | ||
| skipped: number; | ||
|
|
||
| /** Cases that timed out */ | ||
| timedOut: number; | ||
|
|
||
| /** Average score across all cases */ | ||
| averageScore: number; | ||
|
|
||
| /** Total duration in milliseconds */ | ||
| totalDurationMs: number; | ||
| } |
There was a problem hiding this comment.
Remove this .bak file from the repository.
types.ts.bak is a backup artifact and should not be committed. It duplicates types.ts (minus the new difficulty-ladder additions) and will cause confusion. Add *.bak to .gitignore and delete this file.
#!/bin/bash
# Check if .bak files are already in .gitignore
fd ".gitignore" --exec grep -n "bak" {} +
echo "---"
# Check for any other .bak files in the repo
fd -e bak🤖 Prompt for AI Agents
In `@src/cases/types.ts.bak` around lines 1 - 541, Remove the backup file
src/cases/types.ts.bak from the repo and ensure future .bak files are ignored:
delete the file from git (e.g., git rm --cached/ or remove and commit) and add
the pattern "*.bak" to the repository .gitignore; verify no other .bak artifacts
remain (search for files ending in .bak) and commit the .gitignore change so the
backup won't be re-added.
Issue
Closes #22
Status: ✓ verified
Build, tests, and lint all pass locally.
Summary
Automated implementation by Ralph (rlmkit + MiniMax M2.5).
Review the changes carefully — this was generated by a local model.
Summary by CodeRabbit
New Features