ralph: #44 — Change interview grading from 1-10 scale to binary pass/fail#51
ralph: #44 — Change interview grading from 1-10 scale to binary pass/fail#51jharris1679 wants to merge 1 commit intomainfrom
Conversation
WalkthroughThe PR implements binary pass/fail grading instead of a 1-10 scale in documentation and introduces a comprehensive interview command with multi-stage workflows supporting normal interview and baseline comparison modes, baseline management, metrics computation, and sandbox orchestration. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant CLI as Interview CLI
participant Agent
participant BaselineStore as Baseline Store
participant Metrics
User->>CLI: Start interview command
CLI->>CLI: Load interview cases
loop For each case
CLI->>BaselineStore: Load baseline (if exists)
BaselineStore-->>CLI: Baseline data
CLI->>Agent: Execute agent with case
Agent-->>CLI: Stream output + metrics
CLI->>CLI: Display agent output
CLI->>Metrics: Compute behavior metrics
Metrics-->>CLI: Formatted metrics
CLI->>User: Prompt: Pass or Fail?
User-->>CLI: Provide verdict + notes
CLI->>CLI: Accumulate result
end
CLI->>BaselineStore: Save new baseline/run
CLI->>User: Display comparison summary
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
docs/USAGE.md (2)
126-134:⚠️ Potential issue | 🟡 MinorExample output still shows numeric 1-10 grades, contradicting the pass/fail change.
Line 71 was updated to say "pass/fail," but the
runs listexample still displays8.5/10and9.0/10in theGradecolumn. This should reflect the new binary grading (e.g.,Pass/Failor a pass-rate percentage).
278-294:⚠️ Potential issue | 🟡 MinorComparison output examples still use numeric grades.
Lines 286–287 show
Grade 1/Grade 2as8/10,9/10,7/10,8/10, and Line 291 showsAverage grade: 7.5 → 8.5 ↑ +13.3%. These should be updated to use pass/fail or pass-rate metrics to match the new grading system.README.md (1)
78-90:⚠️ Potential issue | 🟡 MinorComparison example still shows numeric grades.
Line 85 (
Grade 7/10 → 9/10 ↑) contradicts the pass/fail change made on Line 40. Update this example to reflect binary grading (e.g.,Result: Fail → Pass).
🤖 Fix all issues with AI agents
In `@src/cli/commands/interview.ts.backup`:
- Around line 298-328: ask() currently resolves to '' for both timeout and EOF
which is indistinguishable from user pressing Enter and causes askResult() to
loop forever; change ask() so that on timeout it rejects with a distinct
TimeoutError (or throws a descriptive Error) and on stdin close (rl 'close')
resolve to null (or another distinct sentinel) instead of ''. Update all callers
such as askResult() to catch the timeout rejection and handle it (e.g.,
abort/re-prompt with a message) and to treat null as EOF so they don't re-prompt
infinitely; ensure you remove the previous logic that relied on '' as the
timeout/EOF signal and adapt any rl.once/rl.removeListener handling accordingly.
- Around line 1050-1056: The status computation incorrectly treats an existing
baseline with result === undefined as a failure; update the logic in the loop
over casesToRun (where hasBaseline = store.baselines[c.id]) so that "not
evaluated" is shown when hasBaseline is missing OR hasBaseline.result ===
undefined; otherwise display "✓ Pass" when hasBaseline.result is truthy and "✗
Fail" when falsey. Modify the status assignment around hasBaseline/result to
explicitly check for undefined rather than relying on truthy/falsy.
- Around line 682-686: The baseline display logic treats undefined the same as
false and shows "✗ Fail"; update the block that computes resultDisplay (and
dateDisplay) to explicitly check for baseline.result === undefined and show a
distinct "• Unevaluated" state (matching runInterviewQuestion behavior), only
using chalk.green('✓ Pass') when true and chalk.red('✗ Fail') when false; ensure
dateDisplay still handles missing baseline.resultAt (keep existing split
fallback) and update the string printed (around console.log(box(caseData.prompt,
`Question: ${caseData.title}`)) and the following console.log that references
resultDisplay and dateDisplay) so unevaluated baselines are displayed as "•
Unevaluated" instead of "✗ Fail".
- Around line 1-9: The file interview.ts.backup contains the Interview command
copy and won't be compiled due to its .ts.backup extension; remove or archive
that backup and ensure the real implementation lives in
src/cli/commands/interview.ts (or merge any missing code from
interview.ts.backup into interview.ts), then verify the Interview command (the
conversational "Interview" implementation and its exported function/class) is
correctly exported and registered with your CLI entry (e.g., the commands index
or CLI registry) so the build includes it.
🧹 Nitpick comments (2)
src/cli/commands/interview.ts.backup (2)
486-493:_outputStartedis dead code.
_outputStartedis set on Line 493 but never read anywhere. The actual flag used throughout istextOutputStarted(Line 489). Remove_outputStartedto avoid confusion.Similarly,
_isRunModeon Line 873 is assigned but never read.Proposed fix
- let _outputStarted = false; const startTime = Date.now(); let textOutputStarted = false; try { const result = await getAgentResponse(caseData, agent, projectRoot, (event) => { - _outputStarted = true; - switch (event.type) {const isCompareMode = options.compare === true; - const _isRunMode = !!options.run;
694-746: Significant code duplication betweenrunComparisonCaseandrunInterviewQuestionevent handlers.The event-handling callback in
runComparisonCase(Lines 694–746) is a near-copy of the one inrunInterviewQuestion(Lines 495–593), minus thethinkingandstatuscases. Consider extracting a shared helper (e.g.,createEventHandler(exploration, agent, textOutputStartedRef)) to reduce duplication and ensure consistent behavior across modes.Also applies to: 495-593
| /** | ||
| * Interview command - conversational comprehension evaluation | ||
| * | ||
| * Inspired by Anthropic Interviewer patterns: | ||
| * - Three stages: Setup → Interview → Grade | ||
| * - Conversational, not rigid | ||
| * - Human-in-the-loop for quality | ||
| */ | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
fd 'interview\.ts' src/cli/commands/ --type fRepository: AnswerLayer/sniffbench
Length of output: 132
File has a .ts.backup extension — it won't be compiled or bundled.
Both interview.ts and interview.ts.backup exist in src/cli/commands/. Since .ts.backup files are ignored by the TypeScript compiler, any code in this file won't be included in the build output. Ensure the actual implementation is in interview.ts instead.
🤖 Prompt for AI Agents
In `@src/cli/commands/interview.ts.backup` around lines 1 - 9, The file
interview.ts.backup contains the Interview command copy and won't be compiled
due to its .ts.backup extension; remove or archive that backup and ensure the
real implementation lives in src/cli/commands/interview.ts (or merge any missing
code from interview.ts.backup into interview.ts), then verify the Interview
command (the conversational "Interview" implementation and its exported
function/class) is correctly exported and registered with your CLI entry (e.g.,
the commands index or CLI registry) so the build includes it.
| async function ask(rl: readline.Interface, question: string, timeoutMs: number = USER_INPUT_TIMEOUT_MS): Promise<string> { | ||
| return new Promise((resolve, _reject) => { | ||
| let answered = false; | ||
|
|
||
| const timeout = setTimeout(() => { | ||
| if (!answered) { | ||
| answered = true; | ||
| resolve(''); // Return empty on timeout | ||
| } | ||
| }, timeoutMs); | ||
|
|
||
| // Handle readline close (e.g., stdin EOF) | ||
| const onClose = () => { | ||
| if (!answered) { | ||
| answered = true; | ||
| clearTimeout(timeout); | ||
| resolve(''); | ||
| } | ||
| }; | ||
| rl.once('close', onClose); | ||
|
|
||
| rl.question(question, (answer) => { | ||
| if (!answered) { | ||
| answered = true; | ||
| clearTimeout(timeout); | ||
| rl.removeListener('close', onClose); | ||
| resolve(answer.trim()); | ||
| } | ||
| }); | ||
| }); | ||
| } |
There was a problem hiding this comment.
Timeout silently returns empty string — caller never knows it timed out.
ask() returns '' on both timeout and EOF, making it indistinguishable from the user pressing Enter. In askResult() this causes an infinite loop (empty string doesn't match y/n, so it re-prompts forever after a 5-minute timeout or EOF). Consider returning a sentinel or throwing on timeout so callers can handle it gracefully.
🤖 Prompt for AI Agents
In `@src/cli/commands/interview.ts.backup` around lines 298 - 328, ask() currently
resolves to '' for both timeout and EOF which is indistinguishable from user
pressing Enter and causes askResult() to loop forever; change ask() so that on
timeout it rejects with a distinct TimeoutError (or throws a descriptive Error)
and on stdin close (rl 'close') resolve to null (or another distinct sentinel)
instead of ''. Update all callers such as askResult() to catch the timeout
rejection and handle it (e.g., abort/re-prompt with a message) and to treat null
as EOF so they don't re-prompt infinitely; ensure you remove the previous logic
that relied on '' as the timeout/EOF signal and adapt any
rl.once/rl.removeListener handling accordingly.
| // Show the question | ||
| console.log(box(caseData.prompt, `Question: ${caseData.title}`)); | ||
| const resultDisplay = baseline.result ? chalk.green('✓ Pass') : chalk.red('✗ Fail'); | ||
| const dateDisplay = baseline.resultAt ? baseline.resultAt.split('T')[0] : 'unknown'; | ||
| console.log(chalk.dim(` Baseline: ${resultDisplay} ${chalk.dim(`(evaluated: ${dateDisplay})`)}`)); |
There was a problem hiding this comment.
Undefined baseline.result silently displayed as "✗ Fail".
Line 684 uses baseline.result ? ... : chalk.red('✗ Fail'), which treats undefined the same as false. However, in runInterviewQuestion (Lines 468–472), undefined is correctly distinguished as "• Unevaluated". Since compare mode filters to cases with baselines (not necessarily evaluated baselines), an unevaluated baseline would be misleadingly shown as "Fail" here.
Proposed fix
- const resultDisplay = baseline.result ? chalk.green('✓ Pass') : chalk.red('✗ Fail');
+ const resultDisplay = baseline.result === undefined
+ ? chalk.yellow('• Unevaluated')
+ : baseline.result
+ ? chalk.green('✓ Pass')
+ : chalk.red('✗ Fail');📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Show the question | |
| console.log(box(caseData.prompt, `Question: ${caseData.title}`)); | |
| const resultDisplay = baseline.result ? chalk.green('✓ Pass') : chalk.red('✗ Fail'); | |
| const dateDisplay = baseline.resultAt ? baseline.resultAt.split('T')[0] : 'unknown'; | |
| console.log(chalk.dim(` Baseline: ${resultDisplay} ${chalk.dim(`(evaluated: ${dateDisplay})`)}`)); | |
| // Show the question | |
| console.log(box(caseData.prompt, `Question: ${caseData.title}`)); | |
| const resultDisplay = baseline.result === undefined | |
| ? chalk.yellow('• Unevaluated') | |
| : baseline.result | |
| ? chalk.green('✓ Pass') | |
| : chalk.red('✗ Fail'); | |
| const dateDisplay = baseline.resultAt ? baseline.resultAt.split('T')[0] : 'unknown'; | |
| console.log(chalk.dim(` Baseline: ${resultDisplay} ${chalk.dim(`(evaluated: ${dateDisplay})`)}`)); |
🤖 Prompt for AI Agents
In `@src/cli/commands/interview.ts.backup` around lines 682 - 686, The baseline
display logic treats undefined the same as false and shows "✗ Fail"; update the
block that computes resultDisplay (and dateDisplay) to explicitly check for
baseline.result === undefined and show a distinct "• Unevaluated" state
(matching runInterviewQuestion behavior), only using chalk.green('✓ Pass') when
true and chalk.red('✗ Fail') when false; ensure dateDisplay still handles
missing baseline.resultAt (keep existing split fallback) and update the string
printed (around console.log(box(caseData.prompt, `Question: ${caseData.title}`))
and the following console.log that references resultDisplay and dateDisplay) so
unevaluated baselines are displayed as "• Unevaluated" instead of "✗ Fail".
| for (const c of casesToRun) { | ||
| const hasBaseline = store.baselines[c.id]; | ||
| const status = hasBaseline | ||
| ? (hasBaseline.result ? chalk.green('✓ Pass') : chalk.red('✗ Fail')) | ||
| : chalk.dim('○ not evaluated'); | ||
| console.log(` ${status} ${chalk.bold(c.id)}: ${c.title}`); | ||
| } |
There was a problem hiding this comment.
Same undefined result display issue in question listing.
Line 1053 treats hasBaseline.result as truthy/falsy, so undefined (unevaluated) renders as "✗ Fail" instead of the "○ not evaluated" shown for missing baselines. An unevaluated existing baseline falls through incorrectly.
Proposed fix
const hasBaseline = store.baselines[c.id];
- const status = hasBaseline
- ? (hasBaseline.result ? chalk.green('✓ Pass') : chalk.red('✗ Fail'))
- : chalk.dim('○ not evaluated');
+ const status = hasBaseline
+ ? (hasBaseline.result === undefined
+ ? chalk.yellow('• Unevaluated')
+ : hasBaseline.result
+ ? chalk.green('✓ Pass')
+ : chalk.red('✗ Fail'))
+ : chalk.dim('○ not evaluated');📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for (const c of casesToRun) { | |
| const hasBaseline = store.baselines[c.id]; | |
| const status = hasBaseline | |
| ? (hasBaseline.result ? chalk.green('✓ Pass') : chalk.red('✗ Fail')) | |
| : chalk.dim('○ not evaluated'); | |
| console.log(` ${status} ${chalk.bold(c.id)}: ${c.title}`); | |
| } | |
| for (const c of casesToRun) { | |
| const hasBaseline = store.baselines[c.id]; | |
| const status = hasBaseline | |
| ? (hasBaseline.result === undefined | |
| ? chalk.yellow('• Unevaluated') | |
| : hasBaseline.result | |
| ? chalk.green('✓ Pass') | |
| : chalk.red('✗ Fail')) | |
| : chalk.dim('○ not evaluated'); | |
| console.log(` ${status} ${chalk.bold(c.id)}: ${c.title}`); | |
| } |
🤖 Prompt for AI Agents
In `@src/cli/commands/interview.ts.backup` around lines 1050 - 1056, The status
computation incorrectly treats an existing baseline with result === undefined as
a failure; update the logic in the loop over casesToRun (where hasBaseline =
store.baselines[c.id]) so that "not evaluated" is shown when hasBaseline is
missing OR hasBaseline.result === undefined; otherwise display "✓ Pass" when
hasBaseline.result is truthy and "✗ Fail" when falsey. Modify the status
assignment around hasBaseline/result to explicitly check for undefined rather
than relying on truthy/falsy.
Issue
Closes #44
Status: ✓ verified
Build, tests, and lint all pass locally.
Summary
Automated implementation by Ralph (rlmkit + MiniMax M2.5).
Review the changes carefully — this was generated by a local model.
Summary by CodeRabbit
New Features
Documentation