Skip to content

Conversation

@ericelliott
Copy link
Collaborator

@ericelliott ericelliott commented Jan 23, 2026

Implement AI test runner foundation following TDD process:

  • readTestFile(): Read test file contents (any extension)
  • calculateRequiredPasses(): Ceiling math for threshold calculation

Architecture decisions documented:

  • Agent-agnostic design via configurable agentConfig
  • Default to Claude Code CLI: claude -p --output-format json
  • Subprocess per run = automatic context isolation
  • Support for OpenCode and Cursor CLI alternatives

Files added:

  • source/ai-runner.js (core module)
  • source/ai-runner.test.js (4 passing tests)

Next steps documented in epic:

  • executeAgent() - spawn CLI subprocess
  • aggregateResults() - aggregate pass/fail
  • runAITests() - orchestrate parallel runs

Note

Introduces an AI prompt testing workflow and CLI, plus supporting docs and agent rules.

  • New riteway ai CLI: Parallel, thresholded AI prompt runs with agent selection (claude|opencode|cursor), structured errors, and help output
  • Core modules: source/ai-runner.js (file read, agent exec, aggregation) and source/test-output.js (TAP output, date/slug naming, browser open)
  • CLI integration: Extends bin/riteway with ai subcommand, argument parsing, getAgentConfig(), and error-causes handling; adds comprehensive unit and E2E tests
  • Docs & config: README section for riteway ai, new AI agent guidelines/rules under ai/, plan.md epic completion, vitest.config.js update
  • Deps: Adds @paralleldrive/cuid2 and open

Written by Cursor Bugbot for commit 3dc4912. This will update automatically on new commits. Configure here.

Implement AI test runner foundation following TDD process:
- readTestFile(): Read test file contents (any extension)
- calculateRequiredPasses(): Ceiling math for threshold calculation

Architecture decisions documented:
- Agent-agnostic design via configurable agentConfig
- Default to Claude Code CLI: `claude -p --output-format json`
- Subprocess per run = automatic context isolation
- Support for OpenCode and Cursor CLI alternatives

Files added:
- source/ai-runner.js (core module)
- source/ai-runner.test.js (4 passing tests)

Next steps documented in epic:
- executeAgent() - spawn CLI subprocess
- aggregateResults() - aggregate pass/fail
- runAITests() - orchestrate parallel runs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove unused imports (vi, aggregateResults, runAITests)
- Add threshold validation (0-100 range check)
- Fix test race condition with unique directory names
- Fix resource leak by moving file ops into try block
- Add tests for threshold validation edge cases

Resolves all bug bot comments from PR #394
Implement core AI testing framework modules:

- Add executeAgent() with 5-minute default timeout and enhanced error messages

- Add aggregateResults() for multi-run pass/fail calculation

- Add runAITests() orchestrating parallel test execution

- Add test output recording with TAP v13 format

- Add browser auto-open for test results

- Add slug generation via cuid2 for unique output files

- Include comprehensive test coverage (31 tests)

Enhanced error handling includes command context, stderr, and stdout previews for debugging.

Task 2 and Task 3 complete from epic 2026-01-22-riteway-ai-testing-framework.md
Implement error-causes library pattern for structured error handling

Add --agent flag to support claude, opencode, and cursor agents

Add getAgentConfig() function with agent name validation

Consolidate path imports into single statement

Expand test coverage from 40 to 49 TAP tests

Document code quality improvements in Task 6
Complete AI testing framework implementation with:

- Comprehensive E2E test suite (13 assertions)

- Full workflow testing with mock agent

- TAP output format verification

- AI testing documentation in README

- CLI usage examples and agent configuration docs

- ESLint configuration update (ES2022 support)

- Linter fixes (unused imports, catch parameters)

- Vitest exclusion for Riteway/TAP tests

All 62 TAP tests + 37 Vitest tests passing

Epic: tasks/archive/2026-01-22-riteway-ai-testing-framework.md
@ericelliott
Copy link
Collaborator Author

@cursor please /review 889429e

@cursor
Copy link

cursor bot commented Jan 23, 2026

🔬 Code Review

🎯 Restatement

Reviewing commit 889429e which implements the AI test runner foundation with:

  • Core module (source/ai-runner.js) with readTestFile() and calculateRequiredPasses()
  • Test coverage (source/ai-runner.test.js) with initial TDD approach
  • Architecture decisions for agent-agnostic design

💡 Initial Analysis

Strengths:

  • Clean TDD approach with tests written first
  • Good test coverage for implemented functions
  • Clear JSDoc documentation
  • Proper use of async/await patterns
  • Agent-agnostic architecture via configurable agentConfig

Areas for Improvement:
The review bot correctly identified several issues that need addressing.

🪞 Critical Reflection

1. Code Structure and Organization

Positive:

  • Functions are focused and follow single responsibility principle
  • Clean separation between file I/O and calculations
  • Good use of ES6 modules

Issues:
None - structure is solid.

2. Adherence to Coding Standards

JavaScript Guide Violations:

export const calculateRequiredPasses = ({ runs = 4, threshold = 75 } = {}) => {
  return Math.ceil((runs * threshold) / 100);
};

Issue: No validation on threshold parameter as the bot noted. With threshold=150 and runs=4, this returns 6 required passes (impossible). The latest code shows this was fixed by adding validation:

export const calculateRequiredPasses = ({ runs = 4, threshold = 75 } = {}) => {
  if (threshold < 0 || threshold > 100) {
    throw new Error('threshold must be between 0 and 100');
  }
  return Math.ceil((runs * threshold) / 100);
};

Status: ✅ Fixed - Good validation added with comprehensive tests.

3. Test Coverage and Quality

TDD Compliance:

The tests answer the 5 key questions:

  1. ✅ What is the unit under test? (Named describe blocks)
  2. ✅ What is the expected behavior? (Clear given and should)
  3. ✅ What is the actual output? (Units exercised)
  4. ✅ What is the expected output? (Clear expectations)
  5. ✅ How can we find the bug? (Test isolation)

Test Isolation Issues:

const testDir = join(tmpdir(), 'riteway-test-' + Date.now() + '-' + Math.random().toString(36).slice(2));

Status: ✅ Fixed - The addition of Math.random().toString(36).slice(2) creates unique directory names, preventing race conditions from Date.now() alone.

Resource Leak Issues:

test('reads file contents from path', async () => {
  const testDir = join(tmpdir(), 'riteway-test-' + Date.now() + '-' + Math.random().toString(36).slice(2));
  let testFile;
  
  try {
    mkdirSync(testDir, { recursive: true });
    testFile = join(testDir, 'test.sudo');
    const contents = 'describe("test", { requirements: ["should work"] })';
    writeFileSync(testFile, contents);

Issue: Directory creation and file write happen outside try block. If writeFileSync throws, the directory is not cleaned up.

Recommended Fix:

test('reads file contents from path', async () => {
  const testDir = join(tmpdir(), 'riteway-test-' + Date.now() + '-' + Math.random().toString(36).slice(2));
  
  try {
    mkdirSync(testDir, { recursive: true });
    const testFile = join(testDir, 'test.sudo');
    const contents = 'describe("test", { requirements: ["should work"] })';
    writeFileSync(testFile, contents);
    
    // ... rest of test
  } finally {
    rmSync(testDir, { recursive: true, force: true });
  }
});

Move ALL file operations inside the try block to ensure cleanup on any failure.

4. Performance Considerations

Good:

  • readTestFile uses async I/O (fs/promises)
  • Math operations are efficient
  • No blocking operations

Concerns:
None for current implementation.

5. Security Review (OWASP Top 10)

A01:2021 – Broken Access Control:

  • ✅ No user authentication/authorization in this module
  • ⚠️ Path Traversal Risk: readTestFile(filePath) accepts arbitrary paths without validation
    • User could supply ../../../../etc/passwd
    • Should validate path is within expected directories
    • Recommendation: Add path validation in CLI layer before calling this function

A02:2021 – Cryptographic Failures:

  • ✅ N/A - No cryptographic operations

A03:2021 – Injection:

  • ✅ No SQL/command injection in current code
  • ⚠️ Future Risk: When executeAgent() is implemented, need to sanitize prompt inputs

A04:2021 – Insecure Design:

  • ✅ Agent-agnostic design is solid
  • ✅ Subprocess isolation pattern is secure

A05:2021 – Security Misconfiguration:

  • ✅ No configuration issues in current code

A06:2021 – Vulnerable and Outdated Components:

  • ✅ Using Node.js built-in modules (fs, child_process)

A07:2021 – Identification and Authentication Failures:

  • ✅ N/A - No auth in this module

A08:2021 – Software and Data Integrity Failures:

  • ✅ No integrity checks needed for current scope

A09:2021 – Security Logging and Monitoring Failures:

  • ⚠️ Missing: No logging for file access attempts
  • Recommendation: Add logging in CLI layer

A10:2021 – Server-Side Request Forgery:

  • ✅ N/A - No server requests in current code

6. Architectural Patterns

Excellent Decisions:

  • ✅ Agent-agnostic via agentConfig
  • ✅ Subprocess per run = automatic context isolation
  • ✅ Pure functions for calculations
  • ✅ Separation of concerns

Adherence to Epic Requirements:

  • ✅ Extension-agnostic file reading
  • ✅ Threshold calculation with ceiling
  • ✅ Agent delegation pattern (ready for subprocess implementation)
  • ✅ TDD approach followed

7. Documentation Quality

JSDoc Comments:

/**
 * Read the contents of a test file.
 * @param {string} filePath - Path to the test file
 * @returns {Promise<string>} File contents
 */
export const readTestFile = (filePath) => readFile(filePath, 'utf-8');

Good: Clear, minimal documentation following JavaScript guide principles.

/**
 * Calculate the number of passes required to meet the threshold.
 * Uses ceiling to ensure threshold is met or exceeded.
 * @param {Object} options
 * @param {number} [options.runs=4] - Total number of test runs
 * @param {number} [options.threshold=75] - Required pass percentage (0-100)
 * @returns {number} Number of passes required
 * @throws {Error} If threshold is not between 0 and 100
 */

Excellent: Documents parameters, defaults, return values, and error conditions.

🔭 Broader Context

Integration Readiness:
The current implementation provides solid foundation functions. When integrating with remaining tasks:

  1. Path validation should happen in CLI layer (Task 4)
  2. Logging should be added to runAICommand() (Task 4)
  3. Error handling patterns should propagate through the stack

Comparison to Epic:
Task 2 requirements are being met systematically:

  • ✅ Read test file (any extension)
  • ✅ Calculate required passes with threshold
  • ⏳ Execute agent (next function)
  • ⏳ Aggregate results (next function)
  • ⏳ Parallel runs (next function)

⚖️ Severity Assessment

Critical Issues: None

High Severity:

  • None currently, but path traversal should be addressed before production

Medium Severity:

  1. ✅ Test race condition - FIXED with random suffix
  2. Resource leak in test cleanup - SHOULD FIX

Low Severity:

  1. ✅ Unused import - Can be addressed when needed
  2. ✅ Threshold validation - FIXED

💬 Actionable Recommendations

Must Fix Before Merge:

  1. Fix resource leak in tests (Medium):
    • Move mkdirSync and writeFileSync inside try block
    • Ensure cleanup happens on all failure paths
    • Apply to both readTestFile() tests (lines 16-35 and 35-56)

Should Address Soon:

  1. Add path validation (Security):

    • Add in CLI layer (Task 4) before calling readTestFile()
    • Validate paths are within allowed directories
    • Prevent path traversal attacks
  2. Consider adding tests for edge cases:

    • calculateRequiredPasses({ runs: 0, threshold: 75 }) - What should happen?
    • calculateRequiredPasses({ runs: 4, threshold: 0 }) - Returns 0, but is this valid?
    • readTestFile() with non-existent file - Error handling test?

Nice to Have:

  1. Remove unused imports when not needed:
    • Currently no issue since tests will use vi later

Summary

This is high-quality TDD implementation that follows project standards well. The core functions are solid, well-tested, and properly documented. The architectural decisions align perfectly with the epic vision.

Key Strengths:

  • Excellent TDD discipline
  • Clean functional design
  • Good documentation
  • Smart architectural choices

Required Fixes:

  • Resource leak in test cleanup (1 issue, affects 2 tests)

Recommended Enhancements:

  • Path validation in CLI layer (security)
  • Edge case test coverage

Overall Assessment:APPROVE with minor fixes required

The implementation is on the right track and ready to proceed once the resource leak is addressed. The threshold validation issue was already fixed, and the race condition was resolved with the random suffix addition.

Open in Cursor Open in Web

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

throw new Error('threshold must be between 0 and 100');
}
return Math.ceil((runs * threshold) / 100);
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zero runs parameter causes false-positive test results

Medium Severity

The calculateRequiredPasses() function validates threshold but not runs. When runs is 0 or negative, Array.from({ length: runs }) creates an empty array, requiredPasses becomes 0, and passed: 0 >= 0 evaluates to true. This causes the test suite to report success without executing any tests. Additionally, Math.round(results.passCount / results.totalRuns * 100) in runAICommand produces NaN due to division by zero.

Additional Locations (1)

Fix in Cursor Fix in Web

OutputError: {
code: 'OUTPUT_ERROR',
message: 'Test output recording failed'
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused OutputError type and handler are dead code

Low Severity

The OutputError error type is defined in errorCauses and destructured, and a handler is implemented in handleAIRunnerErrors, but it is never thrown anywhere in the codebase. All errors from recordTestOutput are caught by the generic catch block in runAICommand and wrapped as AITestError instead. This leaves dead code that adds maintenance burden and could confuse future developers about the error handling design.

Additional Locations (1)

Fix in Cursor Fix in Web

@ericelliott
Copy link
Collaborator Author

ericelliott commented Jan 24, 2026

  1. const testDir = join(tmpdir(), 'riteway-test-' + Date.now() + '-' + Math.random().toString(36).slice(2));

For filename disambiguation, we should use a 5-digit unique slug generated by cuid2

  1. Please record video screencasts of each subagent successfully spawning, and each subagent should be exercised in the e2e tests

  2. e2e tests should not mock the agents - should call real agents

  3. e2e test should load 1 real script from aidd Framework (NOT the /review script, NOT productmanager). Pick something somewhat simple, but complex enough to be actually representative of potential end user prompts.

CRITICAL

Unit tests need to run in isolation, and the AI agent needs to judge the results. We need the agent to intelligently extract each individual test to run in the subagents in isolation from each other, otherwise, AI attention mechanism creates shared mutable state between tests, and the tests don't run in proper isolation.

Proposal: we write the test scripts in SudoLang, and have a pre-processing agent extract each test in isolation from the rest of the tests, inserting the prompt under test at the top, and appending JUST ONE assertion per call. The appended prompt gets sent to the sub-agent.

Sample code thinking:

Ideal test script looks like this:

import @promptUnderTest

# Sample AI Test

Test the basic math capability of the AI.

userPrompt = """
  What is 2 + 2? Please respond with JSON in this format:
  
      {
        "passed": true,
        "output": "2 + 2 = 4"
      }
"""

// In simple cases, we can just pass requirements like this:
- Given a simple addition prompt, should correctly add 2 + 2
- Given the result, should output the number 4
- Given the response format, should provide a JSON response with "passed" and "output" fields

// If you want to customize anything, you can do this:
assert(
  userPrompt: userPrompt2
  should: return the correct square root
)

Ideal Transformation

We'll need the LLM to call a sub-agent to extract tests, which would produce this shape of ideal transformation on a per test basis:

promptUnderTest = <contents of prompt file>

userPrompt = """
  What is 2 + 2? Please respond with JSON in this format:
      
        {
           "passed": true,
            "output": "2 + 2 = 4"
        }
"""

The output would be the AI agent's response to the user, which we can then make

SudoLang Assert Type

type assert = (userPrompt?, promptUnderTest, requirement?, given?, should?, actual?, expected?)

Isolation

In order to invoke each assertion in isolation, we need to extract these into smaller prompts which only contain the context needed for the assertion we care about in the moment:

export const runAITests = async ({
  filePath,
  runs = 4,
  threshold = 75,
  timeout = 300000,
  agentConfig = {
    command: 'claude',
    args: ['-p', '--output-format', 'json', '--no-session-persistence']
  }
}) => {
  const tests = asyncPipe(
    readTestFile,
    extractTests // split the context so it only contains what we need for a single assertion
  )(filePath);

  const responses = Promise.all(tests.map((prompt) => {
    return Array.from({ length: runs }, () => executeAgent({ agentConfig, prompt, timeout }))
  }));
  const scores = Promise.all(responses.map((responseSet) => {
      /* process and score all the responses for prompt coherence */
   })

  return aggregateResults({ runResults, scores, threshold, runs });
};

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants