feat(ai-runner): implement core module with TDD (Task 2 partial) #394

ericelliott · 2026-01-23T00:34:42Z

Implement AI test runner foundation following TDD process:

readTestFile(): Read test file contents (any extension)
calculateRequiredPasses(): Ceiling math for threshold calculation

Architecture decisions documented:

Agent-agnostic design via configurable agentConfig
Default to Claude Code CLI: claude -p --output-format json
Subprocess per run = automatic context isolation
Support for OpenCode and Cursor CLI alternatives

Files added:

source/ai-runner.js (core module)
source/ai-runner.test.js (4 passing tests)

Next steps documented in epic:

executeAgent() - spawn CLI subprocess
aggregateResults() - aggregate pass/fail
runAITests() - orchestrate parallel runs

Note

Introduces an AI prompt testing workflow and CLI, plus supporting docs and agent rules.

New riteway ai CLI: Parallel, thresholded AI prompt runs with agent selection (claude|opencode|cursor), structured errors, and help output
Core modules: source/ai-runner.js (file read, agent exec, aggregation) and source/test-output.js (TAP output, date/slug naming, browser open)
CLI integration: Extends bin/riteway with ai subcommand, argument parsing, getAgentConfig(), and error-causes handling; adds comprehensive unit and E2E tests
Docs & config: README section for riteway ai, new AI agent guidelines/rules under ai/, plan.md epic completion, vitest.config.js update
Deps: Adds @paralleldrive/cuid2 and open

^{Written by Cursor Bugbot for commit 3dc4912. This will update automatically on new commits. Configure here.}

Implement AI test runner foundation following TDD process: - readTestFile(): Read test file contents (any extension) - calculateRequiredPasses(): Ceiling math for threshold calculation Architecture decisions documented: - Agent-agnostic design via configurable agentConfig - Default to Claude Code CLI: `claude -p --output-format json` - Subprocess per run = automatic context isolation - Support for OpenCode and Cursor CLI alternatives Files added: - source/ai-runner.js (core module) - source/ai-runner.test.js (4 passing tests) Next steps documented in epic: - executeAgent() - spawn CLI subprocess - aggregateResults() - aggregate pass/fail - runAITests() - orchestrate parallel runs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

source/ai-runner.test.js

source/ai-runner.js

source/ai-runner.test.js

- Remove unused imports (vi, aggregateResults, runAITests) - Add threshold validation (0-100 range check) - Fix test race condition with unique directory names - Fix resource leak by moving file ops into try block - Add tests for threshold validation edge cases Resolves all bug bot comments from PR #394

Implement core AI testing framework modules: - Add executeAgent() with 5-minute default timeout and enhanced error messages - Add aggregateResults() for multi-run pass/fail calculation - Add runAITests() orchestrating parallel test execution - Add test output recording with TAP v13 format - Add browser auto-open for test results - Add slug generation via cuid2 for unique output files - Include comprehensive test coverage (31 tests) Enhanced error handling includes command context, stderr, and stdout previews for debugging. Task 2 and Task 3 complete from epic 2026-01-22-riteway-ai-testing-framework.md

Implement error-causes library pattern for structured error handling Add --agent flag to support claude, opencode, and cursor agents Add getAgentConfig() function with agent name validation Consolidate path imports into single statement Expand test coverage from 40 to 49 TAP tests Document code quality improvements in Task 6

Complete AI testing framework implementation with: - Comprehensive E2E test suite (13 assertions) - Full workflow testing with mock agent - TAP output format verification - AI testing documentation in README - CLI usage examples and agent configuration docs - ESLint configuration update (ES2022 support) - Linter fixes (unused imports, catch parameters) - Vitest exclusion for Riteway/TAP tests All 62 TAP tests + 37 Vitest tests passing Epic: tasks/archive/2026-01-22-riteway-ai-testing-framework.md

eslint.config.js

ericelliott · 2026-01-23T23:19:10Z

@cursor please /review 889429e

cursor · 2026-01-23T23:19:18Z

🔬 Code Review

🎯 Restatement

Reviewing commit 889429e which implements the AI test runner foundation with:

Core module (source/ai-runner.js) with readTestFile() and calculateRequiredPasses()
Test coverage (source/ai-runner.test.js) with initial TDD approach
Architecture decisions for agent-agnostic design

💡 Initial Analysis

Strengths:

Clean TDD approach with tests written first
Good test coverage for implemented functions
Clear JSDoc documentation
Proper use of async/await patterns
Agent-agnostic architecture via configurable agentConfig

Areas for Improvement:
The review bot correctly identified several issues that need addressing.

🪞 Critical Reflection

1. Code Structure and Organization

Positive:

Functions are focused and follow single responsibility principle
Clean separation between file I/O and calculations
Good use of ES6 modules

Issues:
None - structure is solid.

2. Adherence to Coding Standards

JavaScript Guide Violations:

export const calculateRequiredPasses = ({ runs = 4, threshold = 75 } = {}) => {
  return Math.ceil((runs * threshold) / 100);
};

Issue: No validation on threshold parameter as the bot noted. With threshold=150 and runs=4, this returns 6 required passes (impossible). The latest code shows this was fixed by adding validation:

export const calculateRequiredPasses = ({ runs = 4, threshold = 75 } = {}) => {
  if (threshold < 0 || threshold > 100) {
    throw new Error('threshold must be between 0 and 100');
  }
  return Math.ceil((runs * threshold) / 100);
};

Status: ✅ Fixed - Good validation added with comprehensive tests.

3. Test Coverage and Quality

TDD Compliance:

The tests answer the 5 key questions:

✅ What is the unit under test? (Named describe blocks)
✅ What is the expected behavior? (Clear given and should)
✅ What is the actual output? (Units exercised)
✅ What is the expected output? (Clear expectations)
✅ How can we find the bug? (Test isolation)

Test Isolation Issues:

const testDir = join(tmpdir(), 'riteway-test-' + Date.now() + '-' + Math.random().toString(36).slice(2));

Status: ✅ Fixed - The addition of Math.random().toString(36).slice(2) creates unique directory names, preventing race conditions from Date.now() alone.

Resource Leak Issues:

test('reads file contents from path', async () => {
  const testDir = join(tmpdir(), 'riteway-test-' + Date.now() + '-' + Math.random().toString(36).slice(2));
  let testFile;
  
  try {
    mkdirSync(testDir, { recursive: true });
    testFile = join(testDir, 'test.sudo');
    const contents = 'describe("test", { requirements: ["should work"] })';
    writeFileSync(testFile, contents);

Issue: Directory creation and file write happen outside try block. If writeFileSync throws, the directory is not cleaned up.

Recommended Fix:

test('reads file contents from path', async () => {
  const testDir = join(tmpdir(), 'riteway-test-' + Date.now() + '-' + Math.random().toString(36).slice(2));
  
  try {
    mkdirSync(testDir, { recursive: true });
    const testFile = join(testDir, 'test.sudo');
    const contents = 'describe("test", { requirements: ["should work"] })';
    writeFileSync(testFile, contents);
    
    // ... rest of test
  } finally {
    rmSync(testDir, { recursive: true, force: true });
  }
});

Move ALL file operations inside the try block to ensure cleanup on any failure.

4. Performance Considerations

Good:

readTestFile uses async I/O (fs/promises)
Math operations are efficient
No blocking operations

Concerns:
None for current implementation.

5. Security Review (OWASP Top 10)

A01:2021 – Broken Access Control:

✅ No user authentication/authorization in this module
⚠️ Path Traversal Risk: readTestFile(filePath) accepts arbitrary paths without validation
- User could supply ../../../../etc/passwd
- Should validate path is within expected directories
- Recommendation: Add path validation in CLI layer before calling this function

A02:2021 – Cryptographic Failures:

✅ N/A - No cryptographic operations

A03:2021 – Injection:

✅ No SQL/command injection in current code
⚠️ Future Risk: When executeAgent() is implemented, need to sanitize prompt inputs

A04:2021 – Insecure Design:

✅ Agent-agnostic design is solid
✅ Subprocess isolation pattern is secure

A05:2021 – Security Misconfiguration:

✅ No configuration issues in current code

A06:2021 – Vulnerable and Outdated Components:

✅ Using Node.js built-in modules (fs, child_process)

A07:2021 – Identification and Authentication Failures:

✅ N/A - No auth in this module

A08:2021 – Software and Data Integrity Failures:

✅ No integrity checks needed for current scope

A09:2021 – Security Logging and Monitoring Failures:

⚠️ Missing: No logging for file access attempts
Recommendation: Add logging in CLI layer

A10:2021 – Server-Side Request Forgery:

✅ N/A - No server requests in current code

6. Architectural Patterns

Excellent Decisions:

✅ Agent-agnostic via agentConfig
✅ Subprocess per run = automatic context isolation
✅ Pure functions for calculations
✅ Separation of concerns

Adherence to Epic Requirements:

✅ Extension-agnostic file reading
✅ Threshold calculation with ceiling
✅ Agent delegation pattern (ready for subprocess implementation)
✅ TDD approach followed

7. Documentation Quality

JSDoc Comments:

/**
 * Read the contents of a test file.
 * @param {string} filePath - Path to the test file
 * @returns {Promise<string>} File contents
 */
export const readTestFile = (filePath) => readFile(filePath, 'utf-8');

Good: Clear, minimal documentation following JavaScript guide principles.

/**
 * Calculate the number of passes required to meet the threshold.
 * Uses ceiling to ensure threshold is met or exceeded.
 * @param {Object} options
 * @param {number} [options.runs=4] - Total number of test runs
 * @param {number} [options.threshold=75] - Required pass percentage (0-100)
 * @returns {number} Number of passes required
 * @throws {Error} If threshold is not between 0 and 100
 */

Excellent: Documents parameters, defaults, return values, and error conditions.

🔭 Broader Context

Integration Readiness:
The current implementation provides solid foundation functions. When integrating with remaining tasks:

Path validation should happen in CLI layer (Task 4)
Logging should be added to runAICommand() (Task 4)
Error handling patterns should propagate through the stack

Comparison to Epic:
Task 2 requirements are being met systematically:

✅ Read test file (any extension)
✅ Calculate required passes with threshold
⏳ Execute agent (next function)
⏳ Aggregate results (next function)
⏳ Parallel runs (next function)

⚖️ Severity Assessment

Critical Issues: None

High Severity:

None currently, but path traversal should be addressed before production

Medium Severity:

✅ Test race condition - FIXED with random suffix
Resource leak in test cleanup - SHOULD FIX

Low Severity:

✅ Unused import - Can be addressed when needed
✅ Threshold validation - FIXED

💬 Actionable Recommendations

Must Fix Before Merge:

Fix resource leak in tests (Medium):
- Move mkdirSync and writeFileSync inside try block
- Ensure cleanup happens on all failure paths
- Apply to both readTestFile() tests (lines 16-35 and 35-56)

Should Address Soon:

Add path validation (Security):
- Add in CLI layer (Task 4) before calling readTestFile()
- Validate paths are within allowed directories
- Prevent path traversal attacks
Consider adding tests for edge cases:
- calculateRequiredPasses({ runs: 0, threshold: 75 }) - What should happen?
- calculateRequiredPasses({ runs: 4, threshold: 0 }) - Returns 0, but is this valid?
- readTestFile() with non-existent file - Error handling test?

Nice to Have:

Remove unused imports when not needed:
- Currently no issue since tests will use vi later

Summary

This is high-quality TDD implementation that follows project standards well. The core functions are solid, well-tested, and properly documented. The architectural decisions align perfectly with the epic vision.

Key Strengths:

Excellent TDD discipline
Clean functional design
Good documentation
Smart architectural choices

Required Fixes:

Resource leak in test cleanup (1 issue, affects 2 tests)

Recommended Enhancements:

Path validation in CLI layer (security)
Edge case test coverage

Overall Assessment: ✅ APPROVE with minor fixes required

The implementation is on the right track and ready to proceed once the resource leak is addressed. The threshold validation issue was already fixed, and the race condition was resolved with the random suffix addition.

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-01-23T23:28:27Z

source/ai-runner.js

+    throw new Error('threshold must be between 0 and 100');
+  }
+  return Math.ceil((runs * threshold) / 100);
+};


Zero runs parameter causes false-positive test results

Medium Severity

The calculateRequiredPasses() function validates threshold but not runs. When runs is 0 or negative, Array.from({ length: runs }) creates an empty array, requiredPasses becomes 0, and passed: 0 >= 0 evaluates to true. This causes the test suite to report success without executing any tests. Additionally, Math.round(results.passCount / results.totalRuns * 100) in runAICommand produces NaN due to division by zero.

Additional Locations (1)

bin/riteway#L127-L128

cursor · 2026-01-23T23:28:27Z

bin/riteway

+  OutputError: {
+    code: 'OUTPUT_ERROR',
+    message: 'Test output recording failed'
+  }


Unused OutputError type and handler are dead code

Low Severity

The OutputError error type is defined in errorCauses and destructured, and a handler is implemented in handleAIRunnerErrors, but it is never thrown anywhere in the codebase. All errors from recordTestOutput are caught by the generic catch block in runAICommand and wrapped as AITestError instead. This leaves dead code that adds maintenance burden and could confuse future developers about the error handling design.

Additional Locations (1)

bin/riteway#L223-L231

ericelliott · 2026-01-24T00:14:46Z

const testDir = join(tmpdir(), 'riteway-test-' + Date.now() + '-' + Math.random().toString(36).slice(2));

For filename disambiguation, we should use a 5-digit unique slug generated by cuid2

Please record video screencasts of each subagent successfully spawning, and each subagent should be exercised in the e2e tests
e2e tests should not mock the agents - should call real agents
e2e test should load 1 real script from aidd Framework (NOT the /review script, NOT productmanager). Pick something somewhat simple, but complex enough to be actually representative of potential end user prompts.

CRITICAL

Unit tests need to run in isolation, and the AI agent needs to judge the results. We need the agent to intelligently extract each individual test to run in the subagents in isolation from each other, otherwise, AI attention mechanism creates shared mutable state between tests, and the tests don't run in proper isolation.

Proposal: we write the test scripts in SudoLang, and have a pre-processing agent extract each test in isolation from the rest of the tests, inserting the prompt under test at the top, and appending JUST ONE assertion per call. The appended prompt gets sent to the sub-agent.

Sample code thinking:

Ideal test script looks like this:

import @promptUnderTest

# Sample AI Test

Test the basic math capability of the AI.

userPrompt = """
  What is 2 + 2? Please respond with JSON in this format:
  
      {
        "passed": true,
        "output": "2 + 2 = 4"
      }
"""

// In simple cases, we can just pass requirements like this:
- Given a simple addition prompt, should correctly add 2 + 2
- Given the result, should output the number 4
- Given the response format, should provide a JSON response with "passed" and "output" fields

// If you want to customize anything, you can do this:
assert(
  userPrompt: userPrompt2
  should: return the correct square root
)

Ideal Transformation

We'll need the LLM to call a sub-agent to extract tests, which would produce this shape of ideal transformation on a per test basis:

promptUnderTest = <contents of prompt file>

userPrompt = """
  What is 2 + 2? Please respond with JSON in this format:
      
        {
           "passed": true,
            "output": "2 + 2 = 4"
        }
"""

The output would be the AI agent's response to the user, which we can then make

SudoLang Assert Type

type assert = (userPrompt?, promptUnderTest, requirement?, given?, should?, actual?, expected?)

Isolation

In order to invoke each assertion in isolation, we need to extract these into smaller prompts which only contain the context needed for the assertion we care about in the moment:

export const runAITests = async ({
  filePath,
  runs = 4,
  threshold = 75,
  timeout = 300000,
  agentConfig = {
    command: 'claude',
    args: ['-p', '--output-format', 'json', '--no-session-persistence']
  }
}) => {
  const tests = asyncPipe(
    readTestFile,
    extractTests // split the context so it only contains what we need for a single assertion
  )(filePath);

  const responses = Promise.all(tests.map((prompt) => {
    return Array.from({ length: runs }, () => executeAgent({ agentConfig, prompt, timeout }))
  }));
  const scores = Promise.all(responses.map((responseSet) => {
      /* process and score all the responses for prompt coherence */
   })

  return aggregateResults({ runResults, scores, threshold, runs });
};

cursor bot reviewed Jan 23, 2026

View reviewed changes

source/ai-runner.test.js Outdated Show resolved Hide resolved

source/ai-runner.js Outdated Show resolved Hide resolved

source/ai-runner.test.js Outdated Show resolved Hide resolved

source/ai-runner.test.js Show resolved Hide resolved

ericelliott mentioned this pull request Jan 23, 2026

feat(skills): add agentskills.io compatible workflow skills paralleldrive/aidd#83

Open

7 tasks

ianwhitedeveloper self-assigned this Jan 23, 2026

ianwhitedeveloper added 4 commits January 23, 2026 14:49

ianwhitedeveloper reviewed Jan 23, 2026

View reviewed changes

eslint.config.js Show resolved Hide resolved

Update path for example SudoLang source code

3dc4912

cursor bot reviewed Jan 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ai-runner): implement core module with TDD (Task 2 partial) #394

feat(ai-runner): implement core module with TDD (Task 2 partial) #394

Uh oh!

ericelliott commented Jan 23, 2026 •

edited by cursor bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ericelliott commented Jan 23, 2026

Uh oh!

cursor bot commented Jan 23, 2026 •

edited

Loading

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Jan 23, 2026

Uh oh!

cursor bot Jan 23, 2026

Uh oh!

ericelliott commented Jan 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(ai-runner): implement core module with TDD (Task 2 partial) #394

Are you sure you want to change the base?

feat(ai-runner): implement core module with TDD (Task 2 partial) #394

Uh oh!

Conversation

ericelliott commented Jan 23, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ericelliott commented Jan 23, 2026

Uh oh!

cursor bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔬 Code Review

🎯 Restatement

💡 Initial Analysis

🪞 Critical Reflection

1. Code Structure and Organization

2. Adherence to Coding Standards

3. Test Coverage and Quality

4. Performance Considerations

5. Security Review (OWASP Top 10)

6. Architectural Patterns

7. Documentation Quality

🔭 Broader Context

⚖️ Severity Assessment

💬 Actionable Recommendations

Must Fix Before Merge:

Should Address Soon:

Nice to Have:

Summary

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 23, 2026

Choose a reason for hiding this comment

Zero runs parameter causes false-positive test results

Uh oh!

cursor bot Jan 23, 2026

Choose a reason for hiding this comment

Unused OutputError type and handler are dead code

Uh oh!

ericelliott commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CRITICAL

Ideal Transformation

SudoLang Assert Type

Isolation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ericelliott commented Jan 23, 2026 •

edited by cursor bot

Loading

cursor bot commented Jan 23, 2026 •

edited

Loading

ericelliott commented Jan 24, 2026 •

edited

Loading