feat: add test-runner autonomous agent for E2E validation by sonesuke · Pull Request #37 · sonesuke/patent-kit

sonesuke · 2026-02-22T05:56:29Z

Description

This PR addresses the testing guidelines for Claude Skills by introducing an autonomous test-runner agent. The runner uses the devcontainer exec context to prompt Claude Code with E2E dummy queries.

Key Features

Test-Runner Host Script: Added agents/test-runner/runner.sh.
Agent Prompt: Provided prompt.txt that instructs the agent to read test cases, self-prompt as a user, check token outputs, and write results to e2e/reports/.
E2E Dummy Test Cases: Created e2e/test_cases/01-targeting-trigger.md to validate that the targeting skill triggers correctly on explicit intents without over-stepping into other phases.
Documentation: Updated AGENTS.md.

…ctional E2E test

…d objective evaluation

…ents

- Add init_validation check to verify plugin, skill, and MCP server status - Update search_patents_called to explicitly check search_patents and search_papers tools - Change keywords_file_exists path to check data/target.jsonl instead - Temporarily restrict runner.sh to execute only 01-targeting-trigger case Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Reorganize test cases to use skill/test-type subfolders (e.g., 01-targeting/triggering, 01-targeting/functional) - Update runner.sh to support nested test structure (skill_dir → test_type_dir) - Maintain triggering/functional test separation per Zenn article best practices Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- triggering: simple skill invocation check (no setup files) - functional-no-spec: concept-interview → targeting workflow (specification.md created) - functional-with-spec: complete targeting workflow with existing specification Evaluation improvements: - Add constitution_loaded check - Add keywords_md_created check - Explicit search_patents MCP tool invocation check - Add noise_analysis_performed check - Fix MCP server status check (connected vs running) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…ucture - Move detailed instructions to references/instructions.md - Extract examples to references/examples.md - Extract troubleshooting to references/troubleshooting.md - Move templates under references/templates/ - Keep SKILL.md for orchestration, state management, and transitions only - Fix runner.sh to use container paths for setup directory copy - Reorganize test cases into functional-no-spec, functional-with-spec, and triggering This follows Claude's progressive disclosure principle: - SKILL.md: Prerequisites, orchestration, state management - references/instructions.md: Detailed process steps - references/examples.md: Usage examples - references/troubleshooting.md: Common issues and solutions Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…pt-interview, setup - Move detailed instructions to references/instructions.md for all three skills - Extract examples to references/examples.md - Extract troubleshooting to references/troubleshooting.md - Move templates to assets/templates/ (correct location per Agent Skills convention) - Keep SKILL.md for orchestration, state management, and transitions only Skill structure now follows progressive disclosure principle: - SKILL.md: Prerequisites, orchestration, state transitions - references/instructions.md: Detailed process steps - references/examples.md: Usage examples - references/troubleshooting.md: Common issues and solutions - assets/templates/: Template files Also add E2E tests for all three skills: - 00-setup: triggering, functional tests - 00-constitution: triggering, functional tests - 00-concept-interview: triggering, functional-no-spec, functional-with-spec tests Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Fix jq filters to match actual log structure (message.content[] array) - Add references_instructions_read check to functional tests - Update test-prompt.md files to be more specific and avoid interactive questions - Add NOTE to concept-interview: proceed with available info without asking - Fix specification template reference path in test data All tests now passing: - 00-constitution: triggering, functional ✅ - 00-setup: triggering, functional ✅ - 00-concept-interview: triggering, functional-no-spec, functional-with-spec ✅ Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Convert all evaluation.json to evaluation.toml for better readability - Update runner.sh to use yq instead of jq for reading TOML files - TOML format provides cleaner syntax for test case definitions - yq supports both TOML input/output and provides same query capabilities as jq Test results with new TOML format: - 01-targeting: 2/3 passing (triggering, functional-no-spec ✅) - functional-with-spec has pre-existing issues unrelated to TOML conversion Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Reorganize test cases from nested `e2e/test_cases/*/subdir/` to flat `cases/*/*.toml` - Unify `test-prompt.md` + `evaluation.toml` + `setup/` into single `test.toml` file - Remove numbering prefixes from skill directories (00-, 01-) - Extract test setup and check logic from runner.sh into modular tools: - `tools/test-setup.sh`: workspace setup with detailed output - `tools/test-check.sh`: evaluation checks with result display - Clarify responsibilities: - `runner.sh`: process orchestration only - `test-setup.sh`: setup handling and display - `test-check.sh`: evaluation and result display Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Update runner.sh to output reports to out/ directory - Update .gitignore: e2e/reports → out/ - Remove empty e2e/ directory with old reports Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…ools - Extract summary report generation to tools/test-summary.sh - Remove unused progress tracking tools from test-runner: - load-progress.sh - record-progress.sh - Update runner.sh to delegate summary to test-summary.sh - Clarify tool responsibilities: - test-setup.sh: workspace setup - test-check.sh: evaluation checks - test-summary.sh: report generation and display - runner.sh: process orchestration only Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…ucture - Change log file structure from hierarchical to flat: - Old: out/<report_id>/<skill>/<test>/trial-1.log - New: out/<report_id>/<test>-1.log - Collect trial statistics (duration, tokens) in runner.sh - Move average calculation from runner.sh to test-summary.sh - Add per-test statistics to summary report: - Average duration per test - Average input/output tokens per test - Clarify responsibilities: - runner.sh: collect raw trial data - test-summary.sh: calculate averages and display Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Add optional 3rd argument TARGET_TEST to specify test file name - Usage: runner.sh <n_trials> <skill> <test_name> - Example: runner.sh 1 concept-interview functional-with-spec - Update help banner to show target filters when specified Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…tching - Change usage: runner.sh <n_trials> [skill] [test] → runner.sh <n_trials> [pattern] - Support glob patterns to match test files: - cases/*/*.toml - all tests (default) - cases/c*/*.toml - skills starting with 'c' - cases/concept-interview/*.toml - all concept-interview tests - cases/concept-interview/func*.toml - tests starting with 'func' - cases/concept-interview/functional-with-spec.toml - specific test - Simplify filtering logic by collecting matching files upfront - Update banner to show pattern instead of skill/test Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

The test-summary.sh was unexpectedly complex and expected additional arguments that runner.sh doesn't provide. Simplified to match the actual interface.

- Restore average duration calculation per test case - Restore average input/output tokens calculation per test case - Add conditional display of test results section - Match expected interface with runner.sh (receives TEST_RESULTS array) Fixes regression where statistics were accidentally simplified.

- Change from complex log parsing to simple result files - Create .results file per test case with trial data - Format: "passed|duration|input_tokens|output_tokens" per line - test-summary.sh reads .results files and calculates averages - Fix bash compatibility issues with glob patterns - Remove unused ALL_LOG_FILES array from runner.sh Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Create skill-specific directories (e.g., constitution/, concept-interview/) under report folder - Store log files and result files in skill subdirectories - Place summary.md in skill directory when testing single skill, or in root when testing multiple skills - Update test-summary.sh to recursively find .results files in subdirectories Directory structure example: out/20260223_142007/ └── constitution/ ├── functional-1.log ├── functional.results ├── triggering-1.log ├── triggering.results └── summary.md Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Add specification_template_read check to functional-no-spec.toml - Fix template paths in concept-interview instructions (assets/templates/ → assets/) - Fix template paths in targeting instructions (assets/templates/ → assets/) This ensures AI reads templates from correct paths and tests verify template files are accessed. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Add MCP Server Errors section to concept-interview/troubleshooting.md - Add MCP Server Errors section to targeting/troubleshooting.md - Update instructions to check for isError: true and refer to troubleshooting This ensures skills properly handle MCP tool failures instead of proceeding with fabricated data. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Setting CI=1 environment variable makes google-patent-cli automatically add required Chrome flags (--no-sandbox, --disable-setuid-sandbox, --disable-gpu) for Docker/DevContainer environments. See: https://github.com/sonesuke/google-patent-cli#configuration-file Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Add targeting_template_read and keywords_template_read checks - Fix MCP tool name pattern (mcp__google_patent_mcp__ → google-patent-cli__) This ensures template files are properly read during targeting phase. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Add google_patent_mcp_succeeded check to targeting/functional-with-spec - Add google_patent_mcp_succeeded check to targeting/functional-no-spec - Add google_patent_mcp_succeeded_if_called check to concept-interview/functional-with-spec (conditional check) This ensures MCP tool failures are detected across all test scenarios. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Remove CI=1 environment variable (which skips post-create setup) and instead configure Chrome args via config files: - google-patent-cli: ~/.config/google-patent-cli/config.toml - arxiv-cli: ~/.config/arxiv-cli/config.toml This ensures MCP tools work in Docker without blocking development setup. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Move detailed operational rules from constitution to each skill's instructions: **Constitution Changes:** - Remove "X. Output Management" (now in each skill) - Remove "Templates" rule (now in each skill) - Remove "Unified Search Scope" (moved to targeting) - Remove "User Hearing" (moved to evaluating/claim-analyzing) - Renumber sections (I-VIII instead of I-X) **Skill-Specific Additions:** - Output Management sections in targeting, screening, evaluating, claim-analyzing, prior-art-researching - Template Adherence sections in all template-using skills - Unified Search Scope in targeting - User Interview requirements in evaluating and claim-analyzing **Benefits:** - Constitution focuses on core principles only - Each skill defines its own file paths and requirements - Easier to maintain and update skill-specific rules - Reduced coupling between constitution and individual skills Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

**Test Runner Fix:** - Remove `set -e` from runner.sh to allow all tests to run even when some fail - Previously, test runner would exit on first failure - Now all tests execute and summary report is generated **Legal-Checking Test Fixes:** - Update jq patterns to search for "legal-checking" instead of "legal-checker" - Update description to improve auto-triggering with keywords like "review", "compliance" - All 3 legal-checking tests now pass **Test Results:** - ✅ functional (21s) - ✅ triggering (39s) - ✅ functional-file-review (55s) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

**Concept-Interviewing Fixes:** - Update constitution skill reference to constitution-reminding - Remove CLI-style "Arguments: --assignee" syntax from instructions - Add jq-based histogram generation for assignee name verification - Process: Extract assignee names from JSON → group_by → count → sort → top 100 **Targeting Fixes:** - Remove CLI-style "Arguments: --query" syntax from instructions - Update to proper MCP tool parameter format **Benefits:** - Clear separation between CLI commands and MCP tool usage - Proper JSON processing using jq for assignee analysis - Consistent documentation across all skills Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Fix init_validation to properly check array of system entries - Update concept_interview_invoked to search for 'concept-interviewing' - All 3 concept-interviewing tests now pass Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Change 'to the `--query`' to 'to the query parameter' - Consistent with MCP tool parameter naming Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Remove auto-execution of concept-interviewing from targeting - If specification.md is missing, notify user to run concept-interviewing first - Proper workflow separation: concept-interviewing → targeting Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Targeting should call concept-interviewing if specification.md is missing - Update skill name reference Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Fix init_validation to properly check array of system entries - Fix targeting_skill_invoked to check array and use length > 0 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Add check-skill-invoked.sh to check if a skill was invoked - Update all triggering tests to use script type instead of jq patterns - Makes tests more maintainable and easier to read Updated tests: - concept-interviewing/triggering - constitution-reminding/triggering - legal-checking/triggering - targeting/triggering Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Fix check-skill-invoked.sh to work with JSONL log format - Update all functional test cases to use script-based checks instead of jq patterns - Update skill names to use gerund form (constitution-reminding, concept-interviewing) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

The previous jq-based approach was not working correctly with the JSONL log format. This change uses simple grep to check if both "Skill" tool and the specific skill name are present in the log. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Convert TEST_FILE to absolute path when passing to test-check.sh - Fix grep pattern to match JSON format "skill":"patent-kit:<name>" Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Use proper quote escaping to match JSON format "skill":"patent-kit:<name>" Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Arguments are passed as: $1=skill_name, $2=log_file from test-check.sh Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Ensure script commands are executed with correct path relative to tools/ directory. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…tion - Use realpath to get absolute path for test-check.sh - Add fallback for SCRIPT_DIR detection in test-check.sh Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Read timeout from test case toml file (default 300s) - Use gtimeout or timeout command to limit trial execution time - Implement manual timeout check in wait loop as fallback Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- functional-with-spec: Check that target.jsonl is NOT created (no CSV) - functional-with-data: New test with pre-loaded CSV data to verify merge This aligns with the targeting workflow where users manually download CSV files. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

The previous 300s timeout was too short for functional tests which take 350-400s. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Change specification to LLM-based chatbot to match CSV patent data - Add explicit constitution-reminding loading instruction to prompt Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…ution

…xist

sonesuke and others added 30 commits February 22, 2026 14:56

feat: add autonomous test-runner agent for E2E validation

970d681

test(test-runner): add progress tools, specification fixture, and fun…

48d8d43

…ctional E2E test

test: refine test strategy to include n-trials, mock interactions, an…

ad81717

…d objective evaluation

chore: ignore test runner reports directory

9ec80ac

test: refactor prompt.txt to use sub-agent architecture

7be39e6

test: implement true parallel execution and token tracking for sub-ag…

1938ce2

…ents

feat(e2e): improve test runner, evaluation logic and devcontainer setup

e50cb7a

refactor(e2e): change report output directory from e2e/reports to out

1e05a61

- Update runner.sh to output reports to out/ directory - Update .gitignore: e2e/reports → out/ - Remove empty e2e/ directory with old reports Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

fix(test-runner): correct test-summary.sh to match runner.sh interface

ceab820

The test-summary.sh was unexpectedly complex and expected additional arguments that runner.sh doesn't provide. Simplified to match the actual interface.

sonesuke and others added 28 commits February 24, 2026 00:11

fix: remove CLI-era --query syntax from targeting instructions

7770d51

- Change 'to the `--query`' to 'to the query parameter' - Consistent with MCP tool parameter naming Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

fix: update concept-interview to concept-interviewing in targeting

185c23e

- Targeting should call concept-interviewing if specification.md is missing - Update skill name reference Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

fix: update targeting test jq patterns

fd1ff14

- Fix init_validation to properly check array of system entries - Fix targeting_skill_invoked to check array and use length > 0 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

fix: use absolute path for test file and improve grep pattern

1e28149

- Convert TEST_FILE to absolute path when passing to test-check.sh - Fix grep pattern to match JSON format "skill":"patent-kit:<name>" Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

fix: correct grep pattern for skill invocation check

a8d18f4

Use proper quote escaping to match JSON format "skill":"patent-kit:<name>" Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

fix: correct argument order in check-skill-invoked.sh

02c11b8

Arguments are passed as: $1=skill_name, $2=log_file from test-check.sh Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

fix: prepend ./ to script commands in test-check.sh

ded713c

Ensure script commands are executed with correct path relative to tools/ directory. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

fix: use absolute path for test-check.sh and improve SCRIPT_DIR detec…

ecaad0e

…tion - Use realpath to get absolute path for test-check.sh - Add fallback for SCRIPT_DIR detection in test-check.sh Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

feat: add timeout support to test runner

9a19614

- Read timeout from test case toml file (default 300s) - Use gtimeout or timeout command to limit trial execution time - Implement manual timeout check in wait loop as fallback Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

fix: increase targeting test timeout to 600 seconds

e58d56e

The previous 300s timeout was too short for functional tests which take 350-400s. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

fix: align specification with CSV data and improve prompt

b73ca5d

- Change specification to LLM-based chatbot to match CSV patent data - Add explicit constitution-reminding loading instruction to prompt Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

fix: change competitors to Google, Microsoft, OpenAI

5592d47

refactor: simplify functional-with-data test to only check merge exec…

d1db8e5

…ution

fix: add Target Release Date and Cutoff Date to specification

debd02b

refactor: remove merge_executed check, only verify target.jsonl exists

4291849

fix: explicitly instruct to run merge.sh and skip search steps

1913351

fix: detect existing CSV files and run merge.sh directly

d4722da

fix: explicitly instruct to run merge.sh immediately when CSV files e…

ba17357

…xist

fix: update targeting skill to trigger when CSV files are detected

6d646bd

sonesuke merged commit acddd26 into main Feb 23, 2026
3 checks passed

sonesuke deleted the feature/test-runner branch February 23, 2026 15:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add test-runner autonomous agent for E2E validation#37

feat: add test-runner autonomous agent for E2E validation#37
sonesuke merged 77 commits intomainfrom
feature/test-runner

sonesuke commented Feb 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sonesuke commented Feb 22, 2026

Description

Key Features

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant