Skip to content

feat: add test-runner autonomous agent for E2E validation#37

Merged
sonesuke merged 77 commits intomainfrom
feature/test-runner
Feb 23, 2026
Merged

feat: add test-runner autonomous agent for E2E validation#37
sonesuke merged 77 commits intomainfrom
feature/test-runner

Conversation

@sonesuke
Copy link
Owner

Description

This PR addresses the testing guidelines for Claude Skills by introducing an autonomous test-runner agent. The runner uses the devcontainer exec context to prompt Claude Code with E2E dummy queries.

Key Features

  • Test-Runner Host Script: Added agents/test-runner/runner.sh.
  • Agent Prompt: Provided prompt.txt that instructs the agent to read test cases, self-prompt as a user, check token outputs, and write results to e2e/reports/.
  • E2E Dummy Test Cases: Created e2e/test_cases/01-targeting-trigger.md to validate that the targeting skill triggers correctly on explicit intents without over-stepping into other phases.
  • Documentation: Updated AGENTS.md.

sonesuke and others added 30 commits February 22, 2026 14:56
- Add init_validation check to verify plugin, skill, and MCP server status
- Update search_patents_called to explicitly check search_patents and search_papers tools
- Change keywords_file_exists path to check data/target.jsonl instead
- Temporarily restrict runner.sh to execute only 01-targeting-trigger case

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Reorganize test cases to use skill/test-type subfolders (e.g., 01-targeting/triggering, 01-targeting/functional)
- Update runner.sh to support nested test structure (skill_dir → test_type_dir)
- Maintain triggering/functional test separation per Zenn article best practices

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- triggering: simple skill invocation check (no setup files)
- functional-no-spec: concept-interview → targeting workflow (specification.md created)
- functional-with-spec: complete targeting workflow with existing specification

Evaluation improvements:
- Add constitution_loaded check
- Add keywords_md_created check
- Explicit search_patents MCP tool invocation check
- Add noise_analysis_performed check
- Fix MCP server status check (connected vs running)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…ucture

- Move detailed instructions to references/instructions.md
- Extract examples to references/examples.md
- Extract troubleshooting to references/troubleshooting.md
- Move templates under references/templates/
- Keep SKILL.md for orchestration, state management, and transitions only
- Fix runner.sh to use container paths for setup directory copy
- Reorganize test cases into functional-no-spec, functional-with-spec, and triggering

This follows Claude's progressive disclosure principle:
- SKILL.md: Prerequisites, orchestration, state management
- references/instructions.md: Detailed process steps
- references/examples.md: Usage examples
- references/troubleshooting.md: Common issues and solutions

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…pt-interview, setup

- Move detailed instructions to references/instructions.md for all three skills
- Extract examples to references/examples.md
- Extract troubleshooting to references/troubleshooting.md
- Move templates to assets/templates/ (correct location per Agent Skills convention)
- Keep SKILL.md for orchestration, state management, and transitions only

Skill structure now follows progressive disclosure principle:
- SKILL.md: Prerequisites, orchestration, state transitions
- references/instructions.md: Detailed process steps
- references/examples.md: Usage examples
- references/troubleshooting.md: Common issues and solutions
- assets/templates/: Template files

Also add E2E tests for all three skills:
- 00-setup: triggering, functional tests
- 00-constitution: triggering, functional tests
- 00-concept-interview: triggering, functional-no-spec, functional-with-spec tests

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Fix jq filters to match actual log structure (message.content[] array)
- Add references_instructions_read check to functional tests
- Update test-prompt.md files to be more specific and avoid interactive questions
- Add NOTE to concept-interview: proceed with available info without asking
- Fix specification template reference path in test data

All tests now passing:
- 00-constitution: triggering, functional ✅
- 00-setup: triggering, functional ✅
- 00-concept-interview: triggering, functional-no-spec, functional-with-spec ✅

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Convert all evaluation.json to evaluation.toml for better readability
- Update runner.sh to use yq instead of jq for reading TOML files
- TOML format provides cleaner syntax for test case definitions
- yq supports both TOML input/output and provides same query capabilities as jq

Test results with new TOML format:
- 01-targeting: 2/3 passing (triggering, functional-no-spec ✅)
- functional-with-spec has pre-existing issues unrelated to TOML conversion

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Reorganize test cases from nested `e2e/test_cases/*/subdir/` to flat `cases/*/*.toml`
- Unify `test-prompt.md` + `evaluation.toml` + `setup/` into single `test.toml` file
- Remove numbering prefixes from skill directories (00-, 01-)
- Extract test setup and check logic from runner.sh into modular tools:
  - `tools/test-setup.sh`: workspace setup with detailed output
  - `tools/test-check.sh`: evaluation checks with result display
- Clarify responsibilities:
  - `runner.sh`: process orchestration only
  - `test-setup.sh`: setup handling and display
  - `test-check.sh`: evaluation and result display

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Update runner.sh to output reports to out/ directory
- Update .gitignore: e2e/reports → out/
- Remove empty e2e/ directory with old reports

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…ools

- Extract summary report generation to tools/test-summary.sh
- Remove unused progress tracking tools from test-runner:
  - load-progress.sh
  - record-progress.sh
- Update runner.sh to delegate summary to test-summary.sh
- Clarify tool responsibilities:
  - test-setup.sh: workspace setup
  - test-check.sh: evaluation checks
  - test-summary.sh: report generation and display
  - runner.sh: process orchestration only

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…ucture

- Change log file structure from hierarchical to flat:
  - Old: out/<report_id>/<skill>/<test>/trial-1.log
  - New: out/<report_id>/<test>-1.log
- Collect trial statistics (duration, tokens) in runner.sh
- Move average calculation from runner.sh to test-summary.sh
- Add per-test statistics to summary report:
  - Average duration per test
  - Average input/output tokens per test
- Clarify responsibilities:
  - runner.sh: collect raw trial data
  - test-summary.sh: calculate averages and display

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add optional 3rd argument TARGET_TEST to specify test file name
- Usage: runner.sh <n_trials> <skill> <test_name>
- Example: runner.sh 1 concept-interview functional-with-spec
- Update help banner to show target filters when specified

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…tching

- Change usage: runner.sh <n_trials> [skill] [test] → runner.sh <n_trials> [pattern]
- Support glob patterns to match test files:
  - cases/*/*.toml - all tests (default)
  - cases/c*/*.toml - skills starting with 'c'
  - cases/concept-interview/*.toml - all concept-interview tests
  - cases/concept-interview/func*.toml - tests starting with 'func'
  - cases/concept-interview/functional-with-spec.toml - specific test
- Simplify filtering logic by collecting matching files upfront
- Update banner to show pattern instead of skill/test

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The test-summary.sh was unexpectedly complex and expected additional arguments
that runner.sh doesn't provide. Simplified to match the actual interface.
- Restore average duration calculation per test case
- Restore average input/output tokens calculation per test case
- Add conditional display of test results section
- Match expected interface with runner.sh (receives TEST_RESULTS array)

Fixes regression where statistics were accidentally simplified.
- Change from complex log parsing to simple result files
- Create .results file per test case with trial data
- Format: "passed|duration|input_tokens|output_tokens" per line
- test-summary.sh reads .results files and calculates averages
- Fix bash compatibility issues with glob patterns
- Remove unused ALL_LOG_FILES array from runner.sh

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Create skill-specific directories (e.g., constitution/, concept-interview/) under report folder
- Store log files and result files in skill subdirectories
- Place summary.md in skill directory when testing single skill, or in root when testing multiple skills
- Update test-summary.sh to recursively find .results files in subdirectories

Directory structure example:
  out/20260223_142007/
  └── constitution/
      ├── functional-1.log
      ├── functional.results
      ├── triggering-1.log
      ├── triggering.results
      └── summary.md

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add specification_template_read check to functional-no-spec.toml
- Fix template paths in concept-interview instructions (assets/templates/ → assets/)
- Fix template paths in targeting instructions (assets/templates/ → assets/)

This ensures AI reads templates from correct paths and tests verify template files are accessed.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add MCP Server Errors section to concept-interview/troubleshooting.md
- Add MCP Server Errors section to targeting/troubleshooting.md
- Update instructions to check for isError: true and refer to troubleshooting

This ensures skills properly handle MCP tool failures instead of proceeding with fabricated data.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Setting CI=1 environment variable makes google-patent-cli automatically add required Chrome flags (--no-sandbox, --disable-setuid-sandbox, --disable-gpu) for Docker/DevContainer environments.

See: https://github.com/sonesuke/google-patent-cli#configuration-file

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add targeting_template_read and keywords_template_read checks
- Fix MCP tool name pattern (mcp__google_patent_mcp__ → google-patent-cli__)

This ensures template files are properly read during targeting phase.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add google_patent_mcp_succeeded check to targeting/functional-with-spec
- Add google_patent_mcp_succeeded check to targeting/functional-no-spec
- Add google_patent_mcp_succeeded_if_called check to concept-interview/functional-with-spec (conditional check)

This ensures MCP tool failures are detected across all test scenarios.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Remove CI=1 environment variable (which skips post-create setup) and instead configure Chrome args via config files:
- google-patent-cli: ~/.config/google-patent-cli/config.toml
- arxiv-cli: ~/.config/arxiv-cli/config.toml

This ensures MCP tools work in Docker without blocking development setup.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
sonesuke and others added 28 commits February 24, 2026 00:11
Move detailed operational rules from constitution to each skill's instructions:

**Constitution Changes:**
- Remove "X. Output Management" (now in each skill)
- Remove "Templates" rule (now in each skill)
- Remove "Unified Search Scope" (moved to targeting)
- Remove "User Hearing" (moved to evaluating/claim-analyzing)
- Renumber sections (I-VIII instead of I-X)

**Skill-Specific Additions:**
- Output Management sections in targeting, screening, evaluating, claim-analyzing, prior-art-researching
- Template Adherence sections in all template-using skills
- Unified Search Scope in targeting
- User Interview requirements in evaluating and claim-analyzing

**Benefits:**
- Constitution focuses on core principles only
- Each skill defines its own file paths and requirements
- Easier to maintain and update skill-specific rules
- Reduced coupling between constitution and individual skills

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
**Test Runner Fix:**
- Remove `set -e` from runner.sh to allow all tests to run even when some fail
- Previously, test runner would exit on first failure
- Now all tests execute and summary report is generated

**Legal-Checking Test Fixes:**
- Update jq patterns to search for "legal-checking" instead of "legal-checker"
- Update description to improve auto-triggering with keywords like "review", "compliance"
- All 3 legal-checking tests now pass

**Test Results:**
- ✅ functional (21s)
- ✅ triggering (39s)
- ✅ functional-file-review (55s)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
**Concept-Interviewing Fixes:**
- Update constitution skill reference to constitution-reminding
- Remove CLI-style "Arguments: --assignee" syntax from instructions
- Add jq-based histogram generation for assignee name verification
- Process: Extract assignee names from JSON → group_by → count → sort → top 100

**Targeting Fixes:**
- Remove CLI-style "Arguments: --query" syntax from instructions
- Update to proper MCP tool parameter format

**Benefits:**
- Clear separation between CLI commands and MCP tool usage
- Proper JSON processing using jq for assignee analysis
- Consistent documentation across all skills

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Fix init_validation to properly check array of system entries
- Update concept_interview_invoked to search for 'concept-interviewing'
- All 3 concept-interviewing tests now pass

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Change 'to the `--query`' to 'to the query parameter'
- Consistent with MCP tool parameter naming

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Remove auto-execution of concept-interviewing from targeting
- If specification.md is missing, notify user to run concept-interviewing first
- Proper workflow separation: concept-interviewing → targeting

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Targeting should call concept-interviewing if specification.md is missing
- Update skill name reference

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Fix init_validation to properly check array of system entries
- Fix targeting_skill_invoked to check array and use length > 0

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add check-skill-invoked.sh to check if a skill was invoked
- Update all triggering tests to use script type instead of jq patterns
- Makes tests more maintainable and easier to read

Updated tests:
- concept-interviewing/triggering
- constitution-reminding/triggering
- legal-checking/triggering
- targeting/triggering

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Fix check-skill-invoked.sh to work with JSONL log format
- Update all functional test cases to use script-based checks instead of jq patterns
- Update skill names to use gerund form (constitution-reminding, concept-interviewing)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The previous jq-based approach was not working correctly with the JSONL log format.
This change uses simple grep to check if both "Skill" tool and the specific skill name are present in the log.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Convert TEST_FILE to absolute path when passing to test-check.sh
- Fix grep pattern to match JSON format "skill":"patent-kit:<name>"

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Use proper quote escaping to match JSON format "skill":"patent-kit:<name>"

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Arguments are passed as: $1=skill_name, $2=log_file from test-check.sh

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Ensure script commands are executed with correct path relative to tools/ directory.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…tion

- Use realpath to get absolute path for test-check.sh
- Add fallback for SCRIPT_DIR detection in test-check.sh

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Read timeout from test case toml file (default 300s)
- Use gtimeout or timeout command to limit trial execution time
- Implement manual timeout check in wait loop as fallback

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- functional-with-spec: Check that target.jsonl is NOT created (no CSV)
- functional-with-data: New test with pre-loaded CSV data to verify merge

This aligns with the targeting workflow where users manually download CSV files.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The previous 300s timeout was too short for functional tests which take 350-400s.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Change specification to LLM-based chatbot to match CSV patent data
- Add explicit constitution-reminding loading instruction to prompt

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@sonesuke sonesuke merged commit acddd26 into main Feb 23, 2026
3 checks passed
@sonesuke sonesuke deleted the feature/test-runner branch February 23, 2026 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant