AI-powered autonomous agent that explores web applications to discover bugs, UX issues, and accessibility problems using LLM-based reasoning and custom browser automation tools.
- Quick Start
- Architecture Overview
- Design Decisions
- Features
- Usage Examples
- Output & Reports
- Known Limitations
- Testing
- Node.js 18+
- OpenAI API key or Anthropic API key
# Clone repository
git clone <repository-url>
cd exploratory-testing-agent
# Install dependencies
npm install
# Configure environment
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY or ANTHROPIC_API_KEY# Interactive mode with checkpoints (default)
npm run dev -- explore --url https://with-bugs.practicesoftwaretesting.com
# Headless mode with limited actions
npm run dev -- explore --url <target-url> --headless --actions 20
# Quick mode (skips LLM evaluation for faster testing)
npm run dev -- explore --url <target-url> --quick# Full options
npm run dev -- explore --help
# Options:
# --url <url> Target URL to explore (required)
# --actions <number> Maximum actions to perform (default: 50)
# --checkpoint <number> Actions between checkpoints (default: 10)
# --headless Run browser in headless mode
# --quick Skip LLM evaluation (faster)
# --interactive Enable checkpoint prompts (default: true)The agent can also run as an MCP (Model Context Protocol) server, allowing external LLMs like Claude or GPT-4 to invoke it as a tool.
# Run MCP server in development
npm run mcp:dev
# Run MCP server in production
npm run build && npm run mcp:startWhat is MCP? MCP is an open standard by Anthropic that allows AI applications to expose tools and services in a standardized way. External LLMs can then use these tools through a unified protocol.
Integration Example (Claude Desktop):
Add to claude_desktop_config.json:
{
"mcpServers": {
"exploratory-testing": {
"command": "node",
"args": ["/path/to/exploratory-testing-agent/dist/mcp/index.js"],
"env": {
"OPENAI_API_KEY": "your_key_here"
}
}
}
}Then ask Claude: "Test https://example.com for bugs and report findings"
Claude will invoke the explore_website tool with parameters like URL, max actions, and guidance, then return a complete testing report with findings categorized by severity.
Available Tool:
explore_website(url, maxActions?, guidance?, ...)- Performs autonomous testing and returns structured findings
┌─────────────────────────────────────────────────────────────┐
│ LangGraph Workflow │
├─────────────────────────────────────────────────────────────┤
│ │
│ Analyze → Decide → Execute → Evaluate → (loop/checkpoint) │
│ ↓ ↓ ↓ ↓ │
│ LLM LLM Tools LLM │
│ │
└─────────────────────────────────────────────────────────────┘
Services Layer:
├── llm/
│ ├── analyzer-service.ts # Page analysis with fallback
│ ├── decision-service.ts # Action selection with fallback
│ ├── evaluator-service.ts # Finding detection
│ └── schemas/ # Zod validation schemas
├── browser/
│ └── browser-service.ts # Playwright abstraction
└── tools/
├── broken-images.ts # Custom broken image detector
├── navigation.ts # Navigate, back, etc.
├── interaction.ts # Click, fill, submit
└── extraction.ts # Screenshot, extract data
1. LangGraph State Machine
- Analyze Node: Extracts page content, identifies interactive elements, generates test hypotheses
- Decide Node: Selects next action based on analysis and exploration history
- Execute Node: Performs browser actions (click, fill, navigate, etc.)
- Evaluate Node: Detects bugs, validates hypotheses, generates findings
2. LLM Services (SOLID Refactored)
- Each node delegates LLM communication to dedicated services
- Services handle prompt construction, response parsing, Zod validation, and fallback strategies
- Enables unit testing and reusability
3. Custom Tools
- Broken Images Tool: Detects images failing to load (HTTP errors, network failures, zero dimensions, etc.)
- Integrated into exploration flow as high-priority action
- Returns structured output with severity levels
Decision: Use LangGraph for agent orchestration
Rationale:
- State Management: Built-in state channels with reducers for append-only arrays
- Checkpointing: Native support for pause/resume (critical for long explorations)
- Conditional Routing: Clean decision logic (explore more vs. finish)
- Human-in-the-Loop: Easy checkpoint prompts for user guidance
Trade-off: Added LangGraph dependency, but gained production-ready orchestration
Decision: Use OpenAI GPT-4o by default, with Anthropic Claude as option
Rationale:
- Reasoning Quality: GPT-4 excels at structured output and task decomposition
- JSON Mode: More reliable structured output for our schemas
- Cost: More economical for this use case vs Claude Opus
- API Stability: OpenAI API has been more stable in testing
Trade-off: Claude has better context window and safety, but GPT-4 worked better for our structured prompts
Decision: Implement custom find_broken_images tool instead of relying on console errors
Rationale:
- Comprehensive Detection: Catches HTTP errors, network failures, empty src, zero dimensions, CORS issues
- Structured Output: Returns severity levels, selectors, and failure reasons
- Challenge Requirement: Demonstrates ability to create custom tools
- Real Value: Broken images are common bugs that generic tools miss
Implementation: Uses Playwright page evaluation to inspect all <img> elements and validate their loading state
Decision: Extract LLM logic from nodes into dedicated services (analyzer, decision, evaluator)
Rationale:
- SOLID Compliance: Single Responsibility Principle - nodes orchestrate, services handle LLM
- Testability: Services can be unit tested independently
- Reusability: Services can be used outside LangGraph if needed
- Maintainability: Smaller files (reduced from 400+ to ~150 lines per node)
Before/After:
analyze.ts: 264 → ~130 linesdecide.ts: 243 → ~120 linesevaluate.ts: 425 → ~190 lines
1. Complete Persistence Layer
- Status: Basic state serialization exists, but not full checkpoint resume
- Why: Time vs value - functional exploration works without it
- What it enables: Resume from exact checkpoint after crash
- Trade-off: Acceptable for MVP, would add for production
2. MCP Server ✅
- Status: Implemented (see MCP Server Mode)
- What it enables: External LLMs can invoke the agent as a tool via Model Context Protocol
- Usage: Integrate with Claude Desktop, custom MCP clients, or other AI applications
- Implementation: Exposes
explore_websitetool via stdio transport
3. Extensive Test Coverage
- Status: Unit tests for broken-images tool only
- Why: "Quality over quantity" - focused on working implementation
- What we'd add: Tests for services (analyzer, decision, evaluator)
- Trade-off: Pragmatic choice given time constraints
- Analyzes page structure and identifies testable elements
- Generates hypotheses about application behavior
- Selects actions using LLM reasoning with heuristic fallback
- Adapts strategy based on findings
- Visual Issues: Broken images, layout problems
- Console Errors: JavaScript errors, network failures
- UX Issues: Confusing flows, missing feedback
- Edge Cases: Boundary conditions, validation failures
- Accessibility: Missing alt text, poor contrast (via LLM analysis)
- LLM-Based: Uses GPT-4 to select next action based on context
- Fallback Strategy: Heuristic selection when LLM fails (priority-based)
- Exploration vs Exploitation: Balances testing new areas vs deep-diving
- Recency Avoidance: Doesn't repeat same actions immediately
- Screenshots on every action
- Console error logs
- Network request monitoring
- Structured finding reports (JSON + Markdown)
- Pause exploration at intervals (default: every 10 actions)
- User can review findings and decide to continue or stop
- Graceful shutdown on Ctrl+C
# Explore application with default settings
npm run dev -- explore --url https://example.com# Run in headless mode for CI pipeline
npm run dev -- explore \
--url https://staging.example.com \
--actions 30 \
--headless \
--checkpoint 15# Fast exploration without LLM evaluation
npm run dev -- explore \
--url https://example.com \
--quick \
--actions 10# Explore with frequent checkpoints for guidance
npm run dev -- explore \
--url https://example.com \
--checkpoint 5 \
--interactive./
├── reports/
│ ├── session-<timestamp>.json # Structured findings
│ └── session-<timestamp>.md # Human-readable report
├── screenshots/
│ └── action-<timestamp>.png # Evidence screenshots
└── logs/
└── session-<timestamp>.log # Detailed execution logs
JSON Report (reports/session-*.json):
{
"sessionId": "session-1234567890",
"startTime": "2024-01-06T12:00:00Z",
"findings": [
{
"type": "broken_image",
"severity": "high",
"title": "5 broken images found",
"description": "...",
"steps": ["Navigate to homepage", "Check images"],
"evidence": [...]
}
],
"summary": { ... }
}Markdown Report (reports/session-*.md):
# Exploratory Testing Report
## Session Information
- Session ID: session-1234567890
- Target URL: https://example.com
- Actions Performed: 23
- Findings: 8
## Critical Findings
### 🔴 Broken Images (High Severity)
Found 5 broken images on the homepage...Limitation: Better suited for traditional multi-page apps than SPAs with client-side routing
Impact: May miss state changes in SPAs that don't trigger URL changes
Workaround: Agent still works but may need more actions to explore full SPA state
Limitation: No built-in support for login flows (credentials, OAuth, etc.)
Impact: Can only test public pages
Workaround: Manually navigate to authenticated state, then start agent
Limitation: LLM may occasionally suggest invalid selectors or actions
Impact: Action fails, agent continues with fallback
Mitigation: Fallback strategies ensure exploration continues even with LLM failures
Limitation: LLM API calls cost money (GPT-4: ~$0.01-0.03 per action)
Impact: 50-action exploration costs ~$0.50-1.50
Mitigation: Use --quick mode to skip evaluation, or use cheaper models
Limitation: LLM calls are slow (~2-5 seconds per action)
Impact: 50 actions takes 5-10 minutes
Mitigation: Acceptable for exploratory testing; use --actions to limit scope
# Run all tests
npm test
# Run specific test
npm test -- broken-images
# Watch mode
npm test -- --watchnpm run test:coverage- ✅
tools/broken-images.test.ts- Comprehensive unit tests for broken image detector
llm/analyzer-service.test.ts- Test page analysis logicllm/decision-service.test.ts- Test action selectionllm/evaluator-service.test.ts- Test finding detectionagent/graph.test.ts- Integration test for full workflow
npm run build # Compile TypeScript
npm run typecheck # Type checking onlynpm run lintsrc/
├── agent/
│ ├── graph.ts # LangGraph workflow
│ └── nodes/ # Analyze, Decide, Execute, Evaluate
├── browser/
│ └── browser-service.ts # Playwright abstraction
├── cli/
│ └── explore.ts # CLI interface
├── llm/
│ ├── *-service.ts # LLM communication services
│ ├── prompts/ # Prompt templates
│ └── schemas/ # Zod validation schemas
├── tools/
│ └── *.ts # Browser automation tools
└── types/
└── state.ts # TypeScript types
Built for exploratory testing challenge using:
- LangGraph for agent orchestration
- Playwright for browser automation
- OpenAI GPT-4 for reasoning
- TypeScript for type safety
- Zod for runtime validation
MIT