Exploratory Testing Agent

AI-powered autonomous agent that explores web applications to discover bugs, UX issues, and accessibility problems using LLM-based reasoning and custom browser automation tools.

Quick Start

Prerequisites

Node.js 18+
OpenAI API key or Anthropic API key

Installation

# Clone repository
git clone <repository-url>
cd exploratory-testing-agent

# Install dependencies
npm install

# Configure environment
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY or ANTHROPIC_API_KEY

Run Exploration

# Interactive mode with checkpoints (default)
npm run dev -- explore --url https://with-bugs.practicesoftwaretesting.com

# Headless mode with limited actions
npm run dev -- explore --url <target-url> --headless --actions 20

# Quick mode (skips LLM evaluation for faster testing)
npm run dev -- explore --url <target-url> --quick

Available Commands

# Full options
npm run dev -- explore --help

# Options:
#   --url <url>              Target URL to explore (required)
#   --actions <number>       Maximum actions to perform (default: 50)
#   --checkpoint <number>    Actions between checkpoints (default: 10)
#   --headless              Run browser in headless mode
#   --quick                 Skip LLM evaluation (faster)
#   --interactive           Enable checkpoint prompts (default: true)

MCP Server Mode

The agent can also run as an MCP (Model Context Protocol) server, allowing external LLMs like Claude or GPT-4 to invoke it as a tool.

# Run MCP server in development
npm run mcp:dev

# Run MCP server in production
npm run build && npm run mcp:start

What is MCP? MCP is an open standard by Anthropic that allows AI applications to expose tools and services in a standardized way. External LLMs can then use these tools through a unified protocol.

Integration Example (Claude Desktop): Add to claude_desktop_config.json:

{
  "mcpServers": {
    "exploratory-testing": {
      "command": "node",
      "args": ["/path/to/exploratory-testing-agent/dist/mcp/index.js"],
      "env": {
        "OPENAI_API_KEY": "your_key_here"
      }
    }
  }
}

Then ask Claude: "Test https://example.com for bugs and report findings"

Claude will invoke the explore_website tool with parameters like URL, max actions, and guidance, then return a complete testing report with findings categorized by severity.

Available Tool:

explore_website(url, maxActions?, guidance?, ...) - Performs autonomous testing and returns structured findings

Architecture Overview

High-Level Design

┌─────────────────────────────────────────────────────────────┐
│                        LangGraph Workflow                    │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Analyze → Decide → Execute → Evaluate → (loop/checkpoint)  │
│     ↓         ↓        ↓          ↓                          │
│   LLM      LLM      Tools       LLM                          │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Services Layer:
├── llm/
│   ├── analyzer-service.ts     # Page analysis with fallback
│   ├── decision-service.ts     # Action selection with fallback
│   ├── evaluator-service.ts    # Finding detection
│   └── schemas/                # Zod validation schemas
├── browser/
│   └── browser-service.ts      # Playwright abstraction
└── tools/
    ├── broken-images.ts        # Custom broken image detector
    ├── navigation.ts           # Navigate, back, etc.
    ├── interaction.ts          # Click, fill, submit
    └── extraction.ts           # Screenshot, extract data

Key Components

1. LangGraph State Machine

Analyze Node: Extracts page content, identifies interactive elements, generates test hypotheses
Decide Node: Selects next action based on analysis and exploration history
Execute Node: Performs browser actions (click, fill, navigate, etc.)
Evaluate Node: Detects bugs, validates hypotheses, generates findings

2. LLM Services (SOLID Refactored)

Each node delegates LLM communication to dedicated services
Services handle prompt construction, response parsing, Zod validation, and fallback strategies
Enables unit testing and reusability

3. Custom Tools

Broken Images Tool: Detects images failing to load (HTTP errors, network failures, zero dimensions, etc.)
Integrated into exploration flow as high-priority action
Returns structured output with severity levels

Design Decisions

Why LangGraph over Custom Implementation?

Decision: Use LangGraph for agent orchestration

Rationale:

State Management: Built-in state channels with reducers for append-only arrays
Checkpointing: Native support for pause/resume (critical for long explorations)
Conditional Routing: Clean decision logic (explore more vs. finish)
Human-in-the-Loop: Easy checkpoint prompts for user guidance

Trade-off: Added LangGraph dependency, but gained production-ready orchestration

Why OpenAI GPT-4 as Default LLM?

Decision: Use OpenAI GPT-4o by default, with Anthropic Claude as option

Rationale:

Reasoning Quality: GPT-4 excels at structured output and task decomposition
JSON Mode: More reliable structured output for our schemas
Cost: More economical for this use case vs Claude Opus
API Stability: OpenAI API has been more stable in testing

Trade-off: Claude has better context window and safety, but GPT-4 worked better for our structured prompts

Why Custom Broken Images Tool?

Decision: Implement custom find_broken_images tool instead of relying on console errors

Rationale:

Comprehensive Detection: Catches HTTP errors, network failures, empty src, zero dimensions, CORS issues
Structured Output: Returns severity levels, selectors, and failure reasons
Challenge Requirement: Demonstrates ability to create custom tools
Real Value: Broken images are common bugs that generic tools miss

Implementation: Uses Playwright page evaluation to inspect all <img> elements and validate their loading state

Why Service Layer Extraction?

Decision: Extract LLM logic from nodes into dedicated services (analyzer, decision, evaluator)

Rationale:

SOLID Compliance: Single Responsibility Principle - nodes orchestrate, services handle LLM
Testability: Services can be unit tested independently
Reusability: Services can be used outside LangGraph if needed
Maintainability: Smaller files (reduced from 400+ to ~150 lines per node)

Before/After:

analyze.ts: 264 → ~130 lines
decide.ts: 243 → ~120 lines
evaluate.ts: 425 → ~190 lines

What We Didn't Implement (And Why)

1. Complete Persistence Layer

Status: Basic state serialization exists, but not full checkpoint resume
Why: Time vs value - functional exploration works without it
What it enables: Resume from exact checkpoint after crash
Trade-off: Acceptable for MVP, would add for production

2. MCP Server ✅

Status: Implemented (see MCP Server Mode)
What it enables: External LLMs can invoke the agent as a tool via Model Context Protocol
Usage: Integrate with Claude Desktop, custom MCP clients, or other AI applications
Implementation: Exposes explore_website tool via stdio transport

3. Extensive Test Coverage

Status: Unit tests for broken-images tool only
Why: "Quality over quantity" - focused on working implementation
What we'd add: Tests for services (analyzer, decision, evaluator)
Trade-off: Pragmatic choice given time constraints

Features

Autonomous Exploration

Analyzes page structure and identifies testable elements
Generates hypotheses about application behavior
Selects actions using LLM reasoning with heuristic fallback
Adapts strategy based on findings

Bug Detection

Visual Issues: Broken images, layout problems
Console Errors: JavaScript errors, network failures
UX Issues: Confusing flows, missing feedback
Edge Cases: Boundary conditions, validation failures
Accessibility: Missing alt text, poor contrast (via LLM analysis)

Smart Decision Making

LLM-Based: Uses GPT-4 to select next action based on context
Fallback Strategy: Heuristic selection when LLM fails (priority-based)
Exploration vs Exploitation: Balances testing new areas vs deep-diving
Recency Avoidance: Doesn't repeat same actions immediately

Evidence Collection

Screenshots on every action
Console error logs
Network request monitoring
Structured finding reports (JSON + Markdown)

Checkpointing

Pause exploration at intervals (default: every 10 actions)
User can review findings and decide to continue or stop
Graceful shutdown on Ctrl+C

Usage Examples

Basic Exploration

# Explore application with default settings
npm run dev -- explore --url https://example.com

Headless CI/CD Mode

# Run in headless mode for CI pipeline
npm run dev -- explore \
  --url https://staging.example.com \
  --actions 30 \
  --headless \
  --checkpoint 15

Quick Smoke Test

# Fast exploration without LLM evaluation
npm run dev -- explore \
  --url https://example.com \
  --quick \
  --actions 10

Interactive Deep Dive

# Explore with frequent checkpoints for guidance
npm run dev -- explore \
  --url https://example.com \
  --checkpoint 5 \
  --interactive

Output & Reports

Directory Structure

./
├── reports/
│   ├── session-<timestamp>.json      # Structured findings
│   └── session-<timestamp>.md        # Human-readable report
├── screenshots/
│   └── action-<timestamp>.png        # Evidence screenshots
└── logs/
    └── session-<timestamp>.log       # Detailed execution logs

Report Contents

JSON Report (reports/session-*.json):

{
  "sessionId": "session-1234567890",
  "startTime": "2024-01-06T12:00:00Z",
  "findings": [
    {
      "type": "broken_image",
      "severity": "high",
      "title": "5 broken images found",
      "description": "...",
      "steps": ["Navigate to homepage", "Check images"],
      "evidence": [...]
    }
  ],
  "summary": { ... }
}

Markdown Report (reports/session-*.md):

# Exploratory Testing Report

## Session Information
- Session ID: session-1234567890
- Target URL: https://example.com
- Actions Performed: 23
- Findings: 8

## Critical Findings
### 🔴 Broken Images (High Severity)
Found 5 broken images on the homepage...

Known Limitations

1. Single-Page Application (SPA) Support

Limitation: Better suited for traditional multi-page apps than SPAs with client-side routing

Impact: May miss state changes in SPAs that don't trigger URL changes

Workaround: Agent still works but may need more actions to explore full SPA state

2. Authentication Flows

Limitation: No built-in support for login flows (credentials, OAuth, etc.)

Impact: Can only test public pages

Workaround: Manually navigate to authenticated state, then start agent

3. LLM Hallucination

Limitation: LLM may occasionally suggest invalid selectors or actions

Impact: Action fails, agent continues with fallback

Mitigation: Fallback strategies ensure exploration continues even with LLM failures

4. Cost

Limitation: LLM API calls cost money (GPT-4: ~$0.01-0.03 per action)

Impact: 50-action exploration costs ~$0.50-1.50

Mitigation: Use --quick mode to skip evaluation, or use cheaper models

5. Performance

Limitation: LLM calls are slow (~2-5 seconds per action)

Impact: 50 actions takes 5-10 minutes

Mitigation: Acceptable for exploratory testing; use --actions to limit scope

Testing

Run Tests

# Run all tests
npm test

# Run specific test
npm test -- broken-images

# Watch mode
npm test -- --watch

Test Coverage

npm run test:coverage

Current Tests

✅ tools/broken-images.test.ts - Comprehensive unit tests for broken image detector

Future Tests (Recommended)

llm/analyzer-service.test.ts - Test page analysis logic
llm/decision-service.test.ts - Test action selection
llm/evaluator-service.test.ts - Test finding detection
agent/graph.test.ts - Integration test for full workflow

Development

Build

npm run build        # Compile TypeScript
npm run typecheck    # Type checking only

Linting

npm run lint

Code Structure

src/
├── agent/
│   ├── graph.ts              # LangGraph workflow
│   └── nodes/                # Analyze, Decide, Execute, Evaluate
├── browser/
│   └── browser-service.ts    # Playwright abstraction
├── cli/
│   └── explore.ts            # CLI interface
├── llm/
│   ├── *-service.ts          # LLM communication services
│   ├── prompts/              # Prompt templates
│   └── schemas/              # Zod validation schemas
├── tools/
│   └── *.ts                  # Browser automation tools
└── types/
    └── state.ts              # TypeScript types

Acknowledgments

Built for exploratory testing challenge using:

LangGraph for agent orchestration
Playwright for browser automation
OpenAI GPT-4 for reasoning
TypeScript for type safety
Zod for runtime validation

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
reports		reports
screenshots		screenshots
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
ADR.md		ADR.md
ARCHITECTURE_DIAGRAM.md		ARCHITECTURE_DIAGRAM.md
CLAUDE.md		CLAUDE.md
PRD.md		PRD.md
README.md		README.md
TECHNICAL_DESIGN.md		TECHNICAL_DESIGN.md
eslint.config.js		eslint.config.js
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

joaquimafn/exploratory-testing-agent

Folders and files

Latest commit

History

Repository files navigation

Exploratory Testing Agent

Table of Contents

Quick Start

Prerequisites

Installation

Run Exploration

Available Commands

MCP Server Mode

Architecture Overview

High-Level Design

Key Components

Design Decisions

Why LangGraph over Custom Implementation?

Why OpenAI GPT-4 as Default LLM?

Why Custom Broken Images Tool?

Why Service Layer Extraction?

What We Didn't Implement (And Why)

Features

Autonomous Exploration

Bug Detection

Smart Decision Making

Evidence Collection

Checkpointing

Usage Examples

Basic Exploration

Headless CI/CD Mode

Quick Smoke Test

Interactive Deep Dive

Output & Reports

Directory Structure

Report Contents

Known Limitations

1. Single-Page Application (SPA) Support

2. Authentication Flows

3. LLM Hallucination

4. Cost

5. Performance

Testing

Run Tests

Test Coverage

Current Tests

Future Tests (Recommended)

Development

Build

Linting

Code Structure

Acknowledgments

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages