Skip to content

joaquimafn/exploratory-testing-agent

Repository files navigation

Exploratory Testing Agent

AI-powered autonomous agent that explores web applications to discover bugs, UX issues, and accessibility problems using LLM-based reasoning and custom browser automation tools.

Table of Contents


Quick Start

Prerequisites

  • Node.js 18+
  • OpenAI API key or Anthropic API key

Installation

# Clone repository
git clone <repository-url>
cd exploratory-testing-agent

# Install dependencies
npm install

# Configure environment
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY or ANTHROPIC_API_KEY

Run Exploration

# Interactive mode with checkpoints (default)
npm run dev -- explore --url https://with-bugs.practicesoftwaretesting.com

# Headless mode with limited actions
npm run dev -- explore --url <target-url> --headless --actions 20

# Quick mode (skips LLM evaluation for faster testing)
npm run dev -- explore --url <target-url> --quick

Available Commands

# Full options
npm run dev -- explore --help

# Options:
#   --url <url>              Target URL to explore (required)
#   --actions <number>       Maximum actions to perform (default: 50)
#   --checkpoint <number>    Actions between checkpoints (default: 10)
#   --headless              Run browser in headless mode
#   --quick                 Skip LLM evaluation (faster)
#   --interactive           Enable checkpoint prompts (default: true)

MCP Server Mode

The agent can also run as an MCP (Model Context Protocol) server, allowing external LLMs like Claude or GPT-4 to invoke it as a tool.

# Run MCP server in development
npm run mcp:dev

# Run MCP server in production
npm run build && npm run mcp:start

What is MCP? MCP is an open standard by Anthropic that allows AI applications to expose tools and services in a standardized way. External LLMs can then use these tools through a unified protocol.

Integration Example (Claude Desktop): Add to claude_desktop_config.json:

{
  "mcpServers": {
    "exploratory-testing": {
      "command": "node",
      "args": ["/path/to/exploratory-testing-agent/dist/mcp/index.js"],
      "env": {
        "OPENAI_API_KEY": "your_key_here"
      }
    }
  }
}

Then ask Claude: "Test https://example.com for bugs and report findings"

Claude will invoke the explore_website tool with parameters like URL, max actions, and guidance, then return a complete testing report with findings categorized by severity.

Available Tool:

  • explore_website(url, maxActions?, guidance?, ...) - Performs autonomous testing and returns structured findings

Architecture Overview

High-Level Design

┌─────────────────────────────────────────────────────────────┐
│                        LangGraph Workflow                    │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Analyze → Decide → Execute → Evaluate → (loop/checkpoint)  │
│     ↓         ↓        ↓          ↓                          │
│   LLM      LLM      Tools       LLM                          │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Services Layer:
├── llm/
│   ├── analyzer-service.ts     # Page analysis with fallback
│   ├── decision-service.ts     # Action selection with fallback
│   ├── evaluator-service.ts    # Finding detection
│   └── schemas/                # Zod validation schemas
├── browser/
│   └── browser-service.ts      # Playwright abstraction
└── tools/
    ├── broken-images.ts        # Custom broken image detector
    ├── navigation.ts           # Navigate, back, etc.
    ├── interaction.ts          # Click, fill, submit
    └── extraction.ts           # Screenshot, extract data

Key Components

1. LangGraph State Machine

  • Analyze Node: Extracts page content, identifies interactive elements, generates test hypotheses
  • Decide Node: Selects next action based on analysis and exploration history
  • Execute Node: Performs browser actions (click, fill, navigate, etc.)
  • Evaluate Node: Detects bugs, validates hypotheses, generates findings

2. LLM Services (SOLID Refactored)

  • Each node delegates LLM communication to dedicated services
  • Services handle prompt construction, response parsing, Zod validation, and fallback strategies
  • Enables unit testing and reusability

3. Custom Tools

  • Broken Images Tool: Detects images failing to load (HTTP errors, network failures, zero dimensions, etc.)
  • Integrated into exploration flow as high-priority action
  • Returns structured output with severity levels

Design Decisions

Why LangGraph over Custom Implementation?

Decision: Use LangGraph for agent orchestration

Rationale:

  • State Management: Built-in state channels with reducers for append-only arrays
  • Checkpointing: Native support for pause/resume (critical for long explorations)
  • Conditional Routing: Clean decision logic (explore more vs. finish)
  • Human-in-the-Loop: Easy checkpoint prompts for user guidance

Trade-off: Added LangGraph dependency, but gained production-ready orchestration


Why OpenAI GPT-4 as Default LLM?

Decision: Use OpenAI GPT-4o by default, with Anthropic Claude as option

Rationale:

  • Reasoning Quality: GPT-4 excels at structured output and task decomposition
  • JSON Mode: More reliable structured output for our schemas
  • Cost: More economical for this use case vs Claude Opus
  • API Stability: OpenAI API has been more stable in testing

Trade-off: Claude has better context window and safety, but GPT-4 worked better for our structured prompts


Why Custom Broken Images Tool?

Decision: Implement custom find_broken_images tool instead of relying on console errors

Rationale:

  • Comprehensive Detection: Catches HTTP errors, network failures, empty src, zero dimensions, CORS issues
  • Structured Output: Returns severity levels, selectors, and failure reasons
  • Challenge Requirement: Demonstrates ability to create custom tools
  • Real Value: Broken images are common bugs that generic tools miss

Implementation: Uses Playwright page evaluation to inspect all <img> elements and validate their loading state


Why Service Layer Extraction?

Decision: Extract LLM logic from nodes into dedicated services (analyzer, decision, evaluator)

Rationale:

  • SOLID Compliance: Single Responsibility Principle - nodes orchestrate, services handle LLM
  • Testability: Services can be unit tested independently
  • Reusability: Services can be used outside LangGraph if needed
  • Maintainability: Smaller files (reduced from 400+ to ~150 lines per node)

Before/After:

  • analyze.ts: 264 → ~130 lines
  • decide.ts: 243 → ~120 lines
  • evaluate.ts: 425 → ~190 lines

What We Didn't Implement (And Why)

1. Complete Persistence Layer

  • Status: Basic state serialization exists, but not full checkpoint resume
  • Why: Time vs value - functional exploration works without it
  • What it enables: Resume from exact checkpoint after crash
  • Trade-off: Acceptable for MVP, would add for production

2. MCP Server

  • Status: Implemented (see MCP Server Mode)
  • What it enables: External LLMs can invoke the agent as a tool via Model Context Protocol
  • Usage: Integrate with Claude Desktop, custom MCP clients, or other AI applications
  • Implementation: Exposes explore_website tool via stdio transport

3. Extensive Test Coverage

  • Status: Unit tests for broken-images tool only
  • Why: "Quality over quantity" - focused on working implementation
  • What we'd add: Tests for services (analyzer, decision, evaluator)
  • Trade-off: Pragmatic choice given time constraints

Features

Autonomous Exploration

  • Analyzes page structure and identifies testable elements
  • Generates hypotheses about application behavior
  • Selects actions using LLM reasoning with heuristic fallback
  • Adapts strategy based on findings

Bug Detection

  • Visual Issues: Broken images, layout problems
  • Console Errors: JavaScript errors, network failures
  • UX Issues: Confusing flows, missing feedback
  • Edge Cases: Boundary conditions, validation failures
  • Accessibility: Missing alt text, poor contrast (via LLM analysis)

Smart Decision Making

  • LLM-Based: Uses GPT-4 to select next action based on context
  • Fallback Strategy: Heuristic selection when LLM fails (priority-based)
  • Exploration vs Exploitation: Balances testing new areas vs deep-diving
  • Recency Avoidance: Doesn't repeat same actions immediately

Evidence Collection

  • Screenshots on every action
  • Console error logs
  • Network request monitoring
  • Structured finding reports (JSON + Markdown)

Checkpointing

  • Pause exploration at intervals (default: every 10 actions)
  • User can review findings and decide to continue or stop
  • Graceful shutdown on Ctrl+C

Usage Examples

Basic Exploration

# Explore application with default settings
npm run dev -- explore --url https://example.com

Headless CI/CD Mode

# Run in headless mode for CI pipeline
npm run dev -- explore \
  --url https://staging.example.com \
  --actions 30 \
  --headless \
  --checkpoint 15

Quick Smoke Test

# Fast exploration without LLM evaluation
npm run dev -- explore \
  --url https://example.com \
  --quick \
  --actions 10

Interactive Deep Dive

# Explore with frequent checkpoints for guidance
npm run dev -- explore \
  --url https://example.com \
  --checkpoint 5 \
  --interactive

Output & Reports

Directory Structure

./
├── reports/
│   ├── session-<timestamp>.json      # Structured findings
│   └── session-<timestamp>.md        # Human-readable report
├── screenshots/
│   └── action-<timestamp>.png        # Evidence screenshots
└── logs/
    └── session-<timestamp>.log       # Detailed execution logs

Report Contents

JSON Report (reports/session-*.json):

{
  "sessionId": "session-1234567890",
  "startTime": "2024-01-06T12:00:00Z",
  "findings": [
    {
      "type": "broken_image",
      "severity": "high",
      "title": "5 broken images found",
      "description": "...",
      "steps": ["Navigate to homepage", "Check images"],
      "evidence": [...]
    }
  ],
  "summary": { ... }
}

Markdown Report (reports/session-*.md):

# Exploratory Testing Report

## Session Information
- Session ID: session-1234567890
- Target URL: https://example.com
- Actions Performed: 23
- Findings: 8

## Critical Findings
### 🔴 Broken Images (High Severity)
Found 5 broken images on the homepage...

Known Limitations

1. Single-Page Application (SPA) Support

Limitation: Better suited for traditional multi-page apps than SPAs with client-side routing

Impact: May miss state changes in SPAs that don't trigger URL changes

Workaround: Agent still works but may need more actions to explore full SPA state

2. Authentication Flows

Limitation: No built-in support for login flows (credentials, OAuth, etc.)

Impact: Can only test public pages

Workaround: Manually navigate to authenticated state, then start agent

3. LLM Hallucination

Limitation: LLM may occasionally suggest invalid selectors or actions

Impact: Action fails, agent continues with fallback

Mitigation: Fallback strategies ensure exploration continues even with LLM failures

4. Cost

Limitation: LLM API calls cost money (GPT-4: ~$0.01-0.03 per action)

Impact: 50-action exploration costs ~$0.50-1.50

Mitigation: Use --quick mode to skip evaluation, or use cheaper models

5. Performance

Limitation: LLM calls are slow (~2-5 seconds per action)

Impact: 50 actions takes 5-10 minutes

Mitigation: Acceptable for exploratory testing; use --actions to limit scope


Testing

Run Tests

# Run all tests
npm test

# Run specific test
npm test -- broken-images

# Watch mode
npm test -- --watch

Test Coverage

npm run test:coverage

Current Tests

  • tools/broken-images.test.ts - Comprehensive unit tests for broken image detector

Future Tests (Recommended)

  • llm/analyzer-service.test.ts - Test page analysis logic
  • llm/decision-service.test.ts - Test action selection
  • llm/evaluator-service.test.ts - Test finding detection
  • agent/graph.test.ts - Integration test for full workflow

Development

Build

npm run build        # Compile TypeScript
npm run typecheck    # Type checking only

Linting

npm run lint

Code Structure

src/
├── agent/
│   ├── graph.ts              # LangGraph workflow
│   └── nodes/                # Analyze, Decide, Execute, Evaluate
├── browser/
│   └── browser-service.ts    # Playwright abstraction
├── cli/
│   └── explore.ts            # CLI interface
├── llm/
│   ├── *-service.ts          # LLM communication services
│   ├── prompts/              # Prompt templates
│   └── schemas/              # Zod validation schemas
├── tools/
│   └── *.ts                  # Browser automation tools
└── types/
    └── state.ts              # TypeScript types

Acknowledgments

Built for exploratory testing challenge using:

  • LangGraph for agent orchestration
  • Playwright for browser automation
  • OpenAI GPT-4 for reasoning
  • TypeScript for type safety
  • Zod for runtime validation

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published