Skip to content

Experimental browser automation system that uses LLMs to autonomously navigate websites, extract data, and complete complex web tasks.

Notifications You must be signed in to change notification settings

Arkel-ai/aircrawl

Repository files navigation

AirCrawl Logo

AirCrawl

AirCrawl is an experimental browser automation system that uses LLMs to autonomously navigate websites, extract data, and complete complex web tasks.

AirCrawl Screenshot

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        AGENT LOOP                               │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐       │
│  │ OBSERVE │───▶│  PLAN   │───▶│ EXECUTE │───▶│ VERIFY  │──┐    │
│  └─────────┘    └─────────┘    └─────────┘    └─────────┘  │    │
│       ▲                                                    │    │
│       └────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                     MCP SERVER (Tools)                          │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐            │
│  │ navigate │ │  click   │ │   type   │ │screenshot│  ...       │
│  └──────────┘ └──────────┘ └──────────┘ └──────────┘            │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                      SELENIUMBASE                               │
│              (Undetected Chrome Browser)                        │
└─────────────────────────────────────────────────────────────────┘

Features

  • Autonomous Navigation: LLM-powered decision making for complex web tasks
  • Vision-Based Analysis: Multimodal LLM analyzes screenshots to understand page state
  • Element Annotation: Visual labeling of interactive elements for precise targeting
  • Undetected Browser: SeleniumBase with UC mode bypasses bot detection
  • MCP Protocol: Standard Model Context Protocol for tool communication
  • REST API: Full-featured FastAPI server for integration
  • Interactive Sessions: Persistent browser sessions for step-by-step control
  • Quick Actions: Pre-built endpoints for common tasks (search, login, extract)
  • Agent System: Persistent agents with custom prompts and encrypted secrets

Installation

# Clone the repository
git clone https://github.com/yourusername/aircrawl.git
cd aircrawl

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Copy environment template
cp .env.template .env
# Edit .env with your API keys

Configuration

Edit .env file:

# LLM Configuration (OpenAI-compatible)
OPENAI_API_KEY=your-api-key
OPENAI_BASE_URL=https://api.openai.com/v1  # Or other compatible endpoint
MODEL=gpt-4o  # Or any compatible model

# Browser Settings
BROWSER_UC_MODE=true
BROWSER_TIMEOUT=30

# Agent Settings
AGENT_MAX_STEPS=30

# API Server
API_HOST=0.0.0.0
API_PORT=8000
DEBUG=false

# Database (PostgreSQL)
POSTGRES_USER=aircrawl
POSTGRES_PASSWORD=your_secure_password
POSTGRES_DB=aircrawl
DATABASE_URL=postgresql+asyncpg://aircrawl:your_secure_password@localhost:5432/aircrawl

# Encryption key for secrets
ENCRYPTION_KEY=  # Generate with: python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"

# Browser Proxy (optional) - see "Using Gost Proxy" section below
BROWSER_PROXY=

Docker Compose

The easiest way to run AirCrawl is with Docker Compose:

# Copy environment template
cp .env.template .env
# Edit .env with your API keys and configuration

# Start all services (PostgreSQL, Backend, Frontend)
docker compose up -d

# View logs
docker compose logs -f

# Stop services
docker compose down

Services:

  • PostgreSQL: Database on port 5432
  • Backend API: FastAPI server on port 8000
  • Frontend: SvelteKit app on port 3000

Using Gost Proxy (Anti-Bot)

Gost is a SOCKS5 proxy that helps bypass anti-bot detection by routing browser traffic through a residential IP. This is useful when websites block datacenter IPs.

How it works:

  1. Gost runs as a local SOCKS5 proxy (port 1081)
  2. It forwards traffic to your upstream proxy (e.g., a residential proxy service)
  3. The browser connects through the local proxy, appearing to come from a residential IP

Setup:

  1. Set up your upstream proxy (residential IP, home server, etc.)
  2. Configure .env:
# Local proxy that the browser will use
BROWSER_PROXY=socks5://127.0.0.1:1081

# Upstream proxy (your residential/home proxy)
GOST_UPSTREAM_PROXY=socks5://username:password@your-proxy-ip:1080
  1. Start with the proxy profile:
docker compose --profile proxy up -d

Without proxy: If BROWSER_PROXY is not set, the browser connects directly without any proxy.

Usage

Command Line

# Run a single task
python main.py task "Go to news.ycombinator.com and get the top 5 headlines"

# Interactive demo
python main.py demo

# Start API server
python main.py server --port 8000

# Start MCP server (for external clients)
python main.py mcp-server

API Server

Start the server:

python main.py server

Create async task:

curl -X POST http://localhost:8000/api/tasks \
  -H "Content-Type: application/json" \
  -d '{"task": "Go to wikipedia.org and search for Python programming", "max_steps": 20}'

Check task status:

curl http://localhost:8000/api/tasks/{task_id}

Quick search:

curl -X POST http://localhost:8000/api/quick/search \
  -H "Content-Type: application/json" \
  -d '{"query": "latest AI news", "engine": "duckduckgo", "num_results": 5}'

Quick navigate and extract:

curl -X POST http://localhost:8000/api/quick/navigate \
  -H "Content-Type: application/json" \
  -d '{"url": "https://news.ycombinator.com", "extract": "headlines"}'

Interactive Sessions

Create persistent browser sessions for step-by-step control:

# Create session
curl -X POST http://localhost:8000/api/sessions \
  -H "Content-Type: application/json" \
  -d '{}'

# Execute action
curl -X POST http://localhost:8000/api/sessions/{session_id}/action \
  -H "Content-Type: application/json" \
  -d '{"action": "navigate", "parameters": {"url": "https://google.com"}}'

# Close session
curl -X DELETE http://localhost:8000/api/sessions/{session_id}

Python SDK

import asyncio
from agent.agent import WebAgent
from agent.mcp_client import create_direct_client

async def main():
    async with create_direct_client(headless=True) as browser:
        agent = WebAgent(mcp_client=browser)
  
        result = await agent.execute_task(
            "Go to bbc.com/news and extract the top 5 headlines"
        )
  
        print(f"Status: {result['status']}")
        print(f"Data: {result['extracted_data']}")

asyncio.run(main())

API Endpoints

Tasks

Method Endpoint Description
POST /api/tasks Create async task
POST /api/tasks/sync Create sync task (blocking)
GET /api/tasks List all tasks
GET /api/tasks/{id} Get task details
GET /api/tasks/{id}/history Get execution history
DELETE /api/tasks/{id} Delete task

Quick Actions

Method Endpoint Description
POST /api/quick/navigate Navigate and extract content
POST /api/quick/search Web search
POST /api/quick/login Login to website

Sessions

Method Endpoint Description
POST /api/sessions Create browser session
POST /api/sessions/{id}/action Execute action
GET /api/sessions List sessions
DELETE /api/sessions/{id} Close session

Vision & Screenshot Analysis

AirCrawl uses multimodal LLM vision capabilities to understand web pages visually, enabling more intelligent automation:

How It Works

  1. Page State Observation: When the agent calls get_page_state, a screenshot is automatically captured
  2. Visual Analysis: The screenshot is sent to the vision-capable LLM (e.g., GPT-4 Vision, Claude 3) along with the task context
  3. Element Annotation: Interactive elements are visually labeled with numbered overlays
  4. Intelligent Decision Making: The LLM analyzes the visual layout to determine the next action

Annotated Screenshots

The system can overlay visual labels on interactive elements to help the LLM identify targets:

# Annotated elements are numbered and highlighted
# Example output:
{
  "annotated_elements": [
    {"id": 1, "selector": "#search-input", "type": "input", "text": "Search..."},
    {"id": 2, "selector": ".login-btn", "type": "button", "text": "Login"},
    {"id": 3, "selector": "a.nav-link", "type": "link", "text": "Home"}
  ]
}

The LLM can then reference elements by their visual ID: "Click element #2 to login"

API Endpoints for Screenshots

# Get current session screenshot (base64)
curl http://localhost:8000/api/sessions/{session_id}/screenshot

# Get annotated screenshot with labeled elements
curl http://localhost:8000/api/sessions/{session_id}/screenshot/annotated

Benefits

  • Visual Context: Understands page layout, not just DOM structure
  • Dynamic Content: Handles JavaScript-rendered content that may not be in initial HTML
  • Complex UIs: Navigates modern SPAs with dynamic components
  • Error Detection: Visually identifies error messages, popups, and unexpected states
  • Accessibility: Works even when selectors are obfuscated or change frequently

Browser Tools

The agent has access to these browser automation tools:

Core Navigation

  • browser_start / browser_stop - Session lifecycle
  • navigate - Go to URL
  • go_back / go_forward / refresh - Browser navigation

Element Interaction

  • click - Click element by CSS selector
  • type_text - Enter text in input fields
  • press_key - Keyboard input (Enter, Tab, Escape, etc.)
  • hover / double_click / right_click - Mouse actions
  • select_option - Dropdown selection

Page Analysis

  • get_page_state - Observe page (elements, URL, title, screenshot)
  • annotate_elements - Add visual labels to interactive elements
  • remove_annotations - Remove visual labels
  • take_annotated_screenshot - Capture screenshot with element labels
  • get_text / get_attribute - Extract element data
  • extract_content - Get headlines, links, article, forms

Utilities

  • scroll - Scroll page
  • wait_for_element / wait_for_text - Wait for content
  • execute_script - Run JavaScript
  • switch_to_frame - Handle iframes
  • upload_file - File uploads
  • accept_alert / dismiss_alert - Handle dialogs

Project Structure

aircrawl/
├── mcp_server/
│   ├── __init__.py
│   ├── server.py          # MCP server with browser tools
│   └── browser_manager.py # SeleniumBase wrapper
├── agent/
│   ├── __init__.py
│   ├── agent.py           # Main agent loop
│   ├── mcp_client.py      # MCP client wrapper
│   └── state.py           # State management
├── api/
│   ├── __init__.py
│   └── server.py          # FastAPI REST API
├── config.py              # Configuration
├── main.py                # CLI entry point
├── requirements.txt
└── README.md

License

MIT License

About

Experimental browser automation system that uses LLMs to autonomously navigate websites, extract data, and complete complex web tasks.

Topics

Resources

Stars

Watchers

Forks