AirCrawl is an experimental browser automation system that uses LLMs to autonomously navigate websites, extract data, and complete complex web tasks.
┌─────────────────────────────────────────────────────────────────┐
│ AGENT LOOP │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ OBSERVE │───▶│ PLAN │───▶│ EXECUTE │───▶│ VERIFY │──┐ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │
│ ▲ │ │
│ └────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ MCP SERVER (Tools) │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ navigate │ │ click │ │ type │ │screenshot│ ... │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ SELENIUMBASE │
│ (Undetected Chrome Browser) │
└─────────────────────────────────────────────────────────────────┘
- Autonomous Navigation: LLM-powered decision making for complex web tasks
- Vision-Based Analysis: Multimodal LLM analyzes screenshots to understand page state
- Element Annotation: Visual labeling of interactive elements for precise targeting
- Undetected Browser: SeleniumBase with UC mode bypasses bot detection
- MCP Protocol: Standard Model Context Protocol for tool communication
- REST API: Full-featured FastAPI server for integration
- Interactive Sessions: Persistent browser sessions for step-by-step control
- Quick Actions: Pre-built endpoints for common tasks (search, login, extract)
- Agent System: Persistent agents with custom prompts and encrypted secrets
# Clone the repository
git clone https://github.com/yourusername/aircrawl.git
cd aircrawl
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Copy environment template
cp .env.template .env
# Edit .env with your API keysEdit .env file:
# LLM Configuration (OpenAI-compatible)
OPENAI_API_KEY=your-api-key
OPENAI_BASE_URL=https://api.openai.com/v1 # Or other compatible endpoint
MODEL=gpt-4o # Or any compatible model
# Browser Settings
BROWSER_UC_MODE=true
BROWSER_TIMEOUT=30
# Agent Settings
AGENT_MAX_STEPS=30
# API Server
API_HOST=0.0.0.0
API_PORT=8000
DEBUG=false
# Database (PostgreSQL)
POSTGRES_USER=aircrawl
POSTGRES_PASSWORD=your_secure_password
POSTGRES_DB=aircrawl
DATABASE_URL=postgresql+asyncpg://aircrawl:your_secure_password@localhost:5432/aircrawl
# Encryption key for secrets
ENCRYPTION_KEY= # Generate with: python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"
# Browser Proxy (optional) - see "Using Gost Proxy" section below
BROWSER_PROXY=The easiest way to run AirCrawl is with Docker Compose:
# Copy environment template
cp .env.template .env
# Edit .env with your API keys and configuration
# Start all services (PostgreSQL, Backend, Frontend)
docker compose up -d
# View logs
docker compose logs -f
# Stop services
docker compose downServices:
- PostgreSQL: Database on port 5432
- Backend API: FastAPI server on port 8000
- Frontend: SvelteKit app on port 3000
Gost is a SOCKS5 proxy that helps bypass anti-bot detection by routing browser traffic through a residential IP. This is useful when websites block datacenter IPs.
How it works:
- Gost runs as a local SOCKS5 proxy (port 1081)
- It forwards traffic to your upstream proxy (e.g., a residential proxy service)
- The browser connects through the local proxy, appearing to come from a residential IP
Setup:
- Set up your upstream proxy (residential IP, home server, etc.)
- Configure
.env:
# Local proxy that the browser will use
BROWSER_PROXY=socks5://127.0.0.1:1081
# Upstream proxy (your residential/home proxy)
GOST_UPSTREAM_PROXY=socks5://username:password@your-proxy-ip:1080- Start with the proxy profile:
docker compose --profile proxy up -dWithout proxy: If BROWSER_PROXY is not set, the browser connects directly without any proxy.
# Run a single task
python main.py task "Go to news.ycombinator.com and get the top 5 headlines"
# Interactive demo
python main.py demo
# Start API server
python main.py server --port 8000
# Start MCP server (for external clients)
python main.py mcp-serverStart the server:
python main.py servercurl -X POST http://localhost:8000/api/tasks \
-H "Content-Type: application/json" \
-d '{"task": "Go to wikipedia.org and search for Python programming", "max_steps": 20}'curl http://localhost:8000/api/tasks/{task_id}curl -X POST http://localhost:8000/api/quick/search \
-H "Content-Type: application/json" \
-d '{"query": "latest AI news", "engine": "duckduckgo", "num_results": 5}'curl -X POST http://localhost:8000/api/quick/navigate \
-H "Content-Type: application/json" \
-d '{"url": "https://news.ycombinator.com", "extract": "headlines"}'Create persistent browser sessions for step-by-step control:
# Create session
curl -X POST http://localhost:8000/api/sessions \
-H "Content-Type: application/json" \
-d '{}'
# Execute action
curl -X POST http://localhost:8000/api/sessions/{session_id}/action \
-H "Content-Type: application/json" \
-d '{"action": "navigate", "parameters": {"url": "https://google.com"}}'
# Close session
curl -X DELETE http://localhost:8000/api/sessions/{session_id}import asyncio
from agent.agent import WebAgent
from agent.mcp_client import create_direct_client
async def main():
async with create_direct_client(headless=True) as browser:
agent = WebAgent(mcp_client=browser)
result = await agent.execute_task(
"Go to bbc.com/news and extract the top 5 headlines"
)
print(f"Status: {result['status']}")
print(f"Data: {result['extracted_data']}")
asyncio.run(main())| Method | Endpoint | Description |
|---|---|---|
| POST | /api/tasks |
Create async task |
| POST | /api/tasks/sync |
Create sync task (blocking) |
| GET | /api/tasks |
List all tasks |
| GET | /api/tasks/{id} |
Get task details |
| GET | /api/tasks/{id}/history |
Get execution history |
| DELETE | /api/tasks/{id} |
Delete task |
| Method | Endpoint | Description |
|---|---|---|
| POST | /api/quick/navigate |
Navigate and extract content |
| POST | /api/quick/search |
Web search |
| POST | /api/quick/login |
Login to website |
| Method | Endpoint | Description |
|---|---|---|
| POST | /api/sessions |
Create browser session |
| POST | /api/sessions/{id}/action |
Execute action |
| GET | /api/sessions |
List sessions |
| DELETE | /api/sessions/{id} |
Close session |
AirCrawl uses multimodal LLM vision capabilities to understand web pages visually, enabling more intelligent automation:
- Page State Observation: When the agent calls
get_page_state, a screenshot is automatically captured - Visual Analysis: The screenshot is sent to the vision-capable LLM (e.g., GPT-4 Vision, Claude 3) along with the task context
- Element Annotation: Interactive elements are visually labeled with numbered overlays
- Intelligent Decision Making: The LLM analyzes the visual layout to determine the next action
The system can overlay visual labels on interactive elements to help the LLM identify targets:
# Annotated elements are numbered and highlighted
# Example output:
{
"annotated_elements": [
{"id": 1, "selector": "#search-input", "type": "input", "text": "Search..."},
{"id": 2, "selector": ".login-btn", "type": "button", "text": "Login"},
{"id": 3, "selector": "a.nav-link", "type": "link", "text": "Home"}
]
}The LLM can then reference elements by their visual ID: "Click element #2 to login"
# Get current session screenshot (base64)
curl http://localhost:8000/api/sessions/{session_id}/screenshot
# Get annotated screenshot with labeled elements
curl http://localhost:8000/api/sessions/{session_id}/screenshot/annotated- Visual Context: Understands page layout, not just DOM structure
- Dynamic Content: Handles JavaScript-rendered content that may not be in initial HTML
- Complex UIs: Navigates modern SPAs with dynamic components
- Error Detection: Visually identifies error messages, popups, and unexpected states
- Accessibility: Works even when selectors are obfuscated or change frequently
The agent has access to these browser automation tools:
browser_start/browser_stop- Session lifecyclenavigate- Go to URLgo_back/go_forward/refresh- Browser navigation
click- Click element by CSS selectortype_text- Enter text in input fieldspress_key- Keyboard input (Enter, Tab, Escape, etc.)hover/double_click/right_click- Mouse actionsselect_option- Dropdown selection
get_page_state- Observe page (elements, URL, title, screenshot)annotate_elements- Add visual labels to interactive elementsremove_annotations- Remove visual labelstake_annotated_screenshot- Capture screenshot with element labelsget_text/get_attribute- Extract element dataextract_content- Get headlines, links, article, forms
scroll- Scroll pagewait_for_element/wait_for_text- Wait for contentexecute_script- Run JavaScriptswitch_to_frame- Handle iframesupload_file- File uploadsaccept_alert/dismiss_alert- Handle dialogs
aircrawl/
├── mcp_server/
│ ├── __init__.py
│ ├── server.py # MCP server with browser tools
│ └── browser_manager.py # SeleniumBase wrapper
├── agent/
│ ├── __init__.py
│ ├── agent.py # Main agent loop
│ ├── mcp_client.py # MCP client wrapper
│ └── state.py # State management
├── api/
│ ├── __init__.py
│ └── server.py # FastAPI REST API
├── config.py # Configuration
├── main.py # CLI entry point
├── requirements.txt
└── README.md
MIT License
