Skip to content

πŸ₯·The FREE, Self-hosted alternative to Firecrawl. Pure Rust Stealth Scraping for AI Agents. Bypass Cloudflare, DataDome, Akamai, and Kasada via HITL Collaboration. 99.99% Success Rate. Private, Fast, and Sovereign.πŸ¦€

License

Notifications You must be signed in to change notification settings

DevsHero/ShadowCrawl

Repository files navigation

πŸ₯· ShadowCrawl MCP

ShadowCrawl Logo

Bypass Anything. Scrape Everything.

The 99.99% Success Rate Stealth Engine for AI Agents

The Sovereign, Self-Hosted Alternative to Firecrawl, Jina, and Tavily.


ShadowCrawl is not just a scraperβ€”it's a Cyborg Intelligence Layer. While other APIs fail against Cloudflare, Akamai, and PerimetterX, ShadowCrawl leverages a unique Human-AI Collaboration model to achieve a near-perfect bypass rate on even the most guarded "Boss Level" sites (LinkedIn, Airbnb, Ticketmaster).

πŸš€ Why ShadowCrawl?

  • 99.99% Bot Bypass: Featuring the "Non-Robot Search" engine. When automation hits a wall, ShadowCrawl bridges the gap with Human-In-The-Loop (HITL) interaction, allowing you to solve CAPTCHAs and login walls manually while the agent continues its work.
  • Total Sovereignty: 100% Private. Self-hosted via Docker. No API keys, no monthly fees, and no third-party data tracking.
  • Agent-Native (MCP): Deeply integrated with Cursor, Claude Desktop, and IDEs via the Model Context Protocol. Your AI agent now has eyes and hands in the real web.
  • Universal Noise Reduction: Advanced Rust-based filtering that collapses "Skeleton Screens" and repeats, delivering clean, semantic Markdown that reduces LLM token costs.

πŸ’Ž The "Nuclear Option": Stealth Scrape (HITL)

Most scrapers try to "act" like a human and fail. ShadowCrawl uses a human when it matters.

stealth_scrape is our flagship tool for high-fidelity rendering. It launches a visible, native Brave Browser instance on your machine.

  • Manual Intervention: If a site asks for a Login or a Puzzle, you solve it once; the agent scrapes the rest.
  • Brave Integration: Uses your actual browser profiles (cookies/sessions) to look like a legitimate user, not a headless bot.
  • Stealth Cleanup: Automatically strips automation markers (navigator.webdriver, etc.) before extraction.

πŸ’₯ Shattering the "Unscrapable" (Anti-Bot Bypass)

Most scraping APIs surrender when facing enterprise-grade shields. ShadowCrawl is the Hammer that breaks through. We successfully bypass and extract data from:

  • Cloudflare πŸ›‘οΈ (Turnstile / Challenge Pages)
  • DataDome πŸ€– (Interstitial & Behavioral blocks)
  • Akamai 🏰 (Advanced Bot Manager)
  • PerimeterX / HUMAN πŸ‘€
  • Kasada & Shape Security πŸ”

The Secret? The Cyborg Approach (HITL). ShadowCrawl doesn't just "imitate" a humanβ€”it bridges your real, native Brave/Chrome session into the agent's workflow. If a human can see it, ShadowCrawl can scrape it.


πŸ“‚ Verified Evidence (Boss-Level Targets)

We don't just claim to bypassβ€”we provide the receipts. All evidence below was captured using stealth_scrape (feature flag: non_robot_search) with the Safety Kill Switch enabled (2026-02-14).

Target Site Protection Evidence Size Data Extracted Status
LinkedIn Cloudflare + Auth 413KB πŸ“„ JSON Β· πŸ“ Snippet 60+ job IDs, listings βœ…
Ticketmaster Cloudflare Turnstile 1.1MB πŸ“„ JSON Β· πŸ“ Snippet Tour dates, venues βœ…
Airbnb DataDome 1.8MB πŸ“„ JSON Β· πŸ“ Snippet 1000+ Tokyo listings βœ…
Upwork reCAPTCHA 300KB πŸ“„ JSON Β· πŸ“ Snippet 160K+ job postings βœ…
Amazon AWS Shield 814KB πŸ“„ JSON Β· πŸ“ Snippet RTX 5070 Ti results βœ…
nowsecure.nl Cloudflare 168KB πŸ“„ JSON Β· πŸ“Έ Screenshot Manual button tested βœ…

πŸ“– Full Documentation: See proof/README.md for verification steps, protection analysis, and quality metrics.


πŸ›  Features at a Glance

Feature Description
Search & Discovery Federated search via SearXNG. Finds what Google hides.
Deep Crawling Recursive, bounded crawling to map entire subdomains.
Semantic Memory (Optional) Qdrant integration for long-term research recall.
Proxy Master Native rotation logic for HTTP/SOCKS5 pools.
Hydration Scraper Specialized logic to extract "hidden" JSON data from React/Next.js sites.
Universal Janitor Automatic removal of popups, cookie banners, and overlays.

πŸ† Comparison

Feature Firecrawl / Jina ShadowCrawl
Cost Monthly Subscription $0 (Self-hosted)
Privacy They see your data 100% Private
LinkedIn/Airbnb Often Blocked 99.99% Success (via HITL)
JS Rendering Cloud-only Native Brave / Browserless
Memory None Semantic Research History

πŸ“¦ Quick Start (Bypass in 60 Seconds)

1. The Docker Way (Full Stack)

Docker is the fastest way to bring up the full stack (SearXNG, proxy manager, etc.).

Important: Docker mode cannot use the HITL/GUI renderer (stealth_scrape) because containers cannot reliably access your host's native Brave/Chrome window, keyboard hooks, and OS permissions. Use the Native Rust Way below when you want boss-level bypass.

# Clone and Launch
git clone https://github.com/DevsHero/shadowcrawl.git
cd shadowcrawl
docker compose -f docker-compose-local.yml up -d --build

2. The Native Rust Way (Required for stealth_scrape / HITL)

For the 99.99% bypass (HITL), you must run natively (tested on macOS; Windows supported via a verified install guide below).

Build the MCP stdio server with the HITL feature enabled:

cd mcp-server
cargo build --release --bin shadowcrawl-mcp --features non_robot_search

This produces the local MCP binary at:

  • mcp-server/target/release/shadowcrawl-mcp

Prereqs:

  • Install Brave Browser (recommended) or Google Chrome
  • Grant Accessibility permissions (required for the emergency ESC hold-to-abort kill switch)

Windows:

  • Verified setup guide (tested): docs/WINDOWS_SETUP.md

🧩 MCP Integration (Cursor / Claude / VS Code)

ShadowCrawl can run as an MCP server in 2 modes:

  • Docker MCP server: great for normal scraping/search tools, but cannot do HITL/GUI (stealth_scrape).
  • Local MCP server (shadowcrawl-local): required for HITL tools (a visible Brave/Chrome window).

Option A: Docker MCP server (no stealth_scrape)

Add this to your MCP config to use the Dockerized server:

{
  "mcpServers": {
    "shadowcrawl": {
      "command": "docker",
      "args": [
        "compose",
        "-f",
        "/YOUR_PATH/shadowcrawl/docker-compose-local.yml",
        "exec",
        "-i",
        "-T",
        "shadowcrawl",
        "shadowcrawl-mcp"
      ]
    }
  }
}

Option B: Local MCP server (required for stealth_scrape)

If you want to use HITL tools like stealth_scrape, configure a local MCP server that launches the native binary.

VS Code MCP config example ("servers" format):

{
  "servers": {
    "shadowcrawl-local": {
      "type": "stdio",
      "command": "env",
      "args": [
        "RUST_LOG=info",

        // Optional (only if you run the full stack locally):
        "SEARXNG_URL=http://localhost:8890",
        "BROWSERLESS_URL=http://localhost:3010",
        "BROWSERLESS_TOKEN=mcp_stealth_session",
        "QDRANT_URL=http://localhost:6344",

        // Network + limits:
        "HTTP_TIMEOUT_SECS=30",
        "HTTP_CONNECT_TIMEOUT_SECS=10",
        "OUTBOUND_LIMIT=32",
        "MAX_CONTENT_CHARS=10000",
        "MAX_LINKS=100",

        // Optional (proxy manager):
        "IP_LIST_PATH=/YOUR_PATH/shadowcrawl/ip.txt",
        "PROXY_SOURCE_PATH=/YOUR_PATH/shadowcrawl/proxy_source.json",

        // HITL / stealth_scrape quality-of-life:
        // "SHADOWCRAWL_NON_ROBOT_AUTO_ALLOW=1",
        // "SHADOWCRAWL_RENDER_PROFILE_DIR=/YOUR_PROFILE_DIR",
        // "CHROME_EXECUTABLE=/Applications/Brave Browser.app/Contents/MacOS/Brave Browser",

        "/YOUR_PATH/shadowcrawl/mcp-server/target/release/shadowcrawl-mcp"
      ]
    }
  }
}

Notes:

  • MCP tool name: stealth_scrape (internal handler + feature flag name: non_robot_search).
  • For HITL, prefer Brave + a real profile dir (SHADOWCRAWL_RENDER_PROFILE_DIR) so cookies/sessions persist.
  • If you're running via Docker MCP server, HITL tools will either be unavailable or fail (no host GUI).

β˜• Acknowledgments & Support

ShadowCrawl is built with ❀️ by a Solo Developer for the open-source community. If this tool helped you bypass a $500/mo API, consider supporting its growth!

  • Found a bug? Open an Issue.
  • Want a feature? Submit a request!
  • Love the project? Star the repo ⭐ or buy me a coffee to fuel more updates!

Sponsor

License: MIT. Free for personal and commercial use.


About

πŸ₯·The FREE, Self-hosted alternative to Firecrawl. Pure Rust Stealth Scraping for AI Agents. Bypass Cloudflare, DataDome, Akamai, and Kasada via HITL Collaboration. 99.99% Success Rate. Private, Fast, and Sovereign.πŸ¦€

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project

  •  

Packages