Skip to content

LilFatFrank/custom-crawler

Repository files navigation

Crawl4AI Docs Crawler (with Resume + Retries + S3 Upload)

Crawl full documentation sites into Markdown, with:

  • Sitemap-driven URL index (deterministic, no depth guessing)
  • Checkpointing & resume (safe to stop/restart)
  • Retry logic (up to 3 attempts)
  • Selective re-run of failed pages
  • Optional browser rendering for JS-heavy pages
  • S3 upload support (automatic or manual upload to AWS S3)
  • S3-only workflow (no local files, everything in S3)
  • Domain-based organization (files organized by source domain in S3)
  • Complete state management (URL index and progress stored in S3)

This folder contains several scripts:

  • build_url_index.py — collects in-domain URLs from the site's sitemaps → url_index.json.
  • crawl_test.py — crawls those URLs to Markdown with resume + retry support + optional S3 upload.
  • s3_crawler.pyNEW: S3-only crawler (no local output folders, everything in S3).
  • s3_upload.py — S3 upload utilities and configuration.
  • upload_to_s3.py — standalone script to upload existing crawled files to S3.

1) Requirements

  • Python 3.10+ (3.10, 3.11, 3.12, 3.13 supported)
  • macOS / Linux / Windows supported
  • Disk space: Markdown output + Playwright browsers (if you use --use-browser)

2) Setup

2.1 Create & activate a virtual environment

macOS/Linux

python3 -m venv venv
source venv/bin/activate

Windows (PowerShell)

py -m venv venv
.\venv\Scripts\Activate.ps1

2.2 Install dependencies

Start with the core (fastest, best for most docs/GitHub):

pip install --upgrade pip
pip install crawl4ai boto3

If you expect JS-heavy pages (React/Vue/Angular docs) and want headless browser rendering:

pip install "crawl4ai[full]" boto3
python -m playwright install
# On Linux CI or fresh servers:
# python -m playwright install --with-deps

You can switch browser mode on/off per run with a flag; no code changes needed.


3) Build the URL index (once per docs site)

This step discovers all in-domain pages via sitemaps.

python build_url_index.py https://docs.example.com

Output:

  • url_index.json — sorted list of in-domain URLs

Tip: If the domain has multiple sitemap indexes, the script follows them automatically. If no sitemap is found, it falls back to the base URL.


4) Crawl the URLs to Markdown

4.1 First run (static fetch, no browser)

python crawl_test.py

What happens:

  • Creates a new run folder: output/YYYY-MM-DD_HH-MM-SS/

  • Saves each page to *.md

  • Writes progress to progress.json:

    • completed: list of succeeded URLs
    • pending: remaining URLs
    • failed: exhausted retries
    • attempts: per-URL retry counts

4.2 Resume automatically (after crash/stop)

Just run the same command again:

python crawl_test.py

The script auto-resumes the most recent unfinished run.

4.3 Force a specific run folder (optional)

RUN_FOLDER=output/2025-09-12_14-47-29 python crawl_test.py

4.4 Retry only failed pages

python crawl_test.py --retry-failed

This moves failed → pending (and clears attempts), then crawls just those.

4.5 Use browser rendering (for JS-heavy pages)

First ensure you installed browsers (see 2.2), then:

python crawl_test.py --use-browser

Combine with failed-only retry:

python crawl_test.py --retry-failed --use-browser

5) S3 Upload Support

The crawler now supports automatic upload of crawled documents to AWS S3. You can enable S3 uploads during crawling or upload existing files separately.

5.1 S3 Configuration

Set these environment variables to configure S3 uploads:

# Required
export S3_BUCKET_NAME="your-bucket-name"

# Optional (uses AWS credentials from environment/IAM roles if not set)
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_REGION="us-east-1"  # default

# Optional S3 settings
export S3_PREFIX="crawled-docs/"  # default
export S3_CONTENT_TYPE="text/markdown"  # default
export S3_OVERWRITE="true"  # default: true
export S3_UPLOAD_METADATA="true"  # default: true

5.2 Automatic S3 Upload During Crawling

Enable S3 uploads during crawling:

# Set environment variable
export UPLOAD_TO_S3=true

# Or use command line flag
python crawl_test.py --upload-s3

# Crawl with S3 upload enabled
python crawl_test.py --upload-s3

The crawler will:

  • Upload files to S3 in batches (default: every 50 completed files)
  • Organize files by domain in S3 folders (e.g., docs_example_com/, github_com/)
  • Track upload progress in progress.json
  • Resume uploads if interrupted
  • Skip files that already exist in S3 (unless S3_OVERWRITE=true)

5.3 Manual S3 Upload (Existing Files)

Upload files from existing crawl runs:

# Upload from specific run folder
python upload_to_s3.py output/2025-01-15_14-30-00

# Upload from most recent run folder
python upload_to_s3.py --latest

# Upload from all run folders
python upload_to_s3.py --all

# Dry run to see what would be uploaded
python upload_to_s3.py --latest --dry-run

5.4 S3 Upload Modes

During Crawling:

# Crawl with automatic S3 upload
python crawl_test.py --upload-s3

# Crawl normally, then upload to S3 only
python crawl_test.py --upload-s3-only

Standalone Upload:

# Upload specific run folder
python upload_to_s3.py output/2025-01-15_14-30-00

# Upload most recent run
python upload_to_s3.py --latest

# Upload all runs
python upload_to_s3.py --all

5.5 Domain-Based Folder Structure

Files are automatically organized by source domain in S3:

s3://your-bucket/crawled-docs/
├── docs_example_com/
│   ├── getting-started.md
│   ├── installation.md
│   └── api-reference.md
├── github_com/
│   ├── README.md
│   └── CONTRIBUTING.md
└── stackoverflow_com/
    ├── question-123.md
    └── answer-456.md

Domain Processing:

  • www.example.comexample_com/
  • docs.example.comdocs_example_com/
  • github.comgithub_com/
  • Dots are replaced with underscores for S3 key safety
  • www. prefix is automatically removed

5.6 S3 Upload Progress Tracking

The crawler tracks S3 uploads in progress.json:

{
  "completed": [...],
  "failed": [...],
  "pending": [...],
  "s3_uploads": {
    "uploaded": [
      {
        "file": "page1.md",
        "s3_key": "crawled-docs/page1.md",
        "file_size": 1234,
        "message": "Upload successful"
      }
    ],
    "skipped": [...],
    "failed": [...],
    "last_upload": {
      "timestamp": "1640995200",
      "bucket": "my-bucket",
      "prefix": "crawled-docs/",
      "summary": {
        "uploaded": 10,
        "skipped": 2,
        "failed": 0
      }
    }
  }
}

5.7 S3 Upload Configuration Options

Environment Variable Default Description
S3_BUCKET_NAME required S3 bucket name
AWS_REGION us-east-1 AWS region
S3_PREFIX crawled-docs/ S3 prefix for uploaded files
S3_CONTENT_TYPE text/markdown Content type for uploaded files
S3_OVERWRITE true Overwrite existing files in S3
S3_UPLOAD_METADATA true Upload metadata with files
UPLOAD_TO_S3 false Enable S3 uploads during crawling
S3_UPLOAD_BATCH_SIZE 50 Upload to S3 every N completed files

5.8 S3-Only Workflow (No Local Output Folders)

NEW: Use s3_crawler.py for a completely S3-based workflow:

# Build URL index and crawl (everything in S3)
python s3_crawler.py https://docs.example.com

# Build index only
python s3_crawler.py --build-index https://docs.example.com

# Crawl only (using existing index from S3)
python s3_crawler.py --crawl-only

# Retry failed URLs
python s3_crawler.py --crawl-only --retry-failed

Benefits:

  • No local output folders - everything stored in S3
  • URL index in S3 - shared across runs, stored in domain folder
  • Progress tracking in S3 - resume from anywhere, stored in domain folder
  • Domain-based folders - organized by source domain
  • Automatic updates - overwrite existing files
  • Complete isolation - each domain gets its own folder with all state files

S3 Structure:

s3://your-bucket/crawled-docs/
├── docs_example_com/           # Domain folder
│   ├── url_index.json         # URL list for this domain
│   ├── progress.json          # Crawl progress for this domain
│   ├── page1.md               # Crawled content
│   └── page2.md
├── github_com/                # Another domain
│   ├── url_index.json
│   ├── progress.json
│   └── README.md
└── stackoverflow_com/         # Yet another domain
    ├── url_index.json
    ├── progress.json
    └── question-123.md

5.9 Traditional S3 Upload Examples

Basic crawl with S3 upload:

export S3_BUCKET_NAME="my-docs-bucket"
export UPLOAD_TO_S3=true
python crawl_test.py

Custom S3 settings:

export S3_BUCKET_NAME="my-bucket"
export S3_PREFIX="docs/2025/"
export S3_OVERWRITE=false
python crawl_test.py --upload-s3

Upload existing files:

export S3_BUCKET_NAME="my-bucket"
python upload_to_s3.py --latest

Dry run to test configuration:

export S3_BUCKET_NAME="my-bucket"
python upload_to_s3.py --latest --dry-run

6) Tuning knobs (in crawl_test.py)

At the top of the file:

OUTPUT_BASE = "output"
BATCH_SIZE = 10        # smaller = gentler for flaky sites (try 5 or 1 if needed)
SAVE_EVERY = 10        # checkpoint frequency
USE_BROWSER = False    # overridden by --use-browser flag
MAX_RETRIES = 3        # per-URL retries across runs

Recommendations

  • Start with USE_BROWSER=False (faster).
  • If some pages return empty or rely on JS → re-run those with --use-browser.
  • For stubborn domains, set BATCH_SIZE = 10 or even 1 to reduce concurrency.

6) Verifying results

6.1 Quick count

# macOS/Linux
jq '.completed | length, .failed | length, .pending | length' output/*/progress.json | xargs -n3 echo "completed= failed= pending="

If you don’t have jq, use Python:

python - << 'PY'
import json, glob
p = sorted(glob.glob('output/*/progress.json'))[-1]
d = json.load(open(p))
print("Run:", p)
print("completed:", len(d["completed"]), "failed:", len(d["failed"]), "pending:", len(d["pending"]))
PY

6.2 Spot-check a file

Open a few *.md files under output/<run>/ to verify formatting (headings, lists, code blocks).


7) Common tasks & recipes

7.1 Finish a handful of failed pages

# Retry only failed, without browser (often works for docs)
python crawl_test.py --retry-failed

# If still failing, use browser + small batches:
# (Set BATCH_SIZE=5 or 1 in the script first if necessary)
python crawl_test.py --retry-failed --use-browser

7.2 Re-run everything from scratch (keeping outputs)

  • Delete progress.json inside your run folder, or just start a new run (default behavior if no unfinished run is found).

7.3 Run only a subset of URLs

Edit url_index.json to keep a subset (backup the original first).


8) Troubleshooting

A) TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'

Symptoms

TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'

Cause

You're using Python 3.9 or earlier, but crawl4ai requires Python 3.10+ (due to union type syntax |).

Fix

  1. Check your Python version:

    python3 --version
  2. If you have Python 3.10+ available, use it:

    # Use specific Python version
    python3.13 crawl_test.py
    
    # Or activate virtual environment properly
    source venv/bin/activate
    python3.13 crawl_test.py
  3. If you only have Python 3.9, upgrade:

    # macOS with Homebrew
    brew install python@3.13
    
    # Then recreate your virtual environment
    python3.13 -m venv venv
    source venv/bin/activate
    pip install crawl4ai

B) Playwright missing / "Executable doesn't exist… playwright install"

Symptoms

BrowserType.launch: Executable doesn't exist...
Looks like Playwright was just installed... run:
playwright install

Fix

pip install "crawl4ai[full]"
python -m playwright install

Then re-run with --use-browser only if needed.


B) 'AsyncWebCrawler' object has no attribute 'abatch_run'

You’re on Crawl4AI 0.7.x. Use:

await crawler.arun_many(urls=urls, config=CrawlerRunConfig(stream=False))

(Your current script already uses arun_many.)


C) CrawlerRunConfig.__init__() got an unexpected keyword argument 'browser'

In 0.7.x, browser is not a CrawlerRunConfig arg. Control the browser only here:

async with AsyncWebCrawler(use_browser=USE_BROWSER) as crawler:
    ...

D) CrawlerRunConfig.__init__() got an unexpected keyword argument 'navigation_timeout_ms'

0.7.x doesn’t expose that arg in CrawlerRunConfig. Use defaults, smaller batches, and choose browser vs. non-browser mode. (Your current script already removed that arg.)


E) Playwright navigation timeout (e.g., Arbitrum docs)

Symptoms

Page.goto: Timeout 60000ms exceeded... waiting until "domcontentloaded"

Fix options

  1. Non-browser pass (often succeeds for docs):

    python crawl_test.py --retry-failed
  2. Browser pass with tiny batches (reduces contention):

    • Set BATCH_SIZE = 5 (or 1) in the script.

    • Run:

      python crawl_test.py --retry-failed --use-browser
  3. If a few still fail: run the retry multiple times; they’ll either move to completed or stay in failed after 3 attempts.


F) Job stopped “early” (e.g., completed 550/1050)

This happens if you loop over a shrinking list with index math. Fix (already in your script): batch using a snapshot:

while prog["pending"]:
    batch_urls = prog["pending"][:BATCH_SIZE]
    await process_batch(...)

G) Resuming didn’t pick the right folder

Force it:

RUN_FOLDER=output/2025-09-12_14-47-29 python crawl_test.py

9) File structure

.
├── venv/
├── output/
│   └── 2025-09-12_14-47-29/
│       ├── <page-1>.md
│       ├── <page-2>.md
│       └── progress.json
├── url_index.json
├── build_url_index.py
└── crawl_test.py

10) FAQ

Q: Do I need crawl4ai[full]? A: Only for JS-heavy pages. Start with crawl4ai (static). Use --use-browser for stubborn pages.

Q: Do hash links like #section-3 create multiple files? A: No. They’re anchor sections within the same page; you’ll get one Markdown file for the page.

Q: How do retries work? A: Each failing URL increments a counter in progress.json. After MAX_RETRIES (default 3), it’s moved to failed. --retry-failed moves them back to pending and clears attempts.

Q: Can I run multiple sites? A: Yes. Build a different url_index.json per site (or stash them with different names), and run the crawler per site.

Q: S3 upload fails with "S3_BUCKET_NAME environment variable is required"? A: Make sure you have a .env file with your S3 configuration, or set environment variables directly:

export S3_BUCKET_NAME="your-bucket-name"
export AWS_ACCESS_KEY_ID="your-key"
export AWS_SECRET_ACCESS_KEY="your-secret"
export AWS_REGION="us-east-1"

Q: Files go to "unknown_domain/" folder in S3? A: This happens when the domain extraction fails. Make sure your URL starts with http:// or https://. The crawler extracts the domain from the base URL to create the folder name.

Q: How do I find my AWS credentials? A:

  • AWS Console: IAM → Users → Your User → Security Credentials → Create Access Key
  • AWS CLI: aws configure (stores in ~/.aws/credentials)
  • Environment: Set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
  • IAM Roles: If running on EC2, use IAM roles (no credentials needed)

Q: How do I make my S3 bucket public? A: Add this bucket policy (replace your-bucket-name):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "PublicReadGetObject",
      "Effect": "Allow",
      "Principal": "*",
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::your-bucket-name/*"
    }
  ]
}

Q: What's the S3_PREFIX for? A: The prefix creates a folder structure in your S3 bucket. With S3_PREFIX="crawled-docs/", files go to s3://bucket/crawled-docs/domain_folder/file.md. Useful for organizing multiple crawls or projects in the same bucket.


11) One-liners (handy)

Create venv + install core:

python3 -m venv venv && source venv/bin/activate && pip install --upgrade pip && pip install crawl4ai boto3

Install full (browser) stack:

pip install "crawl4ai[full]" boto3 && python -m playwright install

Build index + crawl (static):

python build_url_index.py https://docs.example.com && python crawl_test.py

S3-only crawl (no local folders):

python s3_crawler.py https://docs.example.com

S3-only with environment file:

# Create .env file with S3 config, then:
python s3_crawler.py https://docs.example.com

Crawl with S3 upload (traditional):

export S3_BUCKET_NAME="my-bucket" && python crawl_test.py --upload-s3

Upload existing files to S3:

export S3_BUCKET_NAME="my-bucket" && python upload_to_s3.py --latest

Complete S3 setup (create .env + crawl):

echo "S3_BUCKET_NAME=my-bucket" > .env && echo "AWS_REGION=us-east-1" >> .env && python s3_crawler.py https://docs.example.com

Retry only failed (browser on):

python crawl_test.py --retry-failed --use-browser

Force resume a specific run:

RUN_FOLDER=output/2025-09-12_14-47-29 python crawl_test.py

Dry run S3 upload:

export S3_BUCKET_NAME="my-bucket" && python upload_to_s3.py --latest --dry-run

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages