Crawl full documentation sites into Markdown, with:
- ✅ Sitemap-driven URL index (deterministic, no depth guessing)
- ✅ Checkpointing & resume (safe to stop/restart)
- ✅ Retry logic (up to 3 attempts)
- ✅ Selective re-run of failed pages
- ✅ Optional browser rendering for JS-heavy pages
- ✅ S3 upload support (automatic or manual upload to AWS S3)
- ✅ S3-only workflow (no local files, everything in S3)
- ✅ Domain-based organization (files organized by source domain in S3)
- ✅ Complete state management (URL index and progress stored in S3)
This folder contains several scripts:
build_url_index.py— collects in-domain URLs from the site's sitemaps →url_index.json.crawl_test.py— crawls those URLs to Markdown with resume + retry support + optional S3 upload.s3_crawler.py— NEW: S3-only crawler (no local output folders, everything in S3).s3_upload.py— S3 upload utilities and configuration.upload_to_s3.py— standalone script to upload existing crawled files to S3.
- Python 3.10+ (3.10, 3.11, 3.12, 3.13 supported)
- macOS / Linux / Windows supported
- Disk space: Markdown output + Playwright browsers (if you use
--use-browser)
macOS/Linux
python3 -m venv venv
source venv/bin/activateWindows (PowerShell)
py -m venv venv
.\venv\Scripts\Activate.ps1Start with the core (fastest, best for most docs/GitHub):
pip install --upgrade pip
pip install crawl4ai boto3If you expect JS-heavy pages (React/Vue/Angular docs) and want headless browser rendering:
pip install "crawl4ai[full]" boto3
python -m playwright install
# On Linux CI or fresh servers:
# python -m playwright install --with-depsYou can switch browser mode on/off per run with a flag; no code changes needed.
This step discovers all in-domain pages via sitemaps.
python build_url_index.py https://docs.example.comOutput:
url_index.json— sorted list of in-domain URLs
Tip: If the domain has multiple sitemap indexes, the script follows them automatically. If no sitemap is found, it falls back to the base URL.
python crawl_test.pyWhat happens:
-
Creates a new run folder:
output/YYYY-MM-DD_HH-MM-SS/ -
Saves each page to
*.md -
Writes progress to
progress.json:completed: list of succeeded URLspending: remaining URLsfailed: exhausted retriesattempts: per-URL retry counts
Just run the same command again:
python crawl_test.pyThe script auto-resumes the most recent unfinished run.
RUN_FOLDER=output/2025-09-12_14-47-29 python crawl_test.pypython crawl_test.py --retry-failedThis moves failed → pending (and clears attempts), then crawls just those.
First ensure you installed browsers (see 2.2), then:
python crawl_test.py --use-browserCombine with failed-only retry:
python crawl_test.py --retry-failed --use-browserThe crawler now supports automatic upload of crawled documents to AWS S3. You can enable S3 uploads during crawling or upload existing files separately.
Set these environment variables to configure S3 uploads:
# Required
export S3_BUCKET_NAME="your-bucket-name"
# Optional (uses AWS credentials from environment/IAM roles if not set)
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_REGION="us-east-1" # default
# Optional S3 settings
export S3_PREFIX="crawled-docs/" # default
export S3_CONTENT_TYPE="text/markdown" # default
export S3_OVERWRITE="true" # default: true
export S3_UPLOAD_METADATA="true" # default: trueEnable S3 uploads during crawling:
# Set environment variable
export UPLOAD_TO_S3=true
# Or use command line flag
python crawl_test.py --upload-s3
# Crawl with S3 upload enabled
python crawl_test.py --upload-s3The crawler will:
- Upload files to S3 in batches (default: every 50 completed files)
- Organize files by domain in S3 folders (e.g.,
docs_example_com/,github_com/) - Track upload progress in
progress.json - Resume uploads if interrupted
- Skip files that already exist in S3 (unless
S3_OVERWRITE=true)
Upload files from existing crawl runs:
# Upload from specific run folder
python upload_to_s3.py output/2025-01-15_14-30-00
# Upload from most recent run folder
python upload_to_s3.py --latest
# Upload from all run folders
python upload_to_s3.py --all
# Dry run to see what would be uploaded
python upload_to_s3.py --latest --dry-runDuring Crawling:
# Crawl with automatic S3 upload
python crawl_test.py --upload-s3
# Crawl normally, then upload to S3 only
python crawl_test.py --upload-s3-onlyStandalone Upload:
# Upload specific run folder
python upload_to_s3.py output/2025-01-15_14-30-00
# Upload most recent run
python upload_to_s3.py --latest
# Upload all runs
python upload_to_s3.py --allFiles are automatically organized by source domain in S3:
s3://your-bucket/crawled-docs/
├── docs_example_com/
│ ├── getting-started.md
│ ├── installation.md
│ └── api-reference.md
├── github_com/
│ ├── README.md
│ └── CONTRIBUTING.md
└── stackoverflow_com/
├── question-123.md
└── answer-456.md
Domain Processing:
www.example.com→example_com/docs.example.com→docs_example_com/github.com→github_com/- Dots are replaced with underscores for S3 key safety
www.prefix is automatically removed
The crawler tracks S3 uploads in progress.json:
{
"completed": [...],
"failed": [...],
"pending": [...],
"s3_uploads": {
"uploaded": [
{
"file": "page1.md",
"s3_key": "crawled-docs/page1.md",
"file_size": 1234,
"message": "Upload successful"
}
],
"skipped": [...],
"failed": [...],
"last_upload": {
"timestamp": "1640995200",
"bucket": "my-bucket",
"prefix": "crawled-docs/",
"summary": {
"uploaded": 10,
"skipped": 2,
"failed": 0
}
}
}
}| Environment Variable | Default | Description |
|---|---|---|
S3_BUCKET_NAME |
required | S3 bucket name |
AWS_REGION |
us-east-1 |
AWS region |
S3_PREFIX |
crawled-docs/ |
S3 prefix for uploaded files |
S3_CONTENT_TYPE |
text/markdown |
Content type for uploaded files |
S3_OVERWRITE |
true |
Overwrite existing files in S3 |
S3_UPLOAD_METADATA |
true |
Upload metadata with files |
UPLOAD_TO_S3 |
false |
Enable S3 uploads during crawling |
S3_UPLOAD_BATCH_SIZE |
50 |
Upload to S3 every N completed files |
NEW: Use s3_crawler.py for a completely S3-based workflow:
# Build URL index and crawl (everything in S3)
python s3_crawler.py https://docs.example.com
# Build index only
python s3_crawler.py --build-index https://docs.example.com
# Crawl only (using existing index from S3)
python s3_crawler.py --crawl-only
# Retry failed URLs
python s3_crawler.py --crawl-only --retry-failedBenefits:
- ✅ No local output folders - everything stored in S3
- ✅ URL index in S3 - shared across runs, stored in domain folder
- ✅ Progress tracking in S3 - resume from anywhere, stored in domain folder
- ✅ Domain-based folders - organized by source domain
- ✅ Automatic updates - overwrite existing files
- ✅ Complete isolation - each domain gets its own folder with all state files
S3 Structure:
s3://your-bucket/crawled-docs/
├── docs_example_com/ # Domain folder
│ ├── url_index.json # URL list for this domain
│ ├── progress.json # Crawl progress for this domain
│ ├── page1.md # Crawled content
│ └── page2.md
├── github_com/ # Another domain
│ ├── url_index.json
│ ├── progress.json
│ └── README.md
└── stackoverflow_com/ # Yet another domain
├── url_index.json
├── progress.json
└── question-123.md
Basic crawl with S3 upload:
export S3_BUCKET_NAME="my-docs-bucket"
export UPLOAD_TO_S3=true
python crawl_test.pyCustom S3 settings:
export S3_BUCKET_NAME="my-bucket"
export S3_PREFIX="docs/2025/"
export S3_OVERWRITE=false
python crawl_test.py --upload-s3Upload existing files:
export S3_BUCKET_NAME="my-bucket"
python upload_to_s3.py --latestDry run to test configuration:
export S3_BUCKET_NAME="my-bucket"
python upload_to_s3.py --latest --dry-runAt the top of the file:
OUTPUT_BASE = "output"
BATCH_SIZE = 10 # smaller = gentler for flaky sites (try 5 or 1 if needed)
SAVE_EVERY = 10 # checkpoint frequency
USE_BROWSER = False # overridden by --use-browser flag
MAX_RETRIES = 3 # per-URL retries across runsRecommendations
- Start with
USE_BROWSER=False(faster). - If some pages return empty or rely on JS → re-run those with
--use-browser. - For stubborn domains, set
BATCH_SIZE = 10or even1to reduce concurrency.
# macOS/Linux
jq '.completed | length, .failed | length, .pending | length' output/*/progress.json | xargs -n3 echo "completed= failed= pending="If you don’t have jq, use Python:
python - << 'PY'
import json, glob
p = sorted(glob.glob('output/*/progress.json'))[-1]
d = json.load(open(p))
print("Run:", p)
print("completed:", len(d["completed"]), "failed:", len(d["failed"]), "pending:", len(d["pending"]))
PYOpen a few *.md files under output/<run>/ to verify formatting (headings, lists, code blocks).
# Retry only failed, without browser (often works for docs)
python crawl_test.py --retry-failed
# If still failing, use browser + small batches:
# (Set BATCH_SIZE=5 or 1 in the script first if necessary)
python crawl_test.py --retry-failed --use-browser- Delete
progress.jsoninside your run folder, or just start a new run (default behavior if no unfinished run is found).
Edit url_index.json to keep a subset (backup the original first).
Symptoms
TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'
Cause
You're using Python 3.9 or earlier, but crawl4ai requires Python 3.10+ (due to union type syntax |).
Fix
-
Check your Python version:
python3 --version
-
If you have Python 3.10+ available, use it:
# Use specific Python version python3.13 crawl_test.py # Or activate virtual environment properly source venv/bin/activate python3.13 crawl_test.py
-
If you only have Python 3.9, upgrade:
# macOS with Homebrew brew install python@3.13 # Then recreate your virtual environment python3.13 -m venv venv source venv/bin/activate pip install crawl4ai
Symptoms
BrowserType.launch: Executable doesn't exist...
Looks like Playwright was just installed... run:
playwright install
Fix
pip install "crawl4ai[full]"
python -m playwright installThen re-run with --use-browser only if needed.
You’re on Crawl4AI 0.7.x. Use:
await crawler.arun_many(urls=urls, config=CrawlerRunConfig(stream=False))(Your current script already uses arun_many.)
In 0.7.x, browser is not a CrawlerRunConfig arg. Control the browser only here:
async with AsyncWebCrawler(use_browser=USE_BROWSER) as crawler:
...0.7.x doesn’t expose that arg in CrawlerRunConfig. Use defaults, smaller batches, and choose browser vs. non-browser mode. (Your current script already removed that arg.)
Symptoms
Page.goto: Timeout 60000ms exceeded... waiting until "domcontentloaded"
Fix options
-
Non-browser pass (often succeeds for docs):
python crawl_test.py --retry-failed
-
Browser pass with tiny batches (reduces contention):
-
Set
BATCH_SIZE = 5(or1) in the script. -
Run:
python crawl_test.py --retry-failed --use-browser
-
-
If a few still fail: run the retry multiple times; they’ll either move to
completedor stay infailedafter 3 attempts.
This happens if you loop over a shrinking list with index math. Fix (already in your script): batch using a snapshot:
while prog["pending"]:
batch_urls = prog["pending"][:BATCH_SIZE]
await process_batch(...)Force it:
RUN_FOLDER=output/2025-09-12_14-47-29 python crawl_test.py.
├── venv/
├── output/
│ └── 2025-09-12_14-47-29/
│ ├── <page-1>.md
│ ├── <page-2>.md
│ └── progress.json
├── url_index.json
├── build_url_index.py
└── crawl_test.py
Q: Do I need crawl4ai[full]?
A: Only for JS-heavy pages. Start with crawl4ai (static). Use --use-browser for stubborn pages.
Q: Do hash links like #section-3 create multiple files?
A: No. They’re anchor sections within the same page; you’ll get one Markdown file for the page.
Q: How do retries work?
A: Each failing URL increments a counter in progress.json. After MAX_RETRIES (default 3), it’s moved to failed. --retry-failed moves them back to pending and clears attempts.
Q: Can I run multiple sites?
A: Yes. Build a different url_index.json per site (or stash them with different names), and run the crawler per site.
Q: S3 upload fails with "S3_BUCKET_NAME environment variable is required"?
A: Make sure you have a .env file with your S3 configuration, or set environment variables directly:
export S3_BUCKET_NAME="your-bucket-name"
export AWS_ACCESS_KEY_ID="your-key"
export AWS_SECRET_ACCESS_KEY="your-secret"
export AWS_REGION="us-east-1"Q: Files go to "unknown_domain/" folder in S3?
A: This happens when the domain extraction fails. Make sure your URL starts with http:// or https://. The crawler extracts the domain from the base URL to create the folder name.
Q: How do I find my AWS credentials? A:
- AWS Console: IAM → Users → Your User → Security Credentials → Create Access Key
- AWS CLI:
aws configure(stores in~/.aws/credentials) - Environment: Set
AWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEY - IAM Roles: If running on EC2, use IAM roles (no credentials needed)
Q: How do I make my S3 bucket public?
A: Add this bucket policy (replace your-bucket-name):
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "PublicReadGetObject",
"Effect": "Allow",
"Principal": "*",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::your-bucket-name/*"
}
]
}Q: What's the S3_PREFIX for?
A: The prefix creates a folder structure in your S3 bucket. With S3_PREFIX="crawled-docs/", files go to s3://bucket/crawled-docs/domain_folder/file.md. Useful for organizing multiple crawls or projects in the same bucket.
Create venv + install core:
python3 -m venv venv && source venv/bin/activate && pip install --upgrade pip && pip install crawl4ai boto3Install full (browser) stack:
pip install "crawl4ai[full]" boto3 && python -m playwright installBuild index + crawl (static):
python build_url_index.py https://docs.example.com && python crawl_test.pyS3-only crawl (no local folders):
python s3_crawler.py https://docs.example.comS3-only with environment file:
# Create .env file with S3 config, then:
python s3_crawler.py https://docs.example.comCrawl with S3 upload (traditional):
export S3_BUCKET_NAME="my-bucket" && python crawl_test.py --upload-s3Upload existing files to S3:
export S3_BUCKET_NAME="my-bucket" && python upload_to_s3.py --latestComplete S3 setup (create .env + crawl):
echo "S3_BUCKET_NAME=my-bucket" > .env && echo "AWS_REGION=us-east-1" >> .env && python s3_crawler.py https://docs.example.comRetry only failed (browser on):
python crawl_test.py --retry-failed --use-browserForce resume a specific run:
RUN_FOLDER=output/2025-09-12_14-47-29 python crawl_test.pyDry run S3 upload:
export S3_BUCKET_NAME="my-bucket" && python upload_to_s3.py --latest --dry-run