A command-line tool for acquiring, organizing, and searching FOIA documents from government archives and other sources.
- Multi-source scraping - Configurable scrapers for FBI Vault, CIA Reading Room, MuckRock, DocumentCloud, and custom sources
- Privacy by default - Routes traffic through Tor with pluggable transports; supports external SOCKS proxies
- Smart rate limiting - Adaptive delays with exponential backoff to avoid blocks, with optional Redis backend for distributed deployments
- Content-addressable storage - Documents stored by SHA-256 + BLAKE3 hash for deduplication
- Multiple OCR backends - Tesseract (default), OCRS (pure Rust), PaddleOCR (GPU), or DeepSeek (VLM)
- Browser automation - Chromium-based scraping for JavaScript-heavy sites with stealth mode for bot detection bypass
- Browser pool - Load balance across multiple browser instances with round-robin, random, or per-domain strategies
- WARC import - Import documents from Web Archive files with filtering and checkpointing
- Full-text search - Search across document content and metadata
- Web UI - Browse, search, and view documents through a local web interface
- LLM annotation - Generate summaries and tags using Ollama, Groq, OpenAI, or Together.ai
- Database flexibility - SQLite (default) or PostgreSQL for larger deployments
- Docker support - Pre-built images for easy deployment
Download a pre-built binary from Releases, or build from source:
# Default build (SQLite + browser automation)
cargo install --git https://github.com/monokrome/foiacquire
# With PostgreSQL support
cargo install --git https://github.com/monokrome/foiacquire --features postgres
# With all OCR backends
cargo install --git https://github.com/monokrome/foiacquire --features ocr-all| Feature | Description |
|---|---|
browser |
Browser automation via Chromium (default) |
postgres |
PostgreSQL database support |
redis-backend |
Redis for distributed rate limiting |
ocr-ocrs |
OCRS pure-Rust OCR backend |
ocr-paddle |
PaddleOCR CNN-based backend |
ocr-all |
All OCR backends |
# Initialize with a target directory
foiacquire init --target ./foia-data
# Or use an existing config
cp etc/example.json foiacquire.json
foiacquire init
# List configured sources
foiacquire source list
# Scrape documents (crawl + download)
foiacquire scrape fbi_vault --limit 100
# Run OCR on downloaded documents
foiacquire analyze --workers 4
# Generate summaries with LLM (requires Ollama)
foiacquire annotate --limit 50
# Start web UI
foiacquire serve
# Open http://localhost:3030| Command | Description |
|---|---|
scrape <source> |
Crawl and download documents from a source |
crawl <source> |
Discover document URLs without downloading |
download [source] |
Download documents from the crawl queue |
import <files...> |
Import from WARC archive files |
refresh [source] |
Re-fetch metadata for existing documents |
| Command | Description |
|---|---|
analyze [source] |
Extract text and run OCR on documents |
analyze-check |
Verify OCR tools are installed |
analyze-compare <file> |
Compare OCR backends on a file |
annotate [source] |
Generate summaries/tags with LLM |
detect-dates [source] |
Detect publication dates in documents |
archive [source] |
Extract contents from ZIP/email attachments |
| Command | Description |
|---|---|
ls |
List documents with filtering |
info <doc_id> |
Show document metadata |
read <doc_id> |
Output document content |
search <query> |
Full-text search |
serve [bind] |
Start web interface (default: 127.0.0.1:3030) |
| Command | Description |
|---|---|
init |
Initialize database and directories |
source list |
List configured sources |
source rename |
Rename a source |
config recover |
Recover config from database |
config history |
Show configuration history |
db copy <from> <to> |
Migrate between SQLite/PostgreSQL |
state status |
Show crawl state |
state clear <source> |
Reset crawl state |
Create a foiacquire.json in your data directory or use --config:
{
"target": "./foia_documents/",
"scrapers": {
"my_source": {
"discovery": {
"type": "html_crawl",
"base_url": "https://example.gov/foia",
"start_paths": ["/documents"],
"document_links": ["a[href*='/doc/']"],
"document_patterns": ["\\.pdf$"],
"pagination": {
"next_selectors": ["a.next-page"]
}
},
"fetch": {
"use_browser": false
}
}
}
}See docs/configuration.md for full options.
| Variable | Description |
|---|---|
DATABASE_URL |
Database connection (e.g., postgres://user:pass@host/db) |
BROWSER_URL |
Remote Chrome DevTools URL (e.g., ws://localhost:9222), comma-separated for pool |
SOCKS_PROXY |
External SOCKS5 proxy (e.g., socks5://localhost:9050) |
FOIACQUIRE_DIRECT |
Set to 1 to disable Tor routing |
LLM_PROVIDER |
LLM provider: ollama, openai, groq, or together |
LLM_MODEL |
Model for annotation |
GROQ_API_KEY |
Groq API key (auto-selects Groq provider) |
RUST_LOG |
Log level (e.g., debug, info) |
# Run with local data directory
docker run -v ./foia-data:/opt/foiacquire \
-e USER_ID=$(id -u) -e GROUP_ID=$(id -g) \
monokrome/foiacquire:latest scrape fbi_vault
# With PostgreSQL
docker run -v ./foia-data:/opt/foiacquire \
-e DATABASE_URL=postgres://user:pass@host/foiacquire \
monokrome/foiacquire:latest scrape fbi_vault
# Start web UI
docker run -v ./foia-data:/opt/foiacquire \
-p 3030:3030 \
monokrome/foiacquire:latest serve 0.0.0.0:3030
# With browser automation (stealth mode for bot detection bypass)
docker run -d --name chromium --shm-size=2g monokrome/chromium:stealth
docker run -v ./foia-data:/opt/foiacquire \
-e BROWSER_URL=ws://chromium:9222 \
--link chromium \
monokrome/foiacquire:latest scrape cia_foiaSee docs/docker.md for Docker Compose examples, VNC setup, and Synology configuration.
- Getting Started - First-time setup guide
- Configuration - All configuration options
- Commands - Detailed command reference
- Scrapers - Writing custom scraper configs
- Docker Deployment - Container deployment guide
MIT