foiacquire

A command-line tool for acquiring, organizing, and searching FOIA documents from government archives and other sources.

Features

Multi-source scraping - Configurable scrapers for FBI Vault, CIA Reading Room, MuckRock, DocumentCloud, and custom sources
Privacy by default - Routes traffic through Tor with pluggable transports; supports external SOCKS proxies
Smart rate limiting - Adaptive delays with exponential backoff to avoid blocks, with optional Redis backend for distributed deployments
Content-addressable storage - Documents stored by SHA-256 + BLAKE3 hash for deduplication
Multiple OCR backends - Tesseract (default), OCRS (pure Rust), PaddleOCR (GPU), or DeepSeek (VLM)
Browser automation - Chromium-based scraping for JavaScript-heavy sites with stealth mode for bot detection bypass
Browser pool - Load balance across multiple browser instances with round-robin, random, or per-domain strategies
WARC import - Import documents from Web Archive files with filtering and checkpointing
Full-text search - Search across document content and metadata
Web UI - Browse, search, and view documents through a local web interface
LLM annotation - Generate summaries and tags using Ollama, Groq, OpenAI, or Together.ai
Database flexibility - SQLite (default) or PostgreSQL for larger deployments
Docker support - Pre-built images for easy deployment

Installation

Download a pre-built binary from Releases, or build from source:

# Default build (SQLite + browser automation)
cargo install --git https://github.com/monokrome/foiacquire

# With PostgreSQL support
cargo install --git https://github.com/monokrome/foiacquire --features postgres

# With all OCR backends
cargo install --git https://github.com/monokrome/foiacquire --features ocr-all

Feature Flags

Feature	Description
`browser`	Browser automation via Chromium (default)
`postgres`	PostgreSQL database support
`redis-backend`	Redis for distributed rate limiting
`ocr-ocrs`	OCRS pure-Rust OCR backend
`ocr-paddle`	PaddleOCR CNN-based backend
`ocr-all`	All OCR backends

Quick Start

# Initialize with a target directory
foiacquire init --target ./foia-data

# Or use an existing config
cp etc/example.json foiacquire.json
foiacquire init

# List configured sources
foiacquire source list

# Scrape documents (crawl + download)
foiacquire scrape fbi_vault --limit 100

# Run OCR on downloaded documents
foiacquire analyze --workers 4

# Generate summaries with LLM (requires Ollama)
foiacquire annotate --limit 50

# Start web UI
foiacquire serve
# Open http://localhost:3030

Commands

Document Acquisition

Command	Description
`scrape <source>`	Crawl and download documents from a source
`crawl <source>`	Discover document URLs without downloading
`download [source]`	Download documents from the crawl queue
`import <files...>`	Import from WARC archive files
`refresh [source]`	Re-fetch metadata for existing documents

Document Processing

Command	Description
`analyze [source]`	Extract text and run OCR on documents
`analyze-check`	Verify OCR tools are installed
`analyze-compare <file>`	Compare OCR backends on a file
`annotate [source]`	Generate summaries/tags with LLM
`detect-dates [source]`	Detect publication dates in documents
`archive [source]`	Extract contents from ZIP/email attachments

Browsing & Search

Command	Description
`ls`	List documents with filtering
`info <doc_id>`	Show document metadata
`read <doc_id>`	Output document content
`search <query>`	Full-text search
`serve [bind]`	Start web interface (default: 127.0.0.1:3030)

Management

Command	Description
`init`	Initialize database and directories
`source list`	List configured sources
`source rename`	Rename a source
`config recover`	Recover config from database
`config history`	Show configuration history
`db copy <from> <to>`	Migrate between SQLite/PostgreSQL
`state status`	Show crawl state
`state clear <source>`	Reset crawl state

Configuration

Create a foiacquire.json in your data directory or use --config:

{
  "target": "./foia_documents/",
  "scrapers": {
    "my_source": {
      "discovery": {
        "type": "html_crawl",
        "base_url": "https://example.gov/foia",
        "start_paths": ["/documents"],
        "document_links": ["a[href*='/doc/']"],
        "document_patterns": ["\\.pdf$"],
        "pagination": {
          "next_selectors": ["a.next-page"]
        }
      },
      "fetch": {
        "use_browser": false
      }
    }
  }
}

See docs/configuration.md for full options.

Environment Variables

Variable	Description
`DATABASE_URL`	Database connection (e.g., `postgres://user:pass@host/db`)
`BROWSER_URL`	Remote Chrome DevTools URL (e.g., `ws://localhost:9222`), comma-separated for pool
`SOCKS_PROXY`	External SOCKS5 proxy (e.g., `socks5://localhost:9050`)
`FOIACQUIRE_DIRECT`	Set to `1` to disable Tor routing
`LLM_PROVIDER`	LLM provider: `ollama`, `openai`, `groq`, or `together`
`LLM_MODEL`	Model for annotation
`GROQ_API_KEY`	Groq API key (auto-selects Groq provider)
`RUST_LOG`	Log level (e.g., `debug`, `info`)

Docker

# Run with local data directory
docker run -v ./foia-data:/opt/foiacquire \
  -e USER_ID=$(id -u) -e GROUP_ID=$(id -g) \
  monokrome/foiacquire:latest scrape fbi_vault

# With PostgreSQL
docker run -v ./foia-data:/opt/foiacquire \
  -e DATABASE_URL=postgres://user:pass@host/foiacquire \
  monokrome/foiacquire:latest scrape fbi_vault

# Start web UI
docker run -v ./foia-data:/opt/foiacquire \
  -p 3030:3030 \
  monokrome/foiacquire:latest serve 0.0.0.0:3030

# With browser automation (stealth mode for bot detection bypass)
docker run -d --name chromium --shm-size=2g monokrome/chromium:stealth
docker run -v ./foia-data:/opt/foiacquire \
  -e BROWSER_URL=ws://chromium:9222 \
  --link chromium \
  monokrome/foiacquire:latest scrape cia_foia

See docs/docker.md for Docker Compose examples, VNC setup, and Synology configuration.

Documentation

Getting Started - First-time setup guide
Configuration - All configuration options
Commands - Detailed command reference
Scrapers - Writing custom scraper configs
Docker Deployment - Container deployment guide

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 197 Commits
.cargo		.cargo
.github		.github
bin		bin
docs		docs
etc		etc
migrations		migrations
scripts		scripts
src		src
templates		templates
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
Dockerfile.chromium		Dockerfile.chromium
NOTES.md		NOTES.md
README.md		README.md
build.rs		build.rs
clippy.toml		clippy.toml
docker-compose.local.yml		docker-compose.local.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

foiacquire

Features

Installation

Feature Flags

Quick Start

Commands

Document Acquisition

Document Processing

Browsing & Search

Management

Configuration

Environment Variables

Docker

Documentation

License

About

Uh oh!

Releases 29

Sponsor this project

Uh oh!

Packages

Uh oh!

Languages

Uh oh!

monokrome/foiacquire

Folders and files

Latest commit

History

Repository files navigation

foiacquire

Features

Installation

Feature Flags

Quick Start

Commands

Document Acquisition

Document Processing

Browsing & Search

Management

Configuration

Environment Variables

Docker

Documentation

License

About

Topics

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 29

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Languages

Packages