🕷️ Advanced Doc Crawler for RAG - AnythingLLM, NotebookLLM, ... Optimized

Stop feeding your RAG garbage.

This is a powerful, self-hosted web crawler designed to ingest documentation into LLM pipelines (like AnythingLLM).

Unlike generic crawlers, it uses trafilatura to strip away navigation bars, ads, and footers, leaving you with clean, high-quality markdown that LLMs actually understand.

NEW CRAWL

CRAWL STATUS

CRAWL HISTORY AND DOWNLOAD

⚡ Why this crawler?

Clean Data = Better Answers: Most crawlers dump raw HTML. This one extracts the article text automatically.
Ready for Vector DBs: Automatically splits large exports into 20MB chunks (JSON/Markdown) to fit upload limits.
Self-Hosted & Private: Runs on your NAS/Server via Docker. Your data stays yours.
Resilient: Background threading with persistent SQLite history. Resume or stop crawls at any time.

✨ Features

Smart Extraction: Uses trafilatura to extract only the main article content, removing navbars, footers, and ads.
RAG-Ready Exports: Automatically splits large datasets into 20MB chunks (JSON/Markdown) for easy upload to vector databases.
Live Dashboard: Real-time progress tracking, speed metrics (pages/sec), and recent activity logs.
Resilient: Background threading with persistent SQLite history. Resume or stop crawls at any time.
Dockerized: Simple deployment with Docker Compose.

🚀 Quick Start

1. Clone & Configure

https://github.com/warezfr/doc-crawler-rag.git
cd doc-crawler-rag

2. Deploy with Docker

docker compose up -d

Access the dashboard at http://localhost:18510.

🛠️ Configuration

Edit docker-compose.yml to persist data or change ports:

volumes:
  - ./data:/app/data  # Persistent storage for crawls

🏗️ Architecture

Frontend: Streamlit (UI, File Exports)
Backend: Python threading + sqlite3
Engine: Trafilatura (Extraction) + Requests (Fetching)

📦 Requirements

Docker & Docker Compose
(Optional) AnythingLLM instance for direct uploads

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
crawler_engine.py		crawler_engine.py
docker-compose.yml		docker-compose.yml
models.py		models.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🕷️ Advanced Doc Crawler for RAG - AnythingLLM, NotebookLLM, ... Optimized

⚡ Why this crawler?

✨ Features

🚀 Quick Start

1. Clone & Configure

2. Deploy with Docker

🛠️ Configuration

🏗️ Architecture

📦 Requirements

About

Uh oh!

Releases

Packages

Languages

warezfr/doc-crawler-rag

Folders and files

Latest commit

History

Repository files navigation

🕷️ Advanced Doc Crawler for RAG - AnythingLLM, NotebookLLM, ... Optimized

⚡ Why this crawler?

✨ Features

🚀 Quick Start

1. Clone & Configure

2. Deploy with Docker

🛠️ Configuration

🏗️ Architecture

📦 Requirements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages