Stop feeding your RAG garbage.
This is a powerful, self-hosted web crawler designed to ingest documentation into LLM pipelines (like AnythingLLM).
Unlike generic crawlers, it uses trafilatura to strip away navigation bars, ads, and footers, leaving you with clean, high-quality markdown that LLMs actually understand.
NEW CRAWL
CRAWL STATUS
CRAWL HISTORY AND DOWNLOAD
- Clean Data = Better Answers: Most crawlers dump raw HTML. This one extracts the article text automatically.
- Ready for Vector DBs: Automatically splits large exports into 20MB chunks (JSON/Markdown) to fit upload limits.
- Self-Hosted & Private: Runs on your NAS/Server via Docker. Your data stays yours.
- Resilient: Background threading with persistent SQLite history. Resume or stop crawls at any time.
- Smart Extraction: Uses
trafilaturato extract only the main article content, removing navbars, footers, and ads. - RAG-Ready Exports: Automatically splits large datasets into 20MB chunks (JSON/Markdown) for easy upload to vector databases.
- Live Dashboard: Real-time progress tracking, speed metrics (pages/sec), and recent activity logs.
- Resilient: Background threading with persistent SQLite history. Resume or stop crawls at any time.
- Dockerized: Simple deployment with Docker Compose.
https://github.com/warezfr/doc-crawler-rag.git
cd doc-crawler-ragdocker compose up -dAccess the dashboard at http://localhost:18510.
Edit docker-compose.yml to persist data or change ports:
volumes:
- ./data:/app/data # Persistent storage for crawls- Frontend: Streamlit (UI, File Exports)
- Backend: Python
threading+sqlite3 - Engine:
Trafilatura(Extraction) +Requests(Fetching)
- Docker & Docker Compose
- (Optional) AnythingLLM instance for direct uploads