A powerful, offline RAG (Retrieval-Augmented Generation) system that lets you chat with your local documents. It uses a Hybrid Architecture to handle both small summaries and massive reports with high precision.
Privacy First: 100% Local. No data leaves your machine. Powered by Ollama.
- Hybrid Retrieval Engine:
- RAM Mode: Small files are kept in memory for 100% accuracy and zero latency.
- Vector Mode: Large files (>4000 chars) are automatically chunked and indexed in ChromaDB for scalable semantic search.
- Multi-Format Intelligence:
- ๐ PDFs: Extracts text and parses structure.
- ๐ CSVs: Auto-calculates row counts and formats data into readable Markdown tables.
- ๐๏ธ NoSQL/JSON: Flattens nested JSON data into analyze-able records (great for logs & mongo dumps).
- ๐ง Deep Summarizer (Map-Reduce):
- Can summarize massive documents (500+ pages) by breaking them down, summarizing chunks, and combining them into a master report.
- Interactive CLI:
- Focus Mode: Lock the chat to a specific file to eliminate noise.
- Session Recorder: Save your entire analysis to a
.txtreport.
%%{init: {'theme': 'dark'}}%%
graph TD
%% Styling
classDef storage fill:#006064,stroke:#4dd0e1,stroke-width:2px,color:#ffffff;
classDef logic fill:#bf360c,stroke:#ffab91,stroke-width:2px,color:#ffffff;
classDef newFeat fill:#4a148c,stroke:#e1bee7,stroke-width:2px,color:#ffffff;
subgraph Startup ["Phase 1: Ingestion"]
Loader["File Loader"] -->|"PDF/Txt"| TextData["Raw Text"]
Loader -->|"CSV"| CSVData["Metadata + Tables"]
Loader -->|"JSON"| JSONData["Flattened JSONL"]:::newFeat
TextData & CSVData & JSONData --> Router{"Size < 4000 chars?"}
Router -->|Yes| RAM["RAM Storage"]:::storage
Router -->|No| VectorDB[("ChromaDB")]:::storage
end
subgraph Loop ["Phase 2: Interaction"]
User([User]) --> CmdCheck{"Command?"}
CmdCheck -->|":deep"| MapReduce["Deep Map-Reduce Engine"]:::newFeat
CmdCheck -->|":focus"| FocusedSearch["Targeted Search"]
CmdCheck -->|"Query"| HybridSearch["Hybrid Context Builder"]:::logic
MapReduce & FocusedSearch & HybridSearch --> Agent["Ollama Agent"]:::logic
Agent --> User
end
๐ ๏ธ Installation
- Python 3.10+
- Ollama installed and running (if using the included Ollama agent)
- Git
A powerful, offline RAG (Retrieval-Augmented Generation) system that lets you chat with your local documents. It uses a Hybrid Architecture to handle both small summaries and massive reports with high precision.
Privacy First: 100% Local. No data leaves your machine. Powered by Ollama.
- Hybrid Retrieval Engine:
- RAM Mode: Small files are kept in memory for 100% accuracy and zero latency.
- Vector Mode: Large files (>4000 chars) are automatically chunked and indexed in ChromaDB for scalable semantic search.
- Multi-Format Intelligence:
- ๐ PDFs: Extracts text and parses structure.
- ๐ CSVs: Auto-calculates row counts and formats data into readable Markdown tables.
- ๐๏ธ NoSQL/JSON: Flattens nested JSON data into analyze-able records (great for logs & mongo dumps).
- ๐ง Deep Summarizer (Map-Reduce):
- Can summarize massive documents (500+ pages) by breaking them down, summarizing chunks, and combining them into a master report.
- Interactive CLI:
- Focus Mode: Lock the chat to a specific file to eliminate noise.
- Session Recorder: Save your entire analysis to a
.txtreport.
Ensure Ollama is running, then pull the default model (for example Llama 3):
ollama pull llama3- Prepare data: place your documents (PDF, CSV, JSON, TXT) under the
data/folder. - Run the CLI:
python main.py --files ./datafiles: List all loaded files and their storage status (RAM vs Vector).:focus [filename]: Lock the chat to a specific file (e.g.,:focus sales.csv).:deep [filename]: Run a Deep Map-Reduce summary on a large document.:save [name]: Export the current chat history to a text file.:all: Return to global search mode (search all files).exit: Quit the application.
local_file_intelligence/ โโโ core/ โ โโโ agent.py # LLM Interaction Handler โ โโโ hybrid_manager.py # Router for RAM vs Vector DB โ โโโ summarizer.py # Map-Reduce Engine for large docs โโโ utils/ โ โโโ file_loader.py # Parsers for PDF, CSV, JSON โ โโโ chunker.py # Semantic Text Splitter โ โโโ vector_store.py # ChromaDB wrapper โโโ data/ # Your documents go here โโโ main.py # CLI Entry Point โโโ requirements.txt # Dependencies โโโ LICENSE # MIT License โโโ README.md # Documentation
Contributions are welcome!
- Fork the project
- Create your feature branch:
git checkout -b feature/AmazingFeature - Commit your changes:
git commit -m 'Add some AmazingFeature' - Push to the branch:
git push origin feature/AmazingFeature - Open a Pull Request
Distributed under the MIT License. See LICENSE for more information.