Retrieval-Augmented Generation (RAG) System

A modular Python framework for building advanced document and website Q&A chatbots using Retrieval-Augmented Generation. This system can ingest, chunk, embed, and search large volumes of text—including OCR-extracted website data—enabling instant, context-aware answers to user queries.

Features

Flexible Document Ingestion: Add text from files, web scraping, or OCR (e.g., Google Vision API).
ML-Based Chunking: Uses spaCy for intelligent sentence segmentation and chunking, optimized for noisy OCR/web data.
Embeddings: Supports both local (all-MiniLM-L6-v2) and Ollama-based embeddings.
Vector Search: Stores and retrieves document chunks using ChromaDB for fast similarity search.
LLM Integration: Uses Ollama (e.g., Llama 3.2 3B) for generating answers from retrieved context.
Interactive Q&A: CLI for asking questions about your ingested data.
Extensible: Easily adapt to new data sources, chunking strategies, or LLMs.

Installation

Clone the repository and set up a Python 3.10+ environment (recommended: use venv).

Install dependencies:

pip install chromadb langchain langchain_community langchain_ollama sentence-transformers ollama pytest python-dotenv spacy
python -m spacy download en_core_web_sm

(Optional) Set up Ollama for LLM and embedding support:
- Install Ollama
- Start the Ollama server:
  ollama serve
- Pull required models:
  ollama pull llama3.2:3b
  ollama pull nomic-embed-text
(Optional) Set up Google Vision API for OCR if extracting text from images.

Usage

Quick Start

python main.py

This will load a sample text file, chunk and embed it, and start an interactive Q&A loop.

Adding Your Own Data

Place your extracted text files in the appropriate directory.
Edit main.py or use text_loader.py to load your data.

Chunking Strategy

By default, the system uses spaCy-based sentence segmentation for robust chunking of OCR/web data.
You can adjust chunking mode and parameters in rag_system.py or document_processor.py.

Project Structure

new_try/
  main.py                # Entry point for demo/interactive Q&A
  rag_system.py          # Main RAG system logic
  document_processor.py  # Chunking (ML-based and default)
  embeddings.py          # Embedding logic (Ollama/local)
  vector_store.py        # ChromaDB vector storage
  llm_client.py          # LLM interface (Ollama)
  config.py              # Configuration
  ...

Example Workflow

Extract text from a website using OCR (e.g., Google Vision API + Crawl4AI).
Ingest text into the RAG system.
Ask questions interactively or via API.
Get answers with sources and confidence scores.

Extending

Change chunking: Edit DocumentProcessor or its parameters.
Switch embedding/LLM: Update config.py and ensure models are available.
Integrate with web UI: Wrap the Q&A logic in a web server (e.g., FastAPI, Flask).

Troubleshooting

Unicode errors: Ensure files are read with encoding='utf-8'.
spaCy errors: Run python -m spacy download en_core_web_sm.
Ollama errors: Ensure Ollama is running and required models are pulled.

License

MIT License (or specify your license here)

Acknowledgements

For questions or contributions, open an issue or pull request!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Extractor		Extractor
RAG_Agent		RAG_Agent
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
sample_output.txt		sample_output.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Retrieval-Augmented Generation (RAG) System

Features

Installation

Usage

Quick Start

Adding Your Own Data

Chunking Strategy

Project Structure

Example Workflow

Extending

Troubleshooting

License

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

Sonicof/Instabot

Folders and files

Latest commit

History

Repository files navigation

Retrieval-Augmented Generation (RAG) System

Features

Installation

Usage

Quick Start

Adding Your Own Data

Chunking Strategy

Project Structure

Example Workflow

Extending

Troubleshooting

License

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages