Skip to content

AnasAmchaar/REFRAG-SYSTEM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

REFRAG - Reinforcement Learning Enhanced RAG System

A RAG (Retrieval-Augmented Generation) system that uses reinforcement learning to decide which retrieved chunks actually need full decompression before being sent to the LLM. The idea is simple: not every chunk is equally important, so why waste tokens on all of them?

Overview

REFRAG works in three stages:

Stage 1: Compress (Lightweight Encoder)

Documents get chunked and turned into vector embeddings using a sentence-transformer model. These compressed representations are stored in Pinecone and used for retrieval. Think of it as pre-processing your documents into a format that's cheap to search and compare.

Stage 2: Perceive (RL Policy)

Once relevant chunks are retrieved, a PPO-trained reinforcement learning policy looks at each one and decides: does this chunk need to be fully decompressed (i.e., sent as raw text to the LLM), or can it stay compressed? The goal is to keep the important stuff intact while cutting down on unnecessary context.

Stage 3: Expand (Decoder / LLM)

The LLM (served via Groq) receives a hybrid context: fully decompressed chunks for the most relevant content, and compressed representations for the rest. This keeps the token count down without sacrificing answer quality.

Architecture

+------------------------------------------------------------------+
|                         REFRAG Pipeline                           |
+------------------------------------------------------------------+
|                                                                   |
|   +-----------+      +-----------+      +---------------------+  |
|   |  COMPRESS |      |  PERCEIVE |      |       EXPAND        |  |
|   |           |      |           |      |                     |  |
|   | Sentence  | ---> | RL Policy | ---> |    Groq LLM         |  |
|   | Transformer|      |   (PPO)   |      |                     |  |
|   +-----------+      +-----------+      +---------------------+  |
|        |                   |                      |               |
|        v                   v                      v               |
|   +-----------+      +-----------+          +-----------+        |
|   |  Chunk    |      | Decompress|          |  Hybrid   |        |
|   | Embeddings|      | Decisions |          |  Context  |        |
|   +-----------+      +-----------+          +-----------+        |
|        |                                                          |
|        v                                                          |
|   +------------------------------------------+                   |
|   |           Pinecone Vector DB              |                   |
|   +------------------------------------------+                   |
|                                                                   |
+------------------------------------------------------------------+

Getting Started

Installation

  1. Clone and navigate to the project:

    cd RAG-system
  2. Set up a virtual environment:

    python -m venv .venv
    
    # Windows
    .venv\Scripts\activate
    
    # Linux/Mac
    source .venv/bin/activate
  3. Install dependencies:

    pip install -r requirements.txt
  4. Set up your environment variables:

    cp .env.example .env

    Then open .env and fill in your API keys.

Configuration

You'll need the following in your .env file:

PINECONE_API_KEY=your_pinecone_api_key
PINECONE_INDEX_NAME=refrag-index
PINECONE_ENVIRONMENT=us-east-1
GROQ_API_KEY=your_groq_api_key
GROQ_MODEL=openai/gpt-oss-120b
ENCODER_MODEL=BAAI/bge-m3
EMBEDDING_DIMENSION=1024

Using the Python API

from src import Pipeline

pipeline = Pipeline()

# Index some documents
documents = [
    {"id": "doc1", "text": "Your document content here..."},
    {"id": "doc2", "text": "Another document..."}
]
pipeline.index_documents(documents)

# Query
result = pipeline.query("What is machine learning?")
print(result["answer"])

# Stream the response
for token in pipeline.query_stream("Explain neural networks"):
    print(token, end="", flush=True)

CLI

# Index a single file
python main.py index --file document.txt

# Index all .txt files in a directory
python main.py index --dir ./documents

# Query (one-shot)
python main.py query "What are the benefits of RAG?"

# Query with streaming output
python main.py query "Explain neural networks" --stream

# Skip the RL policy and use heuristic scoring instead
python main.py query "What is deep learning?" --no-rl

# Check system stats (vector count, model info, etc.)
python main.py stats

Web UI (Streamlit)

There's a full Streamlit app for interactive use:

streamlit run app.py

The web interface lets you:

  • Upload and index documents (txt, md, json, jsonl, csv, xlsx, and more)
  • Chat with your documents using streaming responses
  • Toggle the RL policy on/off
  • View per-query stats (chunks retrieved, decompression ratio, token savings)
  • Monitor latency in real time via the sidebar, with a breakdown by pipeline stage (encode, retrieve, RL policy, LLM) and a trend chart across queries

Project Structure

RAG-system/
├── config/
│   ├── __init__.py
│   └── settings.py            # Pydantic-based config, reads from .env
├── src/
│   ├── __init__.py
│   ├── pipeline.py            # Main pipeline orchestration
│   ├── compress/
│   │   ├── __init__.py
│   │   ├── encoder.py         # Sentence-transformer encoder
│   │   └── chunker.py         # Text chunking
│   ├── perceive/
│   │   ├── __init__.py
│   │   ├── policy.py          # PPO-based RL policy
│   │   ├── environment.py     # Gymnasium environment for training
│   │   └── importance_scorer.py  # Heuristic fallback scorer
│   ├── expand/
│   │   ├── __init__.py
│   │   ├── decoder.py         # Groq LLM decoder
│   │   └── hybrid_context.py  # Builds mixed context for the LLM
│   └── vectordb/
│       ├── __init__.py
│       └── pinecone_store.py  # Pinecone vector store wrapper
├── app.py                     # Streamlit web UI
├── main.py                    # CLI entry point
├── data.json                  # Sample data
├── requirements.txt
├── .env.example
└── README.md

Components

LightweightEncoder

  • Wraps sentence-transformers (default: BGE-M3, 1024-dim)
  • Handles chunking, batch encoding, and query embedding
  • Uses GPU when available

RLPolicy

  • PPO policy (via stable-baselines3) that decides which chunks to decompress
  • Falls back to a cosine-similarity-based importance scorer when no trained model is loaded
  • Configurable max decompression count

Decoder

  • Calls the Groq API for fast LLM inference
  • Supports both one-shot and streaming generation
  • Tracks token usage and latency per request

PineconeStore

  • Serverless Pinecone integration
  • Handles upsert, query, and stats
  • Supports namespaces for logical partitioning

Latency Monitoring

The Streamlit app includes a built-in latency monitor in the sidebar that tracks every query. It breaks down total latency into four stages:

Stage What it measures
Encode Time to embed the query
Retrieve Time to search Pinecone
Policy Time for the RL policy to make decisions
LLM Time for the LLM to generate the full answer

After two or more queries, a trend chart appears so you can spot patterns or regressions. The monitor also shows a running average and a delta compared to the previous query.

Performance Metrics

Each query response includes:

  • Latency - end-to-end time with per-stage breakdown
  • Token savings - percentage of context tokens saved by selective decompression
  • Decompression ratio - how many chunks were fully decompressed vs. kept compressed
  • Chunks retrieved - number of chunks pulled from Pinecone

Contributing

Contributions are welcome. Feel free to open an issue or submit a pull request.

License

MIT License - see LICENSE file for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages