REFRAG - Reinforcement Learning Enhanced RAG System

A RAG (Retrieval-Augmented Generation) system that uses reinforcement learning to decide which retrieved chunks actually need full decompression before being sent to the LLM. The idea is simple: not every chunk is equally important, so why waste tokens on all of them?

Overview

REFRAG works in three stages:

Stage 1: Compress (Lightweight Encoder)

Documents get chunked and turned into vector embeddings using a sentence-transformer model. These compressed representations are stored in Pinecone and used for retrieval. Think of it as pre-processing your documents into a format that's cheap to search and compare.

Stage 2: Perceive (RL Policy)

Once relevant chunks are retrieved, a PPO-trained reinforcement learning policy looks at each one and decides: does this chunk need to be fully decompressed (i.e., sent as raw text to the LLM), or can it stay compressed? The goal is to keep the important stuff intact while cutting down on unnecessary context.

Stage 3: Expand (Decoder / LLM)

The LLM (served via Groq) receives a hybrid context: fully decompressed chunks for the most relevant content, and compressed representations for the rest. This keeps the token count down without sacrificing answer quality.

Architecture

+------------------------------------------------------------------+
|                         REFRAG Pipeline                           |
+------------------------------------------------------------------+
|                                                                   |
|   +-----------+      +-----------+      +---------------------+  |
|   |  COMPRESS |      |  PERCEIVE |      |       EXPAND        |  |
|   |           |      |           |      |                     |  |
|   | Sentence  | ---> | RL Policy | ---> |    Groq LLM         |  |
|   | Transformer|      |   (PPO)   |      |                     |  |
|   +-----------+      +-----------+      +---------------------+  |
|        |                   |                      |               |
|        v                   v                      v               |
|   +-----------+      +-----------+          +-----------+        |
|   |  Chunk    |      | Decompress|          |  Hybrid   |        |
|   | Embeddings|      | Decisions |          |  Context  |        |
|   +-----------+      +-----------+          +-----------+        |
|        |                                                          |
|        v                                                          |
|   +------------------------------------------+                   |
|   |           Pinecone Vector DB              |                   |
|   +------------------------------------------+                   |
|                                                                   |
+------------------------------------------------------------------+

Getting Started

Installation

Clone and navigate to the project:
```
cd RAG-system
```

Set up a virtual environment:

python -m venv .venv

# Windows
.venv\Scripts\activate

# Linux/Mac
source .venv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```
Set up your environment variables:
```
cp .env.example .env
```
Then open .env and fill in your API keys.

Configuration

You'll need the following in your .env file:

PINECONE_API_KEY=your_pinecone_api_key
PINECONE_INDEX_NAME=refrag-index
PINECONE_ENVIRONMENT=us-east-1
GROQ_API_KEY=your_groq_api_key
GROQ_MODEL=openai/gpt-oss-120b
ENCODER_MODEL=BAAI/bge-m3
EMBEDDING_DIMENSION=1024

Using the Python API

from src import Pipeline

pipeline = Pipeline()

# Index some documents
documents = [
    {"id": "doc1", "text": "Your document content here..."},
    {"id": "doc2", "text": "Another document..."}
]
pipeline.index_documents(documents)

# Query
result = pipeline.query("What is machine learning?")
print(result["answer"])

# Stream the response
for token in pipeline.query_stream("Explain neural networks"):
    print(token, end="", flush=True)

CLI

# Index a single file
python main.py index --file document.txt

# Index all .txt files in a directory
python main.py index --dir ./documents

# Query (one-shot)
python main.py query "What are the benefits of RAG?"

# Query with streaming output
python main.py query "Explain neural networks" --stream

# Skip the RL policy and use heuristic scoring instead
python main.py query "What is deep learning?" --no-rl

# Check system stats (vector count, model info, etc.)
python main.py stats

Web UI (Streamlit)

There's a full Streamlit app for interactive use:

streamlit run app.py

The web interface lets you:

Upload and index documents (txt, md, json, jsonl, csv, xlsx, and more)
Chat with your documents using streaming responses
Toggle the RL policy on/off
View per-query stats (chunks retrieved, decompression ratio, token savings)
Monitor latency in real time via the sidebar, with a breakdown by pipeline stage (encode, retrieve, RL policy, LLM) and a trend chart across queries

Project Structure

RAG-system/
├── config/
│   ├── __init__.py
│   └── settings.py            # Pydantic-based config, reads from .env
├── src/
│   ├── __init__.py
│   ├── pipeline.py            # Main pipeline orchestration
│   ├── compress/
│   │   ├── __init__.py
│   │   ├── encoder.py         # Sentence-transformer encoder
│   │   └── chunker.py         # Text chunking
│   ├── perceive/
│   │   ├── __init__.py
│   │   ├── policy.py          # PPO-based RL policy
│   │   ├── environment.py     # Gymnasium environment for training
│   │   └── importance_scorer.py  # Heuristic fallback scorer
│   ├── expand/
│   │   ├── __init__.py
│   │   ├── decoder.py         # Groq LLM decoder
│   │   └── hybrid_context.py  # Builds mixed context for the LLM
│   └── vectordb/
│       ├── __init__.py
│       └── pinecone_store.py  # Pinecone vector store wrapper
├── app.py                     # Streamlit web UI
├── main.py                    # CLI entry point
├── data.json                  # Sample data
├── requirements.txt
├── .env.example
└── README.md

Components

LightweightEncoder

Wraps sentence-transformers (default: BGE-M3, 1024-dim)
Handles chunking, batch encoding, and query embedding
Uses GPU when available

RLPolicy

PPO policy (via stable-baselines3) that decides which chunks to decompress
Falls back to a cosine-similarity-based importance scorer when no trained model is loaded
Configurable max decompression count

Decoder

Calls the Groq API for fast LLM inference
Supports both one-shot and streaming generation
Tracks token usage and latency per request

PineconeStore

Serverless Pinecone integration
Handles upsert, query, and stats
Supports namespaces for logical partitioning

Latency Monitoring

The Streamlit app includes a built-in latency monitor in the sidebar that tracks every query. It breaks down total latency into four stages:

Stage	What it measures
Encode	Time to embed the query
Retrieve	Time to search Pinecone
Policy	Time for the RL policy to make decisions
LLM	Time for the LLM to generate the full answer

After two or more queries, a trend chart appears so you can spot patterns or regressions. The monitor also shows a running average and a delta compared to the previous query.

Performance Metrics

Each query response includes:

Latency - end-to-end time with per-stage breakdown
Token savings - percentage of context tokens saved by selective decompression
Decompression ratio - how many chunks were fully decompressed vs. kept compressed
Chunks retrieved - number of chunks pulled from Pinecone

Contributing

Contributions are welcome. Feel free to open an issue or submit a pull request.

License

MIT License - see LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

REFRAG - Reinforcement Learning Enhanced RAG System

Overview

Stage 1: Compress (Lightweight Encoder)

Stage 2: Perceive (RL Policy)

Stage 3: Expand (Decoder / LLM)

Architecture

Getting Started

Installation

Configuration

Using the Python API

CLI

Web UI (Streamlit)

Project Structure

Components

LightweightEncoder

RLPolicy

Decoder

PineconeStore

Latency Monitoring

Performance Metrics

Contributing

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
API		API
config		config
src		src
.env.example		.env.example
API_DOCS.md		API_DOCS.md
README.md		README.md
app.py		app.py
data.json		data.json
main.py		main.py
requirements.txt		requirements.txt

AnasAmchaar/REFRAG-SYSTEM

Folders and files

Latest commit

History

Repository files navigation

REFRAG - Reinforcement Learning Enhanced RAG System

Overview

Stage 1: Compress (Lightweight Encoder)

Stage 2: Perceive (RL Policy)

Stage 3: Expand (Decoder / LLM)

Architecture

Getting Started

Installation

Configuration

Using the Python API

CLI

Web UI (Streamlit)

Project Structure

Components

LightweightEncoder

RLPolicy

Decoder

PineconeStore

Latency Monitoring

Performance Metrics

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages