A RAG (Retrieval-Augmented Generation) system that uses reinforcement learning to decide which retrieved chunks actually need full decompression before being sent to the LLM. The idea is simple: not every chunk is equally important, so why waste tokens on all of them?
REFRAG works in three stages:
Documents get chunked and turned into vector embeddings using a sentence-transformer model. These compressed representations are stored in Pinecone and used for retrieval. Think of it as pre-processing your documents into a format that's cheap to search and compare.
Once relevant chunks are retrieved, a PPO-trained reinforcement learning policy looks at each one and decides: does this chunk need to be fully decompressed (i.e., sent as raw text to the LLM), or can it stay compressed? The goal is to keep the important stuff intact while cutting down on unnecessary context.
The LLM (served via Groq) receives a hybrid context: fully decompressed chunks for the most relevant content, and compressed representations for the rest. This keeps the token count down without sacrificing answer quality.
+------------------------------------------------------------------+
| REFRAG Pipeline |
+------------------------------------------------------------------+
| |
| +-----------+ +-----------+ +---------------------+ |
| | COMPRESS | | PERCEIVE | | EXPAND | |
| | | | | | | |
| | Sentence | ---> | RL Policy | ---> | Groq LLM | |
| | Transformer| | (PPO) | | | |
| +-----------+ +-----------+ +---------------------+ |
| | | | |
| v v v |
| +-----------+ +-----------+ +-----------+ |
| | Chunk | | Decompress| | Hybrid | |
| | Embeddings| | Decisions | | Context | |
| +-----------+ +-----------+ +-----------+ |
| | |
| v |
| +------------------------------------------+ |
| | Pinecone Vector DB | |
| +------------------------------------------+ |
| |
+------------------------------------------------------------------+
-
Clone and navigate to the project:
cd RAG-system -
Set up a virtual environment:
python -m venv .venv # Windows .venv\Scripts\activate # Linux/Mac source .venv/bin/activate
-
Install dependencies:
pip install -r requirements.txt
-
Set up your environment variables:
cp .env.example .env
Then open
.envand fill in your API keys.
You'll need the following in your .env file:
PINECONE_API_KEY=your_pinecone_api_key
PINECONE_INDEX_NAME=refrag-index
PINECONE_ENVIRONMENT=us-east-1
GROQ_API_KEY=your_groq_api_key
GROQ_MODEL=openai/gpt-oss-120b
ENCODER_MODEL=BAAI/bge-m3
EMBEDDING_DIMENSION=1024from src import Pipeline
pipeline = Pipeline()
# Index some documents
documents = [
{"id": "doc1", "text": "Your document content here..."},
{"id": "doc2", "text": "Another document..."}
]
pipeline.index_documents(documents)
# Query
result = pipeline.query("What is machine learning?")
print(result["answer"])
# Stream the response
for token in pipeline.query_stream("Explain neural networks"):
print(token, end="", flush=True)# Index a single file
python main.py index --file document.txt
# Index all .txt files in a directory
python main.py index --dir ./documents
# Query (one-shot)
python main.py query "What are the benefits of RAG?"
# Query with streaming output
python main.py query "Explain neural networks" --stream
# Skip the RL policy and use heuristic scoring instead
python main.py query "What is deep learning?" --no-rl
# Check system stats (vector count, model info, etc.)
python main.py statsThere's a full Streamlit app for interactive use:
streamlit run app.pyThe web interface lets you:
- Upload and index documents (txt, md, json, jsonl, csv, xlsx, and more)
- Chat with your documents using streaming responses
- Toggle the RL policy on/off
- View per-query stats (chunks retrieved, decompression ratio, token savings)
- Monitor latency in real time via the sidebar, with a breakdown by pipeline stage (encode, retrieve, RL policy, LLM) and a trend chart across queries
RAG-system/
├── config/
│ ├── __init__.py
│ └── settings.py # Pydantic-based config, reads from .env
├── src/
│ ├── __init__.py
│ ├── pipeline.py # Main pipeline orchestration
│ ├── compress/
│ │ ├── __init__.py
│ │ ├── encoder.py # Sentence-transformer encoder
│ │ └── chunker.py # Text chunking
│ ├── perceive/
│ │ ├── __init__.py
│ │ ├── policy.py # PPO-based RL policy
│ │ ├── environment.py # Gymnasium environment for training
│ │ └── importance_scorer.py # Heuristic fallback scorer
│ ├── expand/
│ │ ├── __init__.py
│ │ ├── decoder.py # Groq LLM decoder
│ │ └── hybrid_context.py # Builds mixed context for the LLM
│ └── vectordb/
│ ├── __init__.py
│ └── pinecone_store.py # Pinecone vector store wrapper
├── app.py # Streamlit web UI
├── main.py # CLI entry point
├── data.json # Sample data
├── requirements.txt
├── .env.example
└── README.md
- Wraps sentence-transformers (default: BGE-M3, 1024-dim)
- Handles chunking, batch encoding, and query embedding
- Uses GPU when available
- PPO policy (via stable-baselines3) that decides which chunks to decompress
- Falls back to a cosine-similarity-based importance scorer when no trained model is loaded
- Configurable max decompression count
- Calls the Groq API for fast LLM inference
- Supports both one-shot and streaming generation
- Tracks token usage and latency per request
- Serverless Pinecone integration
- Handles upsert, query, and stats
- Supports namespaces for logical partitioning
The Streamlit app includes a built-in latency monitor in the sidebar that tracks every query. It breaks down total latency into four stages:
| Stage | What it measures |
|---|---|
| Encode | Time to embed the query |
| Retrieve | Time to search Pinecone |
| Policy | Time for the RL policy to make decisions |
| LLM | Time for the LLM to generate the full answer |
After two or more queries, a trend chart appears so you can spot patterns or regressions. The monitor also shows a running average and a delta compared to the previous query.
Each query response includes:
- Latency - end-to-end time with per-stage breakdown
- Token savings - percentage of context tokens saved by selective decompression
- Decompression ratio - how many chunks were fully decompressed vs. kept compressed
- Chunks retrieved - number of chunks pulled from Pinecone
Contributions are welcome. Feel free to open an issue or submit a pull request.
MIT License - see LICENSE file for details.