A chat with research paper application used to explain comprehensive evaluation system for RAG (Retrieval-Augmented Generation) systems using synthetic question-answer pairs generated from PDF documents.
- Synthetic Data Generation: Auto-generates QA pairs from PDFs using GPT-4o-mini
- RAG System Evaluation: Tests your RAG system against generated QA pairs
- Comprehensive Metrics: Retrieval accuracy, answer quality, and response time
- Multiple Chunking Strategies: Fixed character, recursive (LangChain), and semantic chunking
evals/
├── data/ # PDF documents for evaluation
│ ├── *.pdf # Your PDF files
├── output/ # Generated datasets and results
│ ├── synthetic_qa_dataset.json # Full synthetic dataset
│ ├── synthetic_qa_quick.json # Quick evaluation dataset
│ └── evaluation_results.json # Evaluation results
├── synthetic_data.py # QA pair generation
├── evaluate_rag.py # RAG system evaluation
├── run_evaluation.py # Complete pipeline runner
└── README.md # This file
Install dependencies from the existing lock file:
uv syncCreate .env file:
OPENAI_API_KEY=your_openai_api_key_here
DATABASE_URL=your_postgresql_connection_stringQuick generation (5 chunks per doc, 2 questions per chunk):
uv run evals/synthetic_data.py --quickFull generation (20 chunks per doc, 3 questions per chunk):
uv run evals/synthetic_data.py --fullWith specific chunker:
# Recursive chunking (default, uses LangChain)
uv run evals/synthetic_data.py --quick --chunker recursive
# Fixed character chunking
uv run evals/synthetic_data.py --quick --chunker fixed_character
# Semantic chunking
uv run evals/synthetic_data.py --quick --chunker semanticQuick evaluation:
uv run evals/run_evaluation.py --quickFull evaluation:
uv run evals/run_evaluation.py --fullWith additional options:
# Force regenerate QA pairs even if they exist
uv run evals/run_evaluation.py --quick --force-regenerate
# Use specific chunker type
uv run evals/run_evaluation.py --quick --chunker semantic
# Set number of top documents to retrieve
uv run evals/run_evaluation.py --quick --top-k 5
# Ingest documents into vector store during evaluation
uv run evals/run_evaluation.py --quick --ingest
# Combine multiple options
uv run evals/run_evaluation.py --full --chunker recursive --top-k 5 --force-regenerate --ingestThe run_evaluation.py script supports the following arguments:
| Argument | Type | Default | Description |
|---|---|---|---|
--quick |
flag | - | Run quick evaluation with limited data (1 chunk per doc, 1 question per chunk) |
--full |
flag | - | Run full evaluation with comprehensive data (20 chunks per doc, 3 questions per chunk) |
--chunker |
string | fixed_character |
Type of chunker to use: fixed_character, recursive, or semantic |
--force-regenerate |
flag | - | Force regeneration of synthetic QA pairs even if they already exist |
--top-k |
integer | 3 |
Number of top documents to retrieve during RAG evaluation |
--ingest |
flag | - | Ingest documents into vector store during evaluation |
Note: If neither --quick nor --full is specified, the script defaults to quick evaluation.
- Text Extraction: Extracts text from PDF files
- Text Chunking: Splits text using configurable strategies (fixed, recursive, semantic)
- QA Generation: Uses GPT-4o-mini to generate diverse question-answer pairs
- Dataset Creation: Saves structured QA pairs with metadata
- Question Processing: Sends questions to your RAG system
- Answer Quality Assessment: Uses GPT-4o-mini to evaluate answer quality
- Retrieval Accuracy: Measures context retrieval performance
- Performance Metrics: Calculates response times and overall scores
Metrics:
- Retrieval Score: How well the system finds relevant information
- Answer Quality Score: Accuracy and completeness of answers
- Response Time: System performance
- Question Type Breakdown: Performance across different question types
- Splits text into fixed-size chunks
- Tries to break at sentence boundaries
- Fastest and most predictable
- Uses LangChain's
RecursiveCharacterTextSplitter - Applies separators in order: paragraphs → lines → sentences → words → characters
- Enhanced multilingual support
- Recommended for most use cases
- Uses embeddings to find semantic break points
- Maintains semantic coherence within chunks
- Most computationally expensive
synthetic_qa_dataset.json: Complete dataset with all QA pairssynthetic_qa_quick.json: Smaller dataset for quick testingevaluation_results.json: Detailed evaluation results
- EXCELLENT (≥0.8): Outstanding performance
- GOOD (≥0.6): Solid performance with room for improvement
- FAIR (≥0.4): Acceptable but needs optimization
- NEEDS IMPROVEMENT (<0.4): Significant improvements required
- OpenAI API Errors: Check API key, credits, and rate limits
- PDF Processing: Ensure PDFs are text-based (not scanned images)
- Database Connection: Verify PostgreSQL connection and pgvector extension
- Semantic Chunker: Requires OpenAI API key, more expensive
logging.basicConfig(level=logging.DEBUG)Or run with debug logging:
uv run evals/run_evaluation.py --quick --chunker semantic