A comprehensive evaluation framework for comparing baseline LLM performance against Retrieval-Augmented Generation (RAG) systems on ETH Zurich-specific questions.
This project evaluates various Large Language Models (LLMs) on their ability to answer institution-specific questions from the ETH Zurich web archive. The evaluation compares:
- Baseline Performance: Models without RAG (parametric knowledge only)
- RAG Performance: Models with access to retrieved ETH documents (non-parametric + parametric knowledge)
- Apertus-8B:
swiss-ai/Apertus-8B-Instruct-2509(self-hosted on CSCS) - Qwen3-8B:
Qwen/Qwen3-8B(self-hosted on CSCS) - Claude Sonnet 4.5: Anthropic (manual collection via official interface)
- GPT-5.2: OpenAI (manual collection via official interface)
ml4science-apertus-rag-evaluation/
├── docs/ # Documentation
│ ├── evaluation_method.md # Detailed evaluation methodology
│ ├── baseline_evaluation.md # Baseline results summary
│ ├── rag_vs_baseline_report.md # RAG vs Baseline comparison
│ ├── rag_evaluation_plan.md # RAG evaluation methodology
│ └── plot_analysis.md # Plot analysis and insights
├── scripts/ # Evaluation scripts
│ ├── run_evaluation.py # Run baseline evaluation
│ ├── score_responses.py # Score baseline responses (LLM-as-Judge)
│ ├── run_rag_evaluation.py # Run RAG evaluation
│ ├── score_rag_responses.py # Score RAG responses (LLM-as-Judge)
│ ├── compare_models.py # Compare baseline models
│ ├── compare_baseline_rag.py # Compare baseline vs RAG
│ └── generate_improved_plots.py # Generate visualization plots
├── prompts/ # Judge prompts
│ ├── judge_prompt_baseline.txt # Prompt for baseline scoring
│ └── judge_prompt_rag.txt # Prompt for RAG scoring
├── data/test_set/ # Test dataset
│ ├── eth_questions_100.json # 100 ETH questions (JSON)
│ └── eth_questions_100.xlsx # 100 ETH questions (Excel)
├── results/ # Evaluation results
│ ├── baseline_evaluation/ # Baseline responses and scores
│ ├── rag_evaluation/ # RAG responses and scores
│ └── plots/ # Generated visualization plots
└── src/ # Source code (RAG pipeline, etc.)
└── warc_tools/
├── rag/ # RAG pipeline implementation
├── extractor/ # WARC extraction tools
├── indexer/ # Elasticsearch indexing
└── baseline/ # Baseline tools
Note: The evaluation framework does not directly use the source code in src/warc_tools/baseline/ and src/warc_tools/rag/.
src/warc_tools/baseline/: Library code for baseline LLM calls (not used directly in evaluation)scripts/run_evaluation.pyandresults/baseline_evaluation/: Evaluation scripts and results used for baseline evaluationsrc/warc_tools/rag/: Library code for RAG pipeline (not used directly in evaluation)scripts/run_rag_evaluation.pyandresults/rag_evaluation/: Evaluation scripts and results used for RAG evaluation
All responses are automatically scored using LLM-as-Judge with:
- Judge Model:
moonshotai/Kimi-K2-Thinking - API: CSCS OpenAI-compatible endpoint
- Temperature: 0 (deterministic scoring)
- Prompts: Structured prompts with clear scoring criteria
- Factual Correctness (0-2 points): Accuracy of the information provided
- Completeness (0-2 points): How fully the question is answered
- Aggregate Score:
(Correctness + Completeness) / 4(0-1 scale) - Result Tags: Correct, Partial, Generic, Refusal, Hallucination
- Temperature: 0
- Top_p: 1
- Max tokens: 500 (baseline), 2000 (judge)
- Random seed: 42
- Python 3.8 or higher (Python 3.9+ recommended)
- pip (Python package manager)
- Access to CSCS cluster (for self-hosted models)
- API keys for cloud models (if evaluating)
- Clone the repository:
git clone <repository-url>
cd ml4science-apertus-rag-evaluation- Create and activate virtual environment:
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Linux/Mac:
source venv/bin/activate
# On Windows:
venv\Scripts\activate- Install required Python packages:
# Upgrade pip first
pip install --upgrade pip
# Install all dependencies from requirements.txt
pip install -r requirements.txt
# Or install manually (if requirements.txt is not available):
pip install openai>=1.0.0
pip install python-dotenv>=1.0.0
pip install numpy>=1.24.0
pip install matplotlib>=3.7.0
# Optional: For cloud model evaluation (uncomment in requirements.txt)
# pip install anthropic>=0.18.0 # For Claude models
# pip install google-generativeai>=0.3.0 # For Gemini models
# Optional: For RAG evaluation (uncomment in requirements.txt if using src/warc_tools)
# pip install llama-index>=0.9.0
# pip install llama-index-vector-stores-elasticsearch>=0.1.0
# pip install pandas>=2.0.0- Configure environment variables:
# Copy example environment file
cp env.example .env
# Edit .env file with your API keys and configuration
# Required variables:
# - CSCS_API_KEY (for self-hosted models and judge)
# - ES_URL (for RAG evaluation)
# - ES_INDEX_NAME (for RAG evaluation)Note: See the Environment Variables section below for a complete list of required and optional variables.
- Run evaluation for a model:
python scripts/run_evaluation.py \
--model swiss-ai/Apertus-8B-Instruct-2509 \
--output results/baseline_evaluation/- Score responses using LLM-as-Judge:
python scripts/score_responses.py \
--model swiss-ai/Apertus-8B-Instruct-2509 \
--input results/baseline_evaluation/swiss-ai_Apertus-8B-Instruct-2509_responses.json- Compare all baseline models:
python scripts/compare_models.py- Run RAG evaluation:
python scripts/run_rag_evaluation.py \
--model swiss-ai/Apertus-8B-Instruct-2509 \
--test_set data/test_set/eth_questions_100.json \
--output results/rag_evaluation/- Score RAG responses:
python scripts/score_rag_responses.py \
--model swiss-ai/Apertus-8B-Instruct-2509 \
--responses results/rag_evaluation/swiss-ai_Apertus-8B-Instruct-2509_rag_responses.json- Compare baseline vs RAG:
python scripts/compare_baseline_rag.pypython scripts/generate_improved_plots.pyPlots are saved to results/plots/:
performance_comparison.png: Multi-panel comparisonlanguage_analysis.png: Performance by language (DE/EN)rag_improvement_analysis.png: RAG improvement analysistag_distribution.png: Result tag distribution
Baseline evaluation results are available in:
docs/baseline_evaluation.md: Summary statistics and insightsresults/baseline_evaluation/: Individual model scores
Key Findings:
- Models show high reasoning ability but are "institutionally blind" without RAG
- Claude/GPT show high refusal rates (honest uncertainty)
- Qwen/Apertus attempt to answer but often provide generic advice
Comprehensive comparison available in:
docs/rag_vs_baseline_report.md: Detailed analysisresults/plots/: Visualization plots
Key Findings:
- RAG significantly improves performance (60-70% of questions improved)
- Apertus-8B shows better RAG integration than Qwen3-8B
- Retrieved documents enable models to provide ETH-specific information
- Evaluation Methodology: Detailed methodology, metrics, and protocols
- Baseline Results: Baseline evaluation summary
- RAG Evaluation Plan: RAG evaluation methodology
- RAG vs Baseline Report: Comprehensive comparison analysis
- Plot Analysis: Analysis of generated visualizations
Required environment variables (see env.example):
# CSCS API (for self-hosted models and judge)
CSCS_API_KEY=your_cscs_api_key
CSCS_BASE_URL=https://api.swissai.cscs.ch/v1
# Judge model (optional, defaults to Kimi-K2-Thinking)
JUDGE_API_KEY=your_judge_api_key
JUDGE_BASE_URL=https://api.swissai.cscs.ch/v1
# Cloud model APIs (optional, for Claude/GPT evaluation)
ANTHROPIC_API_KEY=your_anthropic_key
OPENAI_API_KEY=your_openai_key
GOOGLE_API_KEY=your_google_key
# RAG pipeline
LLM_API_KEY=your_llm_api_key
LLM_BASE_URL=https://api.swissai.cscs.ch/v1- Cloud Models: Claude and GPT responses were collected manually via official model interfaces (web search/tools disabled)
- Judge Model: All scoring uses
moonshotai/Kimi-K2-Thinkingvia CSCS API - Reproducibility: All evaluations use temperature=0 for deterministic results
- RAG Pipeline: Uses existing RAG implementation from
src/warc_tools/rag/
[Add your license information here]
If you use this evaluation framework, please cite:
[Add citation information][Add contact information]