Skip to content

swiss-ai/ml4science-apertus-rag-evaluation

Repository files navigation

ML4Science Apertus RAG Evaluation

A comprehensive evaluation framework for comparing baseline LLM performance against Retrieval-Augmented Generation (RAG) systems on ETH Zurich-specific questions.

Overview

This project evaluates various Large Language Models (LLMs) on their ability to answer institution-specific questions from the ETH Zurich web archive. The evaluation compares:

  • Baseline Performance: Models without RAG (parametric knowledge only)
  • RAG Performance: Models with access to retrieved ETH documents (non-parametric + parametric knowledge)

Evaluated Models

  • Apertus-8B: swiss-ai/Apertus-8B-Instruct-2509 (self-hosted on CSCS)
  • Qwen3-8B: Qwen/Qwen3-8B (self-hosted on CSCS)
  • Claude Sonnet 4.5: Anthropic (manual collection via official interface)
  • GPT-5.2: OpenAI (manual collection via official interface)

Project Structure

ml4science-apertus-rag-evaluation/
├── docs/                          # Documentation
│   ├── evaluation_method.md       # Detailed evaluation methodology
│   ├── baseline_evaluation.md     # Baseline results summary
│   ├── rag_vs_baseline_report.md  # RAG vs Baseline comparison
│   ├── rag_evaluation_plan.md     # RAG evaluation methodology
│   └── plot_analysis.md            # Plot analysis and insights
├── scripts/                       # Evaluation scripts
│   ├── run_evaluation.py          # Run baseline evaluation
│   ├── score_responses.py         # Score baseline responses (LLM-as-Judge)
│   ├── run_rag_evaluation.py      # Run RAG evaluation
│   ├── score_rag_responses.py     # Score RAG responses (LLM-as-Judge)
│   ├── compare_models.py          # Compare baseline models
│   ├── compare_baseline_rag.py   # Compare baseline vs RAG
│   └── generate_improved_plots.py # Generate visualization plots
├── prompts/                       # Judge prompts
│   ├── judge_prompt_baseline.txt  # Prompt for baseline scoring
│   └── judge_prompt_rag.txt       # Prompt for RAG scoring
├── data/test_set/                 # Test dataset
│   ├── eth_questions_100.json     # 100 ETH questions (JSON)
│   └── eth_questions_100.xlsx     # 100 ETH questions (Excel)
├── results/                       # Evaluation results
│   ├── baseline_evaluation/      # Baseline responses and scores
│   ├── rag_evaluation/            # RAG responses and scores
│   └── plots/                     # Generated visualization plots
└── src/                           # Source code (RAG pipeline, etc.)
    └── warc_tools/
        ├── rag/                   # RAG pipeline implementation
        ├── extractor/              # WARC extraction tools
        ├── indexer/                # Elasticsearch indexing
        └── baseline/               # Baseline tools

Important: Source Code vs Evaluation Results

Note: The evaluation framework does not directly use the source code in src/warc_tools/baseline/ and src/warc_tools/rag/.

  • src/warc_tools/baseline/: Library code for baseline LLM calls (not used directly in evaluation)
  • scripts/run_evaluation.py and results/baseline_evaluation/: Evaluation scripts and results used for baseline evaluation
  • src/warc_tools/rag/: Library code for RAG pipeline (not used directly in evaluation)
  • scripts/run_rag_evaluation.py and results/rag_evaluation/: Evaluation scripts and results used for RAG evaluation

Evaluation Methodology

LLM-as-Judge Approach

All responses are automatically scored using LLM-as-Judge with:

  • Judge Model: moonshotai/Kimi-K2-Thinking
  • API: CSCS OpenAI-compatible endpoint
  • Temperature: 0 (deterministic scoring)
  • Prompts: Structured prompts with clear scoring criteria

Metrics

  1. Factual Correctness (0-2 points): Accuracy of the information provided
  2. Completeness (0-2 points): How fully the question is answered
  3. Aggregate Score: (Correctness + Completeness) / 4 (0-1 scale)
  4. Result Tags: Correct, Partial, Generic, Refusal, Hallucination

Reproducibility Settings

  • Temperature: 0
  • Top_p: 1
  • Max tokens: 500 (baseline), 2000 (judge)
  • Random seed: 42

Quick Start

Prerequisites

  • Python 3.8 or higher (Python 3.9+ recommended)
  • pip (Python package manager)
  • Access to CSCS cluster (for self-hosted models)
  • API keys for cloud models (if evaluating)

Installation

  1. Clone the repository:
git clone <repository-url>
cd ml4science-apertus-rag-evaluation
  1. Create and activate virtual environment:
# Create virtual environment
python -m venv venv

# Activate virtual environment
# On Linux/Mac:
source venv/bin/activate
# On Windows:
venv\Scripts\activate
  1. Install required Python packages:
# Upgrade pip first
pip install --upgrade pip

# Install all dependencies from requirements.txt
pip install -r requirements.txt

# Or install manually (if requirements.txt is not available):
pip install openai>=1.0.0
pip install python-dotenv>=1.0.0
pip install numpy>=1.24.0
pip install matplotlib>=3.7.0

# Optional: For cloud model evaluation (uncomment in requirements.txt)
# pip install anthropic>=0.18.0  # For Claude models
# pip install google-generativeai>=0.3.0  # For Gemini models

# Optional: For RAG evaluation (uncomment in requirements.txt if using src/warc_tools)
# pip install llama-index>=0.9.0
# pip install llama-index-vector-stores-elasticsearch>=0.1.0
# pip install pandas>=2.0.0
  1. Configure environment variables:
# Copy example environment file
cp env.example .env

# Edit .env file with your API keys and configuration
# Required variables:
#   - CSCS_API_KEY (for self-hosted models and judge)
#   - ES_URL (for RAG evaluation)
#   - ES_INDEX_NAME (for RAG evaluation)

Note: See the Environment Variables section below for a complete list of required and optional variables.

Running Baseline Evaluation

  1. Run evaluation for a model:
python scripts/run_evaluation.py \
    --model swiss-ai/Apertus-8B-Instruct-2509 \
    --output results/baseline_evaluation/
  1. Score responses using LLM-as-Judge:
python scripts/score_responses.py \
    --model swiss-ai/Apertus-8B-Instruct-2509 \
    --input results/baseline_evaluation/swiss-ai_Apertus-8B-Instruct-2509_responses.json
  1. Compare all baseline models:
python scripts/compare_models.py

Running RAG Evaluation

  1. Run RAG evaluation:
python scripts/run_rag_evaluation.py \
    --model swiss-ai/Apertus-8B-Instruct-2509 \
    --test_set data/test_set/eth_questions_100.json \
    --output results/rag_evaluation/
  1. Score RAG responses:
python scripts/score_rag_responses.py \
    --model swiss-ai/Apertus-8B-Instruct-2509 \
    --responses results/rag_evaluation/swiss-ai_Apertus-8B-Instruct-2509_rag_responses.json
  1. Compare baseline vs RAG:
python scripts/compare_baseline_rag.py

Generating Plots

python scripts/generate_improved_plots.py

Plots are saved to results/plots/:

  • performance_comparison.png: Multi-panel comparison
  • language_analysis.png: Performance by language (DE/EN)
  • rag_improvement_analysis.png: RAG improvement analysis
  • tag_distribution.png: Result tag distribution

Results

Baseline Results

Baseline evaluation results are available in:

  • docs/baseline_evaluation.md: Summary statistics and insights
  • results/baseline_evaluation/: Individual model scores

Key Findings:

  • Models show high reasoning ability but are "institutionally blind" without RAG
  • Claude/GPT show high refusal rates (honest uncertainty)
  • Qwen/Apertus attempt to answer but often provide generic advice

RAG vs Baseline Comparison

Comprehensive comparison available in:

  • docs/rag_vs_baseline_report.md: Detailed analysis
  • results/plots/: Visualization plots

Key Findings:

  • RAG significantly improves performance (60-70% of questions improved)
  • Apertus-8B shows better RAG integration than Qwen3-8B
  • Retrieved documents enable models to provide ETH-specific information

Documentation

Environment Variables

Required environment variables (see env.example):

# CSCS API (for self-hosted models and judge)
CSCS_API_KEY=your_cscs_api_key
CSCS_BASE_URL=https://api.swissai.cscs.ch/v1

# Judge model (optional, defaults to Kimi-K2-Thinking)
JUDGE_API_KEY=your_judge_api_key
JUDGE_BASE_URL=https://api.swissai.cscs.ch/v1

# Cloud model APIs (optional, for Claude/GPT evaluation)
ANTHROPIC_API_KEY=your_anthropic_key
OPENAI_API_KEY=your_openai_key
GOOGLE_API_KEY=your_google_key

# RAG pipeline
LLM_API_KEY=your_llm_api_key
LLM_BASE_URL=https://api.swissai.cscs.ch/v1

Notes

  • Cloud Models: Claude and GPT responses were collected manually via official model interfaces (web search/tools disabled)
  • Judge Model: All scoring uses moonshotai/Kimi-K2-Thinking via CSCS API
  • Reproducibility: All evaluations use temperature=0 for deterministic results
  • RAG Pipeline: Uses existing RAG implementation from src/warc_tools/rag/

License

[Add your license information here]

Citation

If you use this evaluation framework, please cite:

[Add citation information]

Contact

[Add contact information]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published