“Fresh from ArXiv, served once, and never reheated.”
📌 TL;DR: ArxivRoll tells you “How much of your score is real, and how much is cheating?”
ArxivRoll is a dynamic, one-time-pad-inspired evaluation framework 🛡️ that audits how much Large Language Models (LLMs) over-estimate their true abilities on public benchmarks.
- 📥 Data contamination
Public benchmarks (MMLU, GSM8K, etc.) often sneak into pre-training data → inflated scores. - 🎯 Biased overtraining
Developers may “teach to the test,” tuning models only on popular domains. - 🕵️ Transparency crisis
Private leaderboards (SEAL, Chatbot Arena) are opaque & hard to reproduce.
ArxivRollBench is the private benchmarks built by ArxivRoll.
-
🌱 Fresh Test Cases
Every 6 months we scrape latest ArXiv preprints (Apr–Sep 2024 → ArxivRollBench-2024b).🏷️ Domains: CS, Math, Physics, Bio, Econ, Finance, Statistics, EE.
-
🎲 SCP Tasks
Articles are auto-converted into three symbolic tasks:- Sequencing 🔀 → Re-order shuffled sentences.
- Cloze 🕳️ → Fill masked sentences.
- Prediction 🔮 → Choose the correct next sentence.
-
📈 Rugged Scores (RS)
- RS-I 🧪 = % inflation on public vs. private benchmarks.
- RS-II ⚖️ = performance variance across domains (biased training detector).
- 🕐 One-Time Use: private benchmarks are used once, then expired & open-sourced.
- ✅ High Quality: filtered for length, complexity, minimal math/tables.
- 🌍 Broad Coverage: 8 domains, ~100-word contexts, 1 k+ samples per domain.
The most easy way is to use llm-eval-harness
Just install lm-eval from here,
and then evaluate a huggingface model with:
export task_ls=(
"arxivrollbench2024b"\
)
lm_eval\
--model hf\
--model_args pretrained=your-model-name,parallelize=True\
--tasks any-arxiv-roll-bench-task\
--verbosity DEBUG\
--log_samples\
--output_path your-log-pathYou can also evaluate LLM via APIs with examples detailed in ./eval/.
pip install -r re.txtconda env create -f robench.yaml
conda activate robenchgit clone https://github.com/XXXXXX/XXXXXXXXXXXXXXXXXXX
cd harness-4-robench
pip install -e .| 📁 Path | 🎯 Purpose |
|---|---|
./1.run_vanilla_construct.py |
🏗️ One-click generator of your private benchmark from fresh ArXiv papers. |
constructor.py |
🔧 Engine room: all SCP logic (Sequencing / Cloze / Prediction) lives here. |
data/ |
📚 Static assets |
./data/INSTRUCTION.py |
📝 Prompt templates fed into the LLM during evaluation. |
eval/ |
🧪 Evaluation scripts |
./eval/0.1.vanilla_harness_test.sh |
🤗 Evaluate open-source HuggingFace models. |
./eval/0.2.harness_eval_closeAIs.sh |
🔐 Evaluate OpenAI / Claude / Gemini APIs. |
post_process_paper_text.py |
✂️ Clean & segment raw ArXiv LaTeX → plain English sentences. |
spider_arxiv.py |
🕷️ Crawler that downloads the latest ArXiv PDFs & abstracts. |
SearchBySomething.py |
🔍 TF-IDF retriever to mine distractor sentences for Prediction tasks. |
Vectorize.py |
🧮 Convert any text into dense embeddings for retrieval. |
utils.py |
🧰 Tiny helpers (date parsing, logging, helpers, etc.). |
python spider_arxiv.py🗂️ Drops papers into
./data/raw/.
python 1.run_vanilla_construct.py🎲 Generates Sequencing / Cloze / Prediction tasks →
./benchmarks/.
# Open-source models (Llama, Qwen, ...)
bash ./eval/0.1.vanilla_harness_test.sh
# Proprietary APIs (GPT-4, Claude, Gemini, ...)
bash ./eval/0.2.harness_eval_closeAIs.sh📊 Results saved as JSON + auto-plot RS scores.
Happy benchmarking! 🪄