🚀 ArxivRoll

“Fresh from ArXiv, served once, and never reheated.”

📌 TL;DR: ArxivRoll tells you “How much of your score is real, and how much is cheating?”

1. 📊 What is ArxivRoll?

ArxivRoll is a dynamic, one-time-pad-inspired evaluation framework 🛡️ that audits how much Large Language Models (LLMs) over-estimate their true abilities on public benchmarks.

⚠️ Key Problems ArxivRoll Tackles

📥 Data contamination
Public benchmarks (MMLU, GSM8K, etc.) often sneak into pre-training data → inflated scores.
🎯 Biased overtraining
Developers may “teach to the test,” tuning models only on popular domains.
🕵️ Transparency crisis
Private leaderboards (SEAL, Chatbot Arena) are opaque & hard to reproduce.

What is ArxivRollBench?

ArxivRollBench is the private benchmarks built by ArxivRoll.

🧪 How ArxivRollBench Works

🌱 Fresh Test Cases
Every 6 months we scrape latest ArXiv preprints (Apr–Sep 2024 → ArxivRollBench-2024b).

🏷️ Domains: CS, Math, Physics, Bio, Econ, Finance, Statistics, EE.
🎲 SCP Tasks
Articles are auto-converted into three symbolic tasks:
- Sequencing 🔀 → Re-order shuffled sentences.
- Cloze 🕳️ → Fill masked sentences.
- Prediction 🔮 → Choose the correct next sentence.
📈 Rugged Scores (RS)
- RS-I 🧪 = % inflation on public vs. private benchmarks.
- RS-II ⚖️ = performance variance across domains (biased training detector).

🌟 Unique Features

🕐 One-Time Use: private benchmarks are used once, then expired & open-sourced.
✅ High Quality: filtered for length, complexity, minimal math/tables.
🌍 Broad Coverage: 8 domains, ~100-word contexts, 1 k+ samples per domain.

👩‍💻 2. How Do I Evaluate my Model?

The most easy way is to use llm-eval-harness

Just install lm-eval from here, and then evaluate a huggingface model with:

export task_ls=(
  "arxivrollbench2024b"\
)
	lm_eval\
	    --model hf\
	    --model_args pretrained=your-model-name,parallelize=True\
	    --tasks any-arxiv-roll-bench-task\
	    --verbosity DEBUG\
	    --log_samples\
	    --output_path your-log-path

You can also evaluate LLM via APIs with examples detailed in ./eval/.

👩‍💻 3. How to Use & Read the Code

📦 1. Install Environment

Via pip

pip install -r re.txt

Via conda (recommended)

conda env create -f robench.yaml
conda activate robench

Clone & editable install

git clone https://github.com/XXXXXX/XXXXXXXXXXXXXXXXXXX
cd harness-4-robench
pip install -e .

🗂️ 2. File Map – “Where is what?”

📁 Path	🎯 Purpose
`./1.run_vanilla_construct.py`	🏗️ One-click generator of your private benchmark from fresh ArXiv papers.
`constructor.py`	🔧 Engine room: all SCP logic (Sequencing / Cloze / Prediction) lives here.
`data/`	📚 Static assets
`./data/INSTRUCTION.py`	📝 Prompt templates fed into the LLM during evaluation.
`eval/`	🧪 Evaluation scripts
`./eval/0.1.vanilla_harness_test.sh`	🤗 Evaluate open-source HuggingFace models.
`./eval/0.2.harness_eval_closeAIs.sh`	🔐 Evaluate OpenAI / Claude / Gemini APIs.
`post_process_paper_text.py`	✂️ Clean & segment raw ArXiv LaTeX → plain English sentences.
`spider_arxiv.py`	🕷️ Crawler that downloads the latest ArXiv PDFs & abstracts.
`SearchBySomething.py`	🔍 TF-IDF retriever to mine distractor sentences for Prediction tasks.
`Vectorize.py`	🧮 Convert any text into dense embeddings for retrieval.
`utils.py`	🧰 Tiny helpers (date parsing, logging, helpers, etc.).

🚀 3. Quick Start – 3 Commands to Glory

① Download ArXiv articles

python spider_arxiv.py

🗂️ Drops papers into ./data/raw/.

② Build a private benchmark

python 1.run_vanilla_construct.py

🎲 Generates Sequencing / Cloze / Prediction tasks → ./benchmarks/.

③ Reproduce our Results

# Open-source models (Llama, Qwen, ...)
bash ./eval/0.1.vanilla_harness_test.sh

# Proprietary APIs (GPT-4, Claude, Gemini, ...)
bash ./eval/0.2.harness_eval_closeAIs.sh

📊 Results saved as JSON + auto-plot RS scores.

Happy benchmarking! 🪄

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
data		data
eval		eval
harness-4-robench @ ffdc629		harness-4-robench @ ffdc629
plot		plot
test		test
thirdparty		thirdparty
.gitignore		.gitignore
.gitmodules		.gitmodules
1.run_vanilla_construct.py		1.run_vanilla_construct.py
README.md		README.md
README.org		README.org
SearchBySomething.py		SearchBySomething.py
Vectorize.py		Vectorize.py
box_stability.pdf		box_stability.pdf
constructor.py		constructor.py
eval_benchmark_stablity.py		eval_benchmark_stablity.py
latex_process.py		latex_process.py
main.py		main.py
plot_box_stability.py		plot_box_stability.py
plot_radio_legend.py		plot_radio_legend.py
plot_radio_private.py		plot_radio_private.py
plot_radio_public.py		plot_radio_public.py
plot_radio_rs1.py		plot_radio_rs1.py
plot_rs2.py		plot_rs2.py
post_process_paper_text.py		post_process_paper_text.py
post_process_paper_text_(improved_by_yang).py		post_process_paper_text_(improved_by_yang).py
prepare_radar.py		prepare_radar.py
pyproject.toml		pyproject.toml
radar_legend.pdf		radar_legend.pdf
radar_private.pdf		radar_private.pdf
radar_public.pdf		radar_public.pdf
radar_rs1.pdf		radar_rs1.pdf
re.txt		re.txt
robench.yaml		robench.yaml
rs.py		rs.py
scp.py		scp.py
seriesmodel-results.pdf		seriesmodel-results.pdf
spider_arxiv.py		spider_arxiv.py
tmp.py		tmp.py
tmp_update_arxivbench2024b.py		tmp_update_arxivbench2024b.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 ArxivRoll

1. 📊 What is ArxivRoll?

⚠️ Key Problems ArxivRoll Tackles

What is ArxivRollBench?

🧪 How ArxivRollBench Works

🌟 Unique Features

👩‍💻 2. How Do I Evaluate my Model?

👩‍💻 3. How to Use & Read the Code

📦 1. Install Environment

Via pip

Via conda (recommended)

Clone & editable install

🗂️ 2. File Map – “Where is what?”

🚀 3. Quick Start – 3 Commands to Glory

① Download ArXiv articles

② Build a private benchmark

③ Reproduce our Results

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

liangzid/ArxivRoll

Folders and files

Latest commit

History

Repository files navigation

🚀 ArxivRoll

1. 📊 What is ArxivRoll?

⚠️ Key Problems ArxivRoll Tackles

What is ArxivRollBench?

🧪 How ArxivRollBench Works

🌟 Unique Features

👩‍💻 2. How Do I Evaluate my Model?

👩‍💻 3. How to Use & Read the Code

📦 1. Install Environment

Via pip

Via conda (recommended)

Clone & editable install

🗂️ 2. File Map – “Where is what?”

🚀 3. Quick Start – 3 Commands to Glory

① Download ArXiv articles

② Build a private benchmark

③ Reproduce our Results

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages