Kyu Won Kim* Suhwan Choi* Myeongho Jeon
ICML 2025 Workshop on Long-Context Foundation Models
This repository provides the official implementation of the VerbatimEval framework, introduced in the paper "Say as It Is: Verbatim Fidelity Evaluation of Long-Context Language Models".
Accurately processing long texts and generating precise responses remains a significant challenge for large language models (LLMs). While existing benchmarks evaluate long-text comprehension, they often overlook the models’ ability to faithfully preserve the exact wording, formatting, and sequence of prompts in their responses. To address this gap, we propose a novel evaluation framework with two key advantages: (i) adaptability across diverse domains and data sources, and (ii) tunable difficulty through dynamic variation of text length. Across three tasks—mathematical, contextual, and semantic reasoning—we find that even state-of-the-art long-context LLMs exhibit notable difficulty in maintaining verbatim fidelity during long-text generation.
The repository is organized as follows:
experiments.py: Defines the core experiment classes, including number sorting, sentence shuffling, and entity grouping.llm.py: Provides a modular interface for interacting with different LLMs (e.g., OpenAI, Gemini).exp.py: The main script for running experiments.exp.yaml: Configuration file for the experiments.requirements.txt: A list of Python dependencies.run.sh: An example shell script to run experiments.
First, create a Conda environment with Python 3.10:
conda create -n verbatimeval python=3.10
conda activate verbatimevalInstall the required Python packages using pip:
pip install -r requirements.txtThe framework requires API keys for the language models you intend to use (e.g., OpenAI, Google Gemini). Create a .env file in the root directory and add your keys:
OPENAI_API_KEY="your_openai_api_key"
GOOGLE_API_KEY="your_google_api_key"
The exp.py script will load these environment variables.
You can run experiments using the exp.py script. The script takes several arguments to specify the experiment, model, and other parameters.
To run the number sorting experiment with the gpt-4o model, use the following command:
python exp.py --save_name="my_first_exp" --exp_name="num_sort" --model_name="gpt-4o"You can also use the provided shell script run.sh as a template for running experiments.
save_name(str): A unique name for the experiment run.exp_name(str): The name of the experiment. Options:num_sort,sentence_shuffle,grouping.model_name(str): The name of the model to use. Seellm.pyfor supported models.--config_path(str, optional): Path to the configuration file. Defaults toexp.yaml.--sample_size(int, optional): Override the sample size in the config.--num_test(int, optional): Override the number of tests in the config.--dataset_name(str, optional): Override the dataset name in the config.