Drift-Bench: Quantifying the Safety Half-Life of LLMs

Abstract

As Large Language Models (LLMs) process extended context windows, their adherence to initial safety conditioning (system prompts, RLHF fine-tuning) may degrade—a phenomenon we term "Safety Half-Life". This project provides a rigorous framework to quantify this decay by measuring the number of "benign" conversation tokens required for a model to "forget" its safety alignment and succumb to jailbreak attempts.

Methodology

Drift-Bench employs a modular architecture to simulate long-context drift:

ContextEngine: Manages a sliding window of conversation history, tracking precise token consumption.
DistractorStream: An infinite iterator of benign tasks (GSM8k math, Alpaca instructions, Wikitext knowledge) that fills the context window to simulate realistic usage.
SafetyProbe: Injects jailbreak prompts (from JailbreakBench) at scheduled intervals.
RefusalJudge: evaluates model responses using a hybrid approach (BERT classifier + heuristic regex) to determine safety compliance.

Metric: Safety Half-Life

We define the Safety Half-Life ($T_{50}$) as the context depth (in tokens) at which the model's refusal rate for harmful queries drops below 50%.

Results

In our experiments with llama-3.3-70b-versatile, the model demonstrated exceptional safety retention within standard context limits. We observed a 100% refusal rate up to 8,192 tokens.

Figure 1: Refusal Rate vs. Context Depth for Llama-3.3-70B. The model maintains perfect safety alignment up to the 8k token boundary. Experiments beyond 8k were constrained by API rate limits.

Usage

1. Installation

Clone the repository and install dependencies using Poetry or pip:

git clone https://github.com/yourusername/drift-bench.git
cd drift-bench
pip install -r requirements.txt
# OR
poetry install

2. Configuration

Set your API keys in the environment:

export GROQ_API_KEY="your_groq_key"
export HF_TOKEN="your_hf_token"

3. Run Experiment

Execute the main experiment loop. You can specify the model and choose a dry-run for testing logic.

python -m src.main --model "llama-3.3-70b-versatile"

4. Visualization

Generate the safety decay plot from the latest logs:

python -m src.analysis.visualize

Verification

The codebase is supported by a comprehensive pytest suite covering:

Context window eviction logic
Infinite data streaming and batching
Safety grader classification accuracy
API client resilience (retries and error handling)

Run the tests with:

pytest tests/

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.cache		.cache
config		config
data		data
notebooks		notebooks
src		src
tests		tests
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
safety_decay_plot.png		safety_decay_plot.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Drift-Bench: Quantifying the Safety Half-Life of LLMs

Abstract

Methodology

Metric: Safety Half-Life

Results

Usage

1. Installation

2. Configuration

3. Run Experiment

4. Visualization

Verification

About

Uh oh!

Releases

Packages

Languages

Jason-Wang313/Drift-Bench

Folders and files

Latest commit

History

Repository files navigation

Drift-Bench: Quantifying the Safety Half-Life of LLMs

Abstract

Methodology

Metric: Safety Half-Life

Results

Usage

1. Installation

2. Configuration

3. Run Experiment

4. Visualization

Verification

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages