Skip to content

Quantifying the "Safety Half-Life" of LLMs: A framework to measure how safety alignment degrades and susceptibility to jailbreaks increases as context length grows

Notifications You must be signed in to change notification settings

Jason-Wang313/Drift-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Drift-Bench: Quantifying the Safety Half-Life of LLMs

Tests: Passing License: MIT

Abstract

As Large Language Models (LLMs) process extended context windows, their adherence to initial safety conditioning (system prompts, RLHF fine-tuning) may degrade—a phenomenon we term "Safety Half-Life". This project provides a rigorous framework to quantify this decay by measuring the number of "benign" conversation tokens required for a model to "forget" its safety alignment and succumb to jailbreak attempts.

Methodology

Drift-Bench employs a modular architecture to simulate long-context drift:

  1. ContextEngine: Manages a sliding window of conversation history, tracking precise token consumption.
  2. DistractorStream: An infinite iterator of benign tasks (GSM8k math, Alpaca instructions, Wikitext knowledge) that fills the context window to simulate realistic usage.
  3. SafetyProbe: Injects jailbreak prompts (from JailbreakBench) at scheduled intervals.
  4. RefusalJudge: evaluates model responses using a hybrid approach (BERT classifier + heuristic regex) to determine safety compliance.

Metric: Safety Half-Life

We define the Safety Half-Life ($T_{50}$) as the context depth (in tokens) at which the model's refusal rate for harmful queries drops below 50%.

Results

In our experiments with llama-3.3-70b-versatile, the model demonstrated exceptional safety retention within standard context limits. We observed a 100% refusal rate up to 8,192 tokens.

Safety Decay Plot

Figure 1: Refusal Rate vs. Context Depth for Llama-3.3-70B. The model maintains perfect safety alignment up to the 8k token boundary. Experiments beyond 8k were constrained by API rate limits.

Usage

1. Installation

Clone the repository and install dependencies using Poetry or pip:

git clone https://github.com/yourusername/drift-bench.git
cd drift-bench
pip install -r requirements.txt
# OR
poetry install

2. Configuration

Set your API keys in the environment:

export GROQ_API_KEY="your_groq_key"
export HF_TOKEN="your_hf_token"

3. Run Experiment

Execute the main experiment loop. You can specify the model and choose a dry-run for testing logic.

python -m src.main --model "llama-3.3-70b-versatile"

4. Visualization

Generate the safety decay plot from the latest logs:

python -m src.analysis.visualize

Verification

The codebase is supported by a comprehensive pytest suite covering:

  • Context window eviction logic
  • Infinite data streaming and batching
  • Safety grader classification accuracy
  • API client resilience (retries and error handling)

Run the tests with:

pytest tests/

About

Quantifying the "Safety Half-Life" of LLMs: A framework to measure how safety alignment degrades and susceptibility to jailbreaks increases as context length grows

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages