As Large Language Models (LLMs) process extended context windows, their adherence to initial safety conditioning (system prompts, RLHF fine-tuning) may degrade—a phenomenon we term "Safety Half-Life". This project provides a rigorous framework to quantify this decay by measuring the number of "benign" conversation tokens required for a model to "forget" its safety alignment and succumb to jailbreak attempts.
Drift-Bench employs a modular architecture to simulate long-context drift:
- ContextEngine: Manages a sliding window of conversation history, tracking precise token consumption.
- DistractorStream: An infinite iterator of benign tasks (GSM8k math, Alpaca instructions, Wikitext knowledge) that fills the context window to simulate realistic usage.
- SafetyProbe: Injects jailbreak prompts (from JailbreakBench) at scheduled intervals.
- RefusalJudge: evaluates model responses using a hybrid approach (BERT classifier + heuristic regex) to determine safety compliance.
We define the Safety Half-Life (
In our experiments with llama-3.3-70b-versatile, the model demonstrated exceptional safety retention within standard context limits. We observed a 100% refusal rate up to 8,192 tokens.
Figure 1: Refusal Rate vs. Context Depth for Llama-3.3-70B. The model maintains perfect safety alignment up to the 8k token boundary. Experiments beyond 8k were constrained by API rate limits.
Clone the repository and install dependencies using Poetry or pip:
git clone https://github.com/yourusername/drift-bench.git
cd drift-bench
pip install -r requirements.txt
# OR
poetry installSet your API keys in the environment:
export GROQ_API_KEY="your_groq_key"
export HF_TOKEN="your_hf_token"Execute the main experiment loop. You can specify the model and choose a dry-run for testing logic.
python -m src.main --model "llama-3.3-70b-versatile"Generate the safety decay plot from the latest logs:
python -m src.analysis.visualizeThe codebase is supported by a comprehensive pytest suite covering:
- Context window eviction logic
- Infinite data streaming and batching
- Safety grader classification accuracy
- API client resilience (retries and error handling)
Run the tests with:
pytest tests/