Neurons That Panic

A mechanistic interpretability project studying misgeneralization in small transformer models by identifying "panic neurons" whose activations change significantly under adversarial inputs.

Create virtual environment & Installation

uv venv

uv pip install -e .

Dependencies

transformer_lens - Model loading and activation extraction
torch - PyTorch for tensor operations
datasets - Dataset loading (SST-2)
matplotlib & seaborn - Plotting
pandas & numpy - Data manipulation

Usage

Running the Full Experiment

python run.py [OPTIONS]

Options:

--model: HuggingFace model path (default: EleutherAI/pythia-160m-deduped)
--num-prompts: Number of prompts to process (default: 200)
--k: Number of panic components to rank (default: 20)
--batch-size: Batch size for processing (default: 16)
--device: Device to use (default: cuda)
--save: Output directory (default: artifacts/)
--n-candidates: Number of candidate tokens for adversarial generation (default: 500)
--n-random: Number of random trials for patching baseline (default: 5)
--seed: Random seed (default: 42)
-v, --verbose: Enable verbose output

Example:

python run.py --num-prompts 200 --k 20 --verbose

Running Notebooks

The notebooks (experiment.ipynb and sanity_checks.ipynb) import functions from the src/ modules. They can be run top-to-bottom and will produce the same results as run.py.

Project Structure

src/
├── data.py    # Loading prompts from SST-2
├── model.py   # Loading model + tokenizer
├── adv.py     # Creating adversarial triggers
├── acts.py    # Activation extraction + hooks
├── dn.py      # Delta-norm computation + ranking
├── patch.py   # Causal patching
└── plot.py    # Visualizations

Reproducing Results

Main Experiment Notebook: experiment.ipynb
Sanity Checks Notebook: sanity_checks.ipynb
Supplementary Materials: Google Drive

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
experiment.ipynb		experiment.ipynb
pyproject.toml		pyproject.toml
run.py		run.py
sanity_checks.ipynb		sanity_checks.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Neurons That Panic

Create virtual environment & Installation

Dependencies

Usage

Running the Full Experiment

Running Notebooks

Project Structure

Reproducing Results

About

Uh oh!

Releases

Packages

Languages

License

deprecated-work/neurons-that-panic

Folders and files

Latest commit

History

Repository files navigation

Neurons That Panic

Create virtual environment & Installation

Dependencies

Usage

Running the Full Experiment

Running Notebooks

Project Structure

Reproducing Results

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages