A mechanistic interpretability project studying misgeneralization in small transformer models by identifying "panic neurons" whose activations change significantly under adversarial inputs.
uv venv
uv pip install -e .transformer_lens- Model loading and activation extractiontorch- PyTorch for tensor operationsdatasets- Dataset loading (SST-2)matplotlib&seaborn- Plottingpandas&numpy- Data manipulation
python run.py [OPTIONS]Options:
--model: HuggingFace model path (default:EleutherAI/pythia-160m-deduped)--num-prompts: Number of prompts to process (default:200)--k: Number of panic components to rank (default:20)--batch-size: Batch size for processing (default:16)--device: Device to use (default:cuda)--save: Output directory (default:artifacts/)--n-candidates: Number of candidate tokens for adversarial generation (default:500)--n-random: Number of random trials for patching baseline (default:5)--seed: Random seed (default:42)-v, --verbose: Enable verbose output
Example:
python run.py --num-prompts 200 --k 20 --verboseThe notebooks (experiment.ipynb and sanity_checks.ipynb) import functions from the src/ modules. They can be run top-to-bottom and will produce the same results as run.py.
src/
├── data.py # Loading prompts from SST-2
├── model.py # Loading model + tokenizer
├── adv.py # Creating adversarial triggers
├── acts.py # Activation extraction + hooks
├── dn.py # Delta-norm computation + ranking
├── patch.py # Causal patching
└── plot.py # Visualizations
- Main Experiment Notebook:
experiment.ipynb - Sanity Checks Notebook:
sanity_checks.ipynb - Supplementary Materials: Google Drive