Mechanistic interpretability tool that detects and analyzes induction heads in GPT-2 Small using TransformerLens.
Detected 12 induction heads in GPT-2 Small:
| Layer | Head | Score |
|---|---|---|
| 5 | 5 | 0.932 |
| 6 | 9 | 0.913 |
| 7 | 10 | 0.910 |
| 5 | 1 | 0.908 |
| 7 | 2 | 0.833 |
Found the induction circuit. Previous-token heads (layers 2-3) compose with induction heads (layers 5-7):
L3H3 → L5H1: 30.68
L2H2 → L5H1: 29.66
L3H3 → L6H9: 27.07
L3H3 → L5H5: 26.56
Induction heads implement in-context learning by completing patterns:
[A][B] ... [A] → predicts [B]
Example: "Harry Potter... Harry" → predicts "Potter"
The mechanism requires two attention heads working together:
- Previous Token Head (Layer L): Writes "what came before me" into the residual stream
- Induction Head (Layer L+k): Finds previous occurrences of current token, copies what came after
git clone https://github.com/designer-coderajay/induction-head-detector.git
cd induction-head-detector
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txtpython induction_detector.pyOutputs:
induction_heatmap.png- 12×12 grid of induction scorestop_induction_head_attention.png- Attention pattern of top head
python deeper_analysis.pyRuns:
- Real text visualization ("Harry Potter... Harry")
- Previous-token head detection
- K-composition scoring between head pairs
induction-head-detector/
├── induction_detector.py # Main detection script
├── deeper_analysis.py # Ablation & composition analysis
├── theory_deep_dive.py # Educational explanations
├── test_detector.py # Unit tests (7 passed)
├── requirements.txt
├── induction_heatmap.png
├── previous_token_heads.png
└── induction_real_text.png
Induction Score Calculation:
For a repeated sequence [r1, r2, ..., r50, r1, r2, ..., r50]:
At position i in the second half, measure attention to position i - seq_len + 1 (what followed the previous occurrence of the current token).
induction_score = mean(attention[i, i - seq_len + 1])
for i in range(seq_len, 2*seq_len)- Score > 0.4 → Induction head
- Score < 0.2 → Not an induction head
- In-context Learning and Induction Heads - Anthropic, 2022
- A Mathematical Framework for Transformer Circuits - Anthropic, 2021
- TransformerLens
MIT
