Skip to content

designer-coderajay/induction-head-detector

Repository files navigation

Induction Head Detector

Mechanistic interpretability tool that detects and analyzes induction heads in GPT-2 Small using TransformerLens.

Results

Detected 12 induction heads in GPT-2 Small:

Layer Head Score
5 5 0.932
6 9 0.913
7 10 0.910
5 1 0.908
7 2 0.833

Induction Heatmap

K-Composition Analysis

Found the induction circuit. Previous-token heads (layers 2-3) compose with induction heads (layers 5-7):

L3H3 → L5H1: 30.68
L2H2 → L5H1: 29.66
L3H3 → L6H9: 27.07
L3H3 → L5H5: 26.56

What Are Induction Heads?

Induction heads implement in-context learning by completing patterns:

[A][B] ... [A] → predicts [B]

Example: "Harry Potter... Harry" → predicts "Potter"

The mechanism requires two attention heads working together:

  1. Previous Token Head (Layer L): Writes "what came before me" into the residual stream
  2. Induction Head (Layer L+k): Finds previous occurrences of current token, copies what came after

Installation

git clone https://github.com/designer-coderajay/induction-head-detector.git
cd induction-head-detector

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Usage

Basic Detection

python induction_detector.py

Outputs:

  • induction_heatmap.png - 12×12 grid of induction scores
  • top_induction_head_attention.png - Attention pattern of top head

Deeper Analysis

python deeper_analysis.py

Runs:

  • Real text visualization ("Harry Potter... Harry")
  • Previous-token head detection
  • K-composition scoring between head pairs

Project Structure

induction-head-detector/
├── induction_detector.py    # Main detection script
├── deeper_analysis.py       # Ablation & composition analysis
├── theory_deep_dive.py      # Educational explanations
├── test_detector.py         # Unit tests (7 passed)
├── requirements.txt
├── induction_heatmap.png
├── previous_token_heads.png
└── induction_real_text.png

How It Works

Induction Score Calculation:

For a repeated sequence [r1, r2, ..., r50, r1, r2, ..., r50]:

At position i in the second half, measure attention to position i - seq_len + 1 (what followed the previous occurrence of the current token).

induction_score = mean(attention[i, i - seq_len + 1])
                  for i in range(seq_len, 2*seq_len)
  • Score > 0.4 → Induction head
  • Score < 0.2 → Not an induction head

References

License

MIT

induction-head-detector

About

Mechanistic interpretability tool to detect induction heads in GPT-2 using TransformerLens

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages