A from-scratch implementation of the GPT language model architecture, trained on Shakespeare's complete works.
This is not a library wrapper. Every component, from LayerNorm to the training loop with gradient accumulation, is implemented from first principles to demonstrate deep understanding of transformer architectures.
Using model = GPT.from_pretrained('gpt2') tells you nothing about whether someone understands transformers. This implementation proves I can:
- Build the architecture — Positional embeddings, multi-head causal attention, transformer blocks, the whole stack
- Train it properly — Learning rate scheduling, weight decay, gradient accumulation for memory-constrained hardware
- Debug it systematically — Comprehensive test suite comparing against PyTorch's implementations
Input Tokens
│
▼
┌─────────────────────────┐
│ Token Embedding (tied) │──────────────────────────┐
│ + Positional Embedding │ │
└─────────────────────────┘ │
│ │
▼ │
┌─────────────────────────┐ │
│ Transformer Block │ │
│ ┌───────────────────┐ │ │
│ │ LayerNorm │ │ │
│ │ Multi-Head Attn │◄─┼── Causal mask │
│ │ + Residual │ │ │
│ ├───────────────────┤ │ │
│ │ LayerNorm │ │ │
│ │ MLP (4x expand) │ │ │
│ │ + Residual │ │ │
│ └───────────────────┘ │ │
└─────────────────────────┘ │
│ × N layers │
▼ │
┌─────────────────────────┐ │
│ Final LayerNorm │ │
└─────────────────────────┘ │
│ │
▼ │
┌─────────────────────────┐ │
│ Linear Head (tied) ◄──┼──────────────────────────┘
│ → vocab_size logits │ Weight tying
└─────────────────────────┘
When you can't fit large batches in GPU memory, you simulate them:
# Instead of one batch of 256:
for micro_step in range(4): # 4 micro-batches of 64
loss = model(micro_batch) / 4 # Scale loss
loss.backward() # Gradients accumulate
optimizer.step() # Update once with averaged gradientsMathematically identical to large-batch training. Essential for training on consumer hardware.
Unlike the original transformer (LayerNorm after attention), GPT-2 and later use Pre-Norm:
# Pre-Norm (used here)
x = x + attention(layernorm(x))
# Post-Norm (original transformer)
x = layernorm(x + attention(x))Pre-Norm has better gradient flow in deep networks.
The token embedding and output projection share the same weight matrix. The embedding says "what vector represents token X" and the output projection says "what token does this vector represent", conceptually inverse operations.
# Clone and setup
git clone https://github.com/designer-coderajay/minigpt-shakespeare.git
cd minigpt-shakespeare
python -m venv venv
source venv/bin/activate # or `venv\Scripts\activate` on Windows
pip install -r requirements.txt
# Download dataset
python download_data.py
# Run tests (verify implementation correctness)
pytest test_minigpt.py -v
# Train
python minigpt.py --mode train
# Generate text
python minigpt.py --mode generate --prompt "ROMEO:" --max_tokens 200Default hyperparameters (tuned for laptop/small GPU):
| Parameter | Value | Notes |
|---|---|---|
| Layers | 6 | Number of transformer blocks |
| Heads | 6 | Attention heads per block |
| Embedding | 384 | Hidden dimension |
| Context | 256 | Maximum sequence length |
| Batch size | 64 | Micro-batch size |
| Gradient accum | 4 | Effective batch = 256 |
| Learning rate | 3e-4 | Peak LR with warmup |
| Dropout | 0.2 | Regularization |
~10.75M parameters. Trains in ~5-6 hours on M1 MacBook Pro (CPU).
Integrated with Weights & Biases:
# First time: login to wandb
wandb login
# Training automatically logs to wandb
python minigpt.py --mode train
# Disable if you prefer
python minigpt.py --mode train --no_wandbThe test suite verifies:
- LayerNorm: Output statistics, matches PyTorch implementation
- Attention: Causality (future tokens invisible), variable sequence lengths
- MLP: Shape preservation, nonlinearity verification
- Full model: Loss computation, generation, weight tying
- Gradient accumulation: Accumulated gradients = large batch gradients
- LR schedule: Warmup, cosine decay, minimum floor
pytest test_minigpt.py -vAfter training (5000 iterations, ~12 hours on M1 MacBook Pro CPU):
ROMEO:
Art thou shalt be thy rest?
My dear lord? a little word again?
MERCUTIO:
Well, belike.
TYBALT:
Why, lady, is done. But when this devil's day
The affection of my affairs and thee?
KING:
The king of Buckingham, the king is taken.
QUEEN ELIZABETH:
Had his reasons, and his brotherhood more.
Not perfect Shakespeare, but recognizable dialogue structure, character names, and poetic rhythm after training.
minigpt-shakespeare/
├── minigpt.py # Full implementation (model + training)
├── test_minigpt.py # Comprehensive test suite
├── download_data.py # Dataset download script
├── requirements.txt # Dependencies
├── data/
│ └── input.txt # Tiny Shakespeare dataset
└── README.md
- Attention Is All You Need — Original transformer paper
- Language Models are Unsupervised Multitask Learners — GPT-2 paper
- Karpathy's NanoGPT — Reference implementation
- makemore series — Educational background
MIT
Part of my from-scratch ML portfolio. See also: micrograd-numpy