Skip to content

designer-coderajay/minigpt-shakespeare

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MiniGPT-Shakespeare

A from-scratch implementation of the GPT language model architecture, trained on Shakespeare's complete works.

This is not a library wrapper. Every component, from LayerNorm to the training loop with gradient accumulation, is implemented from first principles to demonstrate deep understanding of transformer architectures.

Why This Project Exists

Using model = GPT.from_pretrained('gpt2') tells you nothing about whether someone understands transformers. This implementation proves I can:

  1. Build the architecture — Positional embeddings, multi-head causal attention, transformer blocks, the whole stack
  2. Train it properly — Learning rate scheduling, weight decay, gradient accumulation for memory-constrained hardware
  3. Debug it systematically — Comprehensive test suite comparing against PyTorch's implementations

Architecture

Input Tokens
    │
    ▼
┌─────────────────────────┐
│ Token Embedding (tied)  │──────────────────────────┐
│ + Positional Embedding  │                          │
└─────────────────────────┘                          │
    │                                                │
    ▼                                                │
┌─────────────────────────┐                          │
│   Transformer Block     │                          │
│  ┌───────────────────┐  │                          │
│  │ LayerNorm         │  │                          │
│  │ Multi-Head Attn   │◄─┼── Causal mask            │
│  │ + Residual        │  │                          │
│  ├───────────────────┤  │                          │
│  │ LayerNorm         │  │                          │
│  │ MLP (4x expand)   │  │                          │
│  │ + Residual        │  │                          │
│  └───────────────────┘  │                          │
└─────────────────────────┘                          │
    │ × N layers                                     │
    ▼                                                │
┌─────────────────────────┐                          │
│    Final LayerNorm      │                          │
└─────────────────────────┘                          │
    │                                                │
    ▼                                                │
┌─────────────────────────┐                          │
│   Linear Head (tied) ◄──┼──────────────────────────┘
│   → vocab_size logits   │   Weight tying
└─────────────────────────┘

Key Implementation Details

Gradient Accumulation

When you can't fit large batches in GPU memory, you simulate them:

# Instead of one batch of 256:
for micro_step in range(4):  # 4 micro-batches of 64
    loss = model(micro_batch) / 4  # Scale loss
    loss.backward()  # Gradients accumulate
optimizer.step()  # Update once with averaged gradients

Mathematically identical to large-batch training. Essential for training on consumer hardware.

Pre-Norm Architecture

Unlike the original transformer (LayerNorm after attention), GPT-2 and later use Pre-Norm:

# Pre-Norm (used here)
x = x + attention(layernorm(x))

# Post-Norm (original transformer)
x = layernorm(x + attention(x))

Pre-Norm has better gradient flow in deep networks.

Weight Tying

The token embedding and output projection share the same weight matrix. The embedding says "what vector represents token X" and the output projection says "what token does this vector represent", conceptually inverse operations.

Quick Start

# Clone and setup
git clone https://github.com/designer-coderajay/minigpt-shakespeare.git
cd minigpt-shakespeare
python -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows
pip install -r requirements.txt

# Download dataset
python download_data.py

# Run tests (verify implementation correctness)
pytest test_minigpt.py -v

# Train
python minigpt.py --mode train

# Generate text
python minigpt.py --mode generate --prompt "ROMEO:" --max_tokens 200

Training Configuration

Default hyperparameters (tuned for laptop/small GPU):

Parameter Value Notes
Layers 6 Number of transformer blocks
Heads 6 Attention heads per block
Embedding 384 Hidden dimension
Context 256 Maximum sequence length
Batch size 64 Micro-batch size
Gradient accum 4 Effective batch = 256
Learning rate 3e-4 Peak LR with warmup
Dropout 0.2 Regularization

~10.75M parameters. Trains in ~5-6 hours on M1 MacBook Pro (CPU).

Experiment Tracking

Integrated with Weights & Biases:

# First time: login to wandb
wandb login

# Training automatically logs to wandb
python minigpt.py --mode train

# Disable if you prefer
python minigpt.py --mode train --no_wandb

Test Coverage

The test suite verifies:

  • LayerNorm: Output statistics, matches PyTorch implementation
  • Attention: Causality (future tokens invisible), variable sequence lengths
  • MLP: Shape preservation, nonlinearity verification
  • Full model: Loss computation, generation, weight tying
  • Gradient accumulation: Accumulated gradients = large batch gradients
  • LR schedule: Warmup, cosine decay, minimum floor
pytest test_minigpt.py -v

Sample Output

After training (5000 iterations, ~12 hours on M1 MacBook Pro CPU):

ROMEO:
Art thou shalt be thy rest?
My dear lord? a little word again?

MERCUTIO:
Well, belike.

TYBALT:
Why, lady, is done. But when this devil's day
The affection of my affairs and thee?
KING:
The king of Buckingham, the king is taken.

QUEEN ELIZABETH:
Had his reasons, and his brotherhood more.

Not perfect Shakespeare, but recognizable dialogue structure, character names, and poetic rhythm after training.

File Structure

minigpt-shakespeare/
├── minigpt.py          # Full implementation (model + training)
├── test_minigpt.py     # Comprehensive test suite
├── download_data.py    # Dataset download script
├── requirements.txt    # Dependencies
├── data/
│   └── input.txt       # Tiny Shakespeare dataset
└── README.md

References

License

MIT


Part of my from-scratch ML portfolio. See also: micrograd-numpy

About

From-scratch GPT implementation trained on Shakespeare (10.75M params, 25 tests)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages