This repository contains a complete implementation of GPT-2 (Generative Pre-trained Transformer 2) built from scratch using PyTorch. The implementation follows the original GPT-2 architecture and includes training, evaluation, and inference capabilities.
The GPT-2 model is implemented as a decoder-only transformer with the following key components:
-
GPT (
gpt.py): Main model class that combines all components- Token embeddings (
wte) and positional embeddings (wpe) - Stack of transformer blocks
- Final layer normalization and language modeling head
- Weight sharing between token embeddings and output projection
- Token embeddings (
-
Block (
block.py): Individual transformer block implementing:- Pre-normalization architecture (LayerNorm before attention/MLP)
- Residual connections around attention and MLP layers
- Standard transformer block:
x = x + attn(ln(x))thenx = x + mlp(ln(x))
-
CausalSelfAttention (
casualattention.py): Multi-head causal attention- Scaled dot-product attention with causal masking
- Uses PyTorch's
F.scaled_dot_product_attentionfor efficient computation - Multi-head attention with configurable number of heads
-
MLP (
mlp.py): Feed-forward network- Two linear layers with GELU activation
- 4x expansion ratio (hidden dimension = 4 * embedding dimension)
-
TrainingConfig (
config.py): Hyperparameters and learning rate scheduling- Cosine annealing with linear warmup
- AdamW optimizer configuration
- Batch size: 524,288 tokens (~0.5M tokens per batch)
- Learning rate: 6e-4 max, 6e-5 min
-
Trainer (
trainer.py): Training loop implementation- Gradient accumulation for large effective batch sizes
- Mixed precision training (bfloat16)
- Gradient clipping and learning rate scheduling
- Distributed training support
-
Evaluator (
evaluate.py): Model evaluation- Validation loss computation
- Text generation capabilities
- HellaSwag evaluation (optional)
-
DDPConfig (
distributed.py): Distributed training setup- Multi-GPU support using DistributedDataParallel
- Automatic device detection (CUDA/MPS/CPU)
- Process group initialization for distributed training
-
FineWeb Dataset (
data_scripts/fineweb.py):- Downloads and processes the FineWeb-Edu dataset
- Tokenizes text using GPT-2's tiktoken tokenizer
- Creates training shards with 10M tokens each
- Supports multiprocessing for efficient tokenization
-
HellaSwag Dataset (
data_scripts/hellaswag.py):- Downloads and processes the HellaSwag evaluation dataset
- Commonsense reasoning benchmark for language models
- Used for evaluation during training to track model performance
- Provides 4-choice multiple choice questions for validation
-
DataLoader (
data_scripts/dataload.py):- Efficient data loading from pre-tokenized shards
- Supports distributed training with proper data sharding
- Memory-efficient streaming of large datasets
- Configurable model size (layers, heads, embedding dimension)
- Default configuration: 12 layers, 12 heads, 768 embedding dimension
- ~124M parameters (GPT-2 small configuration)
- Mixed precision training with bfloat16
- Gradient accumulation for large effective batch sizes
- Distributed training across multiple GPUs
- Cosine learning rate schedule with warmup
- Uses PyTorch's native scaled dot-product attention
- Pre-normalization (LayerNorm before attention/MLP)
- Weight sharing between input and output embeddings
- Efficient data loading and preprocessing
- FineWeb-Edu: High-quality educational web content
- ~10 billion tokens from web pages
- Pre-tokenized and sharded for efficient training
- GPT-2 tokenization (50,257 vocab size)
cd src
python train_gpt.pyFor automated deployment to cloud GPU instances, use the deployment script:
# Make script executable
chmod +x scripts/train_gpu_lambda.sh
# Deploy to remote machine (1 GPU)
./scripts/train_gpu_lambda.sh paperspace@184.105.3.177 1
# Deploy with multiple GPUs
./scripts/train_gpu_lambda.sh paperspace@184.105.3.177 4- Environment Setup: Installs Docker, NVIDIA drivers, and container toolkit
- System Configuration: Configures Docker for GPU access and reboots
- Credential Transfer: Copies
gcp-key.jsonfor data access - Training Execution: Pulls Docker image and runs distributed training
- Automatic Cleanup: Handles permissions and directory setup
- SSH access to remote machine
gcp-key.jsonfile in current directory- Remote machine with GPU(s) and Ubuntu/similar
# After training completes
scp -r paperspace@184.105.3.177:~/my-gpu-project/checkpoints ./The project includes an optimized multi-stage Docker setup:
# Build image (uses BuildKit for faster builds)
DOCKER_BUILDKIT=1 docker build -t gpt2-training .
# Run locally with GPU
docker run --runtime=nvidia \
-v $(pwd)/gcp-key.json:/app/gcp-key.json \
-v $(pwd)/checkpoints:/app/checkpoints \
--rm gpt2-training \
bash -c "torchrun --nproc_per_node=1 train_gpt.py"cd src/data/data_scripts
python fineweb.py- Batch Size: 524,288 tokens (~128 gradient accumulation steps)
- Sequence Length: 512 tokens
- Learning Rate: 6e-4 (max) with cosine decay to 6e-5 (min)
- Warmup: 715 steps
- Total Steps: ~19,073 (approximately 1 epoch on 10B tokens)
- Weight Decay: 0.1
- Optimizer: AdamW
The implementation follows the GPT-2 paper specifications:
- Vocabulary: 50,257 tokens (50k BPE merges + 256 byte tokens + 1 special token)
- Context Length: 1024 tokens maximum (configurable)
- Architecture: Decoder-only transformer with causal attention
- Normalization: Pre-normalization with LayerNorm
- Activation: GELU in MLP layers
- Attention: Multi-head causal self-attention with head dimension 64
This implementation provides a complete, trainable GPT-2 model suitable for research and educational purposes.
The project uses an optimized multi-stage Docker build for faster development:
- Builder Stage: Installs dependencies and build tools
- Runtime Stage: Contains only the application and runtime dependencies
- Benefits: 30-50% smaller final images, faster rebuilds for code changes
GitHub Actions automatically builds and pushes Docker images with intelligent caching:
- Dependency Changes: Full rebuild when
pyproject.tomlorDockerfilechanges - Code Changes: Fast runtime-only builds for source code modifications
- Conditional Building: Automatically detects what changed and chooses optimal build strategy
- BuildKit Support: Parallel stage execution and advanced caching
- Layer Caching: Dependencies cached separately from source code
- Cache Mounts: Persistent package manager caches (optional)
- Smart Rebuilds: Only rebuilds necessary layers based on file changes