My implementation of GPT-2 (124M), inspired by Andrej Karpathy's "Let's Reproduce GPT-2 (124M)" video, with explanatory notes and my own additional implementations throughout.
What I gained from this project:
-
Reinforced my understanding of the transformer architecture and how to implement it in pytorch/python.
-
Learn how to train on large-scale datasets by loading the Hugging Face dataset in Parquet, tokenizing it, and saving the tokenized arrays as .npy files.
-
Incrementally load and use training data through a custom dataloader class. This avoids having unreasonably long tensors.
-
Create a custom cosine-decay lr scheduler.
-
Integrate wandb into training runs for clear and convenient logging.
-
TODO: add clipping gradient norm + optimizations (bf16/tf32, compile, flash attention, etc). decide whether to review ddp; don't have multiple gpus to make use of it but it could be useful for the future. also, view vs reshape. handle multiple choice validation data like goldenswag. fused adamw