Skip to content

brien60/GPT-2

Repository files navigation

GPT-2

My implementation of GPT-2 (124M), inspired by Andrej Karpathy's "Let's Reproduce GPT-2 (124M)" video, with explanatory notes and my own additional implementations throughout.

What I gained from this project:

  • Reinforced my understanding of the transformer architecture and how to implement it in pytorch/python.

  • Learn how to train on large-scale datasets by loading the Hugging Face dataset in Parquet, tokenizing it, and saving the tokenized arrays as .npy files.

  • Incrementally load and use training data through a custom dataloader class. This avoids having unreasonably long tensors.

  • Create a custom cosine-decay lr scheduler.

  • Integrate wandb into training runs for clear and convenient logging.

  • TODO: add clipping gradient norm + optimizations (bf16/tf32, compile, flash attention, etc). decide whether to review ddp; don't have multiple gpus to make use of it but it could be useful for the future. also, view vs reshape. handle multiple choice validation data like goldenswag. fused adamw

About

My code and notes on GPT-2

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published