-
Notifications
You must be signed in to change notification settings - Fork 31
[for informative purposes only] compiled cuda train-gpt.py #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
152334H
wants to merge
6
commits into
modula-systems:main
Choose a base branch
from
152334H:mfu
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
03f6418
fix typo
152334H 5ce1f71
gpt flops counters
152334H 4333a24
compilable cuda bf16 train-gpt.py
152334H 439351d
tokens-per-sec and model-flops-utilization logging
152334H 1246cf8
change hparams to just barely fit on 3090
152334H 37669ed
avoid logging first 2 steps && insert commented failed compiles
152334H File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct? Wikipedia only lists the 2:1 sparse TFLOPs but the source included in the article shows it as 143 TFLOPs. I think that would explain why you're seeing high MFU. On an A10 (with
num_blocks = 3because of memory spike when adam initialises,batch_size=8) I see:Assuming an A10 has peak 16 bit 125 TFLOPs. This is pretty closely in line with what I've seen with nanoGPT experiments. The code matches nanoGPT closely so that would make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please always consult the appropriate nvidia whitepaper when looking for GPU performance numbers. if you have issues finding these PDFs in the future, I have developed a simple frontend that aggregates a number of manually collected sources.
Regarding the observed MFU of 3090,
I hope this addresses your concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK so if I use a triton matmul to accumulate in fp16 I should see the MFU on the A10 increase to around what you're seeing with a 3090?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The A10 has equivalent fp16 vs fp32 accum performance, unlike the 3090. This means that the A10 will always appear to have lower MFU relative to the 3090, regardless of accumulation precision.
Please note that the assumed peak tflops used in MFU calculation is based on the reported spec sheet values for fp32 accumulation. You may interpret this as the 3090 having an "easier target" to accomplish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this for a "lazy coder's MFU": measure the FLOP ceiling using just PyTorch matmuls and assume that's the peak I can get without doing extra work in triton or C++. I wrote this script and it measures the A10 at 69 TFLOPs, then the results are pretty much exactly what you see on the 3090. Out of interest, what flop ceiling do you see on the 3090 with the same script?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done the simple lazy coder's approach in the past; it's usually within 95% of the spec on gamer GPUs. These are my 3090 results, which are not unexpected:
However, your observation of the A10 achieving only 69TFLOPs surprised me, so I spun up A10 (Lambdalabs) && A10g (AWS) instances to test triton fp16 vs torch fp32 accum.
A10
I am also able to replicate your script's results:
During the execution of your script, I noticed that the A10's GPU clock speed in nvtop is drastically limited at 100% utilization. When initially boosted, it goes up close to 1700MHz, but after a period of continuous execution, it settles around 1000MHz.
I believe that the A10's 150W TDP explains the reduced performance of the A10 relative to the expected 125TF spec.
A10g
There is a separate spec sheet for the A10g, which indicates it having only 70 peak TFlops and a TDP of 300W.
On 300W, it achieves this much:

Limited to 150W, it performs as so:

Special thanks to @neggles for quickly setting up tests on A10g, and for pointing out A10 GPU clock throttling.