Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ python examples/data/shakespeare.py

And finally, let's train a GPT:
```bash
python examples/train-GPT.py
python examples/train-gpt.py
```

This runs on CPU and should get train loss: 1.65 and test loss: 1.80 after 2000 iterations.
Expand Down
100 changes: 91 additions & 9 deletions examples/train-gpt.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import time
import torch
import numpy as np

Expand All @@ -11,14 +12,66 @@
d_value = 32
num_blocks = 4

# Llama-7b-like values, excluding the vocabulary size.
vocab_size = 256
context = 1024
num_heads = 32
d_embed = 4096
d_query = 128
d_value = 128
num_blocks = 4

GPU_16BIT_FLOPS = {
"h100-sxm": 1.979e15 / 2,
"h100-pcie": 1.513e15 / 2,
"a100": 312e12,
"v100-sxm": 125e12,
"6000A": 364.25e12,
"4090": 165.2 * 10**12,
"3090": 71 * 10**12,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? Wikipedia only lists the 2:1 sparse TFLOPs but the source included in the article shows it as 143 TFLOPs. I think that would explain why you're seeing high MFU. On an A10 (with num_blocks = 3 because of memory spike when adam initialises, batch_size=8) I see:

...
step: 100        train loss: 2.77        test loss: 2.81         tokens/gpu/sec: 14982.88  MFU: 43.46%

Assuming an A10 has peak 16 bit 125 TFLOPs. This is pretty closely in line with what I've seen with nanoGPT experiments. The code matches nanoGPT closely so that would make sense.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please always consult the appropriate nvidia whitepaper when looking for GPU performance numbers. if you have issues finding these PDFs in the future, I have developed a simple frontend that aggregates a number of manually collected sources.

Regarding the observed MFU of 3090,

  1. all gamer gpus have crippled fp32 accumulation such that the effective performance of any tensor core is half of what it should be. see the table from the link above:
    image
  2. all pytorch matmuls use fp32 accumulation for all GEMMs; this is impossible to change at runtime, so all gamer GPUs on pytorch get half the performance of what they would with fp16 accumulation.
  3. because of this disadvantage, gamer GPUs also have a much lower FLOPs-to-membandwidth ratio than datacenter GPUs do, and it is much easier to achieve high MFU on them than on their corresponding datacenter equivalents.

I hope this addresses your concerns.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so if I use a triton matmul to accumulate in fp16 I should see the MFU on the A10 increase to around what you're seeing with a 3090?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The A10 has equivalent fp16 vs fp32 accum performance, unlike the 3090. This means that the A10 will always appear to have lower MFU relative to the 3090, regardless of accumulation precision.

Please note that the assumed peak tflops used in MFU calculation is based on the reported spec sheet values for fp32 accumulation. You may interpret this as the 3090 having an "easier target" to accomplish.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this for a "lazy coder's MFU": measure the FLOP ceiling using just PyTorch matmuls and assume that's the peak I can get without doing extra work in triton or C++. I wrote this script and it measures the A10 at 69 TFLOPs, then the results are pretty much exactly what you see on the 3090. Out of interest, what flop ceiling do you see on the 3090 with the same script?

Copy link
Author

@152334H 152334H Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done the simple lazy coder's approach in the past; it's usually within 95% of the spec on gamer GPUs. These are my 3090 results, which are not unexpected:

Running on device: cuda
<torch.utils.benchmark.utils.common.Measurement object at 0x7fa483f67b20>
FLOP Ceiling Measurement: Precision: bfloat16
Matrix size: 32768x32768
setup: from __main__ import matrix1, matrix2, matmul
  Median: 1.08 s
  IQR:    0.00 s (1.08 to 1.08)
  10 measurements, 1 runs per measurement, 64 threads
FLOP Ceiling: 65.30 TFLOP/s

However, your observation of the A10 achieving only 69TFLOPs surprised me, so I spun up A10 (Lambdalabs) && A10g (AWS) instances to test triton fp16 vs torch fp32 accum.

A10

image

I am also able to replicate your script's results:

Running on device: cuda
<torch.utils.benchmark.utils.common.Measurement object at 0x7f3bfe2635b0>
FLOP Ceiling Measurement: Precision: bfloat16
Matrix size: 32768x32768
setup: from __main__ import matrix1, matrix2, matmul
  Median: 1.01 s
  IQR:    0.07 s (0.98 to 1.05)
  99 measurements, 1 runs per measurement, 30 threads
FLOP Ceiling: 69.80 TFLOP/s

During the execution of your script, I noticed that the A10's GPU clock speed in nvtop is drastically limited at 100% utilization. When initially boosted, it goes up close to 1700MHz, but after a period of continuous execution, it settles around 1000MHz.

I believe that the A10's 150W TDP explains the reduced performance of the A10 relative to the expected 125TF spec.

A10g

There is a separate spec sheet for the A10g, which indicates it having only 70 peak TFlops and a TDP of 300W.

image

On 300W, it achieves this much:
image

Limited to 150W, it performs as so:
image


Special thanks to @neggles for quickly setting up tests on A10g, and for pointing out A10 GPU clock throttling.

"t4": 65e12,
}
def xf_layer_fwd_flops(slen: int, bs: int=1, causal=True) -> int:
p_mlp = d_embed * 4 * d_embed * 2
f_mlp = p_mlp * 2 * slen

assert d_query == d_value, "Dq != Dv not implemented"
p_att = 4 * d_embed * d_embed
f_att = p_att * 2 * slen
f_sdpa = 4 * slen * slen * d_embed // (2 if causal else 1) # approximation

return (f_mlp + f_att + f_sdpa) * bs

def gpt_train_flops(slen: int, bs: int, causal=True) -> int:
# lmhead layer:
flops = 6 * slen * bs * d_embed * vocab_size
# assume no activation checkpointing
flops += num_blocks * xf_layer_fwd_flops(slen, bs, causal) * 3
return flops

class SpeedLogger:
def __init__(self, ideal_flops_per_sec: float):
self.tps = []
self.mfu = []
self.fps = ideal_flops_per_sec

def add(self, slen: int, bs: int, duration: float) -> tuple[float,float]:
flops = gpt_train_flops(slen, bs)
self.tps.append(slen*bs / duration)
self.mfu.append(flops / duration / self.fps)
return self.tps[-1], self.mfu[-1]

def ave(self):
return sum(self.tps) / len(self.tps), sum(self.mfu) / len(self.mfu)

# training hparams

init_lr = 0.5
wd = 0.01
batch_size = 12
batch_size = 2 # 12
steps = 2001
eval_steps = 100
log_interval = 200
log_interval = 10 # 200

# let's start by defining our GPT architecture
# (we could instead just import GPT from modula.compound)
Expand Down Expand Up @@ -80,8 +133,9 @@ def __len__(self):

# now let's start doing stuff

if __name__ == "__main__":

@torch.cuda.amp.autocast(dtype=torch.bfloat16)
def train(device, ideal_flops_per_sec):
# load the data

trainset = SimpleLLMDataset(np.memmap("examples/data/shakespeare/train.bin", dtype=np.uint16, mode='r'), context)
Expand All @@ -96,12 +150,18 @@ def __len__(self):
train_iterator = iter(train_loader)
test_iterator = iter(test_loader)

getBatch = lambda train: next(train_iterator if train else test_iterator)
def getBatch(train: bool) -> list:
res = next(train_iterator if train else test_iterator)
return [t.to(device=device) for t in res]

# load the model

gpt = GPT(vocab_size, context, num_heads, d_embed, d_query, d_value, num_blocks)
weights = gpt.initialize(device="cpu")
weights = gpt.initialize(device=device)
gpt.forward = torch.compile(gpt.forward)
# gpt.normalize = torch.compile(gpt.normalize)
# gpt.regularize = torch.compile(gpt.regularize)
# init_lr_t = torch.tensor(init_lr, device=device)

# initialize the Adam state

Expand All @@ -114,6 +174,8 @@ def __len__(self):

# train the model

speed_logger = SpeedLogger(ideal_flops_per_sec)

for step in range(steps):

if step % log_interval == 0:
Expand All @@ -131,6 +193,7 @@ def __len__(self):
test_loss /= eval_steps
test_acc /= eval_steps

t0 = time.time()
data, target = getBatch(train = True)
output = gpt.forward(data, weights)
output = output.view(-1, output.size(-1))
Expand Down Expand Up @@ -159,7 +222,26 @@ def __len__(self):
gpt.regularize(weights, strength = init_lr * schedule * wd)
weights.zero_grad()

if step % log_interval == 0:
print( "step:", step,
"\t train loss:", "%.2f" % train_loss.item(),
"\t test loss:", "%.2f" % test_loss.item() )
# avoid first compile && first recompile
if step > 1:
speed_logger.add(*data.shape, time.time() - t0)

if step > 1 and step % log_interval == 0:
tps, mfu = speed_logger.ave()
print(
"step:", step,
"\t train loss:", "%.2f" % train_loss.item(),
"\t test loss:", "%.2f" % test_loss.item(),
f"\t tokens/gpu/sec: {tps:.2f}",
f"\t MFU: {mfu*100:.2f}%",
)


if __name__ == "__main__":
import argparse
ap = argparse.ArgumentParser()
ap.add_argument('--cuda', action='store_true')
args = ap.parse_args()

torch.set_float32_matmul_precision("medium")
train('cuda' if args.cuda else 'cpu', GPU_16BIT_FLOPS['3090'])