Use `llvm.fma` for `tt.dot` lowering by abrown · Pull Request #254 · kernelize-ai/triton-cpu

abrown · 2026-01-26T23:28:10Z

This change replaces the llvm.fmul and llvm.fadd instructions with the fused llvm.fma operation. This should have no downstream impact on the emitted machine code which, due to auto-vectorization and other LLVM magic, already ends up using VFMADD213PS.

What is unclear about this change is that we materialize some fastmath flags from thin air: it seems like we should be able to configure this somewhere at the user level (TODO).

abrown · 2026-01-26T23:29:58Z

This is a draft for now until we can discuss what to do about the fastmath flags.

alexbaden · 2026-01-28T02:49:35Z

cpu/lib/TritonCPUToLLVM/DotOpToLLVM.cpp

      // Multiply and accumulate.
-      auto mul = LLVM::FMulOp::create(builder, loc, tgtTy, aElem, bElem);
-      accum = LLVM::FAddOp::create(builder, loc, tgtTy, accum, mul);
+      auto flags = LLVM::FastmathFlagsAttr::get(builder.getContext(),


tl.dot_scaled has a fast math flag, but triton typically prefers fast math to be off

This change replaces the `llvm.fmul` and `llvm.fadd` instructions with the fused `llvm.fma` operation. This should have no downstream impact on the emitted machine code which, due to auto-vectorization and other LLVM magic, already ends up using `VFMADD213PS`. What _is_ unclear about this change is that we materialize some fastmath flags from thin air: it seems like we should be able to configure this somewhere at the user level (TODO).

abrown · 2026-01-29T17:26:02Z

This has no effect on performance. I still see vfmadd231ps being used in the emitted machine code and my benchmarking infrastructure shows the same results as before:

$ for i in {1..5}; do python bench-triton/matmul.py --size 1024 --device cpu --provider triton
 --block-size-m 8 --block-size-n 256 --block-size-k 1; done
0.4107300620226039
0.41122981901199657
0.41164295082399166
0.4109849791469111
0.41048660226802114

abrown requested a review from alexbaden January 27, 2026 21:31

alexbaden reviewed Jan 28, 2026

View reviewed changes

alexbaden mentioned this pull request Jan 28, 2026

Add lit tests; shows initial bfloat16 support for tt.dot #253

Merged

abrown force-pushed the use-fma-op branch from fa56f6e to edf8bf6 Compare January 28, 2026 22:41

abrown added 4 commits January 29, 2026 09:16

Apply clang formatting

23c84ab

review: remove fastmath flags

f17f266

Update lit tests

df6cae7

abrown force-pushed the use-fma-op branch from edf8bf6 to df6cae7 Compare January 29, 2026 17:22

abrown marked this pull request as ready for review January 29, 2026 17:24

abrown requested a review from alexbaden January 29, 2026 18:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `llvm.fma` for `tt.dot` lowering#254

Use `llvm.fma` for `tt.dot` lowering#254
abrown wants to merge 4 commits intokernelize-ai:mainfrom
abrown:use-fma-op

abrown commented Jan 26, 2026

Uh oh!

abrown commented Jan 26, 2026

Uh oh!

alexbaden Jan 28, 2026

Uh oh!

abrown commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

abrown commented Jan 26, 2026

Uh oh!

abrown commented Jan 26, 2026

Uh oh!

alexbaden Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

abrown commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants