Blockwise scaling linear quantization recipe#1559
Merged
timmoon10 merged 68 commits intoNVIDIA:mainfrom Apr 10, 2025
Merged
Conversation
c42f005 to
16cc925
Compare
Collaborator
Author
|
@ptrendx here is a parallel review from my fork with only the recipe changes. |
16cc925 to
db411ce
Compare
8bb7d63 to
365a4d9
Compare
yaox12
reviewed
Mar 13, 2025
51fbe41 to
78c194d
Compare
78c194d to
8f4f0f0
Compare
BestJuly
reviewed
Mar 28, 2025
transformer_engine/pytorch/tensor/_internal/float8_blockwise_tensor_base.py
Outdated
Show resolved
Hide resolved
22828fe to
413331d
Compare
zhongbozhu
reviewed
Apr 2, 2025
dbb6f63 to
7bfea52
Compare
kunlunl
reviewed
Apr 2, 2025
7bfea52 to
44843bb
Compare
zhongbozhu
reviewed
Apr 2, 2025
zhongbozhu
reviewed
Apr 2, 2025
zhongbozhu
reviewed
Apr 2, 2025
d7775fc to
b62d555
Compare
Apply MR comment change. Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: kwyss-nvidia <kwyss@nvidia.com>
timmoon10
reviewed
Apr 8, 2025
zhongbozhu
reviewed
Apr 9, 2025
8fc753d to
67e790b
Compare
ptrendx
reviewed
Apr 9, 2025
transformer_engine/common/include/transformer_engine/transformer_engine.h
Outdated
Show resolved
Hide resolved
timmoon10
reviewed
Apr 9, 2025
zhongbozhu
reviewed
Apr 9, 2025
Collaborator
There was a problem hiding this comment.
Float8BlockQuantizer getting added here to set columnwise usage True
But this logic is also not doing anything unless we do fp8 gather, I am okay with keeping it there, just taking a note case in case we forget.
Collaborator
Author
There was a problem hiding this comment.
PTAL. Added an assert.
Collaborator
|
/te-ci L1 |
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
6948759 to
ea9e46b
Compare
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
yaox12
reviewed
Apr 9, 2025
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Collaborator
|
/te-ci L1 |
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Xin Yao <yaox12@outlook.com>
Member
|
/te-ci pytorch L1 |
Signed-off-by: Xin Yao <yaox12@outlook.com>
Member
|
/te-ci pytorch |
Member
|
JAX L0/L1 tests passed with build 26726340. |
Collaborator
|
/te-ci pytorch L1 |
wdykas
pushed a commit
to wdykas/TransformerEngine
that referenced
this pull request
Apr 14, 2025
* Add GEMM logic for blockwise quantized tensors. GEMM test cases included in pytorch integration. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Update NVTE_BLOCK_SCALING for GEMM. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Gate feature on CUDA 12.9 Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Gemm typo. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Remove unecessary type converter change. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Reflect epilogue availability and test supported epilogues. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * GEMM simplifications from recipe branch. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Format py code. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Update GEMM DGelu tests to match support depending on output dtype. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Force pow2Scales in GEMM Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Add GEMM test to pytorch test suite. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Add copyright to GEMM test. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Update import for GEMM test. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Add license. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Update test gemm supported predicate. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Use sgemm like interfaces and naming. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Rewrite GEMM comment. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * MR Feedback. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Recipe setup for Linear modules. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Use 12.9 feature test. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Run against tensor dumps from internal library. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Update FIXME to TODO with linked issue. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Update full recompute feature to save recipe. The recompute context uses the same recipe and fp8 settings as the original fwd pass. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * MR Feedback. Avoid reusing quantizer objects. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Update logic in module. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Format py. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Update for PP bug. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Update test numerics. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Update force_power_of_2 scales in the recipe. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Update usage method to satisfy upstream changes. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * fix subchannel recipe in distributed test with bf16 gather Signed-off-by: zhongboz <zhongboz@nvidia.com> * Edit and cleanup BF16 gather code. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Update test import. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * support columnwise only mode to 1D quantize kernel Signed-off-by: zhongboz <zhongboz@nvidia.com> * Format and move enum Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Skip alloc. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * try async bf16 gather Signed-off-by: zhongboz <zhongboz@nvidia.com> * Format python code. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Document and type code. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Update pytorch lint errors. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Dont set high precision dtype. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Add test for sanity and CG; fix CG for sequential? Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Keep make_quantizers API stable Update num_quantizers instead to pass cuda_graph tests. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Fix import name. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Rename recipe method. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Skip grouped linear sanity test. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Set usage before BF16 gather. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * refactor for nvte_quantize_v2 Signed-off-by: zhongboz <zhongboz@nvidia.com> * Format code. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Cleanup nvte_quantize_v2 Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Test fp32 scales. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Disable CUDA graph. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Simplify layernorm linear Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Cleanup layernorm linear. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * LayerNorm linear bwd gather logic. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Communication updates. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Update transformer_engine/pytorch/ops/op.py Apply MR comment change. Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: kwyss-nvidia <kwyss@nvidia.com> * Lint fix. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * MR feedback. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Enable cuda graph tests. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Reduce chance of spurious failure and reword. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Review suggestions from @timmoon10 Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update CPP tests. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Update common.h Signed-off-by: Xin Yao <yaox12@outlook.com> * Update test_float8blockwisetensor.py Signed-off-by: Xin Yao <yaox12@outlook.com> --------- Signed-off-by: Keith Wyss <kwyss@nvidia.com> Signed-off-by: zhongboz <zhongboz@nvidia.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: kwyss-nvidia <kwyss@nvidia.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Xin Yao <yaox12@outlook.com> Co-authored-by: zhongboz <zhongboz@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Xin Yao <yaox12@outlook.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Add a recipe configuration to quantizers and GEMM for float8 blockwise scaling.
Quantizer MR: #1513
GEMM MR: #1545
Checked against golden outputs from internal tool.
Type of change
Changes
Checklist: