Skip to content

Comments

Blockwise scaling linear quantization recipe#1559

Merged
timmoon10 merged 68 commits intoNVIDIA:mainfrom
kwyss-nvidia:kwyss/subchannel_recipe_linear
Apr 10, 2025
Merged

Blockwise scaling linear quantization recipe#1559
timmoon10 merged 68 commits intoNVIDIA:mainfrom
kwyss-nvidia:kwyss/subchannel_recipe_linear

Conversation

@kwyss-nvidia
Copy link
Collaborator

@kwyss-nvidia kwyss-nvidia commented Mar 11, 2025

Description

Add a recipe configuration to quantizers and GEMM for float8 blockwise scaling.

Quantizer MR: #1513
GEMM MR: #1545

Checked against golden outputs from internal tool.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • Add recipe and recipe state classes to create quantizers.
  • Modify linear, layer norm linear, and layer_norm_mlp to use the quantized data paths.
  • Test for linear and layer_norm.
  • Revert recent change that breaks test_numerics for all recipes.

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@kwyss-nvidia kwyss-nvidia force-pushed the kwyss/subchannel_recipe_linear branch 2 times, most recently from c42f005 to 16cc925 Compare March 11, 2025 22:23
@kwyss-nvidia
Copy link
Collaborator Author

@ptrendx here is a parallel review from my fork with only the recipe changes.
kwyss-nvidia#2

@kwyss-nvidia kwyss-nvidia force-pushed the kwyss/subchannel_recipe_linear branch from 16cc925 to db411ce Compare March 12, 2025 00:51
@kwyss-nvidia kwyss-nvidia changed the title Draft: Blockwise scaling linear quantization recipe Blockwise scaling linear quantization recipe Mar 12, 2025
@kwyss-nvidia kwyss-nvidia force-pushed the kwyss/subchannel_recipe_linear branch 2 times, most recently from 8bb7d63 to 365a4d9 Compare March 13, 2025 00:06
@kwyss-nvidia kwyss-nvidia force-pushed the kwyss/subchannel_recipe_linear branch 5 times, most recently from 22828fe to 413331d Compare April 1, 2025 23:23
@kwyss-nvidia kwyss-nvidia force-pushed the kwyss/subchannel_recipe_linear branch 2 times, most recently from dbb6f63 to 7bfea52 Compare April 2, 2025 02:22
@kwyss-nvidia kwyss-nvidia force-pushed the kwyss/subchannel_recipe_linear branch from 7bfea52 to 44843bb Compare April 2, 2025 17:34
@kwyss-nvidia kwyss-nvidia force-pushed the kwyss/subchannel_recipe_linear branch from d7775fc to b62d555 Compare April 8, 2025 23:35
Apply MR comment change.

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: kwyss-nvidia <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
@kwyss-nvidia kwyss-nvidia force-pushed the kwyss/subchannel_recipe_linear branch from 8fc753d to 67e790b Compare April 9, 2025 00:05
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Float8BlockQuantizer getting added here to set columnwise usage True

But this logic is also not doing anything unless we do fp8 gather, I am okay with keeping it there, just taking a note case in case we forget.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL. Added an assert.

@timmoon10
Copy link
Collaborator

/te-ci L1

Signed-off-by: Keith Wyss <kwyss@nvidia.com>
@kwyss-nvidia kwyss-nvidia force-pushed the kwyss/subchannel_recipe_linear branch from 6948759 to ea9e46b Compare April 9, 2025 01:32
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Copy link
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@timmoon10
Copy link
Collaborator

/te-ci L1

kwyss-nvidia and others added 2 commits April 9, 2025 17:56
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Xin Yao <yaox12@outlook.com>
@yaox12
Copy link
Member

yaox12 commented Apr 10, 2025

/te-ci pytorch L1

Signed-off-by: Xin Yao <yaox12@outlook.com>
@yaox12
Copy link
Member

yaox12 commented Apr 10, 2025

/te-ci pytorch

@yaox12
Copy link
Member

yaox12 commented Apr 10, 2025

JAX L0/L1 tests passed with build 26726340.
PyTorch L1 tests passed with build 26734338 except for irrelevant failures.
PyTorch L0 tests passed with build 26736222 except for irrelevant failures.

@zhongbozhu
Copy link
Collaborator

/te-ci pytorch L1

@timmoon10 timmoon10 merged commit a8f0fe0 into NVIDIA:main Apr 10, 2025
19 of 23 checks passed
wdykas pushed a commit to wdykas/TransformerEngine that referenced this pull request Apr 14, 2025
* Add GEMM logic for blockwise quantized tensors.

GEMM test cases included in pytorch integration.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update NVTE_BLOCK_SCALING for GEMM.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Gate feature on CUDA 12.9

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Gemm typo.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Remove unecessary type converter change.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Reflect epilogue availability and test supported epilogues.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* GEMM simplifications from recipe branch.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Format py code.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update GEMM DGelu tests to match support depending on output dtype.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Force pow2Scales in GEMM

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Add GEMM test to pytorch test suite.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Add copyright to GEMM test.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update import for GEMM test.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Add license.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update test gemm supported predicate.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Use sgemm like interfaces and naming.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Rewrite GEMM comment.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* MR Feedback.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Recipe setup for Linear modules.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Use 12.9 feature test.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Run against tensor dumps from internal library.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update FIXME to TODO with linked issue.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update full recompute feature to save recipe.

The recompute context uses the same recipe
and fp8 settings as the original fwd pass.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* MR Feedback. Avoid reusing quantizer objects.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update logic in module.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Format py.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update for PP bug.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update test numerics.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update force_power_of_2 scales in the recipe.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update usage method to satisfy upstream changes.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* fix subchannel recipe in distributed test with bf16 gather

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* Edit and cleanup BF16 gather code.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update test import.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* support columnwise only mode to 1D quantize kernel

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* Format and move enum

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Skip alloc.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* try async bf16 gather

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* Format python code.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Document and type code.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update pytorch lint errors.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Dont set high precision dtype.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Add test for sanity and CG; fix CG for sequential?

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Keep make_quantizers API stable

Update num_quantizers instead to pass cuda_graph tests.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Fix import name.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Rename recipe method.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Skip grouped linear sanity test.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Set usage before BF16 gather.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* refactor for nvte_quantize_v2

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* Format code.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Cleanup nvte_quantize_v2

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Test fp32 scales.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Disable CUDA graph.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Simplify layernorm linear

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Cleanup layernorm linear.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* LayerNorm linear bwd gather logic.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Communication updates.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update transformer_engine/pytorch/ops/op.py

Apply MR comment change.

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: kwyss-nvidia <kwyss@nvidia.com>

* Lint fix.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* MR feedback.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Enable cuda graph tests.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Reduce chance of spurious failure and reword.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Review suggestions from @timmoon10

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Update CPP tests.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update common.h

Signed-off-by: Xin Yao <yaox12@outlook.com>

* Update test_float8blockwisetensor.py

Signed-off-by: Xin Yao <yaox12@outlook.com>

---------

Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: zhongboz <zhongboz@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: kwyss-nvidia <kwyss@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Xin Yao <yaox12@outlook.com>
Co-authored-by: zhongboz <zhongboz@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Xin Yao <yaox12@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants