Skip to content

Comments

Hongbinl/split wgrad new#1

Open
lhb8125 wants to merge 61 commits intohongbinl/split_wgradfrom
hongbinl/split_wgrad_new
Open

Hongbinl/split wgrad new#1
lhb8125 wants to merge 61 commits intohongbinl/split_wgradfrom
hongbinl/split_wgrad_new

Conversation

@lhb8125
Copy link
Owner

@lhb8125 lhb8125 commented Apr 8, 2025

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

lhb8125 and others added 16 commits April 7, 2025 18:55
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
for more information, see https://pre-commit.ci

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
* Use dummy wgrads for lower memory consumption

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

* Bug fix to avoid sharing gradients.

Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

* Disable automatic use of batch_p2p_comm for CP2

Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

* Change weight to origin_weight for LN_LINEAR

Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Minor stylistic tweaks and typo fixes

Review suggestions from @ptrendx

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix incorrect col strides for MXFP8 matrices

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* scaling enum abstract

* rm NVTE_ from ScalingMode names

* rework scaling mode enum in grouped gemm

* fix norm sharding

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…r op backward (NVIDIA#1646)

Explicitly specify quantized tensor usages needed for linear op backward

Signed-off-by: Tim Moon <tmoon@nvidia.com>
* Debug checkpointing with te.Sequential

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Add GEMM logic for blockwise quantized tensors.

GEMM test cases included in pytorch integration.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update NVTE_BLOCK_SCALING for GEMM.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Gate feature on CUDA 12.9

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Gemm typo.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Remove unecessary type converter change.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Reflect epilogue availability and test supported epilogues.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* GEMM simplifications from recipe branch.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Format py code.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update GEMM DGelu tests to match support depending on output dtype.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Force pow2Scales in GEMM

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Add GEMM test to pytorch test suite.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Add copyright to GEMM test.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update import for GEMM test.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Add license.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update test gemm supported predicate.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Use sgemm like interfaces and naming.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Rewrite GEMM comment.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* MR Feedback.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Recipe setup for Linear modules.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Use 12.9 feature test.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Run against tensor dumps from internal library.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update FIXME to TODO with linked issue.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update full recompute feature to save recipe.

The recompute context uses the same recipe
and fp8 settings as the original fwd pass.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* MR Feedback. Avoid reusing quantizer objects.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update logic in module.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Format py.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update for PP bug.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update test numerics.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update force_power_of_2 scales in the recipe.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update usage method to satisfy upstream changes.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* fix subchannel recipe in distributed test with bf16 gather

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* Edit and cleanup BF16 gather code.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update test import.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* support columnwise only mode to 1D quantize kernel

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* Format and move enum

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Skip alloc.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* try async bf16 gather

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* Format python code.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Document and type code.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update pytorch lint errors.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Dont set high precision dtype.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Add test for sanity and CG; fix CG for sequential?

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Keep make_quantizers API stable

Update num_quantizers instead to pass cuda_graph tests.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Fix import name.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Rename recipe method.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Skip grouped linear sanity test.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Set usage before BF16 gather.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* refactor for nvte_quantize_v2

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* Format code.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Cleanup nvte_quantize_v2

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Test fp32 scales.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Disable CUDA graph.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Simplify layernorm linear

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Cleanup layernorm linear.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* LayerNorm linear bwd gather logic.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Communication updates.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update transformer_engine/pytorch/ops/op.py

Apply MR comment change.

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: kwyss-nvidia <kwyss@nvidia.com>

* Lint fix.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* MR feedback.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Enable cuda graph tests.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Reduce chance of spurious failure and reword.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Review suggestions from @timmoon10

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Update CPP tests.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update common.h

Signed-off-by: Xin Yao <yaox12@outlook.com>

* Update test_float8blockwisetensor.py

Signed-off-by: Xin Yao <yaox12@outlook.com>

---------

Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: zhongboz <zhongboz@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: kwyss-nvidia <kwyss@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Xin Yao <yaox12@outlook.com>
Co-authored-by: zhongboz <zhongboz@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Xin Yao <yaox12@outlook.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
@lhb8125 lhb8125 force-pushed the hongbinl/split_wgrad_new branch from a718320 to 76eea17 Compare April 11, 2025 14:43
lhb8125 and others added 9 commits April 11, 2025 07:44
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Repeated calls to nvte_shape should not invalidate
previous data pointers.

It would be possible to avoid unnecessary comparisons
by duplicating some of the logic from shape() so that
the cache is only relevant when columnwise shapes are
involved. Whether this code duplication is preferable
to the comparisons that arise from by value semantics
of reusing shape is a judgment call.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>
…1665)

* Add option to cache activation input in FP8

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Avoid casting to FP8 transpose

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Skip input caching if device is not supported

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add documentation that FP8 input caching is experimental

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
[QA] Add error handling

- Standardize test failure handling using the unified 'test_fail' function and 'error_exit' function

Signed-off-by: Linxi Ding <linxid@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
…VIDIA#1671)

* Added attention activation offloading support for TE v2.0

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com>
Co-authored-by: Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
* add support for new recipe on permute_fusion, rm fp unpermute

Signed-off-by: tongliu <tongliu@nvidia.com>

* fix lint

Signed-off-by: Xin Yao <xiny@nvidia.com>

* remove fp8 from index map

Signed-off-by: Xin Yao <xiny@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* skip unsupported tests

Signed-off-by: Xin Yao <xiny@nvidia.com>

---------

Signed-off-by: tongliu <tongliu@nvidia.com>
Signed-off-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: tongliu <tongliu@nvidia.com>
Co-authored-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@lhb8125 lhb8125 force-pushed the hongbinl/split_wgrad_new branch from b80a842 to 7ec4182 Compare April 14, 2025 12:00
timmoon10 and others added 30 commits April 14, 2025 10:57
…backward (NVIDIA#1676)

* Avoid unnecessary tensor usages when caching for linear op backward

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Debug test failure

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
… dgrad backward (NVIDIA#1648)

* Add fp8 weight transpose cache check in backward, and regenerated it if it does not exist

Signed-off-by: jianbinc <shjwudp@gmail.com>

* Properly handle fsdp shard model weight input.

Signed-off-by: jianbinc <shjwudp@gmail.com>

* move Float8Tensor to QuantizedTensor in cast_master_weights_to_fp8 UT

Signed-off-by: jianbinc <shjwudp@gmail.com>

* handle Float8TensorBase issue

Signed-off-by: jianbinc <shjwudp@gmail.com>

* fix bug in activation recompute

Signed-off-by: jianbinc <shjwudp@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: jianbinc <shjwudp@gmail.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
…A#1673)

* script improvement

* add wait

* add return code back

* relax tols for FP8 test in test_multiprocessing_ by 0.001

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* added test

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* test change

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* changed the test

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* test change

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* test fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* small changes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* small changes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* test

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* clear

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* base

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* support adam bf16 state

Signed-off-by: XiaobingSuper <xiaobingzhangupc@gmail.com>

* use fp32 kernel but keep bf16 optimizer states to save memory

Signed-off-by: lit <lit@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: XiaobingSuper <xiaobingzhangupc@gmail.com>
Signed-off-by: lit <lit@nvidia.com>
Co-authored-by: XiaobingSuper <xiaobingzhangupc@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
* Add test cases for full coverage in jax/test_layer.py
- causal and window size None
- causal and window size default (-1,1)
- no_mask and window size default (-1,1)
- no_mask and window size default (2,2)
- padding and window size None
- padding_causal and window_size (2,2)

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Correct the condition where padding_causal_mask was being mapped to scaled upper triangle

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Fix Issue NVIDIA#1524

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Add a runner and test cases for jax.flax.module.Softmax class for fwd pass only
Segregate runner classes for Softmax module and softmax primitives

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Simplify logic when picking softmax primitives and softmax jax framework calls

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Simplify the logic for performing jax based softmax

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Code clean up

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add support table for mask, SWA and Softmax type. Code linting

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Explicit SWA conditons in comments. Fix Typo

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Resolve typo to remove None in SWA comments section

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…#1614)

* add

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* weight workspace fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* docs fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* file i forgot

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* lint fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Update transformer_engine/debug/pytorch/utils.py

Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>

* setup fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* setup fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Update transformer_engine/pytorch/tensor/_internal/float8_tensor_base.py

Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>

* all tensor types

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* removed check

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* move error

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* _reset

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Update transformer_engine/pytorch/module/linear.py

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>

* name documentation

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* added blockwise quantizer

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* make debug option optional

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Update transformer_engine/pytorch/tensor/quantized_tensor.py

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>

* names fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
* Update README.rst - Installation

Update installation section with comprehensive guidelines

- Add detailed system requirements
- Include Conda installation method (experimental)
- Document environment variables for customizing build process
- Update FlashAttention support to cover both version 2 and 3
- Add troubleshooting section with solutions for common installation issues

Signed-off-by: Santosh Bhavani <sbhavani@nvidia.com>

* Update README.rst - Installation

removed conda section

Signed-off-by: Santosh Bhavani <sbhavani@nvidia.com>

* Update README.rst - Installation

added all gpu archs that support FP8

Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update README.rst

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update README.rst

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update README.rst

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update installation.rst

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix docs and adding troubleshooting

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Santosh Bhavani <sbhavani@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
* drop

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* [QA] Add error handling

- Standardize test failure handling using the unified 'test_fail' function and 'error_exit' function

Signed-off-by: Linxi Ding <linxid@nvidia.com>

* Add XML log generation for pytest results

- Add `--junitxml` option to pytest command to generate JUnit XML format logs

Signed-off-by: Linxi Ding <linxid@nvidia.com>

* Add $XML_LOG_DIR

Signed-off-by: Linxi Ding <linxid@nvidia.com>

* mkdir

Signed-off-by: Linxi Ding <linxid@nvidia.com>

* Update qa/L0_pytorch_unittest/test.sh

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

---------

Signed-off-by: Linxi Ding <linxid@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
…DIA#1690)

* Add a flag to support computing zero-centered gamma in weight dtype or compute dtype for CuDNN

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Address comments

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* Allow NVTEShape to own data.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Convert repeated copy paths to nvte_make_shape calls.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Apply suggestions from code review

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

* Build fixes.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* MR feedback.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

---------

Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
* move swizzle scaling factor to cpp

Signed-off-by: Xin Yao <xiny@nvidia.com>

* resolve comments

Signed-off-by: Xin Yao <xiny@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* re merge request

Signed-off-by: Peter Dykas <wdykas@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add docstring

Signed-off-by: Peter Dykas <wdykas@nvidia.com>

---------

Signed-off-by: Peter Dykas <wdykas@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
rm pax/praxis

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.