Release FBGEMM_GPU v1.4.0 Release Notes · pytorch/FBGEMM

Highlights

CUDA 13 and Blackwell Support

Full CUDA 13 build support for OSS and GenAI components
NVIDIA Blackwell (B200) architecture support with Cutlass FMHA kernels
GB200 and H100 optimizations for CUTLASS grouped GEMM in pretraining

Table Batched Embedding (TBE) Improvements

FP8 embedding weights support (backend and frontend)
Feature score eviction with TTL-based and ID_COUNT trigger conditions
TBE data configuration reporter for forward passes
Streaming and tracking logic for SplitTableBatchedEmbeddingBagsCodegen
ZCH identities support in raw embedding streamer

Optimizer Enhancements

Full Adam optimizer support with state offloading
Partial Rowwise Adam enabled in SSD TBE
CPU support for rowwise adagrad with counter
Multiple optimizer states streaming to backend

GenAI and GEMM Performance

CUTLASS v4.2.1 upgrade with BF16 re-enabled
CUTLASS grouped GEMM for Llama4x pretraining (fprop, dgrad, wgrad)
Fast-simplicial attention implementation
Sliding Window Attention (SWA) backward optimization
Deterministic mode support in Cutlass attention kernels
MXFP8 grouped GEMM with 2D-2D support

Quantization Improvements

FP4 grouped API for torch with Blackwell FP4 refactoring
MXFP4 quantization with inline PTX performance boost
Quantization with min & max bounds support (CPU ops for n-bit, 4-bit, 2-bit, 8-bit)
Symmetric quantization for prefill token-wise FP8
AVX512-based int8 to bf16 dequantization

Platform Improvements

ARM/AArch64 builds enabled (Apple Silicon, ARM64)
ROCm 7 support with MI350X FP8 Triton patches
ROCm 6.4.2 upgrade
gfx950 added to ROCm 7.0

Better Engineering

Extensive kernel migration to FBGEMM_LAUNCH_KERNEL pattern
Code refactoring and C++ modernization (C++17/C++20 features)
Reduced build size by removing old architectures
Build time optimizations
Reusable CI workflows

Software Requirements

FBGEMM_GPU v1.4.0 has been tested and known to work on the following setups:

PyTorch: v2.9
CUDA: v12.6[1], 12.8, 12.9[1]
Python: v3.10, 3.11, 3.12, 3.13, 3.14t[1]
ROCm: v6.3[1], 6.4[1]

[1] Available with --index-url option only, see below for installation example.

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Note

		FBGEMM GPU		GenAI
Default cuda	Version	PyPI	Nova	PyPI	Nova
12.8	1.4.0	8.0;9.0;10.0a; [2]	7.5;8.0;9.0a;10.0a;12.0a	7.5;8.0;9.0a;10.0a;12.0a	7.5;8.0;9.0a;10.0a;12.0a
12.8	1.4.1 (for GenAI release)	N/A [3]	7.5;8.0;9.0a;10.0a;12.0a	7.5;8.0;9.0a;10.0a;12.0a	7.5;8.0;9.0a;10.0a;12.0a

[2] Due to the binary size limitation, we need to drop some architecture support from fbgemm_gpu release on PyPI
[3] Due to project size limitations, we do not release 1.4.1 on PyPI as the minor fix for 1.4.1 was for GenAI only.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant (only the CUDA 12.4 variant is available)
pip install fbgemm-gpu==1.4.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==1.4.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==1.4.0 --index-url https://download.pytorch.org/whl/cu126/
pip install fbgemm-gpu==1.4.0 --index-url https://download.pytorch.org/whl/cu128/
pip install fbgemm-gpu==1.4.0 --index-url https://download.pytorch.org/whl/cu129/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==1.4.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table Batched Embedding (TBE) Operators

For GPU

[New] FP8 embedding weights support - backend (#4614) and frontend (#4615)
[New] TBE data configuration reporter to TBE forward (#4672)
[New] Tracking and streaming logic for SplitTableBatchedEmbeddingBagsCodegen (#4741)
[New] ZCH identities to raw embedding streamer (#4765)
[New] Direct write embedding method (#4800)
[New] Add Inference Feature to Skip Pinned Memory Creation (#4924, #4966)
[Improvement] Merged pool embedding optimization (#4985)
[Improvement] Remove guard_size_oblivious from embedding_backward_split_meta_template (#4973)
[Improvement] Monitoring metrics for DRAM cache performance (#4997)
[Improvement] Add read load metrics (#4956)
[Improvement] TBE reporter fixes (#4882)

For CPU

[New] CPU support for rowwise adagrad with counter (#4986)
[New] Scale_bias_last and quant_padding_float_type support for CPU dequant kernel (#4943)
[Improvement] Skip AVX branches for EmbeddingSpMDM when in_type == BFLOAT16 && out_type == FLOAT (#4685)
[Improvement] EmbeddingSpMDM8Bit_Sve support (#3911)

SSD Table Batched Embedding (TBE) Operators

[New] Feature score eviction backend and frontend support (#4681)
[New] TTL-based eviction in feature score eviction policy (#4752)
[New] ID_COUNT eviction trigger condition (#4829)
[New] Virtual table eviction policy (#4433)
[New] Partial Rowwise Adam enabled in SSD TBE (#4525)
[New] Functions to destruct RocksDB checkpoint from ReadOnlyEmbeddingKVDB (#4361)
[New] Backend API to get KVZCH eviction metadata (#4594)
[New] Multiple paths support in SSD TBE passed_in_path (#4690)
[New] Eviction metadata tensor FQN (#4611)
[New] Inference eviction interfaces and implementation (#4500, #4504)
[Improvement] Emit eviction metrics per TBE (#4563)
[Improvement] Update cache row dim calculation in TBE SSD (#4480)
[Improvement] Simplify SplitState application for optimizers TBE SSD (#4492)
[Improvement] KV embedding DRAM delta loading in predictor (#4438)
[Improvement] Fix feature score eviction policy in different trigger mode (#4952)
[Improvement] Fix loading weight with metaheader for eviction (#4769)
[Improvement] Fix fetch eviction metadata bug (#4701)

Optimizer Support

[New] Full Adam optimizer support with state offloading (#4702)
[New] Multiple optimizer states streaming to backend (#4507)
[New] Optimizer state offloading for partial rowwise adam (#4405, #4477)
[New] Extended backend_return_whole_row to support multiple optimizer states (#4620)
[New] API to set backend_return_whole_row (#4607)
[Improvement] Expand split_optimizer_states() to support multiple optimizer states (#4495)
[Improvement] Replace optimizer dtype() method with optimizer_state_dtypes dictionary (#4523)
[Improvement] Handle fetch optimizer states for KV ZCH cases (#4498)
[Improvement] Fix partial rowwise adam optimizer offloading memory alignment (#4645)

GenAI Support and Operators

Attention Kernels

[New] Enable Cutlass Blackwell attention in OSS (#4793)
[New] Fast-simplicial attention in fbgemm (#4797)
[New] Enable left mask in Cutlass kernels (#4799)
[New] Enable deterministic mode in Cutlass attention (#4840)
[New] Add head_dim = 64 in B200 Attention (#4935)
[New] Add softmax_scale to blackwell cutlass fmha (#4897)
[New] Padded KV Partial Prefill Case (#4848)
[Improvement] SWA attention backward optimization (#4808, #4795)
[Improvement] Fix overflow in attention workspace calculation (#4939)

CUTLASS/GEMM Support

[New] Enable CUTLASS grouped GEMM for pretraining wgrad on GB200 and H100 (#4886, #4913)
[New] Enable CUTLASS grouped GEMM for Llama4x pretraining grad (#4856)
[New] MXFP8 grouped GEMM enabled (#4710)
[New] 2D-2D support for MXFP8 Grouped GEMM (#4816)
[New] FP4 grouped API for torch (#4958)
[New] Add output as an option in CUTLASS grouped GEMM (#4931)
[New] DeepGemm Style Groupwise Group Gemm (#4464, #4365)
[Improvement] Upgrade CUTLASS to v4.2.1 (#4964)
[Improvement] BF16 re-enabled with Cutlass update (#5013)
[Improvement] Modernize FP8 Blockwise GEMM (#5002)
[Improvement] Enable control of number of SMs in cutlass grouped GEMM (#4991)
[Improvement] Improve general shape performance for dgrad/fprop cutlass grouped gemm (#4988, #4989)
[Improvement] Optimize wgrad CUTLASS grouped gemm (#4891)
[Improvement] Improve CUTLASS GMM for Llama4x pretraining fprop (#4855)
[Improvement] MXFP8 Grouped GEMM tuning (#4821)
[Improvement] FP4 grouped refactor (#4847, #4934)
[Improvement] Migrate to new device TMA API for grouped_gemm.py (#4866)

Triton GEMM Support

[New] Triton FP8 GEMM to produce fp16 output (#4719)
[New] MI350X FP8 Triton patch (#4889)
[Improvement] Zen LLATTE CoFormer Triton FP8 tune (#4951)
[Improvement] CMF v0 FP8 Triton tune (#4873)
[Improvement] Align output type in Triton FP8 GEMM meta kernel (#4698)
[Improvement] Tune FP8 GEMM output and bias type to reduce NE diffs (#4638)
[Improvement] Triton early prune config fix (#4917)
[Improvement] Auto-tune for 500x shapes to boost flops (#4626, #4635)

KV Cache Support

[New] Enable building KV cache ops in OSS (#4578)
[Improvement] Split dequant functions from kv_cache.cu (#4787)
[Improvement] Pad dequantized paged FP8 KV with zeros (#4780)
[Improvement] Remove potentially leaky debug values from convert_e4m3fn_kv_cache_to_e4m3fnuz_inplace (#4748)

Quantization Operators

FP8/FP4 Quantization

[New] Symmetric quantization to prefill token-wise FP8 (#4661)
[Improvement] Restrict to FP8 and patch 4.2.0 release changes (#4898)
[Improvement] Remove e5m2 from f8f8bf16_rowwise and f8f8bf16_rowwise_batched (#4907, #4908)
[Improvement] Fix scaled input issue (#4884)

MX4/MXFP Quantization

[New] Enable Triton-MTIA MX4 quantization (#4619)
[New] Boost MXFP4 quantization with inline PTX (#4694)
[Improvement] Mimic MX4 in NVFP4 native (#4970)
[Improvement] Minor dtype change to MX4 quantize (#4571)
[Improvement] Fix MXFP8 test due to OSS FBGEMM version mismatch (#4749)

Other Quantization

[New] Quantization with min & max bounds support - CPU ops for n-bit (#4860)
[New] Quantization with min & max bounds support - 4-bit & 2-bit on X86-64 (#4833)
[New] Quantization with min & max bounds support - float to 8-bit on X86-64 (#4790)
[New] AVX512-based int8 to bf16 dequantization (#4912)
[New] Add ability to pad rowwise quantized tensors (#4877)
[New] Stacked quantization with RMS and SiLU fusion (#4487)
[New] Groupwise quantization kernel (#4439)
[Improvement] Improve aarch64 Fused8BitRowwiseQuantizedSBFloatToFloatOrHalf (#4995)
[Improvement] Expose rounding_mode in quantization for performance (#4862)
[Improvement] Add unit testing to quantize_qkv_per_head (#4789)

Sparse Operators

[New] Add block_bucketize_2d_weights kernel (#4778)
[New] Support 2D weights in permute kernel (#4723)
[New] Add kernel to accumulate weight value and counts based on jagged unique inverse indices (#4654)
[Improvement] Use guard_or_true instead of guard_size_oblivious in sparse_ops.py (#4974)
[Improvement] Fix sparse_permute_1d kernel to support double dtype (#4969)
[Improvement] Correctly handle non-contiguous permute tensors in sparse permute kernel (#4593)
[Improvement] Make length check on mask numel in masked_select_jagged_1d optional (#4621)
[Improvement] Handle M == 0 but N != 0 for scatter_add (#4731)
[Improvement] Support general num experts in scatter_add_padded_tokens (#4627)
[Improvement] Remove assertion check inside scatter_add when in_tokens and out_tokens have different shape (#4639)

Comm Operators

[New] Enable Comm Ops in OSS GenAI (#4548)
[New] Comms for many2one/one2many (#4528)
[New] Support any EP in combine or split shuffling (#4649)
[New] Support non power-of-2 experts in combine or split shuffling (#4634)
[New] Enable index_shuffling with top-K > 1 (#4605)
[New] Enable index_shuffling with more experts (#4588)

Platform Support

CUDA 13 / Blackwell / GB200

[New] Enable CUDA 13 builds in OSS (#5143)
[New] Prepare FBGEMM_GPU for CUDA 13 builds (#5100)
[New] Enable CUDA 13 for GenAI builds (#4837)
[New] Add back cu129 support for 1.4.x release (#5)
[Improvement] Fix gen_ai:quantize_ops_gpu build for B200 (#4683)
[Improvement] Fix tuning cache for f8f8bf16_rowwise_grouped on SM100 (#4843)

ARM / AArch64

[New] Enable ARM builds (#4604, #4632)
[New] Enable Apple Silicon builds (#4673)
[New] Enable more tests and benchmarks on ARM64 (#4721)
[Improvement] Upgrade ARM CI instances (#4693)
[Improvement] Use armv8.2-a+fp16fml (#4942)
[Improvement] Use reduceAVX2 trivial implementation for aarch64 (#4677)
[Improvement] Check arm_neon_sve_bridge.h header (#4967)
[Improvement] Fix Arm64 OSS PyTorch build with FBGEMM (#4775)
[Improvement] Fix fp32 ARM builds (#4699)
[Improvement] Fix arm64 issues (#4711)

ROCm / AMD

[New] Support ROCm 7 in Nova (#4953)
[New] Add gfx950 to ROCm 7.0 (#4443)
[Improvement] Upgrade ROCm to 6.4.2 (#4876)
[Improvement] Fix quantize kernels on ROCm 6.4 (#4708)
[Improvement] Improve assertions on ROCm fp8_rowwise_grouped_gemm (#4812)
[Improvement] Invoke AMD specific kernel reorder_batched_ad_indices_kernel_vec (#4412)
[Improvement] Fix OSS performance on FP8 AMD kernels (#4462)
[Improvement] Optimize tbe_input_combine_with_length_cuda on AMD
[Improvement] Mitigate IG 500x seg-fault on AMD (#4554)
[Improvement] More hipify v2 fixes (#4947)
[Improvement] Fix ROCm build issues (#4902)
[Improvement] Add flags to compile CK GEMMs under PyTorch (#4878)

Build / CI Improvements and Better Engineering

[New] Remove old architectures from build to reduce build size (#5172)
[New] Reusable CI workflows (#4646)
[New] Enable USE_FBGEMM_GENAI (#4703)
[New] Add version compatibility checks on library load (#4875)
[Improvement] Extensive kernel migration to FBGEMM_LAUNCH_KERNEL pattern (#4751, #4738, #4724, #4692, #4662, #4618, #4595, #4572, #4562, #4556)
[Improvement] Migrate to Python 3.9 syntax (#4909)
[Improvement] Update to use C++17/C++20 features (#4558, #4590)
[Improvement] Modernize C++ code by clang-tidy (#4415, #4466, #4436, #4437)
[Improvement] Save binary size (#4900)
[Improvement] Decouple ATEN from Feature Gates Deps (#4965)
[Improvement] Support loading libraries when installing multiple FBGEMM targets (#4993)
[Improvement] General OSS fixes and script improvements
[Improvement] Add tbb as library dependency explicitly in CMake (#4859)
[Improvement] Modularize FBGEMM CMake Target Definitions (#4509, #4518, #4524, #4553)
[Improvement] Fix various build issues for different platforms

Tests and Benchmarks

[New] Add export trace to UVM and cache benchmark (#4777)
[New] Add tests to decode and padded jagged_vs_padded_kv (#4806)
[New] Unit tests for TBE API v1 compatibility (#4926)
[New] Unit tests for Full Adam (#4712, #4713)
[New] Unit tests for Partial Rowwise Adam (#4642)
[Improvement] BWD test improvement (#4981)
[Improvement] SSD TBE test cleanups (#4758)
[Improvement] Update TBE training benchmark (#4455)
[Improvement] Modify VBE bench to support SSD and trace export (#4658)
[Improvement] Support multiple total-k and total-m in quantize bench (#4890)
[Improvement] Re-enable decode tests (#4946)

FBGEMM_GPU v1.4.0 Release Notes