Skip to content

FBGEMM_GPU v1.4.0 Release Notes

Latest

Choose a tag to compare

@spcyppt spcyppt released this 09 Jan 04:50
· 210 commits to main since this release

Highlights

CUDA 13 and Blackwell Support

  • Full CUDA 13 build support for OSS and GenAI components
  • NVIDIA Blackwell (B200) architecture support with Cutlass FMHA kernels
  • GB200 and H100 optimizations for CUTLASS grouped GEMM in pretraining

Table Batched Embedding (TBE) Improvements

  • FP8 embedding weights support (backend and frontend)
  • Feature score eviction with TTL-based and ID_COUNT trigger conditions
  • TBE data configuration reporter for forward passes
  • Streaming and tracking logic for SplitTableBatchedEmbeddingBagsCodegen
  • ZCH identities support in raw embedding streamer

Optimizer Enhancements

  • Full Adam optimizer support with state offloading
  • Partial Rowwise Adam enabled in SSD TBE
  • CPU support for rowwise adagrad with counter
  • Multiple optimizer states streaming to backend

GenAI and GEMM Performance

  • CUTLASS v4.2.1 upgrade with BF16 re-enabled
  • CUTLASS grouped GEMM for Llama4x pretraining (fprop, dgrad, wgrad)
  • Fast-simplicial attention implementation
  • Sliding Window Attention (SWA) backward optimization
  • Deterministic mode support in Cutlass attention kernels
  • MXFP8 grouped GEMM with 2D-2D support

Quantization Improvements

  • FP4 grouped API for torch with Blackwell FP4 refactoring
  • MXFP4 quantization with inline PTX performance boost
  • Quantization with min & max bounds support (CPU ops for n-bit, 4-bit, 2-bit, 8-bit)
  • Symmetric quantization for prefill token-wise FP8
  • AVX512-based int8 to bf16 dequantization

Platform Improvements

  • ARM/AArch64 builds enabled (Apple Silicon, ARM64)
  • ROCm 7 support with MI350X FP8 Triton patches
  • ROCm 6.4.2 upgrade
  • gfx950 added to ROCm 7.0

Better Engineering

  • Extensive kernel migration to FBGEMM_LAUNCH_KERNEL pattern
  • Code refactoring and C++ modernization (C++17/C++20 features)
  • Reduced build size by removing old architectures
  • Build time optimizations
  • Reusable CI workflows

Software Requirements

FBGEMM_GPU v1.4.0 has been tested and known to work on the following setups:

  • PyTorch: v2.9
  • CUDA: v12.6[1], 12.8, 12.9[1]
  • Python: v3.10, 3.11, 3.12, 3.13, 3.14t[1]
  • ROCm: v6.3[1], 6.4[1]

[1] Available with --index-url option only, see below for installation example.

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Note

FBGEMM GPU GenAI
Default cuda Version PyPI Nova PyPI Nova
12.8 1.4.0 8.0;9.0;10.0a; [2] 7.5;8.0;9.0a;10.0a;12.0a 7.5;8.0;9.0a;10.0a;12.0a 7.5;8.0;9.0a;10.0a;12.0a
12.8 1.4.1 (for GenAI release) N/A [3] 7.5;8.0;9.0a;10.0a;12.0a 7.5;8.0;9.0a;10.0a;12.0a 7.5;8.0;9.0a;10.0a;12.0a

[2] Due to the binary size limitation, we need to drop some architecture support from fbgemm_gpu release on PyPI
[3] Due to project size limitations, we do not release 1.4.1 on PyPI as the minor fix for 1.4.1 was for GenAI only.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant (only the CUDA 12.4 variant is available)
pip install fbgemm-gpu==1.4.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==1.4.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==1.4.0 --index-url https://download.pytorch.org/whl/cu126/
pip install fbgemm-gpu==1.4.0 --index-url https://download.pytorch.org/whl/cu128/
pip install fbgemm-gpu==1.4.0 --index-url https://download.pytorch.org/whl/cu129/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==1.4.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table Batched Embedding (TBE) Operators

For GPU

  • [New] FP8 embedding weights support - backend (#4614) and frontend (#4615)
  • [New] TBE data configuration reporter to TBE forward (#4672)
  • [New] Tracking and streaming logic for SplitTableBatchedEmbeddingBagsCodegen (#4741)
  • [New] ZCH identities to raw embedding streamer (#4765)
  • [New] Direct write embedding method (#4800)
  • [New] Add Inference Feature to Skip Pinned Memory Creation (#4924, #4966)
  • [Improvement] Merged pool embedding optimization (#4985)
  • [Improvement] Remove guard_size_oblivious from embedding_backward_split_meta_template (#4973)
  • [Improvement] Monitoring metrics for DRAM cache performance (#4997)
  • [Improvement] Add read load metrics (#4956)
  • [Improvement] TBE reporter fixes (#4882)

For CPU

  • [New] CPU support for rowwise adagrad with counter (#4986)
  • [New] Scale_bias_last and quant_padding_float_type support for CPU dequant kernel (#4943)
  • [Improvement] Skip AVX branches for EmbeddingSpMDM when in_type == BFLOAT16 && out_type == FLOAT (#4685)
  • [Improvement] EmbeddingSpMDM8Bit_Sve support (#3911)

SSD Table Batched Embedding (TBE) Operators

  • [New] Feature score eviction backend and frontend support (#4681)
  • [New] TTL-based eviction in feature score eviction policy (#4752)
  • [New] ID_COUNT eviction trigger condition (#4829)
  • [New] Virtual table eviction policy (#4433)
  • [New] Partial Rowwise Adam enabled in SSD TBE (#4525)
  • [New] Functions to destruct RocksDB checkpoint from ReadOnlyEmbeddingKVDB (#4361)
  • [New] Backend API to get KVZCH eviction metadata (#4594)
  • [New] Multiple paths support in SSD TBE passed_in_path (#4690)
  • [New] Eviction metadata tensor FQN (#4611)
  • [New] Inference eviction interfaces and implementation (#4500, #4504)
  • [Improvement] Emit eviction metrics per TBE (#4563)
  • [Improvement] Update cache row dim calculation in TBE SSD (#4480)
  • [Improvement] Simplify SplitState application for optimizers TBE SSD (#4492)
  • [Improvement] KV embedding DRAM delta loading in predictor (#4438)
  • [Improvement] Fix feature score eviction policy in different trigger mode (#4952)
  • [Improvement] Fix loading weight with metaheader for eviction (#4769)
  • [Improvement] Fix fetch eviction metadata bug (#4701)

Optimizer Support

  • [New] Full Adam optimizer support with state offloading (#4702)
  • [New] Multiple optimizer states streaming to backend (#4507)
  • [New] Optimizer state offloading for partial rowwise adam (#4405, #4477)
  • [New] Extended backend_return_whole_row to support multiple optimizer states (#4620)
  • [New] API to set backend_return_whole_row (#4607)
  • [Improvement] Expand split_optimizer_states() to support multiple optimizer states (#4495)
  • [Improvement] Replace optimizer dtype() method with optimizer_state_dtypes dictionary (#4523)
  • [Improvement] Handle fetch optimizer states for KV ZCH cases (#4498)
  • [Improvement] Fix partial rowwise adam optimizer offloading memory alignment (#4645)

GenAI Support and Operators

Attention Kernels

  • [New] Enable Cutlass Blackwell attention in OSS (#4793)
  • [New] Fast-simplicial attention in fbgemm (#4797)
  • [New] Enable left mask in Cutlass kernels (#4799)
  • [New] Enable deterministic mode in Cutlass attention (#4840)
  • [New] Add head_dim = 64 in B200 Attention (#4935)
  • [New] Add softmax_scale to blackwell cutlass fmha (#4897)
  • [New] Padded KV Partial Prefill Case (#4848)
  • [Improvement] SWA attention backward optimization (#4808, #4795)
  • [Improvement] Fix overflow in attention workspace calculation (#4939)

CUTLASS/GEMM Support

  • [New] Enable CUTLASS grouped GEMM for pretraining wgrad on GB200 and H100 (#4886, #4913)
  • [New] Enable CUTLASS grouped GEMM for Llama4x pretraining grad (#4856)
  • [New] MXFP8 grouped GEMM enabled (#4710)
  • [New] 2D-2D support for MXFP8 Grouped GEMM (#4816)
  • [New] FP4 grouped API for torch (#4958)
  • [New] Add output as an option in CUTLASS grouped GEMM (#4931)
  • [New] DeepGemm Style Groupwise Group Gemm (#4464, #4365)
  • [Improvement] Upgrade CUTLASS to v4.2.1 (#4964)
  • [Improvement] BF16 re-enabled with Cutlass update (#5013)
  • [Improvement] Modernize FP8 Blockwise GEMM (#5002)
  • [Improvement] Enable control of number of SMs in cutlass grouped GEMM (#4991)
  • [Improvement] Improve general shape performance for dgrad/fprop cutlass grouped gemm (#4988, #4989)
  • [Improvement] Optimize wgrad CUTLASS grouped gemm (#4891)
  • [Improvement] Improve CUTLASS GMM for Llama4x pretraining fprop (#4855)
  • [Improvement] MXFP8 Grouped GEMM tuning (#4821)
  • [Improvement] FP4 grouped refactor (#4847, #4934)
  • [Improvement] Migrate to new device TMA API for grouped_gemm.py (#4866)

Triton GEMM Support

  • [New] Triton FP8 GEMM to produce fp16 output (#4719)
  • [New] MI350X FP8 Triton patch (#4889)
  • [Improvement] Zen LLATTE CoFormer Triton FP8 tune (#4951)
  • [Improvement] CMF v0 FP8 Triton tune (#4873)
  • [Improvement] Align output type in Triton FP8 GEMM meta kernel (#4698)
  • [Improvement] Tune FP8 GEMM output and bias type to reduce NE diffs (#4638)
  • [Improvement] Triton early prune config fix (#4917)
  • [Improvement] Auto-tune for 500x shapes to boost flops (#4626, #4635)

KV Cache Support

  • [New] Enable building KV cache ops in OSS (#4578)
  • [Improvement] Split dequant functions from kv_cache.cu (#4787)
  • [Improvement] Pad dequantized paged FP8 KV with zeros (#4780)
  • [Improvement] Remove potentially leaky debug values from convert_e4m3fn_kv_cache_to_e4m3fnuz_inplace (#4748)

Quantization Operators

FP8/FP4 Quantization

  • [New] Symmetric quantization to prefill token-wise FP8 (#4661)
  • [Improvement] Restrict to FP8 and patch 4.2.0 release changes (#4898)
  • [Improvement] Remove e5m2 from f8f8bf16_rowwise and f8f8bf16_rowwise_batched (#4907, #4908)
  • [Improvement] Fix scaled input issue (#4884)

MX4/MXFP Quantization

  • [New] Enable Triton-MTIA MX4 quantization (#4619)
  • [New] Boost MXFP4 quantization with inline PTX (#4694)
  • [Improvement] Mimic MX4 in NVFP4 native (#4970)
  • [Improvement] Minor dtype change to MX4 quantize (#4571)
  • [Improvement] Fix MXFP8 test due to OSS FBGEMM version mismatch (#4749)

Other Quantization

  • [New] Quantization with min & max bounds support - CPU ops for n-bit (#4860)
  • [New] Quantization with min & max bounds support - 4-bit & 2-bit on X86-64 (#4833)
  • [New] Quantization with min & max bounds support - float to 8-bit on X86-64 (#4790)
  • [New] AVX512-based int8 to bf16 dequantization (#4912)
  • [New] Add ability to pad rowwise quantized tensors (#4877)
  • [New] Stacked quantization with RMS and SiLU fusion (#4487)
  • [New] Groupwise quantization kernel (#4439)
  • [Improvement] Improve aarch64 Fused8BitRowwiseQuantizedSBFloatToFloatOrHalf (#4995)
  • [Improvement] Expose rounding_mode in quantization for performance (#4862)
  • [Improvement] Add unit testing to quantize_qkv_per_head (#4789)

Sparse Operators

  • [New] Add block_bucketize_2d_weights kernel (#4778)
  • [New] Support 2D weights in permute kernel (#4723)
  • [New] Add kernel to accumulate weight value and counts based on jagged unique inverse indices (#4654)
  • [Improvement] Use guard_or_true instead of guard_size_oblivious in sparse_ops.py (#4974)
  • [Improvement] Fix sparse_permute_1d kernel to support double dtype (#4969)
  • [Improvement] Correctly handle non-contiguous permute tensors in sparse permute kernel (#4593)
  • [Improvement] Make length check on mask numel in masked_select_jagged_1d optional (#4621)
  • [Improvement] Handle M == 0 but N != 0 for scatter_add (#4731)
  • [Improvement] Support general num experts in scatter_add_padded_tokens (#4627)
  • [Improvement] Remove assertion check inside scatter_add when in_tokens and out_tokens have different shape (#4639)

Comm Operators

  • [New] Enable Comm Ops in OSS GenAI (#4548)
  • [New] Comms for many2one/one2many (#4528)
  • [New] Support any EP in combine or split shuffling (#4649)
  • [New] Support non power-of-2 experts in combine or split shuffling (#4634)
  • [New] Enable index_shuffling with top-K > 1 (#4605)
  • [New] Enable index_shuffling with more experts (#4588)

Platform Support

CUDA 13 / Blackwell / GB200

  • [New] Enable CUDA 13 builds in OSS (#5143)
  • [New] Prepare FBGEMM_GPU for CUDA 13 builds (#5100)
  • [New] Enable CUDA 13 for GenAI builds (#4837)
  • [New] Add back cu129 support for 1.4.x release (#5)
  • [Improvement] Fix gen_ai:quantize_ops_gpu build for B200 (#4683)
  • [Improvement] Fix tuning cache for f8f8bf16_rowwise_grouped on SM100 (#4843)

ARM / AArch64

  • [New] Enable ARM builds (#4604, #4632)
  • [New] Enable Apple Silicon builds (#4673)
  • [New] Enable more tests and benchmarks on ARM64 (#4721)
  • [Improvement] Upgrade ARM CI instances (#4693)
  • [Improvement] Use armv8.2-a+fp16fml (#4942)
  • [Improvement] Use reduceAVX2 trivial implementation for aarch64 (#4677)
  • [Improvement] Check arm_neon_sve_bridge.h header (#4967)
  • [Improvement] Fix Arm64 OSS PyTorch build with FBGEMM (#4775)
  • [Improvement] Fix fp32 ARM builds (#4699)
  • [Improvement] Fix arm64 issues (#4711)

ROCm / AMD

  • [New] Support ROCm 7 in Nova (#4953)
  • [New] Add gfx950 to ROCm 7.0 (#4443)
  • [Improvement] Upgrade ROCm to 6.4.2 (#4876)
  • [Improvement] Fix quantize kernels on ROCm 6.4 (#4708)
  • [Improvement] Improve assertions on ROCm fp8_rowwise_grouped_gemm (#4812)
  • [Improvement] Invoke AMD specific kernel reorder_batched_ad_indices_kernel_vec (#4412)
  • [Improvement] Fix OSS performance on FP8 AMD kernels (#4462)
  • [Improvement] Optimize tbe_input_combine_with_length_cuda on AMD
  • [Improvement] Mitigate IG 500x seg-fault on AMD (#4554)
  • [Improvement] More hipify v2 fixes (#4947)
  • [Improvement] Fix ROCm build issues (#4902)
  • [Improvement] Add flags to compile CK GEMMs under PyTorch (#4878)

Build / CI Improvements and Better Engineering

  • [New] Remove old architectures from build to reduce build size (#5172)
  • [New] Reusable CI workflows (#4646)
  • [New] Enable USE_FBGEMM_GENAI (#4703)
  • [New] Add version compatibility checks on library load (#4875)
  • [Improvement] Extensive kernel migration to FBGEMM_LAUNCH_KERNEL pattern (#4751, #4738, #4724, #4692, #4662, #4618, #4595, #4572, #4562, #4556)
  • [Improvement] Migrate to Python 3.9 syntax (#4909)
  • [Improvement] Update to use C++17/C++20 features (#4558, #4590)
  • [Improvement] Modernize C++ code by clang-tidy (#4415, #4466, #4436, #4437)
  • [Improvement] Save binary size (#4900)
  • [Improvement] Decouple ATEN from Feature Gates Deps (#4965)
  • [Improvement] Support loading libraries when installing multiple FBGEMM targets (#4993)
  • [Improvement] General OSS fixes and script improvements
  • [Improvement] Add tbb as library dependency explicitly in CMake (#4859)
  • [Improvement] Modularize FBGEMM CMake Target Definitions (#4509, #4518, #4524, #4553)
  • [Improvement] Fix various build issues for different platforms

Tests and Benchmarks

  • [New] Add export trace to UVM and cache benchmark (#4777)
  • [New] Add tests to decode and padded jagged_vs_padded_kv (#4806)
  • [New] Unit tests for TBE API v1 compatibility (#4926)
  • [New] Unit tests for Full Adam (#4712, #4713)
  • [New] Unit tests for Partial Rowwise Adam (#4642)
  • [Improvement] BWD test improvement (#4981)
  • [Improvement] SSD TBE test cleanups (#4758)
  • [Improvement] Update TBE training benchmark (#4455)
  • [Improvement] Modify VBE bench to support SSD and trace export (#4658)
  • [Improvement] Support multiple total-k and total-m in quantize bench (#4890)
  • [Improvement] Re-enable decode tests (#4946)