Highlights
CUDA 13 and Blackwell Support
- Full CUDA 13 build support for OSS and GenAI components
- NVIDIA Blackwell (B200) architecture support with Cutlass FMHA kernels
- GB200 and H100 optimizations for CUTLASS grouped GEMM in pretraining
Table Batched Embedding (TBE) Improvements
- FP8 embedding weights support (backend and frontend)
- Feature score eviction with TTL-based and ID_COUNT trigger conditions
- TBE data configuration reporter for forward passes
- Streaming and tracking logic for SplitTableBatchedEmbeddingBagsCodegen
- ZCH identities support in raw embedding streamer
Optimizer Enhancements
- Full Adam optimizer support with state offloading
- Partial Rowwise Adam enabled in SSD TBE
- CPU support for rowwise adagrad with counter
- Multiple optimizer states streaming to backend
GenAI and GEMM Performance
- CUTLASS v4.2.1 upgrade with BF16 re-enabled
- CUTLASS grouped GEMM for Llama4x pretraining (fprop, dgrad, wgrad)
- Fast-simplicial attention implementation
- Sliding Window Attention (SWA) backward optimization
- Deterministic mode support in Cutlass attention kernels
- MXFP8 grouped GEMM with 2D-2D support
Quantization Improvements
- FP4 grouped API for torch with Blackwell FP4 refactoring
- MXFP4 quantization with inline PTX performance boost
- Quantization with min & max bounds support (CPU ops for n-bit, 4-bit, 2-bit, 8-bit)
- Symmetric quantization for prefill token-wise FP8
- AVX512-based int8 to bf16 dequantization
Platform Improvements
- ARM/AArch64 builds enabled (Apple Silicon, ARM64)
- ROCm 7 support with MI350X FP8 Triton patches
- ROCm 6.4.2 upgrade
- gfx950 added to ROCm 7.0
Better Engineering
- Extensive kernel migration to
FBGEMM_LAUNCH_KERNELpattern - Code refactoring and C++ modernization (C++17/C++20 features)
- Reduced build size by removing old architectures
- Build time optimizations
- Reusable CI workflows
Software Requirements
FBGEMM_GPU v1.4.0 has been tested and known to work on the following setups:
- PyTorch: v2.9
- CUDA: v12.6[1], 12.8, 12.9[1]
- Python: v3.10, 3.11, 3.12, 3.13, 3.14t[1]
- ROCm: v6.3[1], 6.4[1]
[1] Available with --index-url option only, see below for installation example.
It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.
Note
| FBGEMM GPU | GenAI | ||||
|---|---|---|---|---|---|
| Default cuda | Version | PyPI | Nova | PyPI | Nova |
| 12.8 | 1.4.0 | 8.0;9.0;10.0a; [2] | 7.5;8.0;9.0a;10.0a;12.0a | 7.5;8.0;9.0a;10.0a;12.0a | 7.5;8.0;9.0a;10.0a;12.0a |
| 12.8 | 1.4.1 (for GenAI release) | N/A [3] | 7.5;8.0;9.0a;10.0a;12.0a | 7.5;8.0;9.0a;10.0a;12.0a | 7.5;8.0;9.0a;10.0a;12.0a |
[2] Due to the binary size limitation, we need to drop some architecture support from fbgemm_gpu release on PyPI
[3] Due to project size limitations, we do not release 1.4.1 on PyPI as the minor fix for 1.4.1 was for GenAI only.
Availability
FBGEMM_GPU can be fetched directly from PyPI:
# FBGEMM_GPU CUDA variant (only the CUDA 12.4 variant is available)
pip install fbgemm-gpu==1.4.0
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==1.4.0Alternatively, it can be fetched from PyTorch PIP:
# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==1.4.0 --index-url https://download.pytorch.org/whl/cu126/
pip install fbgemm-gpu==1.4.0 --index-url https://download.pytorch.org/whl/cu128/
pip install fbgemm-gpu==1.4.0 --index-url https://download.pytorch.org/whl/cu129/
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==1.4.0 --index-url https://download.pytorch.org/whl/cpuChanges
Table Batched Embedding (TBE) Operators
For GPU
- [New] FP8 embedding weights support - backend (#4614) and frontend (#4615)
- [New] TBE data configuration reporter to TBE forward (#4672)
- [New] Tracking and streaming logic for SplitTableBatchedEmbeddingBagsCodegen (#4741)
- [New] ZCH identities to raw embedding streamer (#4765)
- [New] Direct write embedding method (#4800)
- [New] Add Inference Feature to Skip Pinned Memory Creation (#4924, #4966)
- [Improvement] Merged pool embedding optimization (#4985)
- [Improvement] Remove guard_size_oblivious from embedding_backward_split_meta_template (#4973)
- [Improvement] Monitoring metrics for DRAM cache performance (#4997)
- [Improvement] Add read load metrics (#4956)
- [Improvement] TBE reporter fixes (#4882)
For CPU
- [New] CPU support for rowwise adagrad with counter (#4986)
- [New] Scale_bias_last and quant_padding_float_type support for CPU dequant kernel (#4943)
- [Improvement] Skip AVX branches for EmbeddingSpMDM when in_type == BFLOAT16 && out_type == FLOAT (#4685)
- [Improvement] EmbeddingSpMDM8Bit_Sve support (#3911)
SSD Table Batched Embedding (TBE) Operators
- [New] Feature score eviction backend and frontend support (#4681)
- [New] TTL-based eviction in feature score eviction policy (#4752)
- [New] ID_COUNT eviction trigger condition (#4829)
- [New] Virtual table eviction policy (#4433)
- [New] Partial Rowwise Adam enabled in SSD TBE (#4525)
- [New] Functions to destruct RocksDB checkpoint from ReadOnlyEmbeddingKVDB (#4361)
- [New] Backend API to get KVZCH eviction metadata (#4594)
- [New] Multiple paths support in SSD TBE passed_in_path (#4690)
- [New] Eviction metadata tensor FQN (#4611)
- [New] Inference eviction interfaces and implementation (#4500, #4504)
- [Improvement] Emit eviction metrics per TBE (#4563)
- [Improvement] Update cache row dim calculation in TBE SSD (#4480)
- [Improvement] Simplify SplitState application for optimizers TBE SSD (#4492)
- [Improvement] KV embedding DRAM delta loading in predictor (#4438)
- [Improvement] Fix feature score eviction policy in different trigger mode (#4952)
- [Improvement] Fix loading weight with metaheader for eviction (#4769)
- [Improvement] Fix fetch eviction metadata bug (#4701)
Optimizer Support
- [New] Full Adam optimizer support with state offloading (#4702)
- [New] Multiple optimizer states streaming to backend (#4507)
- [New] Optimizer state offloading for partial rowwise adam (#4405, #4477)
- [New] Extended backend_return_whole_row to support multiple optimizer states (#4620)
- [New] API to set backend_return_whole_row (#4607)
- [Improvement] Expand split_optimizer_states() to support multiple optimizer states (#4495)
- [Improvement] Replace optimizer dtype() method with optimizer_state_dtypes dictionary (#4523)
- [Improvement] Handle fetch optimizer states for KV ZCH cases (#4498)
- [Improvement] Fix partial rowwise adam optimizer offloading memory alignment (#4645)
GenAI Support and Operators
Attention Kernels
- [New] Enable Cutlass Blackwell attention in OSS (#4793)
- [New] Fast-simplicial attention in fbgemm (#4797)
- [New] Enable left mask in Cutlass kernels (#4799)
- [New] Enable deterministic mode in Cutlass attention (#4840)
- [New] Add head_dim = 64 in B200 Attention (#4935)
- [New] Add softmax_scale to blackwell cutlass fmha (#4897)
- [New] Padded KV Partial Prefill Case (#4848)
- [Improvement] SWA attention backward optimization (#4808, #4795)
- [Improvement] Fix overflow in attention workspace calculation (#4939)
CUTLASS/GEMM Support
- [New] Enable CUTLASS grouped GEMM for pretraining wgrad on GB200 and H100 (#4886, #4913)
- [New] Enable CUTLASS grouped GEMM for Llama4x pretraining grad (#4856)
- [New] MXFP8 grouped GEMM enabled (#4710)
- [New] 2D-2D support for MXFP8 Grouped GEMM (#4816)
- [New] FP4 grouped API for torch (#4958)
- [New] Add output as an option in CUTLASS grouped GEMM (#4931)
- [New] DeepGemm Style Groupwise Group Gemm (#4464, #4365)
- [Improvement] Upgrade CUTLASS to v4.2.1 (#4964)
- [Improvement] BF16 re-enabled with Cutlass update (#5013)
- [Improvement] Modernize FP8 Blockwise GEMM (#5002)
- [Improvement] Enable control of number of SMs in cutlass grouped GEMM (#4991)
- [Improvement] Improve general shape performance for dgrad/fprop cutlass grouped gemm (#4988, #4989)
- [Improvement] Optimize wgrad CUTLASS grouped gemm (#4891)
- [Improvement] Improve CUTLASS GMM for Llama4x pretraining fprop (#4855)
- [Improvement] MXFP8 Grouped GEMM tuning (#4821)
- [Improvement] FP4 grouped refactor (#4847, #4934)
- [Improvement] Migrate to new device TMA API for grouped_gemm.py (#4866)
Triton GEMM Support
- [New] Triton FP8 GEMM to produce fp16 output (#4719)
- [New] MI350X FP8 Triton patch (#4889)
- [Improvement] Zen LLATTE CoFormer Triton FP8 tune (#4951)
- [Improvement] CMF v0 FP8 Triton tune (#4873)
- [Improvement] Align output type in Triton FP8 GEMM meta kernel (#4698)
- [Improvement] Tune FP8 GEMM output and bias type to reduce NE diffs (#4638)
- [Improvement] Triton early prune config fix (#4917)
- [Improvement] Auto-tune for 500x shapes to boost flops (#4626, #4635)
KV Cache Support
- [New] Enable building KV cache ops in OSS (#4578)
- [Improvement] Split dequant functions from kv_cache.cu (#4787)
- [Improvement] Pad dequantized paged FP8 KV with zeros (#4780)
- [Improvement] Remove potentially leaky debug values from convert_e4m3fn_kv_cache_to_e4m3fnuz_inplace (#4748)
Quantization Operators
FP8/FP4 Quantization
- [New] Symmetric quantization to prefill token-wise FP8 (#4661)
- [Improvement] Restrict to FP8 and patch 4.2.0 release changes (#4898)
- [Improvement] Remove e5m2 from f8f8bf16_rowwise and f8f8bf16_rowwise_batched (#4907, #4908)
- [Improvement] Fix scaled input issue (#4884)
MX4/MXFP Quantization
- [New] Enable Triton-MTIA MX4 quantization (#4619)
- [New] Boost MXFP4 quantization with inline PTX (#4694)
- [Improvement] Mimic MX4 in NVFP4 native (#4970)
- [Improvement] Minor dtype change to MX4 quantize (#4571)
- [Improvement] Fix MXFP8 test due to OSS FBGEMM version mismatch (#4749)
Other Quantization
- [New] Quantization with min & max bounds support - CPU ops for n-bit (#4860)
- [New] Quantization with min & max bounds support - 4-bit & 2-bit on X86-64 (#4833)
- [New] Quantization with min & max bounds support - float to 8-bit on X86-64 (#4790)
- [New] AVX512-based int8 to bf16 dequantization (#4912)
- [New] Add ability to pad rowwise quantized tensors (#4877)
- [New] Stacked quantization with RMS and SiLU fusion (#4487)
- [New] Groupwise quantization kernel (#4439)
- [Improvement] Improve aarch64 Fused8BitRowwiseQuantizedSBFloatToFloatOrHalf (#4995)
- [Improvement] Expose rounding_mode in quantization for performance (#4862)
- [Improvement] Add unit testing to quantize_qkv_per_head (#4789)
Sparse Operators
- [New] Add block_bucketize_2d_weights kernel (#4778)
- [New] Support 2D weights in permute kernel (#4723)
- [New] Add kernel to accumulate weight value and counts based on jagged unique inverse indices (#4654)
- [Improvement] Use guard_or_true instead of guard_size_oblivious in sparse_ops.py (#4974)
- [Improvement] Fix sparse_permute_1d kernel to support double dtype (#4969)
- [Improvement] Correctly handle non-contiguous permute tensors in sparse permute kernel (#4593)
- [Improvement] Make length check on mask numel in masked_select_jagged_1d optional (#4621)
- [Improvement] Handle M == 0 but N != 0 for scatter_add (#4731)
- [Improvement] Support general num experts in scatter_add_padded_tokens (#4627)
- [Improvement] Remove assertion check inside scatter_add when in_tokens and out_tokens have different shape (#4639)
Comm Operators
- [New] Enable Comm Ops in OSS GenAI (#4548)
- [New] Comms for many2one/one2many (#4528)
- [New] Support any EP in combine or split shuffling (#4649)
- [New] Support non power-of-2 experts in combine or split shuffling (#4634)
- [New] Enable index_shuffling with top-K > 1 (#4605)
- [New] Enable index_shuffling with more experts (#4588)
Platform Support
CUDA 13 / Blackwell / GB200
- [New] Enable CUDA 13 builds in OSS (#5143)
- [New] Prepare FBGEMM_GPU for CUDA 13 builds (#5100)
- [New] Enable CUDA 13 for GenAI builds (#4837)
- [New] Add back cu129 support for 1.4.x release (#5)
- [Improvement] Fix gen_ai:quantize_ops_gpu build for B200 (#4683)
- [Improvement] Fix tuning cache for f8f8bf16_rowwise_grouped on SM100 (#4843)
ARM / AArch64
- [New] Enable ARM builds (#4604, #4632)
- [New] Enable Apple Silicon builds (#4673)
- [New] Enable more tests and benchmarks on ARM64 (#4721)
- [Improvement] Upgrade ARM CI instances (#4693)
- [Improvement] Use armv8.2-a+fp16fml (#4942)
- [Improvement] Use reduceAVX2 trivial implementation for aarch64 (#4677)
- [Improvement] Check arm_neon_sve_bridge.h header (#4967)
- [Improvement] Fix Arm64 OSS PyTorch build with FBGEMM (#4775)
- [Improvement] Fix fp32 ARM builds (#4699)
- [Improvement] Fix arm64 issues (#4711)
ROCm / AMD
- [New] Support ROCm 7 in Nova (#4953)
- [New] Add gfx950 to ROCm 7.0 (#4443)
- [Improvement] Upgrade ROCm to 6.4.2 (#4876)
- [Improvement] Fix quantize kernels on ROCm 6.4 (#4708)
- [Improvement] Improve assertions on ROCm fp8_rowwise_grouped_gemm (#4812)
- [Improvement] Invoke AMD specific kernel reorder_batched_ad_indices_kernel_vec (#4412)
- [Improvement] Fix OSS performance on FP8 AMD kernels (#4462)
- [Improvement] Optimize tbe_input_combine_with_length_cuda on AMD
- [Improvement] Mitigate IG 500x seg-fault on AMD (#4554)
- [Improvement] More hipify v2 fixes (#4947)
- [Improvement] Fix ROCm build issues (#4902)
- [Improvement] Add flags to compile CK GEMMs under PyTorch (#4878)
Build / CI Improvements and Better Engineering
- [New] Remove old architectures from build to reduce build size (#5172)
- [New] Reusable CI workflows (#4646)
- [New] Enable USE_FBGEMM_GENAI (#4703)
- [New] Add version compatibility checks on library load (#4875)
- [Improvement] Extensive kernel migration to
FBGEMM_LAUNCH_KERNELpattern (#4751, #4738, #4724, #4692, #4662, #4618, #4595, #4572, #4562, #4556) - [Improvement] Migrate to Python 3.9 syntax (#4909)
- [Improvement] Update to use C++17/C++20 features (#4558, #4590)
- [Improvement] Modernize C++ code by clang-tidy (#4415, #4466, #4436, #4437)
- [Improvement] Save binary size (#4900)
- [Improvement] Decouple ATEN from Feature Gates Deps (#4965)
- [Improvement] Support loading libraries when installing multiple FBGEMM targets (#4993)
- [Improvement] General OSS fixes and script improvements
- [Improvement] Add tbb as library dependency explicitly in CMake (#4859)
- [Improvement] Modularize FBGEMM CMake Target Definitions (#4509, #4518, #4524, #4553)
- [Improvement] Fix various build issues for different platforms
Tests and Benchmarks
- [New] Add export trace to UVM and cache benchmark (#4777)
- [New] Add tests to decode and padded jagged_vs_padded_kv (#4806)
- [New] Unit tests for TBE API v1 compatibility (#4926)
- [New] Unit tests for Full Adam (#4712, #4713)
- [New] Unit tests for Partial Rowwise Adam (#4642)
- [Improvement] BWD test improvement (#4981)
- [Improvement] SSD TBE test cleanups (#4758)
- [Improvement] Update TBE training benchmark (#4455)
- [Improvement] Modify VBE bench to support SSD and trace export (#4658)
- [Improvement] Support multiple total-k and total-m in quantize bench (#4890)
- [Improvement] Re-enable decode tests (#4946)