Skip to content

Releases: flashinfer-ai/flashinfer

Release v0.6.0

08 Jan 02:50
4230a48

Choose a tag to compare

What's Changed

  • feat: Add backend='auto' to mm_fp4 and enable autotune for backend='cudnn' by @bkryu in #1979
  • fix: Fix bench_mm_fp8.py by @bkryu in #2129
  • feat: Enable API Logging for Better Debugging POC by @bkryu in #2108
  • fix: add a check for int32 indices in sampling.py by @raayandhar in #2127
  • update autotuner input tensor random range by @jiahanc in #2116
  • enable xqa speculative decoding by @qsang-nv in #2105
  • Add custom communicator for trtllm_mnnvl_ar by @wenscarl in #2056
  • fix: DeepSeek activation uninitialized data by @nekorobov in #2128
  • chore: Update CODEOWNERS by @flashinfer-bot in #2135
  • bugfix: fix unittest error introduced in #2056 by @yzh119 in #2136
  • fix flaky xqa test by @qsang-nv in #2126
  • fix: some bugs of headDim 256 trtllm-gen fmha kernels. by @PerkzZheng in #2137
  • fix(trtllm): reset negative strideBatch to 0 for ragged KV layout to … by @YAMY1234 in #2134
  • feat: add trtllm-gen per-tensor sparseMla kernels. by @PerkzZheng in #2138
  • Use global TuningConfig, to fix memory leak caused by AutoTuner LRU cache and dynamic lambda TuningConfig by @juju812 in #2140
  • feat: add seed offset args to sampler to allow cuda graph support by @ksukrit in #2132
  • ci: Reduce test time by moving compilation off-line by @kahyunnam in #2089
  • feat: TRTLLM FMHAv2 backend for ctx attention by @jimmyzho in #2142
  • refactor: pass hopper deepgemm include directory through python by @yzh119 in #2090
  • bugfix: add driver support to CUPTI benchmark function, issue #2145 by @nv-yunzheq in #2154
  • Bump tvm ffi version to 0.1.4 by @cyx-6 in #2155
  • Update Docker CI tags to 20251202-23ff744 by @flashinfer-bot in #2158
  • misc: Label APIs for Logging by @bkryu in #2153
  • Update nvidia-cutlass-dsl version to 4.3.1 by @aleozlx in #2161
  • chore: Update CODEOWNERS by @flashinfer-bot in #2152
  • feat: C++ side tensor validation by @raayandhar in #2160
  • Update Docker CI tags to 20251203-4efb7bb by @flashinfer-bot in #2164
  • ci: Install CUDA version specified torch first during container building. by @bkryu in #2167
  • fix xqa mha_sm90.cu by @qsang-nv in #2157
  • Update Docker CI tags to 20251203-1e15fed by @flashinfer-bot in #2172
  • enable sm103 moe dsl backend by @aleozlx in #2149
  • ci: Use stable Torch Release for cu130 by @bkryu in #2174
  • tiny upd mm_fp4 docstring by @b8zhong in #2177
  • fix: compile flags for trtllm fmha_v2 by @jimmyzho in #2175
  • Fix/dsl smem query by @aleozlx in #2178
  • Update Docker CI tags to 20251204-cdc5fb7 by @flashinfer-bot in #2176
  • feat: MxInt4 x Bf16 TRT-LLM Gen MoE support by @nekorobov in #2159
  • refactor: Move mla code from decode.py to mla.py and add to documentation by @bkryu in #2163
  • Fix gemm allreduce two shot by @aleozlx in #2171
  • Update Docker CI tags to 20251205-54c1678 by @flashinfer-bot in #2179
  • Rename noauxtc to fused_topk_deepseek by @nv-yunzheq in #2181
  • refactor: update fa3 codebase and fix hopper unittest [part 1] by @yzh119 in #2111
  • Add data type check for deepseek fp4 moe by @samuellees in #2165
  • benchmark: Make use_cupti the default in microbenchmarks. by @bkryu in #2180
  • ci: Specify MPI implementation to mpich by @bkryu in #2182
  • Update Docker CI tags to 20251206-185d63a by @flashinfer-bot in #2184
  • test: Skip sm90 test in test_jit_warmup.py if not on sm90 by @bkryu in #2189
  • ci: Update sm12X minimum cuda capability to 12.9 in aot.py by @bkryu in #2188
  • Super tiny fix version by @fzyzcjy in #2199
  • docs: Document CUDA version support in README and installation page by @bkryu in #2197
  • docs: Fix inaccurate API docstrings for attention prefill by @bkryu in #2196
  • feat: unit-test and api change, w4a8 grouped-gemm fused MoE for SM90 by @jimmyzho in #2193
  • Permute page table in benchmarking by @jhjpark in #2194
  • Fix for moe on sm110 by @jhalabi-nv in #2190
  • chore: update authorized codeowners by @jimmyzho in #2210
  • perf: bunch of features and optimizations for top-k (sampling + sparse attention) by @yzh119 in #2119
  • Refactor trtllm_mnnvl_allreduce by @timlee0212 in #2118
  • chore: Update CODEOWNERS by @flashinfer-bot in #2186
  • feat: support more head dim in RoPE kernel by @raayandhar in #2109
  • Port TRT-LLM communication kernels to flashinfer by @djns99 in #2102
  • cicd: Add sanity test script by @kahyunnam in #2212
  • feat: add memcpy and memset to CUPTI timing method by @nv-yunzheq in #2223
  • Added an initial implementation of Q and KV Cache in fp8 and to use t… by @Anerudhan in #2035
  • feat: Support unpadded output hidden size for trtllm_fp4_block_scale_moe by @elvischenv in #2217
  • fix: Eliminate the usage of CUDA ARCH macro in host function. by @timlee0212 in #2228
  • misc: support checks for gemm by @jimmyzho in #2214
  • feat: Cold L2 Cache Benchmarking with Rotating Buffers by @bkryu in #2213
  • Move the run function definition out of BatchedGemmInterface by @jhalabi-nv in #2211
  • make DeepGEMM swapAB available for linear gemm SM90 by @katec846 in #2131
  • misc: upgrade tvm-ffi dependency to 0.1.6 by @yzh119 in #2229
  • A unified API for the MNNVL and single-node/multi-GPU AllReduce kernels. by @nvmbreughe in #2130
  • Update Docker CI tags to 20251217-f059241 by @flashinfer-bot in #2231
  • Rebase FP8 SM100 Cutlass FMHA Attention to main (original PR#1238) by @pavanimajety in #2047
  • [feat] Integrate SGLang concat_mla_k kernel into flashinfer by @jiahanc in #2237
  • fix: add DeepSeek routing for Bf16xBf16 and MxIntxBf16 TRT-LLM Gen MoE by @nekorobov in #2234
  • fix: Fix compilation with GCC 11 by @dbari in #2242
  • feat: RMSNorm/Fused RMSNorm + FP8 Quantization kernels by @BLaZeKiLL in #2243
  • feat: further optimize top-k and add fused top-k page construction kernels for DSA by @yzh119 in #2215
  • test: Fix MNNVL tests to skip when container lacks SYS_PTRACE capability by @bkryu in #2245
  • Remove cudaStreamSynchronize from gemm_groupwise_sm120.cuh for CUDA graph compatibility by @Copilot in #2244
  • feat: support variable sequence length in decode kernel of trtllm-gen attention by @yaoyaoding in #2125
  • feat: Fused RMSNorm + FP4 Quantization Kernels in CuTe-DSL by @bkryu in #2233
  • A...
Read more

Nightly Release v0.5.3-20260107

07 Jan 05:20
edb37cd

Choose a tag to compare

Pre-release

Automated nightly build for version 0.5.3 (dev20260107)

Nightly Release v0.5.3-20260106

06 Jan 05:19
5a8bcf6

Choose a tag to compare

Pre-release

Automated nightly build for version 0.5.3 (dev20260106)

Nightly Release v0.5.3-20260105

05 Jan 05:32
30203fb

Choose a tag to compare

Pre-release

Automated nightly build for version 0.5.3 (dev20260105)

Nightly Release v0.5.3-20260104

04 Jan 05:33
30203fb

Choose a tag to compare

Pre-release

Automated nightly build for version 0.5.3 (dev20260104)

Nightly Release v0.5.3-20260103

03 Jan 05:04
b09fbcb

Choose a tag to compare

Pre-release

Automated nightly build for version 0.5.3 (dev20260103)

Nightly Release v0.5.3-20260102

02 Jan 05:18
6f1624c

Choose a tag to compare

Pre-release

Automated nightly build for version 0.5.3 (dev20260102)

Nightly Release v0.5.3-20260101

01 Jan 05:28
747b0cb

Choose a tag to compare

Pre-release

Automated nightly build for version 0.5.3 (dev20260101)

Nightly Release v0.5.3-20251231

31 Dec 05:23
835a015

Choose a tag to compare

Pre-release

Automated nightly build for version 0.5.3 (dev20251231)

Nightly Release v0.5.3-20251230

30 Dec 05:16
790321b

Choose a tag to compare

Pre-release

Automated nightly build for version 0.5.3 (dev20251230)