Add split reduction kernel for Flash Attention decoding #671

wuxun-zhang · 2025-12-18T03:12:29Z

Description

This PR aims to add new split reduction kernel for flash attention which benefits for long context length scenario.

Note: Codes are not cleaned but ready for testing.

What's newly added in this PR

new FMHAFwdKernel named XeFMHAFwdSplitKVKernel
new split reduce kernel named ReduceSplitK
new tile shceudler named XeReduceSplitKTileScheduler
support variable length

Limitation

decoding only
GQA ratio (num_heads_q/num_heads_kv) <= 8

Type

Bug - [x] Feature - [ ] Performance - [ ] Refactor

Testing

Tests pass - [ ] Xe12 - [x] Xe20

Performance

Metric	Before	After

References

Fixes #

Checklist

Copyright - [ ] Co-pilot Review - [ ] Deprecated APIs not used

accuracy passed performance WIP

each work group handles whole group query heads and packing group query heads into single MMA call

…-kernel

wuxun-zhang added 12 commits December 17, 2025 17:27

Add split reduction kernel

3c89889

fix return type

1f06931

debugging

7e784be

accuracy passed performance WIP

GQA packing

015bba4

each work group handles whole group query heads and packing group query heads into single MMA call

fix NaN issue

4d6ed2f

remove redundant barrier

e16bc34

Add variable length support

f516fef

GQA packing as default

dbab7fb

limit num kv splits to wg size

f337155

fix tile shceduler

61649ce

fix return order

386b288

Merge remote-tracking branch 'origin/main' into wuxun/split-reduction…

2ad4764

…-kernel

wuxun-zhang force-pushed the wuxun/split-reduction-kernel branch from 11ab8d0 to 2ad4764 Compare December 22, 2025 04:05

extend to all head_dim

c126c71

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add split reduction kernel for Flash Attention decoding #671

Add split reduction kernel for Flash Attention decoding #671

Uh oh!

wuxun-zhang commented Dec 18, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add split reduction kernel for Flash Attention decoding #671

Are you sure you want to change the base?

Add split reduction kernel for Flash Attention decoding #671

Uh oh!

Conversation

wuxun-zhang commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What's newly added in this PR

Limitation

Type

Testing

Performance

References

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wuxun-zhang commented Dec 18, 2025 •

edited

Loading