Skip to content

Conversation

@wuxun-zhang
Copy link

@wuxun-zhang wuxun-zhang commented Dec 18, 2025

Description

This PR aims to add new split reduction kernel for flash attention which benefits for long context length scenario.

Note: Codes are not cleaned but ready for testing.

What's newly added in this PR

  • new FMHAFwdKernel named XeFMHAFwdSplitKVKernel
  • new split reduce kernel named ReduceSplitK
  • new tile shceudler named XeReduceSplitKTileScheduler
  • support variable length

Limitation

  • decoding only
  • GQA ratio (num_heads_q/num_heads_kv) <= 8

Type

  • Bug - [x] Feature - [ ] Performance - [ ] Refactor

Testing

  • Tests pass - [ ] Xe12 - [x] Xe20

Performance

Metric Before After

References

Fixes #

Checklist

  • Copyright - [ ] Co-pilot Review - [ ] Deprecated APIs not used

@wuxun-zhang wuxun-zhang force-pushed the wuxun/split-reduction-kernel branch from 11ab8d0 to 2ad4764 Compare December 22, 2025 04:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant