Support for cached KV and paged KV in the New Flash Attention Kernel #661

ClarkChin08 · 2025-12-08T01:07:44Z

Description

This PR introduces support for cached KV and paged KV in the New Flash Attention Kernel.

Type

[x] Cached KV on fixed sequence lengths with multi-batch and GQA support (Good accuracy and performance).
[x] Paged KV (non-contiguous) on fixed sequence lengths with multi-batch and GQA support (Good accuracy and performance).
[x] Cached KV/Paged KV on variable sequence lengths (Good accuracy and performance).
[x] Cached KV/Paged KV with Causal Mask enabled, supporting multi-batch and GQA (Good accuracy and performance).

We now support the most complex running combinations with good accuracy and performance. Example command:

./examples/06_bmg_flash_attention/06_xe_fmha_fwd_prefill_bfloat16_t_hdim128 --iterations=1 --batch=2 --num_heads_q=32 --seq_len_kv=1024 --seq_len_qo=1024 --num_heads_kv=8 --varlen --seq_len_kv_cache=512 --is_causal --use_paged_kv

Copilot

Pull request overview

This PR adds support for cached KV and paged KV cache to the Flash Attention implementation for Intel Xe GPUs. The changes enable more efficient memory management for attention mechanisms by allowing:

Cached KV support: Appending new KV tokens to existing cached tokens
Paged KV support: Non-contiguous memory layout for KV cache using a page table mapping

Key changes include:

New command-line options --seq_len_kv_cache, --use_paged_kv, --page_size, and --verify
Extension of problem shape to include cached KV sequence length
Page table infrastructure for mapping logical to physical pages
Mainloop modifications to handle both cached and new KV tensors with optional paging

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File	Description
examples/06_bmg_flash_attention/xe_fmha_fwd_runner.hpp	Adds command-line options, page table initialization, verification logic for cached/paged KV, and extends problem shape to 8 dimensions
applications/flash_attention_v2/kernel/xe_fhma_fwd_kernel.hpp	Adds K_cache and V_cache tensors to kernel arguments, extends sequence length shape to 3D tuple, computes offsets for cached tensors
applications/flash_attention_v2/collective/xe_fmha_fwd_mainloop.hpp	Adds PagedKV template parameter, implements page table lookup for physical tile indexing, handles prefetch/copy for both cached and new KV tensors

applications/flash_attention_v2/collective/xe_fmha_fwd_mainloop.hpp

examples/06_bmg_flash_attention/xe_fmha_fwd_runner.hpp

Signed-off-by: Chen, Xi2 <xi2.chen@intel.com>

Copilot AI review requested due to automatic review settings December 8, 2025 01:07

ClarkChin08 changed the title ~~Add cached KV and paged KV support to Flash Attention~~ [WIP] Add cached KV and paged KV support to Flash Attention Dec 8, 2025

ClarkChin08 marked this pull request as draft December 8, 2025 01:08

Copilot started reviewing on behalf of ClarkChin08 December 8, 2025 01:23 View session

Copilot AI reviewed Dec 8, 2025

View reviewed changes

Add cached KV and paged KV support to Flash Attention

6fbed33

Signed-off-by: Chen, Xi2 <xi2.chen@intel.com>

ClarkChin08 force-pushed the cached_kv branch from cf63631 to 6fbed33 Compare December 8, 2025 01:33

ClarkChin08 added 5 commits December 8, 2025 11:10

Fix the comment page KV issue and refine physical_K_tile

283540a

Add default verify option

df8efee

Fix the CausalMask cache KV and paged KV acc issue

d91c838

Signed-off-by: Chen, Xi2 <xi2.chen@intel.com>

Fix the CachedKV VarLen acc issue

aa6e3d6

Signed-off-by: Chen, Xi2 <xi2.chen@intel.com>

Fix the PagedKV indexing multi-batch issue

9971afb

Signed-off-by: Chen, Xi2 <xi2.chen@intel.com>

ClarkChin08 changed the title ~~[WIP] Add cached KV and paged KV support to Flash Attention~~ Add cached KV and paged KV support to Flash Attention Dec 11, 2025

ClarkChin08 marked this pull request as ready for review December 11, 2025 06:32

ClarkChin08 changed the title ~~Add cached KV and paged KV support to Flash Attention~~ Support for cached KV and paged KV in the New Flash Attention Kernel Dec 11, 2025

Fix the typo

7a10e72

Signed-off-by: Chen, Xi2 <xi2.chen@intel.com>

tdeng5 requested review from jiyang1011, taozha2 and tdeng5 December 11, 2025 07:09

jiyang1011 approved these changes Dec 11, 2025

View reviewed changes

tdeng5 approved these changes Dec 17, 2025

View reviewed changes

tdeng5 added 2 commits December 17, 2025 09:09

Merge branch 'main' into cached_kv

36a027e

Merge branch 'main' into cached_kv

7277116

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for cached KV and paged KV in the New Flash Attention Kernel #661

Support for cached KV and paged KV in the New Flash Attention Kernel #661

Uh oh!

ClarkChin08 commented Dec 8, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Support for cached KV and paged KV in the New Flash Attention Kernel #661

Are you sure you want to change the base?

Support for cached KV and paged KV in the New Flash Attention Kernel #661

Uh oh!

Conversation

ClarkChin08 commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ClarkChin08 commented Dec 8, 2025 •

edited

Loading