Add Flashinfer cudnn Backend for ViT #30

maxyanghu · 2026-01-23T21:54:18Z

This branch adds cuDNN backend support for ViT (Vision Transformer) attention in multimodal models like Qwen2.5-VL and Qwen3-VL.

Key File Changes

1. New FlashInfer/cuDNN Wrapper (`vllm/v1/attention/ops/vit_attn_wrappers.py`)

Added flashinfer_wrapper and vit_flashinfer_wrapper functions that:

Call cudnn_batch_prefill_with_kv_cache from FlashInfer
Handle the special cu_seqlens format (3x longer, containing batch_offsets_qk, batch_offsets_v, batch_offsets_o)
Support both 3D and 4D tensor inputs (with automatic reshape)
Registered as a custom op for torch.compile compatibility

2. MMEncoderAttention Updates (`vllm/model_executor/layers/attention/mm_encoder_attention.py`)

Added workspace_buffer parameter for cuDNN backend
New _forward_flashinfer() method
Updated forward_cuda() to dispatch to FlashInfer when FLASHINFER backend is selected
Added sequence_lengths parameter throughout the call chain

3. Qwen2.5-VL Model Updates (`vllm/model_executor/models/qwen2_5_vl.py`)

Added workspace_buffer allocation (128MB) when using FlashInfer backend
Propagated workspace_buffer and sequence_lengths through attention layers
Added FlashInfer to supported backends

4. Qwen3-VL Model Updates (`vllm/model_executor/models/qwen3_vl.py`)

More extensive changes for cuDNN compatibility:

New add_padding_to_fi_seqlens() method to pad sequence lengths to batch size of 8
New compute_flashinfer_cu_seqlens() method to compute the 3-section cu_seqlens format
Fixed max_seqlen to 128K for FlashInfer backend

Technical Details

The cuDNN batch attention API requires:

Parameter	Description
`cu_seqlens`	Split into 3 sections: Q/K offsets, V offsets, O offsets
`sequence_lengths`	Padded to batch size 8
`max_seqlen`	Fixed at 128K for FlashInfer
`workspace_buffer`	128MB pre-allocated buffer

cu_seqlens Format

cu_seqlens is 3x the normal length

cu_seqlength = len(cu_seqlens) // 3
batch_offsets_qk = cu_seqlens[:cu_seqlength] # Q/K batch offsets
batch_offsets_v = cu_seqlens[cu_seqlength:cu_seqlength2] # V batch offsets
batch_offsets_o = cu_seqlens[cu_seqlength2:] # O batch offsets

github-actions · 2026-01-23T21:54:27Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Signed-off-by: Max Hu <hyoung2991@gmail.com>

zhandaz

Overall LGTM.

The Dockerfile may be polished. Please run a docker image build, and an end-to-end functionality run, then it is good to be merged!

zhandaz · 2026-01-26T01:13:37Z

docker/Dockerfile

I am not quite familiar with the flashinfer build. Please correct me if I'm wrong.

I found there is a script tool/flashinfer-build.sh. While the usage may be different, I feel like we should also set FI_TORCH_CUDA_ARCH_LIST in our case which can reduce our image build time a lot.

python3 are used throughout other places in this docker file. Let's switch to python3 for consistency. And also python may not work.

wangshangsam · 2026-01-26T00:58:15Z

vllm/model_executor/models/qwen2_5_vl.py

                f"Qwen2.5-VL does not support {self.attn_backend} backend now."
            )
-
+        workspace_buffer = (


I vaguely remember that this is specified through VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE (it was you who introduced this env var)?

VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE controls text model's Flashinfer workspace buffer size. It's different from this one.

Should the ViT FI workspace size be specificable via (a different) env var too? Do you imagine this number to be different on different GPUs?

For fprop this is count_of_ragged_tensor (Q,K,V = 3 in this case) * Batch_size * sizeof(TMADescriptor) (128B) + 16 bytes for alignment + 4 bytes for tile size. So, 128 MB should be more than sufficient across architectures.

vllm/model_executor/models/qwen3_vl.py

wangshangsam · 2026-01-26T01:09:57Z

vllm/model_executor/layers/attention/mm_encoder_attention.py

        self.num_heads = num_heads
        self.head_size = head_size
-        self.scale = scale
+        self.scale = 1.0 / (head_size**0.5) if scale is None else scale


What is this default scale factor based on?

vllm/model_executor/models/qwen3_vl.py

zhandaz · 2026-01-26T01:26:15Z

@maxyanghu One more thing, flashinfer-python==0.5.3 is specified in requirements/cuda.txt and is installed during the docker build. While I feel it won't have effects since our built version should also be 0.5.3 and thus this may be skipped, please have a double check.

b-mu · 2026-01-26T02:27:20Z

I think we should document that when we have the upcoming cudnn FE release, those harded coded max_seqlen and paddings should be removed.

Signed-off-by: Max Hu <hyoung2991@gmail.com>

maxyanghu self-assigned this Jan 23, 2026

maxyanghu marked this pull request as draft January 23, 2026 22:19

maxyanghu force-pushed the vit-attn-cudnn-backend branch from 3372e3a to 28d139b Compare January 26, 2026 00:26

maxyanghu changed the base branch from main to mlperf-inf-mm-q3vl-v6.0 January 26, 2026 00:26

maxyanghu marked this pull request as ready for review January 26, 2026 00:27

maxyanghu and others added 11 commits January 25, 2026 16:33

add implementation

d4597f3

Signed-off-by: Max Hu <hyoung2991@gmail.com>

add impl

7cbf291

Signed-off-by: Max Hu <hyoung2991@gmail.com>

add flashinfer

8713291

fix tp

f9362fb

Signed-off-by: Max Hu <hyoung2991@gmail.com>

Temporary change for ViT

d48087f

fix workspace_buffer device.

71eeda2

change max_seqlen to 128k.

392b3ac

remove duplicate multiplier.

772a17b

fix accuracy and refactor

c38e8c4

more fix

19d5ffa

change dockerfile

47af3e1

maxyanghu force-pushed the vit-attn-cudnn-backend branch from 28d139b to 47af3e1 Compare January 26, 2026 00:34

wangshangsam requested review from b-mu, wangshangsam and zhandaz January 26, 2026 00:45

wangshangsam added the enhancement New feature or request label Jan 26, 2026

maxyanghu added 2 commits January 25, 2026 16:49

format

a09a785

Signed-off-by: Max Hu <hyoung2991@gmail.com>

fix version

bfd41ec

Signed-off-by: Max Hu <hyoung2991@gmail.com>

zhandaz previously approved these changes Jan 26, 2026

View reviewed changes

change python version

5599eb4

maxyanghu dismissed zhandaz’s stale review via 5599eb4 January 26, 2026 01:23

wangshangsam requested changes Jan 26, 2026

View reviewed changes

remove qwen25 transformer support

76b1482

maxyanghu requested review from Anerudhan and wangshangsam January 26, 2026 15:43

maxyanghu added 19 commits January 26, 2026 08:35

change dockerfile

fec4833

add build versions

9a8c2d5

chagne version

f6a2ee7

change version

4b9aa2a

change

f782e97

change

56868a9

change

c2ca450

change

1d8b7ec

change

413260e

build image

7a2ac66

change back

e8d34b7

change to 10.0f

5adb294

fix fi import

bc90e8f

Signed-off-by: Max Hu <hyoung2991@gmail.com>

change to build in dev image

2d1286d

Signed-off-by: Max Hu <hyoung2991@gmail.com>

change location

42858c6

Signed-off-by: Max Hu <hyoung2991@gmail.com>

change location

c9a8f9b

Signed-off-by: Max Hu <hyoung2991@gmail.com>

change

89703a4

Signed-off-by: Max Hu <hyoung2991@gmail.com>

change cubin and jitcache to wheels

9431a61

Signed-off-by: Max Hu <hyoung2991@gmail.com>

change

0e0f19e

Signed-off-by: Max Hu <hyoung2991@gmail.com>

Anerudhan approved these changes Jan 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Flashinfer cudnn Backend for ViT #30

Add Flashinfer cudnn Backend for ViT #30

Uh oh!

maxyanghu commented Jan 23, 2026 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jan 23, 2026

Uh oh!

zhandaz left a comment

Uh oh!

zhandaz Jan 26, 2026

Uh oh!

wangshangsam Jan 26, 2026

Uh oh!

maxyanghu Jan 26, 2026

Uh oh!

wangshangsam Jan 26, 2026 •

edited

Loading

Uh oh!

Anerudhan Jan 26, 2026

Uh oh!

Uh oh!

Uh oh!

wangshangsam Jan 26, 2026

Uh oh!

Uh oh!

zhandaz commented Jan 26, 2026

Uh oh!

b-mu commented Jan 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Add Flashinfer cudnn Backend for ViT #30

Are you sure you want to change the base?

Add Flashinfer cudnn Backend for ViT #30

Uh oh!

Conversation

maxyanghu commented Jan 23, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key File Changes

1. New FlashInfer/cuDNN Wrapper (vllm/v1/attention/ops/vit_attn_wrappers.py)

2. MMEncoderAttention Updates (vllm/model_executor/layers/attention/mm_encoder_attention.py)

3. Qwen2.5-VL Model Updates (vllm/model_executor/models/qwen2_5_vl.py)

4. Qwen3-VL Model Updates (vllm/model_executor/models/qwen3_vl.py)

Technical Details

cu_seqlens Format

Uh oh!

github-actions bot commented Jan 23, 2026

Uh oh!

zhandaz left a comment

Choose a reason for hiding this comment

Uh oh!

zhandaz Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

wangshangsam Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

maxyanghu Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

wangshangsam Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Anerudhan Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wangshangsam Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhandaz commented Jan 26, 2026

Uh oh!

b-mu commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

maxyanghu commented Jan 23, 2026 •

edited by github-actions bot

Loading

1. New FlashInfer/cuDNN Wrapper (`vllm/v1/attention/ops/vit_attn_wrappers.py`)

2. MMEncoderAttention Updates (`vllm/model_executor/layers/attention/mm_encoder_attention.py`)

3. Qwen2.5-VL Model Updates (`vllm/model_executor/models/qwen2_5_vl.py`)

4. Qwen3-VL Model Updates (`vllm/model_executor/models/qwen3_vl.py`)

wangshangsam Jan 26, 2026 •

edited

Loading

b-mu commented Jan 26, 2026 •

edited

Loading