-
Notifications
You must be signed in to change notification settings - Fork 0
Add Flashinfer cudnn Backend for ViT #30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: mlperf-inf-mm-q3vl-v6.0
Are you sure you want to change the base?
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
3372e3a to
28d139b
Compare
Signed-off-by: Max Hu <hyoung2991@gmail.com>
28d139b to
47af3e1
Compare
Signed-off-by: Max Hu <hyoung2991@gmail.com>
zhandaz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM.
The Dockerfile may be polished. Please run a docker image build, and an end-to-end functionality run, then it is good to be merged!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not quite familiar with the flashinfer build. Please correct me if I'm wrong.
- I found there is a script
tool/flashinfer-build.sh. While the usage may be different, I feel like we should also setFI_TORCH_CUDA_ARCH_LISTin our case which can reduce our image build time a lot. python3are used throughout other places in this docker file. Let's switch topython3for consistency. And alsopythonmay not work.
| f"Qwen2.5-VL does not support {self.attn_backend} backend now." | ||
| ) | ||
|
|
||
| workspace_buffer = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vaguely remember that this is specified through VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE (it was you who introduced this env var)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE controls text model's Flashinfer workspace buffer size. It's different from this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the ViT FI workspace size be specificable via (a different) env var too? Do you imagine this number to be different on different GPUs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For fprop this is count_of_ragged_tensor (Q,K,V = 3 in this case) * Batch_size * sizeof(TMADescriptor) (128B) + 16 bytes for alignment + 4 bytes for tile size. So, 128 MB should be more than sufficient across architectures.
| self.num_heads = num_heads | ||
| self.head_size = head_size | ||
| self.scale = scale | ||
| self.scale = 1.0 / (head_size**0.5) if scale is None else scale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this default scale factor based on?
|
@maxyanghu One more thing, |
|
I think we should document that when we have the upcoming cudnn FE release, those harded coded |
Signed-off-by: Max Hu <hyoung2991@gmail.com>
Signed-off-by: Max Hu <hyoung2991@gmail.com>
Signed-off-by: Max Hu <hyoung2991@gmail.com>
Signed-off-by: Max Hu <hyoung2991@gmail.com>
Signed-off-by: Max Hu <hyoung2991@gmail.com>
This branch adds cuDNN backend support for ViT (Vision Transformer) attention in multimodal models like Qwen2.5-VL and Qwen3-VL.
Key File Changes
1. New FlashInfer/cuDNN Wrapper (
vllm/v1/attention/ops/vit_attn_wrappers.py)Added
flashinfer_wrapperandvit_flashinfer_wrapperfunctions that:cudnn_batch_prefill_with_kv_cachefrom FlashInfercu_seqlensformat (3x longer, containingbatch_offsets_qk,batch_offsets_v,batch_offsets_o)torch.compilecompatibility2. MMEncoderAttention Updates (
vllm/model_executor/layers/attention/mm_encoder_attention.py)workspace_bufferparameter for cuDNN backend_forward_flashinfer()methodforward_cuda()to dispatch to FlashInfer whenFLASHINFERbackend is selectedsequence_lengthsparameter throughout the call chain3. Qwen2.5-VL Model Updates (
vllm/model_executor/models/qwen2_5_vl.py)workspace_bufferallocation (128MB) when using FlashInfer backendworkspace_bufferandsequence_lengthsthrough attention layers4. Qwen3-VL Model Updates (
vllm/model_executor/models/qwen3_vl.py)More extensive changes for cuDNN compatibility:
add_padding_to_fi_seqlens()method to pad sequence lengths to batch size of 8compute_flashinfer_cu_seqlens()method to compute the 3-sectioncu_seqlensformatmax_seqlento 128K for FlashInfer backendTechnical Details
The cuDNN batch attention API requires:
cu_seqlenssequence_lengthsmax_seqlenworkspace_buffercu_seqlens Format