-
Notifications
You must be signed in to change notification settings - Fork 65
Enable qwen3 vl moe quant and load #1182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…fp UT Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>
Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enables quantization and loading support for the Qwen3-VL-MoE model by implementing expert-to-linear conversion and adding comprehensive test coverage.
Key Changes:
- Added Qwen3-VL-MoE model handler with expert conversion logic similar to existing MoE models
- Implemented device-aware E2M1 tensor caching to improve performance
- Added test fixtures and test cases for both CPU and CUDA environments
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| auto_round/modelling/qwen3_vl_moe.py | New module implementing LinearQwen3VLMoeTextSparseMoeBlock for expert-to-linear conversion during quantization |
| auto_round/special_model_handler.py | Registered qwen3_vl_moe in supported models list and expert conversion mapping |
| test/test_cuda/test_moe_model.py | Added fixture and test case for Qwen3-VL-MoE MXFP4 quantization on CUDA |
| test/test_cpu/test_moe_model.py | Added fixture and test case for Qwen3-VL-MoE MXFP4 quantization on CPU |
| auto_round/experimental/qmodules/fp4_utils.py | Refactored E2M1 lookup tensor to use device-aware caching mechanism |
| auto_round/data_type/utils.py | Added trailing newline for consistency |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…auto-round into enable_qwen3_vl_moe_quant
yiliu30
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Others LGTM
Co-authored-by: Yi Liu <yi4.liu@intel.com>
Co-authored-by: Yi Liu <yi4.liu@intel.com>
No description provided.