discussion about model quantization and MatMulInteger in quantized model #26735

nyamuko-Amoris · 2025-12-05T11:03:45Z

nyamuko-Amoris
Dec 5, 2025

hi, i'm using onnxruntime to deploy a transformer-based model to nvidia gpu. With only 4~8GB mem, i need to quantize my model with quantize_dynamic() to UINT8.
But, i found that the model performance is really bad, and finally i found that there is an op which named 'MatMulInteger' running on cpu but not gpu.
i read ort source code and i found a implement of MatMulinteger in cuda provider, only with int8 but not uint8, when DynamicQuantizeLinear, the previous node before MatMulInteger, gives UINT8 data as output in onnx graph.
So, my question is, why MatMulInteger only accept INT8 tensor as input, but not UINT8? And, i have not enough data for static quantization and calibration, is it possible to accomplish a UINT8 MatMulInteger op to accelerate my model inference?

rivkastroh · 2025-12-11T07:31:56Z

rivkastroh
Dec 11, 2025

You are right – there is currently a mismatch between the ONNX spec and the ONNX Runtime implementation of MatMulInteger on CUDA.

It would be useful to open a dedicated issue in the onnxruntime repository so this can be tracked and handled.
The fix on the CUDA side is not just about changing the type constraints: the MatMulInteger CUDA path needs to be reviewed so that it supports all valid type combinations, including updating/adding the GEMM implementation so that it correctly handles uint8.

In the meantime, if it fits your use case, you can use quantize_dynamic so that:

MatMul ops are quantized to int8, and
other operators are quantized to uint8, as appropriate for your model.

For example, you can:

Quantize only MatMul to int8:

from onnxruntime.quantization import quantize_dynamic, QuantType

# Quantize only MatMul nodes, using int8 weights
quantize_dynamic(
    model_input="model.onnx",
    model_output="model_matmul_int8.onnx",
    op_types_to_quantize=["MatMul"],
    weight_type=QuantType.QInt8,  # int8 weights for MatMul
    # other arguments as needed...
)

Quantize the rest of the model with uint8 (excluding MatMul nodes you already handled),
for example by selecting specific op types or node names:

from onnxruntime.quantization import quantize_dynamic, QuantType

# Example: quantize other ops (e.g., Conv, Gemm, Attention) to uint8
quantize_dynamic(
    model_input="model_matmul_int8.onnx",
    model_output="quantize_model.onnx",
    op_types_to_quantize=["Conv", "Gemm", "Attention"],  # adjust to your model
    weight_type=QuantType.QUInt8,  
    # other arguments as needed...
)

2 replies

nyamuko-Amoris Dec 15, 2025
Author

thank you for your reply:)
i've tried to quantize my model in this way:

from onnxruntime.quantization import quantize_dynamic, QuantType

# Quantize only MatMul nodes, using int8 weights
quantize_dynamic(
    model_input="model.onnx",
    model_output="model_matmul_int8.onnx",
    op_types_to_quantize=["MatMul"],
    weight_type=QuantType.QInt8,  # int8 weights for MatMul
    # other arguments as needed...
)

But MatMulInteger is still using the CPUExecutionProviders.
I believe the root cause of this problem is that the previous node, DynamicQuantizeLinear, only outputs a UInt8 tensor. Even though I quantize Matmul to INT8, the MatMulintger node can only use the CPU for computation due to the lack of a UInt8 GPU implementation.
So, this seems to be a legacy issue, because the GPU version of MatMulInteger is implemented via cuBLAS's cublasGemmEx(), and this CUDA kernel only has an int8 implementation.
Is there any plan to support uint8 MatMulInteger in CUDAExecutionProvider? Because I found that it could be implemented via the CUTLASS int8 Gemm kernel.
Looking for your reply:)

rivkastroh Dec 15, 2025

You’re right — the core issue is that DynamicQuantizeLinear only produces UINT8 outputs (as defined by the ONNX spec).
You could insert a Cast node afterward to convert to INT8, but that quickly becomes messy and non-trivial to manage correctly.

I’m not aware of any concrete plan to add UINT8 support for MatMulInteger in the CUDAExecutionProvider, but opening an issue in onnxruntime with a proposal to implement a CUTLASS-based backend could help move this forward.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

discussion about model quantization and MatMulInteger in quantized model #26735

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

discussion about model quantization and MatMulInteger in quantized model #26735

Uh oh!

nyamuko-Amoris Dec 5, 2025

Replies: 1 comment · 2 replies

Uh oh!

rivkastroh Dec 11, 2025

Uh oh!

nyamuko-Amoris Dec 15, 2025 Author

Uh oh!

rivkastroh Dec 15, 2025

nyamuko-Amoris
Dec 5, 2025

Replies: 1 comment 2 replies

rivkastroh
Dec 11, 2025

nyamuko-Amoris Dec 15, 2025
Author