discussion about model quantization and MatMulInteger in quantized model #26735
Unanswered
nyamuko-Amoris
asked this question in
API Q&A
Replies: 1 comment 2 replies
-
|
You are right – there is currently a mismatch between the ONNX spec and the ONNX Runtime implementation of It would be useful to open a dedicated issue in the onnxruntime repository so this can be tracked and handled. In the meantime, if it fits your use case, you can use
For example, you can:
from onnxruntime.quantization import quantize_dynamic, QuantType
# Quantize only MatMul nodes, using int8 weights
quantize_dynamic(
model_input="model.onnx",
model_output="model_matmul_int8.onnx",
op_types_to_quantize=["MatMul"],
weight_type=QuantType.QInt8, # int8 weights for MatMul
# other arguments as needed...
)
from onnxruntime.quantization import quantize_dynamic, QuantType
# Example: quantize other ops (e.g., Conv, Gemm, Attention) to uint8
quantize_dynamic(
model_input="model_matmul_int8.onnx",
model_output="quantize_model.onnx",
op_types_to_quantize=["Conv", "Gemm", "Attention"], # adjust to your model
weight_type=QuantType.QUInt8,
# other arguments as needed...
) |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
hi, i'm using onnxruntime to deploy a transformer-based model to nvidia gpu. With only 4~8GB mem, i need to quantize my model with quantize_dynamic() to UINT8.
But, i found that the model performance is really bad, and finally i found that there is an op which named 'MatMulInteger' running on cpu but not gpu.
i read ort source code and i found a implement of MatMulinteger in cuda provider, only with int8 but not uint8, when DynamicQuantizeLinear, the previous node before MatMulInteger, gives UINT8 data as output in onnx graph.
So, my question is, why MatMulInteger only accept INT8 tensor as input, but not UINT8? And, i have not enough data for static quantization and calibration, is it possible to accomplish a UINT8 MatMulInteger op to accelerate my model inference?
Beta Was this translation helpful? Give feedback.
All reactions