Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions microsoft-Phi-4-mini-instruct/NvTensorRtRtx/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Phi-4-mini-instruct optimization

This folder contains examples of Olive recipes for `Phi-4-mini-instruct` optimization.

## NVMO PTQ Mixed Precision Quantization

The olive recipe `microsoft-Phi-4-mini-instruct_nvmo_ptq_mixed_precision_awq_lite.json` produces INT4 + INT8 mixed precision quantized model using NVIDIA's TensorRT Model Optimizer toolkit with AWQ algorithm.

### Setup

1. Install Olive with NVIDIA TensorRT Model Optimizer toolkit

- Run following command to install Olive with TensorRT Model Optimizer.
```bash
pip install olive-ai[nvmo]
```

- If TensorRT Model Optimizer needs to be installed from a local wheel, then follow below steps.

```bash
pip install olive-ai
pip install <modelopt-wheel>[onnx]
```

- Make sure that TensorRT Model Optimizer is installed correctly.
```bash
python -c "from modelopt.onnx.quantization.int4 import quantize as quantize_int4"
```

- Refer TensorRT Model Optimizer [documentation](https://nvidia.github.io/TensorRT-Model-Optimizer/getting_started/windows/_installation_with_olive.html) for its detailed installation instructions and setup dependencies.

2. Install suitable onnxruntime and onnxruntime-genai packages

- Install the onnxruntime and onnxruntime-genai packages that have NvTensorRTRTXExecutionProvider support. Refer documentation for [NvTensorRtRtx execution-provider](https://onnxruntime.ai/docs/execution-providers/TensorRTRTX-ExecutionProvider) to setup its dependencies/requirements.
- Note that by default, TensorRT Model Optimizer comes with onnxruntime-directml. And onnxrutime-genai-cuda package comes with onnxruntime-gpu. So, in order to use onnxruntime package with NvTensorRTRTXExecutionProvider support, one might need to uninstall existing other onnxruntime packages.
- Make sure that at the end, there is only one onnxruntime package installed. Use command like following for validating the onnxruntime package installation.
```bash
python -c "import onnxruntime as ort; print(ort.get_available_providers())"
```

3. Install additional requirements.

- Install packages provided in requirements text file.
```bash
pip install -r requirements-nvmo.txt
```

### Steps to run

```bash
olive run --config microsoft-Phi-4-mini-instruct_nvmo_ptq_mixed_precision_awq_lite.json
```

### Recipe details

The olive recipe `microsoft-Phi-4-mini-instruct_nvmo_ptq_mixed_precision_awq_lite.json` has 2 passes: (a) `ModelBuilder` and (b) `NVModelOptQuantization`. The `ModelBuilder` pass is used to generate the FP16 model for `NvTensorRTRTXExecutionProvider` (aka `NvTensorRtRtx` EP). Subsequently, the `NVModelOptQuantization` pass performs INT4 + INT8 mixed precision quantization using AWQ algorithm with AWQ Lite calibration method to produce the optimized model.

### Troubleshoot

In case of any issue related to quantization using TensorRT Model Optimizer toolkit, refer its [FAQs](https://nvidia.github.io/TensorRT-Model-Optimizer/support/2_faqs.html) for potential help or suggestions.
6 changes: 6 additions & 0 deletions microsoft-Phi-4-mini-instruct/NvTensorRtRtx/info.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
arch: phi4
recipes:
- name: microsoft-Phi-4-mini-instruct_nvmo_ptq_mixed_precision_awq_lite
file: microsoft-Phi-4-mini-instruct_nvmo_ptq_mixed_precision_awq_lite.json
devices: gpu
eps: NvTensorRTRTXExecutionProvider
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
{
"input_model": {
"type": "HfModel",
"model_path": "microsoft/Phi-4-mini-instruct",
"task": "text-classification"
},
"systems": {
"local_system": {
"type": "LocalSystem",
"accelerators": [ { "device": "gpu", "execution_providers": [ "NvTensorRTRTXExecutionProvider" ] } ]
}
},
"engine": { "target": "local_system" },
"passes": {
"builder": { "type": "ModelBuilder", "precision": "fp16" },
"quantization": {
"type": "NVModelOptQuantization",
"algorithm": "awq",
"int4_block_size": 32,
"tokenizer_dir": "microsoft/Phi-4-mini-instruct",
"calibration_method": "awq_lite",
"enable_mixed_quant": true,
"calibration_providers": ["NvTensorRtRtx"],
"calibration_params": {
"add_position_ids": false
}
}
},
"log_severity_level": 0
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
datasets>=2.14.4
torch
transformers