GLM5 NVFP4 quantization GPU OOM on B200

**Before submitting an issue, please make sure it hasn't been already addressed by searching through the [existing and past issues](https://github.com/NVIDIA/Model-Optimizer/issues?q=is%3Aissue).**

## Describe the bug


GLM5 NVFP4 quantization GPU OOM on B200

Want to see if nvfp4 quantization is supported for this model on B200. Or we will need bigger machine?

### Steps/Code to reproduce bug



```
python examples/llm_ptq/hf_ptq.py --pyt_ckpt_path zai-org/GLM-5 --qformat nvfp4_mlp_only --export_path /home/prithudasgupta_google_com/models/glm-5-nvfp4-modelopt --kv_cache_qformat none --trust_remote_code --calib_size 64 --trust_remote_code --dataset cnn_dailymail
Multiple distributions found for package modelopt. Picked distribution: nvidia-modelopt
/home/prithudasgupta_google_com/Model-Optimizer/modelopt/torch/__init__.py:36: UserWarning: transformers version 5.2.0 is not tested with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
  _warnings.warn(
ModelOpt save/restore enabled for `transformers` library.
ModelOpt save/restore enabled for `diffusers` library.
ModelOpt save/restore enabled for `peft` library.
Initializing model from zai-org/GLM-5
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
model.safetensors.index.json: 5.43MB [00:00, 48.2MB/s]
Fetching 282 files: 100%|███████████████████████████████████| 282/282 [13:07<00:00,  2.79s/it]
Download complete: 100%|█████████████████████████████████| 1.51T/1.51T [13:08<00:00, 1.91GB/s]
Loading weights:  95%|▉| 1544/1629 [04:21<00:14,  5.91it/s, Materializing param=model.layers.7
Traceback (most recent call last):
  File "/home/prithudasgupta_google_com/Model-Optimizer/examples/llm_ptq/hf_ptq.py", line 1179, in <module>
    main(args)
  File "/home/prithudasgupta_google_com/Model-Optimizer/examples/llm_ptq/hf_ptq.py", line 1150, in main
    ) = load_model(args)
        ^^^^^^^^^^^^^^^^
  File "/home/prithudasgupta_google_com/Model-Optimizer/examples/llm_ptq/hf_ptq.py", line 320, in load_model
    full_model = get_model(
                 ^^^^^^^^^^
  File "/home/prithudasgupta_google_com/Model-Optimizer/examples/llm_ptq/example_utils.py", line 577, in get_model
    model = auto_model_module.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/prithudasgupta_google_com/Model-Optimizer/modelopt/torch/opt/plugins/transformers.py", line 68, in _new_from_pretrained
    model = types.MethodType(cls._modelopt_cache["from_pretrained"].__func__, cls)(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/prithudasgupta_google_com/venv/lib/python3.12/site-packages/transformers/modeling_utils.py", line 4072, in from_pretrained
    loading_info, disk_offload_index = cls._load_pretrained_model(model, state_dict, checkpoint_files, load_config)
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/prithudasgupta_google_com/venv/lib/python3.12/site-packages/transformers/modeling_utils.py", line 4191, in _load_pretrained_model
    loading_info, disk_offload_index = convert_and_load_state_dict_in_model(
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/prithudasgupta_google_com/venv/lib/python3.12/site-packages/transformers/core_model_loading.py", line 1209, in convert_and_load_state_dict_in_model
    realized_value = mapping.convert(
                     ^^^^^^^^^^^^^^^^
  File "/home/prithudasgupta_google_com/venv/lib/python3.12/site-packages/transformers/core_model_loading.py", line 677, in convert
    collected_tensors = self.materialize_tensors()
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/prithudasgupta_google_com/venv/lib/python3.12/site-packages/transformers/core_model_loading.py", line 653, in materialize_tensors
    tensors = [future.result() for future in tensors if future.result() is not None]
                                                        ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/prithudasgupta_google_com/venv/lib/python3.12/site-packages/transformers/core_model_loading.py", line 799, in _job
    return _materialize_copy(tensor, device, dtype)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/prithudasgupta_google_com/venv/lib/python3.12/site-packages/transformers/core_model_loading.py", line 788, in _materialize_copy
    tensor = tensor.to(device=device, dtype=dtype)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 7 has a total capacity of 178.35 GiB of which 57.94 MiB is free. Including non-PyTorch memory, this process has 178.28 GiB memory in use. Of the allocated memory 177.52 GiB is allocated by PyTorch, and 64.85 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
########
GPU 0: Peak memory usage = 176.62 GB for all processes on the GPU
GPU 1: Peak memory usage = 179.00 GB for all processes on the GPU
GPU 2: Peak memory usage = 179.00 GB for all processes on the GPU
GPU 3: Peak memory usage = 179.00 GB for all processes on the GPU
GPU 4: Peak memory usage = 179.00 GB for all processes on the GPU
GPU 5: Peak memory usage = 179.00 GB for all processes on the GPU
GPU 6: Peak memory usage = 179.00 GB for all processes on the GPU
GPU 7: Peak memory usage = 179.00 GB for all processes on the GPU
########
(venv) prithudasgupta_google_com@prithu-b200-dev-1201:~/Model-Optimizer$
```

### Expected behavior

Model is quantized to nvfp4 and saved succesfully

### Who can help?



@Edwardf0t1 

## System information



- Container used (if applicable): ?
- OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): ? 
- CPU architecture (x86_64, aarch64): ?
- GPU name (e.g. H100, A100, L40S): ?
- GPU memory size: ?
- Number of GPUs: ?
- Library versions (if applicable):
  - Python: ?
  - ModelOpt version or commit hash: ?
  - CUDA: ?
  - PyTorch: ?
  - Transformers: ?
  - TensorRT-LLM: ?
  - ONNXRuntime: ?
  - TensorRT: ?
- Any other details that may help: ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GLM5 NVFP4 quantization GPU OOM on B200 #909

Describe the bug

Steps/Code to reproduce bug

Expected behavior

Who can help?

System information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GLM5 NVFP4 quantization GPU OOM on B200 #909

Description

Describe the bug

Steps/Code to reproduce bug

Expected behavior

Who can help?

System information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions