-
Notifications
You must be signed in to change notification settings - Fork 279
Open
Labels
Description
Before submitting an issue, please make sure it hasn't been already addressed by searching through the existing and past issues.
Describe the bug
GLM5 NVFP4 quantization GPU OOM on B200
Want to see if nvfp4 quantization is supported for this model on B200. Or we will need bigger machine?
Steps/Code to reproduce bug
python examples/llm_ptq/hf_ptq.py --pyt_ckpt_path zai-org/GLM-5 --qformat nvfp4_mlp_only --export_path /home/prithudasgupta_google_com/models/glm-5-nvfp4-modelopt --kv_cache_qformat none --trust_remote_code --calib_size 64 --trust_remote_code --dataset cnn_dailymail
Multiple distributions found for package modelopt. Picked distribution: nvidia-modelopt
/home/prithudasgupta_google_com/Model-Optimizer/modelopt/torch/__init__.py:36: UserWarning: transformers version 5.2.0 is not tested with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
_warnings.warn(
ModelOpt save/restore enabled for `transformers` library.
ModelOpt save/restore enabled for `diffusers` library.
ModelOpt save/restore enabled for `peft` library.
Initializing model from zai-org/GLM-5
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
model.safetensors.index.json: 5.43MB [00:00, 48.2MB/s]
Fetching 282 files: 100%|███████████████████████████████████| 282/282 [13:07<00:00, 2.79s/it]
Download complete: 100%|█████████████████████████████████| 1.51T/1.51T [13:08<00:00, 1.91GB/s]
Loading weights: 95%|▉| 1544/1629 [04:21<00:14, 5.91it/s, Materializing param=model.layers.7
Traceback (most recent call last):
File "/home/prithudasgupta_google_com/Model-Optimizer/examples/llm_ptq/hf_ptq.py", line 1179, in <module>
main(args)
File "/home/prithudasgupta_google_com/Model-Optimizer/examples/llm_ptq/hf_ptq.py", line 1150, in main
) = load_model(args)
^^^^^^^^^^^^^^^^
File "/home/prithudasgupta_google_com/Model-Optimizer/examples/llm_ptq/hf_ptq.py", line 320, in load_model
full_model = get_model(
^^^^^^^^^^
File "/home/prithudasgupta_google_com/Model-Optimizer/examples/llm_ptq/example_utils.py", line 577, in get_model
model = auto_model_module.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/prithudasgupta_google_com/Model-Optimizer/modelopt/torch/opt/plugins/transformers.py", line 68, in _new_from_pretrained
model = types.MethodType(cls._modelopt_cache["from_pretrained"].__func__, cls)(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/prithudasgupta_google_com/venv/lib/python3.12/site-packages/transformers/modeling_utils.py", line 4072, in from_pretrained
loading_info, disk_offload_index = cls._load_pretrained_model(model, state_dict, checkpoint_files, load_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/prithudasgupta_google_com/venv/lib/python3.12/site-packages/transformers/modeling_utils.py", line 4191, in _load_pretrained_model
loading_info, disk_offload_index = convert_and_load_state_dict_in_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/prithudasgupta_google_com/venv/lib/python3.12/site-packages/transformers/core_model_loading.py", line 1209, in convert_and_load_state_dict_in_model
realized_value = mapping.convert(
^^^^^^^^^^^^^^^^
File "/home/prithudasgupta_google_com/venv/lib/python3.12/site-packages/transformers/core_model_loading.py", line 677, in convert
collected_tensors = self.materialize_tensors()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/prithudasgupta_google_com/venv/lib/python3.12/site-packages/transformers/core_model_loading.py", line 653, in materialize_tensors
tensors = [future.result() for future in tensors if future.result() is not None]
^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/prithudasgupta_google_com/venv/lib/python3.12/site-packages/transformers/core_model_loading.py", line 799, in _job
return _materialize_copy(tensor, device, dtype)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/prithudasgupta_google_com/venv/lib/python3.12/site-packages/transformers/core_model_loading.py", line 788, in _materialize_copy
tensor = tensor.to(device=device, dtype=dtype)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 7 has a total capacity of 178.35 GiB of which 57.94 MiB is free. Including non-PyTorch memory, this process has 178.28 GiB memory in use. Of the allocated memory 177.52 GiB is allocated by PyTorch, and 64.85 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
########
GPU 0: Peak memory usage = 176.62 GB for all processes on the GPU
GPU 1: Peak memory usage = 179.00 GB for all processes on the GPU
GPU 2: Peak memory usage = 179.00 GB for all processes on the GPU
GPU 3: Peak memory usage = 179.00 GB for all processes on the GPU
GPU 4: Peak memory usage = 179.00 GB for all processes on the GPU
GPU 5: Peak memory usage = 179.00 GB for all processes on the GPU
GPU 6: Peak memory usage = 179.00 GB for all processes on the GPU
GPU 7: Peak memory usage = 179.00 GB for all processes on the GPU
########
(venv) prithudasgupta_google_com@prithu-b200-dev-1201:~/Model-Optimizer$
Expected behavior
Model is quantized to nvfp4 and saved succesfully
Who can help?
System information
- Container used (if applicable): ?
- OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): ?
- CPU architecture (x86_64, aarch64): ?
- GPU name (e.g. H100, A100, L40S): ?
- GPU memory size: ?
- Number of GPUs: ?
- Library versions (if applicable):
- Python: ?
- ModelOpt version or commit hash: ?
- CUDA: ?
- PyTorch: ?
- Transformers: ?
- TensorRT-LLM: ?
- ONNXRuntime: ?
- TensorRT: ?
- Any other details that may help: ?
Reactions are currently unavailable