-
Notifications
You must be signed in to change notification settings - Fork 279
Support force tokens to % of total experts during calibration #910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -906,6 +906,7 @@ def quantize_main( | |
| model_type, | ||
| QUANT_CFG_CHOICES, | ||
| KV_QUANT_CFG_CHOICES, | ||
| args.moe_calib_experts_ratio, | ||
| ) | ||
|
|
||
| # Exclude MTP layers from quantization if detected (e.g., GLM-4.7's layer 92) | ||
|
|
@@ -1126,6 +1127,15 @@ def parse_args() -> argparse.Namespace: | |
| "(sensitivity scores, costs, etc.). Only used when auto_quantize_bits is specified." | ||
| ), | ||
| ) | ||
| parser.add_argument( | ||
| "--moe_calib_experts_ratio", | ||
| type=float, | ||
| default=1.0 / 4, | ||
| help=( | ||
| "Percentage of experts to calibrate during forward pass. Only used for MOE models. " | ||
| "This is used to reduce the number of experts to calibrate during forward pass. " | ||
| ), | ||
| ) | ||
|
Comment on lines
+1130
to
+1138
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Default Since the default is Consider defaulting to parser.add_argument(
"--moe_calib_experts_ratio",
type=float,
- default=1.0 / 4,
+ default=None,
help=(
- "Percentage of experts to calibrate during forward pass. Only used for MOE models. "
- "This is used to reduce the number of experts to calibrate during forward pass. "
+ "Ratio of experts to calibrate during forward pass (0, 1]. Only used for MOE models. "
+ "Default behavior routes to all experts if not specified. "
+ "Example: 0.25 calibrates 25%% of experts. "
),
)🤖 Prompt for AI Agents |
||
|
|
||
| return parser.parse_args() | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -458,8 +458,9 @@ def _setup(self): | |||||||||||||||||||||
| elif hasattr(self, "experts") and hasattr(self.experts, "num_experts"): | ||||||||||||||||||||||
| num_experts = self.experts.num_experts | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| self.expert_token_count = torch.zeros(num_experts, dtype=torch.long, device="cpu") | ||||||||||||||||||||||
| self.expert_token_count = torch.zeros(num_experts, dtype=torch.long, device="cuda") | ||||||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: # First, let's examine the file to understand the class structure and initialization
head -100 modelopt/torch/quantization/plugins/huggingface.py | tail -50Repository: NVIDIA/Model-Optimizer Length of output: 2002 🏁 Script executed: # Check the __init__ method and surrounding context
sed -n '440,480p' modelopt/torch/quantization/plugins/huggingface.pyRepository: NVIDIA/Model-Optimizer Length of output: 1707 🏁 Script executed: # Check the forward method and the area around line 488
sed -n '480,540p' modelopt/torch/quantization/plugins/huggingface.pyRepository: NVIDIA/Model-Optimizer Length of output: 3317 🏁 Script executed: # Search for how expert_token_count is used elsewhere
rg -n "expert_token_count" modelopt/torch/quantization/plugins/huggingface.pyRepository: NVIDIA/Model-Optimizer Length of output: 347 🏁 Script executed: # Check if there's a parent class or module device tracking
rg -n "class.*MoELayer\|def __init__" modelopt/torch/quantization/plugins/huggingface.py | head -20Repository: NVIDIA/Model-Optimizer Length of output: 48 🏁 Script executed: # Search for tests that might use this module on CPU or multi-GPU
find . -name "*.py" -path "*/test*" -type f | xargs grep -l "huggingface\|MoELayer" 2>/dev/null | head -5Repository: NVIDIA/Model-Optimizer Length of output: 270 🏁 Script executed: # Check if there's any device attribute on the module or parameters
rg -n "self\.(device|parameters|cuda|cpu)" modelopt/torch/quantization/plugins/huggingface.py | head -20Repository: NVIDIA/Model-Optimizer Length of output: 48 🏁 Script executed: # Check if there are CPU-based tests
rg -n "cpu\|CPU" tests/ --type=py -i | grep -i "test\|def " | head -20Repository: NVIDIA/Model-Optimizer Length of output: 48 🏁 Script executed: # Check the parent class QuantModule to understand device tracking
rg -n "class QuantModule" modelopt/torch/quantization/ --type=py -B2 -A10 | head -40Repository: NVIDIA/Model-Optimizer Length of output: 1117 🏁 Script executed: # Verify if expert_token_count should be registered as a buffer
rg -n "register_buffer" modelopt/torch/quantization/plugins/huggingface.pyRepository: NVIDIA/Model-Optimizer Length of output: 48 🏁 Script executed: # Check how other modules handle device-specific tensors
rg -n "torch.zeros.*device" modelopt/torch/quantization/plugins/huggingface.pyRepository: NVIDIA/Model-Optimizer Length of output: 161 🏁 Script executed: # Search for .to() or device handling patterns in this file
rg -n "\.to\(|device=" modelopt/torch/quantization/plugins/huggingface.py | head -20Repository: NVIDIA/Model-Optimizer Length of output: 633 🏁 Script executed: # Check if this module is used in multi-GPU or distributed settings
rg -n "_QuantSparseMoe\|expert_token_count" tests/ --type=py -A2 -B2 | head -30Repository: NVIDIA/Model-Optimizer Length of output: 48 🏁 Script executed: # Check how other torch.nn modules handle device at initialization
rg -n "register_buffer.*zeros\|register_parameter" modelopt/torch/quantization/ --type=py | head -10Repository: NVIDIA/Model-Optimizer Length of output: 48 🏁 Script executed: # Look at the parent class hierarchy to see if modules get device info
rg -n "class DynamicModule\|def __init__" modelopt/torch/quantization/nn/modules/ --type=py -A5 | head -40Repository: NVIDIA/Model-Optimizer Length of output: 48 🏁 Script executed: # Check if the module has access to parameters to infer device
sed -n '440,465p' modelopt/torch/quantization/plugins/huggingface.pyRepository: NVIDIA/Model-Optimizer Length of output: 1115 🏁 Script executed: # Verify if expert_token_count needs to be on same device as parameters
rg -n "expert_token_count\|self.gate\|self.experts" modelopt/torch/quantization/plugins/huggingface.py | head -30Repository: NVIDIA/Model-Optimizer Length of output: 48 🏁 Script executed: # Check how the hook accesses and uses expert_token_count
sed -n '473,495p' modelopt/torch/quantization/plugins/huggingface.pyRepository: NVIDIA/Model-Optimizer Length of output: 1308 🏁 Script executed: # Check if there are any CPU or device-specific test configurations
rg -n "device.*cpu\|cpu.*test\|torch.device" tests/ --type=py | head -20Repository: NVIDIA/Model-Optimizer Length of output: 48 🏁 Script executed: # Check how the module gets its device or if it can access parameters for device inference
sed -n '443,475p' modelopt/torch/quantization/plugins/huggingface.pyRepository: NVIDIA/Model-Optimizer Length of output: 1446 🏁 Script executed: # Verify if parameters exist in the module that could provide device info
rg -n "self.gate.*parameters\|self.experts.*parameters\|next(self.parameters())" modelopt/torch/quantization/plugins/huggingface.pyRepository: NVIDIA/Model-Optimizer Length of output: 48 🏁 Script executed: # Check if there's documentation or config for device handling
rg -n "device\|cuda" modelopt/torch/quantization/plugins/huggingface.py -B2 -A2 | head -40Repository: NVIDIA/Model-Optimizer Length of output: 48 🏁 Script executed: # Check if this module is ever called in non-CUDA or multi-GPU contexts
rg -n "cuda\|device\|distributed\|DataParallel" tests/gpu/torch/conftest.py -A5 -B5Repository: NVIDIA/Model-Optimizer Length of output: 48 🏁 Script executed: # Check if there are any initialization patterns that handle device
rg -n "next(self.parameters())\|self.device\|to\(device" modelopt/torch/quantization/plugins/ --type=py | head -20Repository: NVIDIA/Model-Optimizer Length of output: 48 🏁 Script executed: # Look for any tensor initialization patterns with proper device handling
rg -n "register_buffer\|torch.zeros.*parameters\|infer.*device" modelopt/torch/ --type=py | head -15Repository: NVIDIA/Model-Optimizer Length of output: 48 🏁 Script executed: # Check if gate or experts have parameters that could be used to infer device
sed -n '443,490p' modelopt/torch/quantization/plugins/huggingface.py | grep -n "gate\|experts"Repository: NVIDIA/Model-Optimizer Length of output: 1145 🏁 Script executed: # Verify if gate module has parameters we can use to infer device
python3 << 'EOF'
# Check best practice for device inference in PyTorch nn.Module
code = """
# Common patterns for device inference:
# 1. next(self.gate.parameters()).device - if gate has parameters
# 2. lazy initialization on first forward call
# 3. defer to parent module device tracking
"""
print(code)
EOFRepository: NVIDIA/Model-Optimizer Length of output: 264 🏁 Script executed: # Check if there's a pattern for lazy device initialization in the codebase
rg -n "lazy.*device\|first.*forward\|register_forward_pre_hook" modelopt/torch/quantization/ --type=py | head -10Repository: NVIDIA/Model-Optimizer Length of output: 48 🏁 Script executed: # Verify the actual impact: check if counts tensor is accessed before line 488
sed -n '461,495p' modelopt/torch/quantization/plugins/huggingface.py | grep -n "expert_token_count"Repository: NVIDIA/Model-Optimizer Length of output: 343 Hardcoded Line 461 allocates
Infer device from the gate module's parameters instead: Proposed fix- self.expert_token_count = torch.zeros(num_experts, dtype=torch.long, device="cuda")
+ device = next(self.gate.parameters()).device if hasattr(self, "gate") else "cuda"
+ self.expert_token_count = torch.zeros(num_experts, dtype=torch.long, device=device)Alternatively, defer allocation to the first forward pass to avoid device placement assumptions. 🤖 Prompt for AI Agents |
||||||||||||||||||||||
| self._count_expert_tokens = False | ||||||||||||||||||||||
| self._moe_calib_experts_ratio = None | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| if num_experts == 0: | ||||||||||||||||||||||
| warnings.warn( | ||||||||||||||||||||||
|
|
@@ -483,36 +484,50 @@ def _gate_forward_hook(self, module, input, output): | |||||||||||||||||||||
| logits = output if not isinstance(output, tuple) else output[0] | ||||||||||||||||||||||
| top_k = self.gate.top_k if hasattr(self.gate, "top_k") else self.top_k | ||||||||||||||||||||||
| _, indices = torch.topk(logits.float(), top_k, dim=-1) | ||||||||||||||||||||||
| counts = torch.bincount( | ||||||||||||||||||||||
| indices.reshape(-1).cpu(), minlength=len(self.expert_token_count) | ||||||||||||||||||||||
| ) | ||||||||||||||||||||||
| self.expert_token_count += counts | ||||||||||||||||||||||
| counts = torch.bincount(indices.reshape(-1), minlength=len(self.expert_token_count)) | ||||||||||||||||||||||
| self.expert_token_count += counts.to(self.expert_token_count.device) | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: | ||||||||||||||||||||||
| is_calib = any(getattr(m, "_if_calib", False) for m in self.experts.modules()) | ||||||||||||||||||||||
| if is_calib: | ||||||||||||||||||||||
| self._count_expert_tokens = is_calib | ||||||||||||||||||||||
| if is_calib and self._moe_calib_experts_ratio: | ||||||||||||||||||||||
| self._count_expert_tokens = True | ||||||||||||||||||||||
| assert 0 < self._moe_calib_experts_ratio <= 1, ( | ||||||||||||||||||||||
| "moe_calib_experts_ratio must be between 0 and 1" | ||||||||||||||||||||||
| ) | ||||||||||||||||||||||
| # If any of the experts are in calibration mode, we will forward all tokens to all experts | ||||||||||||||||||||||
| # This is used only for calibration, we need to re-calculate the actual outputs again using | ||||||||||||||||||||||
| # the original top_k | ||||||||||||||||||||||
| if TRANSFORMERS_VERSION_GE_5_0: | ||||||||||||||||||||||
| assert hasattr(self, "gate") and hasattr(self.gate, "top_k") | ||||||||||||||||||||||
| original_top_k = self.gate.top_k | ||||||||||||||||||||||
| self.gate.top_k = self.gate.num_experts | ||||||||||||||||||||||
| self.gate.top_k = round(self.gate.num_experts * self._moe_calib_experts_ratio) | ||||||||||||||||||||||
| assert self.gate.top_k >= original_top_k, ( | ||||||||||||||||||||||
| f"moe_calib_experts_ratio {self._moe_calib_experts_ratio}," | ||||||||||||||||||||||
| f" calib top_k {self.gate.top_k} smaller than original" | ||||||||||||||||||||||
| f" top_k {original_top_k}" | ||||||||||||||||||||||
| ) | ||||||||||||||||||||||
|
Comment on lines
+504
to
+509
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The assertion If The assertion message says the ratio-based Proposed fix (transformers >= 5.0 path)- self.gate.top_k = round(self.gate.num_experts * self._moe_calib_experts_ratio)
- assert self.gate.top_k >= original_top_k, (
- f"moe_calib_experts_ratio {self._moe_calib_experts_ratio},"
- f" calib top_k {self.gate.top_k} smaller than original"
- f" top_k {original_top_k}"
- )
+ self.gate.top_k = max(
+ round(self.gate.num_experts * self._moe_calib_experts_ratio),
+ original_top_k,
+ )The same applies to the transformers < 5.0 path at lines 516–525. 📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||||||||||
| super().forward(hidden_states) | ||||||||||||||||||||||
| self.gate.top_k = original_top_k | ||||||||||||||||||||||
| else: | ||||||||||||||||||||||
| # Path for transformers < 5.0 | ||||||||||||||||||||||
| original_top_k = self.top_k | ||||||||||||||||||||||
| if hasattr(self, "num_experts"): | ||||||||||||||||||||||
| self.top_k = self.num_experts | ||||||||||||||||||||||
| self.top_k = round(self.num_experts * self._moe_calib_experts_ratio) | ||||||||||||||||||||||
| elif hasattr(self, "experts"): | ||||||||||||||||||||||
| self.top_k = self.experts.num_experts | ||||||||||||||||||||||
| self.top_k = round(self.experts.num_experts * self._moe_calib_experts_ratio) | ||||||||||||||||||||||
| else: | ||||||||||||||||||||||
| raise ValueError(f"Could not find num_experts in module {self}") | ||||||||||||||||||||||
| assert self.top_k >= original_top_k, ( | ||||||||||||||||||||||
| f"moe_calib_experts_ratio {self._moe_calib_experts_ratio}," | ||||||||||||||||||||||
| f" calib top_k {self.top_k} smaller than original" | ||||||||||||||||||||||
| f" top_k {original_top_k}" | ||||||||||||||||||||||
| ) | ||||||||||||||||||||||
| super().forward(hidden_states) | ||||||||||||||||||||||
| self.top_k = original_top_k | ||||||||||||||||||||||
| # Enable counting only for the real-routing forward during calibration | ||||||||||||||||||||||
| self._count_expert_tokens = is_calib | ||||||||||||||||||||||
| self._count_expert_tokens = False | ||||||||||||||||||||||
| else: | ||||||||||||||||||||||
| self._count_expert_tokens = True | ||||||||||||||||||||||
| output = super().forward(hidden_states) | ||||||||||||||||||||||
| self._count_expert_tokens = False | ||||||||||||||||||||||
| return output | ||||||||||||||||||||||
|
Comment on lines
490
to
533
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
# First, let's see recent commits to understand context
git log --oneline -10 -- modelopt/torch/quantization/plugins/huggingface.pyRepository: NVIDIA/Model-Optimizer Length of output: 87 🏁 Script executed: #!/bin/bash
# Get a diff to see what changed in recent commits
# Look at the last few commits for this file
git log -p -2 -- modelopt/torch/quantization/plugins/huggingface.py | head -300Repository: NVIDIA/Model-Optimizer Length of output: 48 🏁 Script executed: #!/bin/bash
# Let's also check the full forward method and surrounding context
cat -n modelopt/torch/quantization/plugins/huggingface.py | sed -n '440,550p'Repository: NVIDIA/Model-Optimizer Length of output: 6176 🏁 Script executed: #!/bin/bash
# Search for any tests related to this forward method or _moe_calib_experts_ratio
find . -name "*.py" -type f | xargs grep -l "_moe_calib_experts_ratio" 2>/dev/nullRepository: NVIDIA/Model-Optimizer Length of output: 156 🏁 Script executed: #!/bin/bash
# Look for documentation or comments about the expected behavior
grep -r "all experts" --include="*.py" modelopt/torch/quantization/plugins/ 2>/dev/null | head -20Repository: NVIDIA/Model-Optimizer Length of output: 672 🏁 Script executed: #!/bin/bash
# Let's check what happens in each scenario by examining the control flow
# Case analysis of the forward method:
# When is_calib=True and ratio is set (line 493 condition is True):
# - Line 528: _count_expert_tokens = False (before final forward)
# When is_calib=True and ratio is None (line 493 condition is False, goes to else):
# - Line 530: _count_expert_tokens = True (before final forward)
# When is_calib=False (line 493 condition is False, goes to else):
# - Line 530: _count_expert_tokens = True (before final forward) <-- PROBLEM!
# Check if there's any conditional that prevents token counting during inference
cat -n modelopt/torch/quantization/plugins/huggingface.py | sed -n '475,495p'Repository: NVIDIA/Model-Optimizer Length of output: 1385 🏁 Script executed: #!/bin/bash
# Check the gate_forward_hook to understand what happens when _count_expert_tokens is True
cat -n modelopt/torch/quantization/plugins/huggingface.py | sed -n '475,489p'Repository: NVIDIA/Model-Optimizer Length of output: 982 🏁 Script executed: #!/bin/bash
# Search for where _moe_calib_experts_ratio might be set to understand intended usage
grep -n "_moe_calib_experts_ratio" modelopt/torch/quantization/plugins/huggingface.pyRepository: NVIDIA/Model-Optimizer Length of output: 699 🏁 Script executed: #!/by
# Check if there are any tests that show expected behavior
find . -name "*test*.py" -type f | xargs grep -l "QuantSparseMoe\|_moe_calib_experts_ratio" 2>/dev/null | head -5Repository: NVIDIA/Model-Optimizer Length of output: 124 🏁 Script executed: #!/bin/bash
# Examine the test file to understand expected behavior
cat modelopt/torch/quantization/plugins/test_sparse_moe.pyRepository: NVIDIA/Model-Optimizer Length of output: 152 🏁 Script executed: #!/bin/bash
# Find the correct path for the test file
find . -name "test_sparse_moe.py" -type fRepository: NVIDIA/Model-Optimizer Length of output: 124 🏁 Script executed: #!/bin/bash
# If not found, search for any test file containing QuantSparseMoe or sparse moe tests
find . -name "*test*.py" -type f -exec grep -l "QuantSparseMoe\|sparse.*moe" {} \; 2>/dev/null | head -5Repository: NVIDIA/Model-Optimizer Length of output: 124 🏁 Script executed: #!/bin/bash
# Read the test file
cat ./tests/unit/torch/quantization/plugins/test_sparse_moe.pyRepository: NVIDIA/Model-Optimizer Length of output: 12565 🏁 Script executed: #!/bin/bash
# Check where _moe_calib_experts_ratio might be set (outside of _setup)
grep -rn "_moe_calib_experts_ratio" modelopt/torch/quantization/ --include="*.py" | grep -v "test_sparse"Repository: NVIDIA/Model-Optimizer Length of output: 1321 🏁 Script executed: #!/bin/bash
# Check the quantize function and INT8_DEFAULT_CFG to see if they set the ratio
grep -rn "INT8_DEFAULT_CFG\|_moe_calib_experts_ratio" modelopt/torch/quantization/ --include="*.py" | head -30Repository: NVIDIA/Model-Optimizer Length of output: 1959 🏁 Script executed: #!/bin/bash
# Look for mode.py to see how _moe_calib_experts_ratio is used
cat modelopt/torch/quantization/mode.py | head -100Repository: NVIDIA/Model-Optimizer Length of output: 3247 🏁 Script executed: #!/bin/bash
# Look at the full context around line 222 in mode.py to see the function signature
cat -n modelopt/torch/quantization/mode.py | sed -n '210,235p'Repository: NVIDIA/Model-Optimizer Length of output: 1265 🏁 Script executed: #!/bin/bash
# Search for moe_calib_experts_ratio parameter definition and default value
grep -B5 -A5 "moe_calib_experts_ratio" modelopt/torch/quantization/mode.py | head -40Repository: NVIDIA/Model-Optimizer Length of output: 665 🏁 Script executed: #!/bin/bash
# Check if there's a function signature or parameter with default
grep -rn "def.*moe_calib_experts_ratio\|moe_calib_experts_ratio\s*=" modelopt/torch/quantization/ --include="*.py" | head -20Repository: NVIDIA/Model-Optimizer Length of output: 393 Clarify whether all-experts calibration should be the default during quantization. The class docstring promises "During calibration, we forward all tokens to all experts so that all experts see sufficient tokens to calibrate" (line 445), but this behavior only activates when Additionally, the else block at lines 529-530 enables token counting for both inference ( Either set a default ratio (e.g., 1.0 for all experts) when entering calibration mode, or update the docstring to clarify that expanded-expert forwarding requires explicit configuration. 🤖 Prompt for AI Agents |
||||||||||||||||||||||
|
|
||||||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
Repository: NVIDIA/Model-Optimizer
Length of output: 1204
🏁 Script executed:
Repository: NVIDIA/Model-Optimizer
Length of output: 4804
🏁 Script executed:
Repository: NVIDIA/Model-Optimizer
Length of output: 1253
🏁 Script executed:
Repository: NVIDIA/Model-Optimizer
Length of output: 1339
🏁 Script executed:
Repository: NVIDIA/Model-Optimizer
Length of output: 4203
🏁 Script executed:
Repository: NVIDIA/Model-Optimizer
Length of output: 253
🏁 Script executed:
Repository: NVIDIA/Model-Optimizer
Length of output: 3153
Crash when
algorithmisNone.The code will crash when
moe_calib_experts_ratiois truthy (the CLI default is 0.25) and the quantization config has"algorithm": None(e.g.,mxfp8,mxfp6,mxfp4,mxint8,w4a8_mxfp4_fp8). At line 243, theelsebranch attemptsNone["moe_calib_experts_ratio"] = ..., raising aTypeError: 'NoneType' object is not subscriptable.Any user running with a None-algorithm format (e.g.,
--qformat mxfp8) using the CLI default will immediately hit this crash.Proposed fix
Alternatively, only inject the ratio when the model is actually an MoE model, or change the CLI default to
Noneand only inject when explicitly provided.📝 Committable suggestion
🤖 Prompt for AI Agents