Skip to content

Comments

[Bugfix] Improve GPU validation logging in Ray fallback scenarios (#2…#29

Open
MitchLewis930 wants to merge 1 commit intogpu_ray_beforefrom
gpu_ray_after
Open

[Bugfix] Improve GPU validation logging in Ray fallback scenarios (#2…#29
MitchLewis930 wants to merge 1 commit intogpu_ray_beforefrom
gpu_ray_after

Conversation

@MitchLewis930
Copy link
Collaborator

@MitchLewis930 MitchLewis930 commented Jan 24, 2026

test

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced error messages when GPU allocation fails, providing specific GPU requirements and remediation guidance.
    • Added pre-validation warnings during Ray cluster initialization to detect insufficient GPU availability and prevent runtime failures.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 24, 2026

📝 Walkthrough

Walkthrough

These changes enhance error handling and resource validation in Ray-based parallel execution. A Ray-availability check is replaced with explicit GPU validation in configuration, while Ray cluster initialization gains improved error messaging and GPU availability pre-checks to provide users with clearer diagnostics.

Changes

Cohort / File(s) Summary
GPU Validation in Parallel Config
vllm/config/parallel.py
Removed conditional Ray availability check for multi-node inference. Replaced with direct ValueError when CUDA tensor world size exceeds available GPU count, including GPU count details in the error message.
Ray Utilities Enhanced Diagnostics
vllm/v1/executor/ray_utils.py
Improved _wait_until_pg_ready_ with computed GPU requirements from placement group specs and detailed error messages including tensor parallel guidance. Added pre-validation in _initialize_ray_cluster_ that logs warnings when world size exceeds available GPUs before Ray processing begins.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 No more silent Ray falls,
Clear GPU counts now on the walls,
Pre-checks warn with wisdom's call,
Better errors catch them all,
Parallel paths need not stall! 🚀

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: improving GPU validation logging in Ray fallback scenarios, which aligns with modifications to GPU validation and error handling in both config/parallel.py and v1/executor/ray_utils.py.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@vllm/v1/executor/ray_utils.py`:
- Around line 323-338: Update the incorrect import and clarify the warning:
change the import of cuda_device_count_stateless to come from
vllm.utils.torch_utils (replace "from vllm.utils import
cuda_device_count_stateless" with the correct module import), and update the
logger.warning message that references parallel_config.world_size to state
"world_size (tensor_parallel_size × pipeline_parallel_size)" instead of "Tensor
parallel size" so it accurately describes parallel_config.world_size when
logging available_gpus vs parallel_config.world_size in the prevalidation block
that uses cuda_device_count_stateless and parallel_config.world_size.
🧹 Nitpick comments (1)
vllm/config/parallel.py (1)

521-528: Redundant call to cuda_device_count_stateless() and misleading error message.

Two issues:

  1. cuda_device_count_stateless() is called on line 522 for the condition check and again on line 524. While the function is cached, this is still redundant.
  2. The error message refers to "Tensor parallel size" but self.world_size is TPxPP (tensor × pipeline parallel), which could mislead users using pipeline parallelism.
Suggested fix
-            elif (
-                current_platform.is_cuda()
-                and cuda_device_count_stateless() < self.world_size
-            ):
-                gpu_count = cuda_device_count_stateless()
+            elif current_platform.is_cuda():
+                gpu_count = cuda_device_count_stateless()
+                if gpu_count < self.world_size:
+                    raise ValueError(
+                        f"world_size ({self.world_size}) cannot be larger than "
+                        f"the number of available GPUs ({gpu_count}). "
+                        f"Reduce tensor_parallel_size or pipeline_parallel_size."
+                    )
+            elif self.data_parallel_backend == "ray":
-                raise ValueError(
-                    f"Tensor parallel size ({self.world_size}) cannot be "
-                    f"larger than the number of available GPUs ({gpu_count})."
-                )

Note: The diff above shows the conceptual change. The actual fix needs to preserve the subsequent elif branches properly.

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c01f6e5 and 7437438.

📒 Files selected for processing (2)
  • vllm/config/parallel.py
  • vllm/v1/executor/ray_utils.py
🧰 Additional context used
🧬 Code graph analysis (2)
vllm/config/parallel.py (1)
vllm/utils/torch_utils.py (1)
  • cuda_device_count_stateless (414-424)
vllm/v1/executor/ray_utils.py (1)
vllm/utils/torch_utils.py (1)
  • cuda_device_count_stateless (414-424)
🪛 Ruff (0.14.13)
vllm/config/parallel.py

525-528: Avoid specifying long messages outside the exception class

(TRY003)

vllm/v1/executor/ray_utils.py

266-276: Avoid specifying long messages outside the exception class

(TRY003)


278-284: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (1)
vllm/v1/executor/ray_utils.py (1)

258-284: LGTM! Well-structured error messaging for GPU resource failures.

The conditional logic to provide more detailed error messages for multi-GPU scenarios (> 1 GPU required) is a good UX improvement. The guidance about tensor parallel size and cluster resources will help users diagnose placement group allocation failures more quickly.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines +323 to +338
# Prevalidate GPU requirements before Ray processing
if current_platform.is_cuda() and parallel_config.world_size > 1:
from vllm.utils import cuda_device_count_stateless

available_gpus = cuda_device_count_stateless()
if parallel_config.world_size > available_gpus:
logger.warning(
"Tensor parallel size (%d) exceeds available GPUs (%d). "
"This may result in Ray placement group allocation failures. "
"Consider reducing tensor_parallel_size to %d or less, "
"or ensure your Ray cluster has %d GPUs available.",
parallel_config.world_size,
available_gpus,
available_gpus,
parallel_config.world_size,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Check if cuda_device_count_stateless is exported from vllm.utils
rg -n "cuda_device_count_stateless" --type py vllm/utils/ | head -20

Repository: Signal65/vllm-code-review

Length of output: 349


🏁 Script executed:

# Check the parallel_config definition to understand world_size
fd -e py -p "parallel_config" vllm/ | head -5

Repository: Signal65/vllm-code-review

Length of output: 51


🏁 Script executed:

# Search for world_size definition in parallel_config
rg -n "world_size" --type py vllm/ -A 2 | grep -E "(def|world_size|=)" | head -30

Repository: Signal65/vllm-code-review

Length of output: 2699


🏁 Script executed:

# Check the specific file mentioned in the review
cat -n vllm/v1/executor/ray_utils.py | sed -n '320,340p'

Repository: Signal65/vllm-code-review

Length of output: 1108


🏁 Script executed:

# Check if cuda_device_count_stateless is re-exported from vllm/utils/__init__.py
rg -n "cuda_device_count_stateless" vllm/utils/__init__.py

Repository: Signal65/vllm-code-review

Length of output: 51


🏁 Script executed:

# Find the ParallelConfig class definition
fd -e py "parallel_config" vllm/ -x grep -l "class.*ParallelConfig"

Repository: Signal65/vllm-code-review

Length of output: 51


🏁 Script executed:

# Search for world_size property or attribute definition in parallel config
rg -n "def world_size|world_size\s*=" --type py vllm/ -B 2 -A 2 | head -50

Repository: Signal65/vllm-code-review

Length of output: 3976


🏁 Script executed:

# Search for ParallelConfig class definition more broadly
rg -n "class ParallelConfig" --type py vllm/

Repository: Signal65/vllm-code-review

Length of output: 117


🏁 Script executed:

# Check if cuda_device_count_stateless is available via direct import from torch_utils
cat vllm/utils/torch_utils.py | grep -n "def cuda_device_count_stateless" -A 10

Repository: Signal65/vllm-code-review

Length of output: 628


🏁 Script executed:

# Search vllm/utils/__init__.py to see what is exported
head -100 vllm/utils/__init__.py

Repository: Signal65/vllm-code-review

Length of output: 3202


🏁 Script executed:

# Check how parallel_config.world_size is used/defined
rg -n "parallel_config\.world_size" --type py vllm/ -B 1 -A 1 | head -40

Repository: Signal65/vllm-code-review

Length of output: 2649


🏁 Script executed:

# Check the ParallelConfig class definition to see world_size
sed -n '68,200p' vllm/config/parallel.py

Repository: Signal65/vllm-code-review

Length of output: 6774


🏁 Script executed:

# Verify the import path - try to find if cuda_device_count_stateless is actually accessible via vllm.utils
rg -n "from vllm\.utils import cuda_device_count_stateless" --type py vllm/

Repository: Signal65/vllm-code-review

Length of output: 161


🏁 Script executed:

# Check what's actually exported from vllm/utils or torch_utils
rg -n "__all__" vllm/utils/torch_utils.py

Repository: Signal65/vllm-code-review

Length of output: 51


🏁 Script executed:

# Search for world_size as a property or method in ParallelConfig
rg -n "def world_size|@property" vllm/config/parallel.py -A 3 | head -50

Repository: Signal65/vllm-code-review

Length of output: 637


🏁 Script executed:

# Check if there's a __getattr__ or similar in ParallelConfig
rg -n "__getattr__|world_size" vllm/config/parallel.py

Repository: Signal65/vllm-code-review

Length of output: 982


🏁 Script executed:

# Look at the full ParallelConfig class to find world_size
cat vllm/config/parallel.py | grep -n "world_size" -B 2 -A 2

Repository: Signal65/vllm-code-review

Length of output: 2298


🏁 Script executed:

# Check if cuda_device_count_stateless might be re-exported dynamically or if the import works
python3 << 'EOF'
import sys
try:
    from vllm.utils import cuda_device_count_stateless
    print("✓ Import SUCCESS: cuda_device_count_stateless is accessible from vllm.utils")
except ImportError as e:
    print(f"✗ Import FAILED: {e}")
except Exception as e:
    print(f"✗ Unexpected error: {e}")
EOF

Repository: Signal65/vllm-code-review

Length of output: 155


🏁 Script executed:

# Check the imports at the top of vllm/config/parallel.py to see how cuda_device_count_stateless is imported
head -50 vllm/config/parallel.py | grep -E "^from|^import"

Repository: Signal65/vllm-code-review

Length of output: 647


Fix incorrect import path and misleading warning message.

Two issues:

  1. Import path is incorrect (line 325): The import from vllm.utils import cuda_device_count_stateless will fail. Should be from vllm.utils.torch_utils import cuda_device_count_stateless to match how it's imported elsewhere (e.g., vllm/config/parallel.py).

  2. Warning message is misleading: parallel_config.world_size is defined as tensor_parallel_size × pipeline_parallel_size, not just tensor parallel. The warning says "Tensor parallel size" but should clarify it's the world_size.

Suggested fix
    if current_platform.is_cuda() and parallel_config.world_size > 1:
-       from vllm.utils import cuda_device_count_stateless
+       from vllm.utils.torch_utils import cuda_device_count_stateless
 
         available_gpus = cuda_device_count_stateless()
         if parallel_config.world_size > available_gpus:
             logger.warning(
-                "Tensor parallel size (%d) exceeds available GPUs (%d). "
+                "Required GPU count (world_size=%d) exceeds available GPUs (%d). "
                 "This may result in Ray placement group allocation failures. "
-                "Consider reducing tensor_parallel_size to %d or less, "
+                "Consider reducing tensor_parallel_size or pipeline_parallel_size "
+                "so that their product is %d or less, "
                 "or ensure your Ray cluster has %d GPUs available.",
                 parallel_config.world_size,
                 available_gpus,
                 available_gpus,
                 parallel_config.world_size,
             )
🤖 Prompt for AI Agents
In `@vllm/v1/executor/ray_utils.py` around lines 323 - 338, Update the incorrect
import and clarify the warning: change the import of cuda_device_count_stateless
to come from vllm.utils.torch_utils (replace "from vllm.utils import
cuda_device_count_stateless" with the correct module import), and update the
logger.warning message that references parallel_config.world_size to state
"world_size (tensor_parallel_size × pipeline_parallel_size)" instead of "Tensor
parallel size" so it accurately describes parallel_config.world_size when
logging available_gpus vs parallel_config.world_size in the prevalidation block
that uses cuda_device_count_stateless and parallel_config.world_size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants