Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 23 additions & 2 deletions vllm/v1/worker/gpu_model_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -316,6 +316,12 @@ def __init__(
# Cached outputs.
self._draft_token_ids: Optional[Union[list[list[int]],
torch.Tensor]] = None
self.transfer_event = torch.cuda.Event()
self.sampled_token_ids_pinned_cpu = torch.empty(
(self.max_model_len, 1),
dtype=torch.int64,
device="cpu",
pin_memory=True)
Comment on lines +319 to +324
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Incorrect buffer dimension: should use max_num_reqs instead of max_model_len.

The pinned buffer is used to copy sampled_token_ids which has shape (num_reqs, 1) where the first dimension is bounded by max_num_reqs, not max_model_len. Using max_model_len is semantically incorrect and can cause:

  1. Memory waste: max_model_len can be 128K+ while max_num_reqs is typically ~256
  2. Potential buffer overflow if max_num_reqs > max_model_len (edge case with short context models)
🐛 Proposed fix
         self.transfer_event = torch.cuda.Event()
         self.sampled_token_ids_pinned_cpu = torch.empty(
-            (self.max_model_len, 1),
+            (self.max_num_reqs, 1),
             dtype=torch.int64,
             device="cpu",
             pin_memory=True)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self.transfer_event = torch.cuda.Event()
self.sampled_token_ids_pinned_cpu = torch.empty(
(self.max_model_len, 1),
dtype=torch.int64,
device="cpu",
pin_memory=True)
self.transfer_event = torch.cuda.Event()
self.sampled_token_ids_pinned_cpu = torch.empty(
(self.max_num_reqs, 1),
dtype=torch.int64,
device="cpu",
pin_memory=True)
🤖 Prompt for AI Agents
In `@vllm/v1/worker/gpu_model_runner.py` around lines 319 - 324, The pinned CPU
buffer self.sampled_token_ids_pinned_cpu is allocated with the wrong first
dimension: it uses max_model_len but should be sized to hold sampled_token_ids
which are shaped (num_reqs, 1) and bounded by max_num_reqs; fix the allocation
in GPUModelRunner (where self.sampled_token_ids_pinned_cpu is created) to use
(self.max_num_reqs, 1) (or max_num_reqs) instead of (self.max_model_len, 1) and
keep dtype=torch.int64, device="cpu", pin_memory=True so the
transfer_event-based copy of sampled_token_ids fits without waste or overflow.


def _make_buffer(self, *args, dtype: torch.dtype) -> CpuGpuBuffer:
return CpuGpuBuffer(*args,
Expand Down Expand Up @@ -1691,7 +1697,7 @@ def execute_model(
max_gen_len = sampled_token_ids.shape[-1]
if max_gen_len == 1:
# No spec decode tokens.
valid_sampled_token_ids = sampled_token_ids.tolist()
valid_sampled_token_ids = self._to_list(sampled_token_ids)
else:
# Includes spec decode tokens.
valid_sampled_token_ids = self.rejection_sampler.parse_output(
Expand Down Expand Up @@ -2219,7 +2225,7 @@ def _dummy_run(
- CUDAGraphMode.PIECEWISE: Piecewise cudagraph.
- CUDAGraphMode.FULL: Full cudagraph, attention metadata is
needed.
force_attention: If True, always create attention metadata. Used to
force_attention: If True, always create attention metadata. Used to
warm up attention backend when mode is NONE.
uniform_decode: If True, the batch is a uniform decode batch.
skip_eplb: If True, skip EPLB state update.
Expand Down Expand Up @@ -3233,3 +3239,18 @@ def get_kv_cache_spec(self) -> dict[str, KVCacheSpec]:
mamba_type=mamba_module.mamba_type)

return kv_cache_spec

def _to_list(self, sampled_token_ids: torch.Tensor) -> list[list[int]]:
# This is a short term mitigation for issue mentioned in
# https://github.com/vllm-project/vllm/issues/22754.
# `tolist` would trigger a cuda wise stream sync, which
# would block other copy ops from other cuda streams.
# A cuda event sync would avoid such a situation. Since
# this is in the critical path of every single model
# forward loop, this has caused perf issue for a disagg
# setup.
pinned = self.sampled_token_ids_pinned_cpu[:sampled_token_ids.shape[0]]
pinned.copy_(sampled_token_ids, non_blocking=True)
self.transfer_event.record()
self.transfer_event.synchronize()
return pinned.tolist()