[Disagg][Perf] Use CUDA event sync instead of blocking tolist to av…#31
[Disagg][Perf] Use CUDA event sync instead of blocking tolist to av…#31MitchLewis930 wants to merge 1 commit intosample_token_ids_beforefrom
tolist to av…#31Conversation
…oid unintentional copy ops blocking across different CUDA streams, improving disagg TTIT/TTFT (vllm-project#22760) Signed-off-by: Zijing Liu <liuzijing2014@gmail.com> Signed-off-by: Zijing Liu <liuzijing2014@users.noreply.github.com>
📝 WalkthroughWalkthroughAdded GPU-to-CPU transfer optimization in GPUModelRunner using pinned memory allocation and CUDA event synchronization. Replaced direct tensor-to-list conversion with a new helper method that uses non-blocking copy for sampled token IDs. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@vllm/v1/worker/gpu_model_runner.py`:
- Around line 319-324: The pinned CPU buffer self.sampled_token_ids_pinned_cpu
is allocated with the wrong first dimension: it uses max_model_len but should be
sized to hold sampled_token_ids which are shaped (num_reqs, 1) and bounded by
max_num_reqs; fix the allocation in GPUModelRunner (where
self.sampled_token_ids_pinned_cpu is created) to use (self.max_num_reqs, 1) (or
max_num_reqs) instead of (self.max_model_len, 1) and keep dtype=torch.int64,
device="cpu", pin_memory=True so the transfer_event-based copy of
sampled_token_ids fits without waste or overflow.
📜 Review details
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
vllm/v1/worker/gpu_model_runner.py
🔇 Additional comments (2)
vllm/v1/worker/gpu_model_runner.py (2)
1698-1700: LGTM!The integration of the new
_to_listhelper method is correct. It's appropriately used only in the non-speculative decode path wheremax_gen_len == 1.
3243-3256: Good optimization approach.The CUDA event synchronization pattern correctly avoids the stream-wide sync that direct
.tolist()on a GPU tensor would cause. The comment clearly explains the motivation and references the relevant issue.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| self.transfer_event = torch.cuda.Event() | ||
| self.sampled_token_ids_pinned_cpu = torch.empty( | ||
| (self.max_model_len, 1), | ||
| dtype=torch.int64, | ||
| device="cpu", | ||
| pin_memory=True) |
There was a problem hiding this comment.
Incorrect buffer dimension: should use max_num_reqs instead of max_model_len.
The pinned buffer is used to copy sampled_token_ids which has shape (num_reqs, 1) where the first dimension is bounded by max_num_reqs, not max_model_len. Using max_model_len is semantically incorrect and can cause:
- Memory waste:
max_model_lencan be 128K+ whilemax_num_reqsis typically ~256 - Potential buffer overflow if
max_num_reqs > max_model_len(edge case with short context models)
🐛 Proposed fix
self.transfer_event = torch.cuda.Event()
self.sampled_token_ids_pinned_cpu = torch.empty(
- (self.max_model_len, 1),
+ (self.max_num_reqs, 1),
dtype=torch.int64,
device="cpu",
pin_memory=True)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| self.transfer_event = torch.cuda.Event() | |
| self.sampled_token_ids_pinned_cpu = torch.empty( | |
| (self.max_model_len, 1), | |
| dtype=torch.int64, | |
| device="cpu", | |
| pin_memory=True) | |
| self.transfer_event = torch.cuda.Event() | |
| self.sampled_token_ids_pinned_cpu = torch.empty( | |
| (self.max_num_reqs, 1), | |
| dtype=torch.int64, | |
| device="cpu", | |
| pin_memory=True) |
🤖 Prompt for AI Agents
In `@vllm/v1/worker/gpu_model_runner.py` around lines 319 - 324, The pinned CPU
buffer self.sampled_token_ids_pinned_cpu is allocated with the wrong first
dimension: it uses max_model_len but should be sized to hold sampled_token_ids
which are shaped (num_reqs, 1) and bounded by max_num_reqs; fix the allocation
in GPUModelRunner (where self.sampled_token_ids_pinned_cpu is created) to use
(self.max_num_reqs, 1) (or max_num_reqs) instead of (self.max_model_len, 1) and
keep dtype=torch.int64, device="cpu", pin_memory=True so the
transfer_event-based copy of sampled_token_ids fits without waste or overflow.
test
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.