Introduce LLM class for offline inference (#115)#28
Introduce LLM class for offline inference (#115)#28MitchLewis930 wants to merge 1 commit intorequest_id_beforefrom
Conversation
📝 WalkthroughWalkthroughThe PR refactors server configuration management by introducing a centralized Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
cacheflow/entrypoints/fastapi_server.py (1)
121-126: Useparallel_config.use_rayto avoid mismatched Ray settings.If tensor/pipeline parallelism forces Ray on (ParallelConfig sets
use_ray=True), passingargs.use_rayhere can keep it False and allocate the actor with the wrong resource spec. Prefer the derived config so behavior matches the actual cluster init. Line 125.🐛 Proposed fix
- server = FastAPIServer(args.use_ray, *server_configs, + server = FastAPIServer(parallel_config.use_ray, *server_configs, distributed_init_method, stage_devices, log_stats=not args.disable_log_stats)cacheflow/outputs.py (1)
32-39: Breaking change:RequestOutputnow requiresdoneparameter with no default value.All internal instantiations have been properly updated to use the
from_seq_group()factory method, which correctly passesseq_group.is_finished()as thedoneargument. However, sinceRequestOutputis part of the public API (exported incacheflow/__init__.py), external users doing direct instantiation will encounterTypeError. Consider adding a default value for backward compatibility:💡 Backward compatibility fix
- done: bool, + done: bool = False,
🤖 Fix all issues with AI agents
In `@cacheflow/entrypoints/llm.py`:
- Around line 1-4: The module imports tqdm at top-level which forces the
dependency even when not used; modify the code to lazily import tqdm only when
needed (e.g., inside the function or block that reads the use_tqdm flag) or wrap
the import in a try/except and fall back to a no-op iterator; update any
references to tqdm in the code to use the local/imported name where appropriate
(search for use_tqdm and the function(s) that produce progress bars) so the
module can be imported without having tqdm installed unless progress is
requested.
🧹 Nitpick comments (3)
cacheflow/config.py (2)
6-6: Make swap-space units explicit at assignment.Small clarity win now that the stored value is bytes while the input is GiB.
♻️ Proposed tweak
- self.swap_space_bytes = swap_space * _GiB + self.swap_space_bytes = swap_space * _GiB # swap_space is in GiBAlso applies to: 75-75
143-145: Improve unknown-dtype error clarity.Including supported values makes misconfigurations easier to diagnose.
♻️ Proposed tweak
- raise ValueError(f"Unknown dtype: {dtype}") + raise ValueError( + f"Unknown dtype: {dtype}. Supported: " + f"{', '.join(_STR_DTYPE_TO_TORCH_DTYPE)}")cacheflow/entrypoints/llm.py (1)
34-62: Return outputs in prompt order (or document otherwise).
outputsare appended in completion order, which can differ from input order and surprise callers. Consider returning results aligned with the input prompt order (or explicitly document ordering).♻️ Suggested ordering fix
- # Add requests to the server. - for prompt in prompts: + # Add requests to the server. + request_ids: List[str] = [] + for prompt in prompts: request_id = str(next(self.request_counter)) + request_ids.append(request_id) self.llm_server.add_request(request_id, prompt, sampling_params) - outputs: List[RequestOutput] = [] + outputs_by_id: dict[str, RequestOutput] = {} while self.llm_server.has_unfinished_requests(): step_outputs = self.llm_server.step() for output in step_outputs: if output.done: - outputs.append(output) + outputs_by_id[output.request_id] = output if use_tqdm: pbar.update(1) if use_tqdm: pbar.close() - return outputs + return [outputs_by_id[req_id] for req_id in request_ids]
📜 Review details
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
cacheflow/__init__.pycacheflow/config.pycacheflow/entrypoints/fastapi_server.pycacheflow/entrypoints/llm.pycacheflow/outputs.pycacheflow/server/arg_utils.pycacheflow/server/llm_server.pyexamples/offline_inference.pyexamples/simple_server.py
🧰 Additional context used
🧬 Code graph analysis (6)
cacheflow/entrypoints/fastapi_server.py (1)
cacheflow/server/arg_utils.py (4)
ServerArgs(11-59)add_cli_args(32-35)from_cli_args(38-43)create_server_configs(45-59)
cacheflow/outputs.py (1)
cacheflow/sequence.py (2)
SequenceGroup(130-168)is_finished(162-163)
cacheflow/server/llm_server.py (2)
cacheflow/server/arg_utils.py (2)
ServerArgs(11-59)create_server_configs(45-59)cacheflow/server/ray_utils.py (1)
initialize_cluster(14-90)
examples/offline_inference.py (2)
cacheflow/entrypoints/llm.py (2)
LLM(12-62)generate(34-62)cacheflow/sampling_params.py (1)
SamplingParams(5-137)
examples/simple_server.py (2)
cacheflow/server/arg_utils.py (3)
ServerArgs(11-59)from_cli_args(38-43)add_cli_args(32-35)cacheflow/server/llm_server.py (2)
LLMServer(25-254)from_server_args(113-122)
cacheflow/server/arg_utils.py (1)
cacheflow/config.py (4)
CacheConfig(65-79)ModelConfig(9-62)ParallelConfig(82-102)SchedulerConfig(105-113)
🪛 Ruff (0.14.13)
cacheflow/config.py
144-144: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (13)
cacheflow/server/arg_utils.py (4)
10-30: Clean consolidation of server defaults.
31-43: CLI plumbing looks solid.
45-59: Config construction flow is clear and consistent.
62-117: CLI argument wiring aligns well with the new ServerArgs defaults.examples/simple_server.py (1)
4-10: Nice migration to ServerArgs-based initialization.Also applies to: 42-42
examples/offline_inference.py (1)
1-23: Clear, minimal offline inference example.cacheflow/entrypoints/fastapi_server.py (1)
15-15: Import update looks good.cacheflow/__init__.py (1)
1-14: Public exports updated cleanly.cacheflow/outputs.py (1)
46-74:from_seq_groupnow respects subclasses.Using
cls(...)is the right move for extensibility.cacheflow/server/llm_server.py (3)
15-16: ServerArgs-based constructor is a clean entrypoint.Centralizing config + cluster init here is a good API improvement.
Also applies to: 112-122
95-95: Unit conversion is correctly implemented—no action needed.The field
swap_space_bytesis properly defined inCacheConfigand the conversion from GiB to bytes is correctly handled:ServerArgs.swap_space(in GiB) is passed toCacheConfig.__init__(), which converts it to bytes viaself.swap_space_bytes = swap_space * _GiB(where_GiB = 1 << 30). The usage inllm_server.py:95is correct.
35-35: No breaking change—all instantiation already passeslog_stats.The factory method
from_server_args()explicitly passeslog_stats=not server_args.disable_log_stats(line 120), and all call sites in the codebase use either this factory method orray.remote()class wrapping (which doesn't instantiate directly). No direct constructor calls omit the parameter.cacheflow/entrypoints/llm.py (1)
14-33: LLM initialization wiring looks solid.ServerArgs construction + default log-stats disabling are clear and consistent.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| from typing import List, Optional | ||
|
|
||
| from tqdm import tqdm | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check for tqdm in packaging metadata
git ls-files | grep -E '(pyproject\.toml|setup\.py|setup\.cfg|requirements.*\.txt|Pipfile)' | head -20Repository: Signal65/vllm-code-review
Length of output: 94
🏁 Script executed:
#!/bin/bash
# Search for tqdm in common package declaration files
rg "tqdm" -t toml -t py -t txt 2>/dev/null | head -30Repository: Signal65/vllm-code-review
Length of output: 1098
🏁 Script executed:
#!/bin/bash
# Examine the actual llm.py file
cat cacheflow/entrypoints/llm.py 2>/dev/null | head -70Repository: Signal65/vllm-code-review
Length of output: 1963
🏁 Script executed:
#!/bin/bash
# Check requirements.txt
cat requirements.txtRepository: Signal65/vllm-code-review
Length of output: 256
🏁 Script executed:
#!/bin/bash
# Check setup.py for tqdm
cat setup.pyRepository: Signal65/vllm-code-review
Length of output: 2635
Add tqdm to dependencies or use lazy import.
tqdm is imported unconditionally at module level but is not declared in requirements.txt or setup.py. This breaks module imports unless tqdm is installed separately, even when use_tqdm=False. Either add tqdm to dependencies or defer the import until needed.
💡 Optional lazy import approach
-from tqdm import tqdm
+ # Defer tqdm import until needed to keep it optional.- if use_tqdm:
- pbar = tqdm(total=len(prompts), desc="Processed prompts")
+ if use_tqdm:
+ try:
+ from tqdm import tqdm
+ except ImportError as exc:
+ raise ImportError(
+ "tqdm is required when use_tqdm=True"
+ ) from exc
+ pbar = tqdm(total=len(prompts), desc="Processed prompts")🤖 Prompt for AI Agents
In `@cacheflow/entrypoints/llm.py` around lines 1 - 4, The module imports tqdm at
top-level which forces the dependency even when not used; modify the code to
lazily import tqdm only when needed (e.g., inside the function or block that
reads the use_tqdm flag) or wrap the import in a try/except and fall back to a
no-op iterator; update any references to tqdm in the code to use the
local/imported name where appropriate (search for use_tqdm and the function(s)
that produce progress bars) so the module can be imported without having tqdm
installed unless progress is requested.
test
Summary by CodeRabbit
Release Notes
New Features
Improvements
API Changes
✏️ Tip: You can customize this high-level summary in your review settings.