Skip to content

Comments

Introduce LLM class for offline inference (#115)#28

Open
MitchLewis930 wants to merge 1 commit intorequest_id_beforefrom
request_id_after
Open

Introduce LLM class for offline inference (#115)#28
MitchLewis930 wants to merge 1 commit intorequest_id_beforefrom
request_id_after

Conversation

@MitchLewis930
Copy link
Collaborator

@MitchLewis930 MitchLewis930 commented Jan 24, 2026

test

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced LLM helper class for simplified language model interaction.
    • Added ServerArgs for streamlined server configuration management.
    • Added offline inference example demonstrating library usage.
  • Improvements

    • Enhanced dtype validation with stricter error handling.
    • Refactored server initialization pipeline.
  • API Changes

    • Updated cache configuration representation.
    • Streamlined public API exports.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 24, 2026

📝 Walkthrough

Walkthrough

The PR refactors server configuration management by introducing a centralized ServerArgs dataclass to replace scattered argument utilities. It adds a new LLM convenience class for batch inference, updates the public API surface, improves type safety with swap_space_bytes and stricter dtype validation, and adds supporting examples.

Changes

Cohort / File(s) Summary
Public API Restructuring
cacheflow/__init__.py
Added LLM and ServerArgs exports; removed add_server_arguments, create_server_configs_from_args, and initialize_server_from_args from public API.
Type Safety & Validation
cacheflow/config.py
Introduced swap_space_bytes (int) attribute in CacheConfig replacing swap_space; added explicit dtype validation in _get_and_verify_dtype() to raise ValueError for unknown dtypes.
RequestOutput Structure
cacheflow/outputs.py
Made done parameter required in RequestOutput.__init__() (removed default); changed from_seq_group() from @staticmethod to @classmethod with updated object instantiation pattern.
Server Configuration & Initialization
cacheflow/server/arg_utils.py, cacheflow/server/llm_server.py, cacheflow/entrypoints/fastapi_server.py
Introduced ServerArgs dataclass consolidating server configuration with add_cli_args(), from_cli_args(), and create_server_configs() methods; added LLMServer.from_server_args() classmethod; removed log_stats default parameter; updated fastapi_server.py to use new ServerArgs interface.
New LLM Interface
cacheflow/entrypoints/llm.py
Added LLM helper class with __init__() (accepting model, tensor_parallel_size, dtype, seed, kwargs) and generate() method (batch processing with optional tqdm progress reporting).
Usage Examples
examples/offline_inference.py, examples/simple_server.py
New offline inference example demonstrating LLM.generate() API; updated server example to use ServerArgs.from_cli_args() and LLMServer.from_server_args().

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A hop and a bound, the configs align,
With ServerArgs gleaming, the structure's divine!
New LLM class hops swift with a generate cheer,
Type safety and clarity—oh what a year! 🌟✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.88% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: introducing a new LLM class for offline inference, which is the primary feature added in cacheflow/entrypoints/llm.py.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
cacheflow/entrypoints/fastapi_server.py (1)

121-126: Use parallel_config.use_ray to avoid mismatched Ray settings.

If tensor/pipeline parallelism forces Ray on (ParallelConfig sets use_ray=True), passing args.use_ray here can keep it False and allocate the actor with the wrong resource spec. Prefer the derived config so behavior matches the actual cluster init. Line 125.

🐛 Proposed fix
-    server = FastAPIServer(args.use_ray, *server_configs,
+    server = FastAPIServer(parallel_config.use_ray, *server_configs,
                            distributed_init_method, stage_devices,
                            log_stats=not args.disable_log_stats)
cacheflow/outputs.py (1)

32-39: Breaking change: RequestOutput now requires done parameter with no default value.

All internal instantiations have been properly updated to use the from_seq_group() factory method, which correctly passes seq_group.is_finished() as the done argument. However, since RequestOutput is part of the public API (exported in cacheflow/__init__.py), external users doing direct instantiation will encounter TypeError. Consider adding a default value for backward compatibility:

💡 Backward compatibility fix
-        done: bool,
+        done: bool = False,
🤖 Fix all issues with AI agents
In `@cacheflow/entrypoints/llm.py`:
- Around line 1-4: The module imports tqdm at top-level which forces the
dependency even when not used; modify the code to lazily import tqdm only when
needed (e.g., inside the function or block that reads the use_tqdm flag) or wrap
the import in a try/except and fall back to a no-op iterator; update any
references to tqdm in the code to use the local/imported name where appropriate
(search for use_tqdm and the function(s) that produce progress bars) so the
module can be imported without having tqdm installed unless progress is
requested.
🧹 Nitpick comments (3)
cacheflow/config.py (2)

6-6: Make swap-space units explicit at assignment.

Small clarity win now that the stored value is bytes while the input is GiB.

♻️ Proposed tweak
-        self.swap_space_bytes = swap_space * _GiB
+        self.swap_space_bytes = swap_space * _GiB  # swap_space is in GiB

Also applies to: 75-75


143-145: Improve unknown-dtype error clarity.

Including supported values makes misconfigurations easier to diagnose.

♻️ Proposed tweak
-            raise ValueError(f"Unknown dtype: {dtype}")
+            raise ValueError(
+                f"Unknown dtype: {dtype}. Supported: "
+                f"{', '.join(_STR_DTYPE_TO_TORCH_DTYPE)}")
cacheflow/entrypoints/llm.py (1)

34-62: Return outputs in prompt order (or document otherwise).

outputs are appended in completion order, which can differ from input order and surprise callers. Consider returning results aligned with the input prompt order (or explicitly document ordering).

♻️ Suggested ordering fix
-        # Add requests to the server.
-        for prompt in prompts:
+        # Add requests to the server.
+        request_ids: List[str] = []
+        for prompt in prompts:
             request_id = str(next(self.request_counter))
+            request_ids.append(request_id)
             self.llm_server.add_request(request_id, prompt, sampling_params)

-        outputs: List[RequestOutput] = []
+        outputs_by_id: dict[str, RequestOutput] = {}
         while self.llm_server.has_unfinished_requests():
             step_outputs = self.llm_server.step()
             for output in step_outputs:
                 if output.done:
-                    outputs.append(output)
+                    outputs_by_id[output.request_id] = output
                     if use_tqdm:
                         pbar.update(1)
         if use_tqdm:
             pbar.close()
-        return outputs
+        return [outputs_by_id[req_id] for req_id in request_ids]
📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f746ced and 655a5e4.

📒 Files selected for processing (9)
  • cacheflow/__init__.py
  • cacheflow/config.py
  • cacheflow/entrypoints/fastapi_server.py
  • cacheflow/entrypoints/llm.py
  • cacheflow/outputs.py
  • cacheflow/server/arg_utils.py
  • cacheflow/server/llm_server.py
  • examples/offline_inference.py
  • examples/simple_server.py
🧰 Additional context used
🧬 Code graph analysis (6)
cacheflow/entrypoints/fastapi_server.py (1)
cacheflow/server/arg_utils.py (4)
  • ServerArgs (11-59)
  • add_cli_args (32-35)
  • from_cli_args (38-43)
  • create_server_configs (45-59)
cacheflow/outputs.py (1)
cacheflow/sequence.py (2)
  • SequenceGroup (130-168)
  • is_finished (162-163)
cacheflow/server/llm_server.py (2)
cacheflow/server/arg_utils.py (2)
  • ServerArgs (11-59)
  • create_server_configs (45-59)
cacheflow/server/ray_utils.py (1)
  • initialize_cluster (14-90)
examples/offline_inference.py (2)
cacheflow/entrypoints/llm.py (2)
  • LLM (12-62)
  • generate (34-62)
cacheflow/sampling_params.py (1)
  • SamplingParams (5-137)
examples/simple_server.py (2)
cacheflow/server/arg_utils.py (3)
  • ServerArgs (11-59)
  • from_cli_args (38-43)
  • add_cli_args (32-35)
cacheflow/server/llm_server.py (2)
  • LLMServer (25-254)
  • from_server_args (113-122)
cacheflow/server/arg_utils.py (1)
cacheflow/config.py (4)
  • CacheConfig (65-79)
  • ModelConfig (9-62)
  • ParallelConfig (82-102)
  • SchedulerConfig (105-113)
🪛 Ruff (0.14.13)
cacheflow/config.py

144-144: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (13)
cacheflow/server/arg_utils.py (4)

10-30: Clean consolidation of server defaults.


31-43: CLI plumbing looks solid.


45-59: Config construction flow is clear and consistent.


62-117: CLI argument wiring aligns well with the new ServerArgs defaults.

examples/simple_server.py (1)

4-10: Nice migration to ServerArgs-based initialization.

Also applies to: 42-42

examples/offline_inference.py (1)

1-23: Clear, minimal offline inference example.

cacheflow/entrypoints/fastapi_server.py (1)

15-15: Import update looks good.

cacheflow/__init__.py (1)

1-14: Public exports updated cleanly.

cacheflow/outputs.py (1)

46-74: from_seq_group now respects subclasses.

Using cls(...) is the right move for extensibility.

cacheflow/server/llm_server.py (3)

15-16: ServerArgs-based constructor is a clean entrypoint.

Centralizing config + cluster init here is a good API improvement.

Also applies to: 112-122


95-95: Unit conversion is correctly implemented—no action needed.

The field swap_space_bytes is properly defined in CacheConfig and the conversion from GiB to bytes is correctly handled: ServerArgs.swap_space (in GiB) is passed to CacheConfig.__init__(), which converts it to bytes via self.swap_space_bytes = swap_space * _GiB (where _GiB = 1 << 30). The usage in llm_server.py:95 is correct.


35-35: No breaking change—all instantiation already passes log_stats.

The factory method from_server_args() explicitly passes log_stats=not server_args.disable_log_stats (line 120), and all call sites in the codebase use either this factory method or ray.remote() class wrapping (which doesn't instantiate directly). No direct constructor calls omit the parameter.

cacheflow/entrypoints/llm.py (1)

14-33: LLM initialization wiring looks solid.

ServerArgs construction + default log-stats disabling are clear and consistent.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines +1 to +4
from typing import List, Optional

from tqdm import tqdm

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check for tqdm in packaging metadata
git ls-files | grep -E '(pyproject\.toml|setup\.py|setup\.cfg|requirements.*\.txt|Pipfile)' | head -20

Repository: Signal65/vllm-code-review

Length of output: 94


🏁 Script executed:

#!/bin/bash
# Search for tqdm in common package declaration files
rg "tqdm" -t toml -t py -t txt 2>/dev/null | head -30

Repository: Signal65/vllm-code-review

Length of output: 1098


🏁 Script executed:

#!/bin/bash
# Examine the actual llm.py file
cat cacheflow/entrypoints/llm.py 2>/dev/null | head -70

Repository: Signal65/vllm-code-review

Length of output: 1963


🏁 Script executed:

#!/bin/bash
# Check requirements.txt
cat requirements.txt

Repository: Signal65/vllm-code-review

Length of output: 256


🏁 Script executed:

#!/bin/bash
# Check setup.py for tqdm
cat setup.py

Repository: Signal65/vllm-code-review

Length of output: 2635


Add tqdm to dependencies or use lazy import.

tqdm is imported unconditionally at module level but is not declared in requirements.txt or setup.py. This breaks module imports unless tqdm is installed separately, even when use_tqdm=False. Either add tqdm to dependencies or defer the import until needed.

💡 Optional lazy import approach
-from tqdm import tqdm
+ # Defer tqdm import until needed to keep it optional.
-        if use_tqdm:
-            pbar = tqdm(total=len(prompts), desc="Processed prompts")
+        if use_tqdm:
+            try:
+                from tqdm import tqdm
+            except ImportError as exc:
+                raise ImportError(
+                    "tqdm is required when use_tqdm=True"
+                ) from exc
+            pbar = tqdm(total=len(prompts), desc="Processed prompts")
🤖 Prompt for AI Agents
In `@cacheflow/entrypoints/llm.py` around lines 1 - 4, The module imports tqdm at
top-level which forces the dependency even when not used; modify the code to
lazily import tqdm only when needed (e.g., inside the function or block that
reads the use_tqdm flag) or wrap the import in a try/except and fall back to a
no-op iterator; update any references to tqdm in the code to use the
local/imported name where appropriate (search for use_tqdm and the function(s)
that produce progress bars) so the module can be imported without having tqdm
installed unless progress is requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants