Skip to content

Comments

vLLM Python Backend#596

Draft
jaredoconnell wants to merge 57 commits intovllm-project:mainfrom
jaredoconnell:feat/vllm-python
Draft

vLLM Python Backend#596
jaredoconnell wants to merge 57 commits intovllm-project:mainfrom
jaredoconnell:feat/vllm-python

Conversation

@jaredoconnell
Copy link
Collaborator

@jaredoconnell jaredoconnell commented Feb 12, 2026

Summary

This backend is an alternative backend that starts a complete vLLM instance and uses its Python API rather than the HTTP API.

This PR will be in a draft since it still needs tests and documentation.

TODO

Details

  • [ ]

Test Plan

  • TODO

Related Issues

  • Resolves #

  • "I certify that all code in this PR is my own, except as noted below."

Use of AI

  • Includes AI-assisted code completion
  • Includes code generated by an AI application
  • Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes ## WRITTEN BY AI ##)

Generated-by: Cursor AI
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Assisted-by: Cursor AI
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Generated-by: Cursor AI
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Assisted-by: Cursor AI
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Generated-by: Cursor AI
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Generated-by: Cursor AI
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Generated-by: Cursor AI
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Generated-by: Cursor AI
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Generated-by: Cursor AI
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Generated-by: Cursor AI
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Generated-by: Cursor AI
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Assisted-by: Cursor AI
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Generated-by: Cursor AI
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Generated-by: Cursor AI
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
… backend

Assisted-by: Cursor AI
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Copy link
Collaborator

@sjmonson sjmonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Core handling of vLLM seems good, the rest is mostly cleanup and minor fixes. There are a lot of dead codepaths that likely exist due to this PR being started before #478. A lot of converting to/from /v1/chat/completions format can be removed. Multimodel probably needs bit more work. Audio should be easy to fix but image/video don't seem implemented; this can be a later follow-up PR. Both plain and chat template formats need fixes, see respective comments; neither should be difficult.

Comment on lines 23 to 35
try:
from vllm import SamplingParams
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.outputs import RequestOutput

HAS_VLLM = True
except ImportError:
AsyncLLMEngine = None # type: ignore[assignment, misc]
AsyncEngineArgs = None # type: ignore[assignment, misc]
SamplingParams = None # type: ignore[assignment, misc]
RequestOutput = None # type: ignore[assignment, misc]
HAS_VLLM = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this under src/guidellm/extras/vllm.py and from guidellm.extras.vllm import SamplingParams, AsyncLLMEngine, ..., HAS_VLLM here.

Comment on lines +314 to +321
arguments = getattr(request, "arguments", None)

if arguments is not None:
body = getattr(arguments, "body", None) or {}
stream_override = getattr(arguments, "stream", None)
stream = self.stream if stream_override is None else bool(stream_override)
files = getattr(arguments, "files", None) or {}
else:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A GenerationRequest will never have an arguments field as of #478; drop this.

Comment on lines +22 to +32
# Conditionally import VLLM backend if available
try:
from .vllm_python.vllm import VLLMPythonBackend
from .vllm_python.vllm_response import VLLMResponseHandler

HAS_VLLM_BACKEND = True
except ImportError:
VLLMPythonBackend = None # type: ignore[assignment, misc]
VLLMResponseHandler = None # type: ignore[assignment, misc]
HAS_VLLM_BACKEND = False

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the above comment on vLLM extras is better here:

Suggested change
# Conditionally import VLLM backend if available
try:
from .vllm_python.vllm import VLLMPythonBackend
from .vllm_python.vllm_response import VLLMResponseHandler
HAS_VLLM_BACKEND = True
except ImportError:
VLLMPythonBackend = None # type: ignore[assignment, misc]
VLLMResponseHandler = None # type: ignore[assignment, misc]
HAS_VLLM_BACKEND = False
from guidellm.extras.vllm import HAS_VLLM
# Conditionally import VLLM backend if available
if HAS_VLLM:
from .vllm_python import VLLMPythonBackend

Comment on lines +24 to +25
from .vllm_python.vllm import VLLMPythonBackend
from .vllm_python.vllm_response import VLLMResponseHandler
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should import from next level. E.g. from .vllm_python import VLLMPythonBackend, VLLMResponseHandler.

Comment on lines +413 to +417
if not requires_target and self.target is not None:
raise ValueError(
f"Backend '{backend_type}' does not support a target parameter. "
"Please remove --target as this backend runs locally."
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't do this, just ignore target if its not needed.

Comment on lines 955 to 956
if token_delta > 0:
state["total_output_tokens"] += token_delta
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you observed token_delta being less than 0 or is this just catching the 0 case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE: "less than 0" -- the test is > 0, and I could argue the test probably isn't even worthwhile for 0 if there's no possibility of it being less than 0. (Though, if it can be, > 0 is more efficient than checking >= 0 and adding the 0. 🙂)

Comment on lines +832 to +837
if hasattr(audio_samples.data, "numpy"):
audio_array = audio_samples.data.numpy()
elif hasattr(audio_samples.data, "cpu"):
audio_array = audio_samples.data.cpu().numpy()
else:
audio_array = np.asarray(audio_samples.data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be careful with hasattr, it will hide invalid code paths. The only valid option is the first one. audio_samples.data is a Tensor and all tensors have a numpy() function.

Comment on lines +752 to +758
max_tokens = (
body.get("max_tokens")
if body.get("max_tokens") is not None
else (max_tokens_override if max_tokens_override is not None else 16)
)
if max_tokens == 0:
max_tokens = 16
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why set max_tokens to 16 by default. Should be uncapped so we let the model decide when its done.


return SamplingParams(**params) # type: ignore[misc]

def _convert_vllm_output_to_openai_format(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the response handler be written to take RequestOutput natively? Seems like a lot of extra code/work to convert to this format first and then parse.

Comment on lines +652 to +663
def _extract_prompt_chat_plain(
self, formatted_messages: list[dict[str, Any]]
) -> str:
"""Format messages as plain 'Role: content' lines."""
parts = []
for msg in formatted_messages:
role = msg.get("role", "user")
content = msg.get("content", "")
if role and content:
parts.append(f"{role.capitalize()}: {content}")
parts.append("Assistant: ")
return "\n".join(parts)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plain should be equivalent to /v1/completions. Need to read vLLM code and double check but there should not be any "role" here. Prompts should be concated with " ".join(...).

Assisted-by: Cursor AI
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Copy link
Collaborator

@dbutenhof dbutenhof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My eyes are starting to glaze over, in the middle of vllm.py, so I need a break. Might as well post what I've got so far and start fresh later...

# Run a Benchmark

After [installing GuideLLM](install.md) and [starting a server](server.md), you're ready to run benchmarks to evaluate your LLM deployment's performance.
After [installing GuideLLM](install.md) and [starting a server](server.md), you're ready to run benchmarks to evaluate your LLM deployment's performance. Alternatively, you can run benchmarks with the vLLM Python backend (`--backend vllm_python`) without a separate server; see [vLLM Python backend](../guides/vllm-python-backend.md).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd break this up: "after installing and starting a server [...] or run the Python backend" obscures the latter case. E.g.,

  1. Install GuideLLM
  2. You can run GuideLLM two ways:
    a. targeting a running OpenAI-compatible LLM server or
    b. with an installed vLLM package using the vLLM Python backend (...)

@@ -0,0 +1,68 @@
# vLLM Python Backend

The **vLLM Python backend** (`vllm_python`) runs inference in the **same process** as GuideLLM using vLLM's [AsyncLLMEngine](https://docs.vllm.ai/). No HTTP server is involved, reducing overheat and variables. This is useful for isolating performance bottlenecks or simplifying your benchmark setup. You do **not** pass `--target`; you **must** pass `--model`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bolding on "same process" feels like overkill here.

Also, I think you want "overhead", not "overheat"; plus I think the "and variables" doesn't mean much in this context. Maybe "variability" would be better... or "variability due to network bandwidth and latency" or whatever else you have in mind here.


## Basic example

Run a benchmark with the vLLM Python backend (no `--target`):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wording might suggest that it's the lack of --target that triggers the Python backend, when it's really the --backend. You've already said that we don't use --target, and you're not showing it below, so I'm not sure that the explicit reminder that "we don't use --target here" is really necessary.

Comment on lines +77 to +80
raise ImportError(
"vllm is not installed. Please install it using "
"'pip install guidellm[vllm]' or 'pip install vllm>=0.6.0'"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except that since the dependency morass requires installing vLLM first if not together, suggesting at this point that they just pip install vllm seems potentially unhelpful ...

"""
config = dict(user_config)

# Ensure model is set in config (required; overrides user if they passed it)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be documented; there's a lot about discovering vLLM config options and how to pass them as backend args, so we should be careful to document options that can't be passed that way.

I get the necessity, although it seems fairly artificial (but probably too messy to fix now) ... target and model aren't actually required by Backend.create, even though resolve_backend always passes them ... arguably they should be demoted to backend_kwargs especially as we now have concrete examples which conflict in their requirements for those two parameters!

if not self._in_process:
raise RuntimeError("Backend not started up for process.")

# Shutdown the async engine if it has a shutdown method
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misleading comment ... there's no conditional here for "if" it has a shutdown method -- if there's an _engine we just call it. If it's really an "if" then the call should be conditional or in a captured try block... if not, can we remove that comment phrase?

:param iter_time: Current iteration time
"""
if request_info.timings.first_request_iteration is None:
request_info.timings.first_request_iteration = iter_time
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without checking other references ... below in token timing, the is None dynamic initialization case sets token_iterations = 0 along with storing the first timestamp. Should this logic be following that same pattern?

Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Copy link
Collaborator

@dbutenhof dbutenhof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More comments / questions / rambling

self._engine = None
self._in_process = False

async def validate(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently the HTTP backend used to do a real request to validate but was "downgraded" to do a /health check. So the question here is: do we want to depart from that convention by doing a real request here, or should we use the check_health() API to be equivalent to the HTTP behavior?

Comment on lines +581 to +583
block_type = block.get("type", "")
if block_type == "text":
text = block.get("text", "")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In both cases, it seems unnecessarily "wasteful" to add the get fallback to "" since the default None is perfectly (and more straightforwardly) falsey already. The fallback suggests that the result will be used as a str in a context where None would present a problem ... that represents an unnecessary cognitive load (i.e., it makes me scan the code to try to find such reference ... which doesn't exist. 😁 )

This obviously isn't a big deal, and probably isn't significantly slower -- it just feels like it makes the code slightly less easy to read, and that always bugs me.

Comment on lines 869 to 871
json_str_decoded = (
json_str.decode("utf-8") if isinstance(json_str, bytes) else json_str
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch ... this is entirely nonsensical unless you take the long journey to figure out that we try to import orjson as json. Not thrilled by that obfuscation, which makes this weird phrase necessary; but there you go. 😦

Comment on lines 955 to 956
if token_delta > 0:
state["total_output_tokens"] += token_delta
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE: "less than 0" -- the test is > 0, and I could argue the test probably isn't even worthwhile for 0 if there's no possibility of it being less than 0. (Though, if it can be, > 0 is more efficient than checking >= 0 and adding the 0. 🙂)

Comment on lines +925 to +926
if engine is None:
raise RuntimeError("Backend not started up.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several similar checks, some with slightly different exception text. Maybe one of the different patterns is meaningful, but I suspect it'd be better to give a single consistent message if the engine isn't started, probably by consistently calling _validate_backend_initialized rather than checking locally.

if line == "data: [DONE]":
return None
line = (line or "").strip()
if not line or not line.startswith("data:"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If line is "" (which is the least it can be after the previous line), then not line.startswith(<anything>). Yeah, checking for an empty line is probably slightly more efficient, but I'd be more tempted to simplify the statement here by dropping the first phrase.

And if we're uninterested in a line that doesn't start with "data:", I'm a bit curious why we're special-casing that by returning {} instead of None?

# nightly: increment to next minor, add 'a' with build iteration
# alpha: increment to next minor, add 'a' with build iteration
# dev: increment to next minor, add 'dev' with build iteration
ARG GUIDELLM_BUILD_TYPE=dev
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want dev default instead of release??


ENV GUIDELLM_BUILD_TYPE=$GUIDELLM_BUILD_TYPE

# Copy repository and install GuideLLM from source with pip (no uv, to avoid
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. "no uv" sounds stilted ... "not [with] uv" would sound a bit better.

Comment on lines +43 to +44
# Ensure that the user home dir can be used by any user
# (OpenShift Pods can't use the cache otherwise)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment that I made on Kevin's PR ... granted, he responded that this is essentially a quote from the OpenShift documentation, it still bugs me. It can be used by any user *in gid 0". That includes both the random uid in OpenShift and the root (0) uid in standalone podman, both in gid 0; but it's still not "any user".

VOLUME /results

# Root group for k8s
USER 1001:0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A constant uid? Is that ... wise?

Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants