Conversation
Generated-by: Cursor AI Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Assisted-by: Cursor AI Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Generated-by: Cursor AI Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Assisted-by: Cursor AI Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Generated-by: Cursor AI Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Generated-by: Cursor AI Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Generated-by: Cursor AI Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
This reverts commit 45dbafe.
Generated-by: Cursor AI Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Generated-by: Cursor AI Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
This reverts commit 14b049c.
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Generated-by: Cursor AI Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Generated-by: Cursor AI Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Assisted-by: Cursor AI Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Generated-by: Cursor AI Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Generated-by: Cursor AI Signed-off-by: Jared O'Connell <joconnel@redhat.com>
… backend Assisted-by: Cursor AI Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
sjmonson
left a comment
There was a problem hiding this comment.
Core handling of vLLM seems good, the rest is mostly cleanup and minor fixes. There are a lot of dead codepaths that likely exist due to this PR being started before #478. A lot of converting to/from /v1/chat/completions format can be removed. Multimodel probably needs bit more work. Audio should be easy to fix but image/video don't seem implemented; this can be a later follow-up PR. Both plain and chat template formats need fixes, see respective comments; neither should be difficult.
| try: | ||
| from vllm import SamplingParams | ||
| from vllm.engine.async_llm_engine import AsyncLLMEngine | ||
| from vllm.engine.arg_utils import AsyncEngineArgs | ||
| from vllm.outputs import RequestOutput | ||
|
|
||
| HAS_VLLM = True | ||
| except ImportError: | ||
| AsyncLLMEngine = None # type: ignore[assignment, misc] | ||
| AsyncEngineArgs = None # type: ignore[assignment, misc] | ||
| SamplingParams = None # type: ignore[assignment, misc] | ||
| RequestOutput = None # type: ignore[assignment, misc] | ||
| HAS_VLLM = False |
There was a problem hiding this comment.
Move this under src/guidellm/extras/vllm.py and from guidellm.extras.vllm import SamplingParams, AsyncLLMEngine, ..., HAS_VLLM here.
| arguments = getattr(request, "arguments", None) | ||
|
|
||
| if arguments is not None: | ||
| body = getattr(arguments, "body", None) or {} | ||
| stream_override = getattr(arguments, "stream", None) | ||
| stream = self.stream if stream_override is None else bool(stream_override) | ||
| files = getattr(arguments, "files", None) or {} | ||
| else: |
There was a problem hiding this comment.
A GenerationRequest will never have an arguments field as of #478; drop this.
| # Conditionally import VLLM backend if available | ||
| try: | ||
| from .vllm_python.vllm import VLLMPythonBackend | ||
| from .vllm_python.vllm_response import VLLMResponseHandler | ||
|
|
||
| HAS_VLLM_BACKEND = True | ||
| except ImportError: | ||
| VLLMPythonBackend = None # type: ignore[assignment, misc] | ||
| VLLMResponseHandler = None # type: ignore[assignment, misc] | ||
| HAS_VLLM_BACKEND = False | ||
|
|
There was a problem hiding this comment.
Actually the above comment on vLLM extras is better here:
| # Conditionally import VLLM backend if available | |
| try: | |
| from .vllm_python.vllm import VLLMPythonBackend | |
| from .vllm_python.vllm_response import VLLMResponseHandler | |
| HAS_VLLM_BACKEND = True | |
| except ImportError: | |
| VLLMPythonBackend = None # type: ignore[assignment, misc] | |
| VLLMResponseHandler = None # type: ignore[assignment, misc] | |
| HAS_VLLM_BACKEND = False | |
| from guidellm.extras.vllm import HAS_VLLM | |
| # Conditionally import VLLM backend if available | |
| if HAS_VLLM: | |
| from .vllm_python import VLLMPythonBackend |
| from .vllm_python.vllm import VLLMPythonBackend | ||
| from .vllm_python.vllm_response import VLLMResponseHandler |
There was a problem hiding this comment.
Should import from next level. E.g. from .vllm_python import VLLMPythonBackend, VLLMResponseHandler.
| if not requires_target and self.target is not None: | ||
| raise ValueError( | ||
| f"Backend '{backend_type}' does not support a target parameter. " | ||
| "Please remove --target as this backend runs locally." | ||
| ) |
There was a problem hiding this comment.
Don't do this, just ignore target if its not needed.
| if token_delta > 0: | ||
| state["total_output_tokens"] += token_delta |
There was a problem hiding this comment.
Have you observed token_delta being less than 0 or is this just catching the 0 case?
There was a problem hiding this comment.
RE: "less than 0" -- the test is > 0, and I could argue the test probably isn't even worthwhile for 0 if there's no possibility of it being less than 0. (Though, if it can be, > 0 is more efficient than checking >= 0 and adding the 0. 🙂)
| if hasattr(audio_samples.data, "numpy"): | ||
| audio_array = audio_samples.data.numpy() | ||
| elif hasattr(audio_samples.data, "cpu"): | ||
| audio_array = audio_samples.data.cpu().numpy() | ||
| else: | ||
| audio_array = np.asarray(audio_samples.data) |
There was a problem hiding this comment.
Be careful with hasattr, it will hide invalid code paths. The only valid option is the first one. audio_samples.data is a Tensor and all tensors have a numpy() function.
| max_tokens = ( | ||
| body.get("max_tokens") | ||
| if body.get("max_tokens") is not None | ||
| else (max_tokens_override if max_tokens_override is not None else 16) | ||
| ) | ||
| if max_tokens == 0: | ||
| max_tokens = 16 |
There was a problem hiding this comment.
Why set max_tokens to 16 by default. Should be uncapped so we let the model decide when its done.
|
|
||
| return SamplingParams(**params) # type: ignore[misc] | ||
|
|
||
| def _convert_vllm_output_to_openai_format( |
There was a problem hiding this comment.
Shouldn't the response handler be written to take RequestOutput natively? Seems like a lot of extra code/work to convert to this format first and then parse.
| def _extract_prompt_chat_plain( | ||
| self, formatted_messages: list[dict[str, Any]] | ||
| ) -> str: | ||
| """Format messages as plain 'Role: content' lines.""" | ||
| parts = [] | ||
| for msg in formatted_messages: | ||
| role = msg.get("role", "user") | ||
| content = msg.get("content", "") | ||
| if role and content: | ||
| parts.append(f"{role.capitalize()}: {content}") | ||
| parts.append("Assistant: ") | ||
| return "\n".join(parts) |
There was a problem hiding this comment.
Plain should be equivalent to /v1/completions. Need to read vLLM code and double check but there should not be any "role" here. Prompts should be concated with " ".join(...).
Assisted-by: Cursor AI Signed-off-by: Jared O'Connell <joconnel@redhat.com>
dbutenhof
left a comment
There was a problem hiding this comment.
My eyes are starting to glaze over, in the middle of vllm.py, so I need a break. Might as well post what I've got so far and start fresh later...
| # Run a Benchmark | ||
|
|
||
| After [installing GuideLLM](install.md) and [starting a server](server.md), you're ready to run benchmarks to evaluate your LLM deployment's performance. | ||
| After [installing GuideLLM](install.md) and [starting a server](server.md), you're ready to run benchmarks to evaluate your LLM deployment's performance. Alternatively, you can run benchmarks with the vLLM Python backend (`--backend vllm_python`) without a separate server; see [vLLM Python backend](../guides/vllm-python-backend.md). |
There was a problem hiding this comment.
I'd break this up: "after installing and starting a server [...] or run the Python backend" obscures the latter case. E.g.,
- Install GuideLLM
- You can run GuideLLM two ways:
a. targeting a running OpenAI-compatible LLM server or
b. with an installed vLLM package using the vLLM Python backend (...)
| @@ -0,0 +1,68 @@ | |||
| # vLLM Python Backend | |||
|
|
|||
| The **vLLM Python backend** (`vllm_python`) runs inference in the **same process** as GuideLLM using vLLM's [AsyncLLMEngine](https://docs.vllm.ai/). No HTTP server is involved, reducing overheat and variables. This is useful for isolating performance bottlenecks or simplifying your benchmark setup. You do **not** pass `--target`; you **must** pass `--model`. | |||
There was a problem hiding this comment.
The bolding on "same process" feels like overkill here.
Also, I think you want "overhead", not "overheat"; plus I think the "and variables" doesn't mean much in this context. Maybe "variability" would be better... or "variability due to network bandwidth and latency" or whatever else you have in mind here.
|
|
||
| ## Basic example | ||
|
|
||
| Run a benchmark with the vLLM Python backend (no `--target`): |
There was a problem hiding this comment.
This wording might suggest that it's the lack of --target that triggers the Python backend, when it's really the --backend. You've already said that we don't use --target, and you're not showing it below, so I'm not sure that the explicit reminder that "we don't use --target here" is really necessary.
| raise ImportError( | ||
| "vllm is not installed. Please install it using " | ||
| "'pip install guidellm[vllm]' or 'pip install vllm>=0.6.0'" | ||
| ) |
There was a problem hiding this comment.
Except that since the dependency morass requires installing vLLM first if not together, suggesting at this point that they just pip install vllm seems potentially unhelpful ...
| """ | ||
| config = dict(user_config) | ||
|
|
||
| # Ensure model is set in config (required; overrides user if they passed it) |
There was a problem hiding this comment.
This should be documented; there's a lot about discovering vLLM config options and how to pass them as backend args, so we should be careful to document options that can't be passed that way.
I get the necessity, although it seems fairly artificial (but probably too messy to fix now) ... target and model aren't actually required by Backend.create, even though resolve_backend always passes them ... arguably they should be demoted to backend_kwargs especially as we now have concrete examples which conflict in their requirements for those two parameters!
| if not self._in_process: | ||
| raise RuntimeError("Backend not started up for process.") | ||
|
|
||
| # Shutdown the async engine if it has a shutdown method |
There was a problem hiding this comment.
Misleading comment ... there's no conditional here for "if" it has a shutdown method -- if there's an _engine we just call it. If it's really an "if" then the call should be conditional or in a captured try block... if not, can we remove that comment phrase?
| :param iter_time: Current iteration time | ||
| """ | ||
| if request_info.timings.first_request_iteration is None: | ||
| request_info.timings.first_request_iteration = iter_time |
There was a problem hiding this comment.
Without checking other references ... below in token timing, the is None dynamic initialization case sets token_iterations = 0 along with storing the first timestamp. Should this logic be following that same pattern?
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
dbutenhof
left a comment
There was a problem hiding this comment.
More comments / questions / rambling
| self._engine = None | ||
| self._in_process = False | ||
|
|
||
| async def validate(self): |
There was a problem hiding this comment.
Apparently the HTTP backend used to do a real request to validate but was "downgraded" to do a /health check. So the question here is: do we want to depart from that convention by doing a real request here, or should we use the check_health() API to be equivalent to the HTTP behavior?
| block_type = block.get("type", "") | ||
| if block_type == "text": | ||
| text = block.get("text", "") |
There was a problem hiding this comment.
In both cases, it seems unnecessarily "wasteful" to add the get fallback to "" since the default None is perfectly (and more straightforwardly) falsey already. The fallback suggests that the result will be used as a str in a context where None would present a problem ... that represents an unnecessary cognitive load (i.e., it makes me scan the code to try to find such reference ... which doesn't exist. 😁 )
This obviously isn't a big deal, and probably isn't significantly slower -- it just feels like it makes the code slightly less easy to read, and that always bugs me.
| json_str_decoded = ( | ||
| json_str.decode("utf-8") if isinstance(json_str, bytes) else json_str | ||
| ) |
There was a problem hiding this comment.
Ouch ... this is entirely nonsensical unless you take the long journey to figure out that we try to import orjson as json. Not thrilled by that obfuscation, which makes this weird phrase necessary; but there you go. 😦
| if token_delta > 0: | ||
| state["total_output_tokens"] += token_delta |
There was a problem hiding this comment.
RE: "less than 0" -- the test is > 0, and I could argue the test probably isn't even worthwhile for 0 if there's no possibility of it being less than 0. (Though, if it can be, > 0 is more efficient than checking >= 0 and adding the 0. 🙂)
| if engine is None: | ||
| raise RuntimeError("Backend not started up.") |
There was a problem hiding this comment.
There are several similar checks, some with slightly different exception text. Maybe one of the different patterns is meaningful, but I suspect it'd be better to give a single consistent message if the engine isn't started, probably by consistently calling _validate_backend_initialized rather than checking locally.
| if line == "data: [DONE]": | ||
| return None | ||
| line = (line or "").strip() | ||
| if not line or not line.startswith("data:"): |
There was a problem hiding this comment.
If line is "" (which is the least it can be after the previous line), then not line.startswith(<anything>). Yeah, checking for an empty line is probably slightly more efficient, but I'd be more tempted to simplify the statement here by dropping the first phrase.
And if we're uninterested in a line that doesn't start with "data:", I'm a bit curious why we're special-casing that by returning {} instead of None?
| # nightly: increment to next minor, add 'a' with build iteration | ||
| # alpha: increment to next minor, add 'a' with build iteration | ||
| # dev: increment to next minor, add 'dev' with build iteration | ||
| ARG GUIDELLM_BUILD_TYPE=dev |
There was a problem hiding this comment.
Do we really want dev default instead of release??
|
|
||
| ENV GUIDELLM_BUILD_TYPE=$GUIDELLM_BUILD_TYPE | ||
|
|
||
| # Copy repository and install GuideLLM from source with pip (no uv, to avoid |
There was a problem hiding this comment.
Hmm. "no uv" sounds stilted ... "not [with] uv" would sound a bit better.
| # Ensure that the user home dir can be used by any user | ||
| # (OpenShift Pods can't use the cache otherwise) |
There was a problem hiding this comment.
Similar comment that I made on Kevin's PR ... granted, he responded that this is essentially a quote from the OpenShift documentation, it still bugs me. It can be used by any user *in gid 0". That includes both the random uid in OpenShift and the root (0) uid in standalone podman, both in gid 0; but it's still not "any user".
| VOLUME /results | ||
|
|
||
| # Root group for k8s | ||
| USER 1001:0 |
There was a problem hiding this comment.
A constant uid? Is that ... wise?
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Summary
This backend is an alternative backend that starts a complete vLLM instance and uses its Python API rather than the HTTP API.
This PR will be in a draft since it still needs tests and documentation.
TODO
Details
Test Plan
Related Issues
Use of AI
## WRITTEN BY AI ##)