-
Notifications
You must be signed in to change notification settings - Fork 127
vLLM Python Backend #596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
vLLM Python Backend #596
Changes from all commits
46972a0
cfcb09a
165477f
2f6c47a
c2cc6d3
6f818a0
ef60f15
6707b1d
afa992a
f82bf3b
5a945c3
218fbb5
69aa524
48048e5
6275e14
0c393c1
40552d7
45dbafe
5526048
2d69656
ea7b81b
2f39e9e
5223f3d
1838ef5
c0319b3
0f86fac
e5aa345
8a35197
ccecff3
dbcb13f
f6e277c
f7d908c
d8aa359
f62abc9
14b049c
d4ad846
c21e21d
8a3e43e
6493e8e
f2b0533
c123b77
1c2f7cd
056ef9f
60fa86f
8c58beb
8e7f8e4
084dea6
774bcd8
1c8cba8
1d69eac
ade6c3e
c94235f
7cbff80
7cef453
92fc979
b14c4d7
32e4af3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,53 @@ | ||
| # Base image: vllm/vllm-openai (has vLLM pre-installed). Override with build arg. | ||
| ARG BASE_IMAGE=vllm/vllm-openai | ||
|
|
||
| FROM $BASE_IMAGE | ||
|
|
||
| # release: take the last version and add a post if build iteration | ||
| # candidate: increment to next minor, add 'rc' with build iteration | ||
| # nightly: increment to next minor, add 'a' with build iteration | ||
| # alpha: increment to next minor, add 'a' with build iteration | ||
| # dev: increment to next minor, add 'dev' with build iteration | ||
| ARG GUIDELLM_BUILD_TYPE=dev | ||
|
|
||
| # Extra dependencies to install (e.g. recommended, all) | ||
| ARG GUIDELLM_BUILD_EXTRAS=recommended,audio | ||
|
|
||
| # Switch to root for installing system deps and pip install | ||
| USER root | ||
|
|
||
| # Install git for setuptools-git-versioning (version discovery) | ||
| RUN apt-get update && apt-get install -y --no-install-recommends git \ | ||
| && rm -rf /var/lib/apt/lists/* | ||
|
|
||
| ENV GUIDELLM_BUILD_TYPE=$GUIDELLM_BUILD_TYPE | ||
|
|
||
| # Copy repository and install GuideLLM from source with pip (no uv, to avoid | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm. "no uv" sounds stilted ... "not [with] uv" would sound a bit better. |
||
| # conflicting with the pre-installed vLLM in the base image) | ||
| COPY / /src | ||
| WORKDIR /src | ||
| RUN pip install --no-cache-dir ".[${GUIDELLM_BUILD_EXTRAS}]" | ||
|
|
||
| # Metadata | ||
| LABEL io.k8s.display-name="GuideLLM" \ | ||
| org.opencontainers.image.description="GuideLLM Performance Benchmarking Container" \ | ||
| org.opencontainers.image.source="https://github.com/vllm-project/guidellm" \ | ||
| org.opencontainers.image.documentation="https://blog.vllm.ai/guidellm/stable" \ | ||
| org.opencontainers.image.license="Apache-2.0" | ||
|
|
||
| ENV HOME="/home/guidellm" \ | ||
| GUIDELLM_OUTPUT_DIR="/results" | ||
|
|
||
| WORKDIR $HOME | ||
|
|
||
| # Ensure that the user home dir can be used by any user | ||
| # (OpenShift Pods can't use the cache otherwise) | ||
|
Comment on lines
+43
to
+44
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similar comment that I made on Kevin's PR ... granted, he responded that this is essentially a quote from the OpenShift documentation, it still bugs me. It can be used by any user *in gid 0". That includes both the random uid in OpenShift and the root (0) uid in standalone podman, both in gid 0; but it's still not "any user". |
||
| RUN chgrp -R 0 "$HOME" && chmod -R g=u "$HOME" | ||
|
|
||
| VOLUME /results | ||
|
|
||
| # Root group for k8s | ||
| USER 1001:0 | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A constant uid? Is that ... wise? |
||
|
|
||
| ENTRYPOINT [ "guidellm" ] | ||
| CMD [ "benchmark", "run" ] | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4,7 +4,7 @@ weight: -6 | |
|
|
||
| # Run a Benchmark | ||
|
|
||
| After [installing GuideLLM](install.md) and [starting a server](server.md), you're ready to run benchmarks to evaluate your LLM deployment's performance. | ||
| After [installing GuideLLM](install.md) and [starting a server](server.md), you're ready to run benchmarks to evaluate your LLM deployment's performance. Alternatively, you can run benchmarks with the vLLM Python backend (`--backend vllm_python`) without a separate server; see [vLLM Python backend](../guides/vllm-python-backend.md). | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd break this up: "after installing and starting a server [...] or run the Python backend" obscures the latter case. E.g.,
|
||
|
|
||
| Running a GuideLLM benchmark is straightforward. The basic command structure is: | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,68 @@ | ||
| # vLLM Python Backend | ||
|
|
||
| The **vLLM Python backend** (`vllm_python`) runs inference in the **same process** as GuideLLM using vLLM's [AsyncLLMEngine](https://docs.vllm.ai/). No HTTP server is involved, reducing overheat and variables. This is useful for isolating performance bottlenecks or simplifying your benchmark setup. You do **not** pass `--target`; you **must** pass `--model`. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The bolding on "same process" feels like overkill here. Also, I think you want "overhead", not "overheat"; plus I think the "and variables" doesn't mean much in this context. Maybe "variability" would be better... or "variability due to network bandwidth and latency" or whatever else you have in mind here. |
||
|
|
||
| For all engine options and supported models, see vLLM's [Engine Arguments](https://docs.vllm.ai/en/stable/configuration/engine_args/) and the [vLLM documentation](https://docs.vllm.ai/). | ||
|
|
||
| ## Installation | ||
|
|
||
| ### Recommended methods | ||
|
|
||
| - **Official GuideLLM + vLLM image** | ||
| Build and run the image that uses the vLLM base image (e.g. [Containerfile.vllm](https://github.com/vllm-project/guidellm/blob/main/Containerfile.vllm)). It is based on `vllm/vllm-openai` and installs GuideLLM on top, giving a known-good vLLM + GuideLLM stack with hardware support as provided by the base image. | ||
|
|
||
| **Note:** This method will result in the preference for vllm's requirements as opposed to GuideLLM's requirements. Since vLLM is the more complex project, this is the recommended configuration, but this may result in an older Python or dependency version, resulting in sub-optimal GuideLLM performance and behavior in some scenarios. | ||
|
|
||
| - **Existing vLLM installation** | ||
| Install vLLM first for your environment (GPU/CPU, CUDA, etc.), then install GuideLLM in the same environment (e.g. `pip install guidellm` or with extras). You avoid a duplicate vLLM install and reuse your existing acceleration setup. | ||
|
|
||
| **Note:** Using [uv](https://github.com/astral-sh/uv) is not recommended for the vLLM Python backend because of potentially incompatible requirements between the two projects. Prefer pip or the container / existing vLLM environment. | ||
|
|
||
|
|
||
| It is also possible to install GuideLLM and vLLM via pip using `pip install guidellm[vllm]`. This method may make **hardware acceleration** (e.g. CUDA) harder to get working. See [vLLM installation](https://docs.vllm.ai/en/latest/getting_started/installation) and GPU/hardware-specific docs there. For production or GPU use, the container or existing-install path is recommended. | ||
|
|
||
|
|
||
| ## Basic example | ||
|
|
||
| Run a benchmark with the vLLM Python backend (no `--target`): | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This wording might suggest that it's the lack of |
||
|
|
||
| ```bash | ||
| guidellm benchmark run \ | ||
| --backend vllm_python \ | ||
| --model "Qwen/Qwen3-0.6B" \ | ||
| --data "prompt_tokens=256,output_tokens=128" \ | ||
| --max-seconds 20 \ | ||
| --rate 3 | ||
| ``` | ||
|
|
||
| Engine behavior (device, memory, etc.) follows vLLM defaults unless you override it via `--backend-kwargs` (e.g. `vllm_config`). When running without a GPU (e.g. the GuideLLM + vLLM container without GPU access), the backend automatically uses the CPU device unless you set `device` in `vllm_config`. For engine configuration options, see vLLM's [Engine Arguments](https://docs.vllm.ai/en/stable/configuration/engine_args/). | ||
|
|
||
| ## Request format and backend options | ||
|
|
||
| - **`--request-format`** | ||
| Controls how chat prompts are built. Options: `plain` (no chat template; message content is concatenated as plain text), `default-template` (use the tokenizer’s default chat template), or a file path / single-line template string per vLLM’s supported options. The value is passed through to vLLM's chat template handling. For details, see vLLM's [Chat templates](https://docs.vllm.ai/en/latest/api/vllm/transformers_utils/chat_templates/) documentation. | ||
|
|
||
| - **`--backend-kwargs`** | ||
| Backend-specific options are passed here as a JSON object: pass a `vllm_config` key whose value is a dict of engine option names and values. You can also pass `request_format` here as an alternative to `--request-format`. | ||
|
|
||
| **Using Engine Arguments in `vllm_config`:** The [Engine Arguments](https://docs.vllm.ai/en/stable/configuration/engine_args/) documentation describes options in **CLI form** (e.g. `--gpu-memory-utilization`, `--max-model-len`). For `vllm_config` you must use the **Python parameter names** instead: strip the leading `--` and replace dashes with underscores (e.g. `gpu_memory_utilization`, `max_model_len`). The keys are the same as the field names on vLLM's `EngineArgs` and `AsyncEngineArgs` dataclasses; for the exact list of allowed keys and types, see the [vLLM source: `vllm/engine/arg_utils.py`](https://github.com/vllm-project/vllm/blob/main/vllm/engine/arg_utils.py) (search for `class EngineArgs`). | ||
|
|
||
| Example — limit GPU memory use and context length: | ||
|
|
||
| ```bash | ||
| --backend-kwargs '{"vllm_config": {"gpu_memory_utilization": 0.8, "max_model_len": 4096}}' | ||
| ``` | ||
|
|
||
| For the full list of options and their types, see vLLM's [Engine Arguments](https://docs.vllm.ai/en/stable/configuration/engine_args/) (CLI form) and the [EngineArgs source](https://github.com/vllm-project/vllm/blob/main/vllm/engine/arg_utils.py) (Python field names for `vllm_config`). | ||
|
|
||
| ## See also | ||
|
|
||
| - [Backends](backends.md) — Overview of supported backends. | ||
| - [Run a benchmark](../getting-started/benchmark.md) — General benchmark options. | ||
| - [vLLM Engine Arguments](https://docs.vllm.ai/en/stable/configuration/engine_args/) — CLI-oriented docs; use Python names (e.g. `gpu_memory_utilization`) in `vllm_config`. | ||
| - [vLLM source: `vllm/engine/arg_utils.py`](https://github.com/vllm-project/vllm/blob/main/vllm/engine/arg_utils.py) — `EngineArgs` / `AsyncEngineArgs` field names and types for `vllm_config` keys. | ||
| - [vLLM AsyncEngineArgs API](https://docs.vllm.ai/en/stable/api/vllm/engine/arg_utils/#vllm.engine.arg_utils.AsyncEngineArgs) — API reference for the class that receives these options. | ||
| - [vLLM Chat templates](https://docs.vllm.ai/en/latest/api/vllm/transformers_utils/chat_templates/) — For `--request-format` behavior. | ||
| - [vLLM documentation](https://docs.vllm.ai/) | ||
| - [vLLM installation](https://docs.vllm.ai/en/latest/getting_started/installation) | ||
| - [vLLM OpenAI-compatible server](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html) — When using the HTTP server instead of the Python backend. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -46,6 +46,10 @@ | |
| get_builtin_scenarios, | ||
| reimport_benchmarks_report, | ||
| ) | ||
| from guidellm.benchmark.schemas.generative.entrypoints import ( | ||
| backend_requires_model, | ||
| backend_requires_target, | ||
| ) | ||
| from guidellm.mock_server import MockServer, MockServerConfig | ||
| from guidellm.scheduler import StrategyType | ||
| from guidellm.settings import print_config | ||
|
|
@@ -186,7 +190,12 @@ def benchmark(): | |
| default=BenchmarkGenerativeTextArgs.get_default("request_format"), | ||
| help=( | ||
| "Format to use for requests. Options depend on backend. " | ||
| "If not provided, uses backend default." | ||
| "For vLLM backend: plain (no chat template, text appending only), " | ||
| "default-template (use tokenizer default), or a file path / single-line " | ||
| "template per vLLM docs. Default: default-template" | ||
| "For openai backend: http endpoint path (/v1/chat/completions, " | ||
| "/v1/completions, /v1/audio/transcriptions, /v1/audio/translations) or " | ||
| "alias (e.g. chat_completions); default /v1/chat/completions." | ||
| ), | ||
| ) | ||
| @click.option( | ||
|
|
@@ -460,6 +469,40 @@ def run(**kwargs): # noqa: C901 | |
| status="warning", | ||
| ) | ||
|
|
||
| # Early validation: check target and model parameters based on backend requirements | ||
| backend = kwargs.get("backend", BenchmarkGenerativeTextArgs.get_default("backend")) | ||
| target = kwargs.get("target", None) | ||
| model = kwargs.get("model", None) | ||
| requires_target = backend_requires_target(backend) | ||
| requires_model = backend_requires_model(backend) | ||
| backend_type = backend.type_ if hasattr(backend, "type_") else backend | ||
|
|
||
| # Validate target parameter | ||
| if requires_target and target is None: | ||
| raise click.BadParameter( | ||
| f"Backend '{backend_type}' requires a target parameter. " | ||
| "Please provide --target with a valid endpoint URL.", | ||
| ctx=click.get_current_context(), | ||
| param_hint="--target", | ||
| ) | ||
|
|
||
| if not requires_target and target is not None: | ||
| raise click.BadParameter( | ||
| f"Backend '{backend_type}' does not support a target parameter. " | ||
| "Please remove --target as this backend runs locally.", | ||
| ctx=click.get_current_context(), | ||
| param_hint="--target", | ||
| ) | ||
|
|
||
| # Validate model parameter | ||
| if requires_model and model is None: | ||
| raise click.BadParameter( | ||
| f"Backend '{backend_type}' requires a model parameter. " | ||
| "Please provide --model with a valid model identifier.", | ||
| ctx=click.get_current_context(), | ||
| param_hint="--model", | ||
| ) | ||
|
|
||
|
Comment on lines
+472
to
+505
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Duplicate of the |
||
| try: | ||
| args = BenchmarkGenerativeTextArgs.create( | ||
| scenario=kwargs.pop("scenario", None), **kwargs | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -19,6 +19,17 @@ | |||||||||||||||||||||||||||||||
| TextCompletionsRequestHandler, | ||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
| # Conditionally import VLLM backend if available | ||||||||||||||||||||||||||||||||
| try: | ||||||||||||||||||||||||||||||||
| from .vllm_python.vllm import VLLMPythonBackend | ||||||||||||||||||||||||||||||||
| from .vllm_python.vllm_response import VLLMResponseHandler | ||||||||||||||||||||||||||||||||
|
Comment on lines
+24
to
+25
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should import from next level. E.g. |
||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
| HAS_VLLM_BACKEND = True | ||||||||||||||||||||||||||||||||
| except ImportError: | ||||||||||||||||||||||||||||||||
| VLLMPythonBackend = None # type: ignore[assignment, misc] | ||||||||||||||||||||||||||||||||
| VLLMResponseHandler = None # type: ignore[assignment, misc] | ||||||||||||||||||||||||||||||||
| HAS_VLLM_BACKEND = False | ||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
|
Comment on lines
+22
to
+32
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually the above comment on vLLM extras is better here:
Suggested change
|
||||||||||||||||||||||||||||||||
| __all__ = [ | ||||||||||||||||||||||||||||||||
| "AudioRequestHandler", | ||||||||||||||||||||||||||||||||
| "Backend", | ||||||||||||||||||||||||||||||||
|
|
@@ -29,3 +40,7 @@ | |||||||||||||||||||||||||||||||
| "OpenAIRequestHandlerFactory", | ||||||||||||||||||||||||||||||||
| "TextCompletionsRequestHandler", | ||||||||||||||||||||||||||||||||
| ] | ||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
| # Conditionally add VLLM backend and handler to exports | ||||||||||||||||||||||||||||||||
| if HAS_VLLM_BACKEND: | ||||||||||||||||||||||||||||||||
| __all__.extend(["VLLMPythonBackend", "VLLMResponseHandler"]) | ||||||||||||||||||||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| """ | ||
| VLLM Python API backend package. | ||
|
|
||
| Provides the VLLM Python backend and response handler for compiling | ||
| OpenAI-style response dicts into GenerationResponse. | ||
| """ | ||
|
|
||
| from .vllm_response import VLLMResponseHandler | ||
|
|
||
| __all__ = ["VLLMResponseHandler"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want
devdefault instead ofrelease??