vllm-project · jaredoconnell · Dec 11, 2025 · Dec 11, 2025 · Dec 12, 2025 · Dec 12, 2025
diff --git a/Containerfile.vllm b/Containerfile.vllm
@@ -0,0 +1,53 @@
+# Base image: vllm/vllm-openai (has vLLM pre-installed). Override with build arg.
+ARG BASE_IMAGE=vllm/vllm-openai
+
+FROM $BASE_IMAGE
+
+# release: take the last version and add a post if build iteration
+# candidate: increment to next minor, add 'rc' with build iteration
+# nightly: increment to next minor, add 'a' with build iteration
+# alpha: increment to next minor, add 'a' with build iteration
+# dev: increment to next minor, add 'dev' with build iteration
+ARG GUIDELLM_BUILD_TYPE=dev
+
+# Extra dependencies to install (e.g. recommended, all)
+ARG GUIDELLM_BUILD_EXTRAS=recommended,audio
+
+# Switch to root for installing system deps and pip install
+USER root
+
+# Install git for setuptools-git-versioning (version discovery)
+RUN apt-get update && apt-get install -y --no-install-recommends git \
+    && rm -rf /var/lib/apt/lists/*
+
+ENV GUIDELLM_BUILD_TYPE=$GUIDELLM_BUILD_TYPE
+
+# Copy repository and install GuideLLM from source with pip (no uv, to avoid
+# conflicting with the pre-installed vLLM in the base image)
+COPY / /src
+WORKDIR /src
+RUN pip install --no-cache-dir ".[${GUIDELLM_BUILD_EXTRAS}]"
+
+# Metadata
+LABEL io.k8s.display-name="GuideLLM" \
+      org.opencontainers.image.description="GuideLLM Performance Benchmarking Container" \
+      org.opencontainers.image.source="https://github.com/vllm-project/guidellm" \
+      org.opencontainers.image.documentation="https://blog.vllm.ai/guidellm/stable" \
+      org.opencontainers.image.license="Apache-2.0"
+
+ENV HOME="/home/guidellm" \
+    GUIDELLM_OUTPUT_DIR="/results"
+
+WORKDIR $HOME
+
+# Ensure that the user home dir can be used by any user
+# (OpenShift Pods can't use the cache otherwise)
+RUN chgrp -R 0 "$HOME" && chmod -R g=u "$HOME"
+
+VOLUME /results
+
+# Root group for k8s
+USER 1001:0
+
+ENTRYPOINT [ "guidellm" ]
+CMD [ "benchmark", "run" ]
diff --git a/docs/getting-started/benchmark.md b/docs/getting-started/benchmark.md
@@ -4,7 +4,7 @@ weight: -6
 
 # Run a Benchmark
 
-After [installing GuideLLM](install.md) and [starting a server](server.md), you're ready to run benchmarks to evaluate your LLM deployment's performance.
+After [installing GuideLLM](install.md) and [starting a server](server.md), you're ready to run benchmarks to evaluate your LLM deployment's performance. Alternatively, you can run benchmarks with the vLLM Python backend (`--backend vllm_python`) without a separate server; see [vLLM Python backend](../guides/vllm-python-backend.md).
 
 Running a GuideLLM benchmark is straightforward. The basic command structure is:
 

diff --git a/docs/getting-started/install.md b/docs/getting-started/install.md
@@ -86,6 +86,8 @@ guidellm --help
 
 This should display the installed version of GuideLLM.
 
+To use the vLLM Python backend (in-process inference), see [vLLM Python backend](../guides/vllm-python-backend.md) for recommended installation (container or existing vLLM environment) and pip installation notes.
+
 ## Troubleshooting
 
 If you encounter any issues during installation, ensure that your Python and pip versions meet the prerequisites. For further assistance, please refer to the [GitHub Issues](https://github.com/vllm-project/guidellm/issues) page or consult the [Documentation](https://github.com/vllm-project/guidellm/tree/main/docs).
diff --git a/docs/guides/backends.md b/docs/guides/backends.md
@@ -8,6 +8,10 @@ GuideLLM is designed to work with OpenAI-compatible HTTP servers, enabling seaml
 
 GuideLLM supports OpenAI-compatible HTTP servers, which provide a standardized API for interacting with LLMs. This includes popular implementations such as [vLLM](https://github.com/vllm-project/vllm) and [Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference). These servers allow GuideLLM to perform evaluations, benchmarks, and optimizations with minimal setup.
 
+### vLLM Python backend
+
+GuideLLM supports running inference in the same process using the **vLLM Python backend** (`vllm_python`). This backend runs inference in the same process as GuideLLM's using vLLM's python API (AsyncLLMEngine), without an HTTP server. For setup, installation options (container, existing vLLM, pip), and examples, see [vLLM Python backend](vllm-python-backend.md).
+
 ## Examples for Spinning Up Compatible Servers
 
 ### 1. vLLM

diff --git a/docs/guides/multimodal/audio.md b/docs/guides/multimodal/audio.md
@@ -106,10 +106,10 @@ For example, to specify French as the target language for an audio translation r
 
 #### "stream"
 
-Turn streaming responses on or off (if supported by the server) using a boolean value. By default, streaming is enabled.
+Turn streaming responses on or off (if supported by the backend) using a boolean value. By default, streaming is enabled. Use `--backend-kwargs`:
 
 ```bash
---request-formatter-kwargs '{"stream": false}'
+--backend-kwargs '{"stream": false}'
 ```
 
 ## Expected Results

diff --git a/docs/guides/multimodal/image.md b/docs/guides/multimodal/image.md
@@ -97,10 +97,10 @@ For example, to specify a specific system prompt or other body parameter:
 
 #### "stream"
 
-Turn streaming responses on or off (if supported by the server) using a boolean value. By default, streaming is enabled.
+Turn streaming responses on or off (if supported by the backend) using a boolean value. By default, streaming is enabled. Use `--backend-kwargs`:
 
 ```bash
---request-formatter-kwargs '{"stream": false}'
+--backend-kwargs '{"stream": false}'
 ```
 
 ## Expected Results

diff --git a/docs/guides/multimodal/video.md b/docs/guides/multimodal/video.md
@@ -92,10 +92,10 @@ For example, to specify a specific system prompt or other body parameter:
 
 #### "stream"
 
-Turn streaming responses on or off (if supported by the server) using a boolean value. By default, streaming is enabled.
+Turn streaming responses on or off (if supported by the backend) using a boolean value. By default, streaming is enabled. Use `--backend-kwargs`:
 
 ```bash
---request-formatter-kwargs '{"stream": false}'
+--backend-kwargs '{"stream": false}'
 ```
 
 ## Expected Results

diff --git a/docs/guides/vllm-python-backend.md b/docs/guides/vllm-python-backend.md
@@ -0,0 +1,68 @@
+# vLLM Python Backend
+
+The **vLLM Python backend** (`vllm_python`) runs inference in the **same process** as GuideLLM using vLLM's [AsyncLLMEngine](https://docs.vllm.ai/). No HTTP server is involved, reducing overheat and variables. This is useful for isolating performance bottlenecks or simplifying your benchmark setup. You do **not** pass `--target`; you **must** pass `--model`.
+
+For all engine options and supported models, see vLLM's [Engine Arguments](https://docs.vllm.ai/en/stable/configuration/engine_args/) and the [vLLM documentation](https://docs.vllm.ai/).
+
+## Installation
+
+### Recommended methods
+
+- **Official GuideLLM + vLLM image**  
+  Build and run the image that uses the vLLM base image (e.g. [Containerfile.vllm](https://github.com/vllm-project/guidellm/blob/main/Containerfile.vllm)). It is based on `vllm/vllm-openai` and installs GuideLLM on top, giving a known-good vLLM + GuideLLM stack with hardware support as provided by the base image.
+
+  **Note:** This method will result in the preference for vllm's requirements as opposed to GuideLLM's requirements. Since vLLM is the more complex project, this is the recommended configuration, but this may result in an older Python or dependency version, resulting in sub-optimal GuideLLM performance and behavior in some scenarios.
+
+- **Existing vLLM installation**  
+  Install vLLM first for your environment (GPU/CPU, CUDA, etc.), then install GuideLLM in the same environment (e.g. `pip install guidellm` or with extras). You avoid a duplicate vLLM install and reuse your existing acceleration setup.
+
+  **Note:** Using [uv](https://github.com/astral-sh/uv) is not recommended for the vLLM Python backend because of potentially incompatible requirements between the two projects. Prefer pip or the container / existing vLLM environment.
+
+
+It is also possible to install GuideLLM and vLLM via pip using `pip install guidellm[vllm]`. This method may make **hardware acceleration** (e.g. CUDA) harder to get working. See [vLLM installation](https://docs.vllm.ai/en/latest/getting_started/installation) and GPU/hardware-specific docs there. For production or GPU use, the container or existing-install path is recommended.
+
+
+## Basic example
+
+Run a benchmark with the vLLM Python backend (no `--target`):
+
+```bash
+guidellm benchmark run \
+  --backend vllm_python \
+  --model "Qwen/Qwen3-0.6B" \
+  --data "prompt_tokens=256,output_tokens=128" \
+  --max-seconds 20 \
+  --rate 3
+```
+
+Engine behavior (device, memory, etc.) follows vLLM defaults unless you override it via `--backend-kwargs` (e.g. `vllm_config`). When running without a GPU (e.g. the GuideLLM + vLLM container without GPU access), the backend automatically uses the CPU device unless you set `device` in `vllm_config`. For engine configuration options, see vLLM's [Engine Arguments](https://docs.vllm.ai/en/stable/configuration/engine_args/).
+
+## Request format and backend options
+
+- **`--request-format`**  
+  Controls how chat prompts are built. Options: `plain` (no chat template; message content is concatenated as plain text), `default-template` (use the tokenizer’s default chat template), or a file path / single-line template string per vLLM’s supported options. The value is passed through to vLLM's chat template handling. For details, see vLLM's [Chat templates](https://docs.vllm.ai/en/latest/api/vllm/transformers_utils/chat_templates/) documentation.
+
+- **`--backend-kwargs`**  
+  Backend-specific options are passed here as a JSON object: pass a `vllm_config` key whose value is a dict of engine option names and values. You can also pass `request_format` here as an alternative to `--request-format`.
+
+  **Using Engine Arguments in `vllm_config`:** The [Engine Arguments](https://docs.vllm.ai/en/stable/configuration/engine_args/) documentation describes options in **CLI form** (e.g. `--gpu-memory-utilization`, `--max-model-len`). For `vllm_config` you must use the **Python parameter names** instead: strip the leading `--` and replace dashes with underscores (e.g. `gpu_memory_utilization`, `max_model_len`). The keys are the same as the field names on vLLM's `EngineArgs` and `AsyncEngineArgs` dataclasses; for the exact list of allowed keys and types, see the [vLLM source: `vllm/engine/arg_utils.py`](https://github.com/vllm-project/vllm/blob/main/vllm/engine/arg_utils.py) (search for `class EngineArgs`).
+
+  Example — limit GPU memory use and context length:
+
+  ```bash
+  --backend-kwargs '{"vllm_config": {"gpu_memory_utilization": 0.8, "max_model_len": 4096}}'
+  ```
+
+  For the full list of options and their types, see vLLM's [Engine Arguments](https://docs.vllm.ai/en/stable/configuration/engine_args/) (CLI form) and the [EngineArgs source](https://github.com/vllm-project/vllm/blob/main/vllm/engine/arg_utils.py) (Python field names for `vllm_config`).
+
+## See also
+
+- [Backends](backends.md) — Overview of supported backends.
+- [Run a benchmark](../getting-started/benchmark.md) — General benchmark options.
+- [vLLM Engine Arguments](https://docs.vllm.ai/en/stable/configuration/engine_args/) — CLI-oriented docs; use Python names (e.g. `gpu_memory_utilization`) in `vllm_config`.
+- [vLLM source: `vllm/engine/arg_utils.py`](https://github.com/vllm-project/vllm/blob/main/vllm/engine/arg_utils.py) — `EngineArgs` / `AsyncEngineArgs` field names and types for `vllm_config` keys.
+- [vLLM AsyncEngineArgs API](https://docs.vllm.ai/en/stable/api/vllm/engine/arg_utils/#vllm.engine.arg_utils.AsyncEngineArgs) — API reference for the class that receives these options.
+- [vLLM Chat templates](https://docs.vllm.ai/en/latest/api/vllm/transformers_utils/chat_templates/) — For `--request-format` behavior.
+- [vLLM documentation](https://docs.vllm.ai/)
+- [vLLM installation](https://docs.vllm.ai/en/latest/getting_started/installation)
+- [vLLM OpenAI-compatible server](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html) — When using the HTTP server instead of the Python backend.
diff --git a/pyproject.toml b/pyproject.toml
@@ -91,6 +91,9 @@ vision = [
     "datasets[vision]",
     "pillow",
 ]
+vllm = [
+    "vllm",
+]
 # Dev Tooling
 dev = [
     # Install all optional dependencies

diff --git a/src/guidellm/__main__.py b/src/guidellm/__main__.py
@@ -46,6 +46,10 @@
     get_builtin_scenarios,
     reimport_benchmarks_report,
 )
+from guidellm.benchmark.schemas.generative.entrypoints import (
+    backend_requires_model,
+    backend_requires_target,
+)
 from guidellm.mock_server import MockServer, MockServerConfig
 from guidellm.scheduler import StrategyType
 from guidellm.settings import print_config
@@ -186,7 +190,12 @@ def benchmark():
     default=BenchmarkGenerativeTextArgs.get_default("request_format"),
     help=(
         "Format to use for requests. Options depend on backend. "
-        "If not provided, uses backend default."
+        "For vLLM backend: plain (no chat template, text appending only), "
+        "default-template (use tokenizer default), or a file path / single-line "
+        "template per vLLM docs. Default: default-template"
+        "For openai backend: http endpoint path (/v1/chat/completions, "
+        "/v1/completions, /v1/audio/transcriptions, /v1/audio/translations) or "
+        "alias (e.g. chat_completions); default /v1/chat/completions."
     ),
 )
 @click.option(
@@ -460,6 +469,40 @@ def run(**kwargs):  # noqa: C901
             status="warning",
         )
 
+    # Early validation: check target and model parameters based on backend requirements
+    backend = kwargs.get("backend", BenchmarkGenerativeTextArgs.get_default("backend"))
+    target = kwargs.get("target", None)
+    model = kwargs.get("model", None)
+    requires_target = backend_requires_target(backend)
+    requires_model = backend_requires_model(backend)
+    backend_type = backend.type_ if hasattr(backend, "type_") else backend
+
+    # Validate target parameter
+    if requires_target and target is None:
+        raise click.BadParameter(
+            f"Backend '{backend_type}' requires a target parameter. "
+            "Please provide --target with a valid endpoint URL.",
+            ctx=click.get_current_context(),
+            param_hint="--target",
+        )
+
+    if not requires_target and target is not None:
+        raise click.BadParameter(
+            f"Backend '{backend_type}' does not support a target parameter. "
+            "Please remove --target as this backend runs locally.",
+            ctx=click.get_current_context(),
+            param_hint="--target",
+        )
+
+    # Validate model parameter
+    if requires_model and model is None:
+        raise click.BadParameter(
+            f"Backend '{backend_type}' requires a model parameter. "
+            "Please provide --model with a valid model identifier.",
+            ctx=click.get_current_context(),
+            param_hint="--model",
+        )
+
     try:
         args = BenchmarkGenerativeTextArgs.create(
             scenario=kwargs.pop("scenario", None), **kwargs

diff --git a/src/guidellm/backends/__init__.py b/src/guidellm/backends/__init__.py
@@ -19,6 +19,17 @@
     TextCompletionsRequestHandler,
 )
 
+# Conditionally import VLLM backend if available
+try:
+    from .vllm_python.vllm import VLLMPythonBackend
+    from .vllm_python.vllm_response import VLLMResponseHandler
+
+    HAS_VLLM_BACKEND = True
+except ImportError:
+    VLLMPythonBackend = None  # type: ignore[assignment, misc]
+    VLLMResponseHandler = None  # type: ignore[assignment, misc]
+    HAS_VLLM_BACKEND = False
+
-# Conditionally import VLLM backend if available
-try:
-    from .vllm_python.vllm import VLLMPythonBackend
-    from .vllm_python.vllm_response import VLLMResponseHandler
-
-    HAS_VLLM_BACKEND = True
-except ImportError:
-    VLLMPythonBackend = None  # type: ignore[assignment, misc]
-    VLLMResponseHandler = None  # type: ignore[assignment, misc]
-    HAS_VLLM_BACKEND = False
+from guidellm.extras.vllm import HAS_VLLM
+
+# Conditionally import VLLM backend if available
+if HAS_VLLM:
+    from .vllm_python import VLLMPythonBackend
-# Conditionally import VLLM backend if available
-try:
-    from .vllm_python.vllm import VLLMPythonBackend
-    from .vllm_python.vllm_response import VLLMResponseHandler
-
-    HAS_VLLM_BACKEND = True
-except ImportError:
-    VLLMPythonBackend = None  # type: ignore[assignment, misc]
-    VLLMResponseHandler = None  # type: ignore[assignment, misc]
-    HAS_VLLM_BACKEND = False
+from guidellm.extras.vllm import HAS_VLLM
+
+# Conditionally import VLLM backend if available
+if HAS_VLLM:
+    from .vllm_python import VLLMPythonBackend
 __all__ = [
     "AudioRequestHandler",
     "Backend",
@@ -29,3 +40,7 @@
     "OpenAIRequestHandlerFactory",
     "TextCompletionsRequestHandler",
 ]
+
+# Conditionally add VLLM backend and handler to exports
+if HAS_VLLM_BACKEND:
+    __all__.extend(["VLLMPythonBackend", "VLLMResponseHandler"])
diff --git a/src/guidellm/backends/backend.py b/src/guidellm/backends/backend.py
@@ -21,7 +21,7 @@
 ]
 
 
-BackendType = Literal["openai_http"]
+BackendType = Literal["openai_http", "vllm_python"]
 
 
 class Backend(
@@ -101,6 +101,24 @@ def requests_limit(self) -> int | None:
         """
         return None
 
+    @classmethod
+    def requires_target(cls) -> bool:
+        """
+        Indicate whether this backend requires a target parameter.
+
+        :return: True if the backend requires a target parameter, False otherwise
+        """
+        return True  # Default to True for safety (most backends need a target)
+
+    @classmethod
+    def requires_model(cls) -> bool:
+        """
+        Indicate whether this backend requires a model parameter.
+
+        :return: True if the backend requires a model parameter, False otherwise
+        """
+        return False  # Default to False (model is optional by default)
+
     @abstractmethod
     async def default_model(self) -> str:
         """

diff --git a/src/guidellm/backends/openai/http.py b/src/guidellm/backends/openai/http.py
@@ -78,6 +78,25 @@ class OpenAIHTTPBackend(Backend):
         await backend.process_shutdown()
     """
 
+    @classmethod
+    def requires_target(cls) -> bool:
+        """
+        OpenAI HTTP backend requires a target URL.
+
+        :return: True, as this backend requires a target endpoint URL
+        """
+        return True
+
+    @classmethod
+    def requires_model(cls) -> bool:
+        """
+        OpenAI HTTP backend does not require a model parameter.
+        The model can be optional as we can use the server's default model.
+
+        :return: False, as this backend does not require a model parameter
+        """
+        return False
+
     def __init__(
         self,
         target: str,

diff --git a/src/guidellm/backends/vllm_python/__init__.py b/src/guidellm/backends/vllm_python/__init__.py
@@ -0,0 +1,10 @@
+"""
+VLLM Python API backend package.
+
+Provides the VLLM Python backend and response handler for compiling
+OpenAI-style response dicts into GenerationResponse.
+"""
+
+from .vllm_response import VLLMResponseHandler
+
+__all__ = ["VLLMResponseHandler"]