Skip to content

--max-requests with --rate-type=sweep causes first stage to run all requests synchronously, resulting in extremely long runtimes #588

@cemigo114

Description

@cemigo114

Bug Description

When using --rate-type=sweep combined with --max-requests=N, the first synchronous stage of the sweep runs ALL N requests sequentially before moving to parallel stages. This results in unexpectedly long benchmark runtimes, especially with larger request counts.
Observed Behavior
A benchmark configured with --max-requests=1000 --rate-type=sweep took ~7 hours to complete on H100 GPUs, when users expected it to complete in minutes.
Root Cause
The sweep profile's first stage runs requests synchronously (one at a time). The --max-requests parameter applies to each stage, causing all 1000 requests to be processed sequentially in stage 1 alone before any parallel stages begin.

Expected Behavior

Either:
--max-requests should be the total requests across all sweep stages, not per-stage
The synchronous stage should have a separate, smaller limit
Documentation should clearly warn against using high --max-requests values with sweep mode

Steps to Reproduce

apiVersion: batch/v1
kind: Job
metadata:
name: guidellm-benchmark
spec:
template:
spec:
containers:
- name: guidellm
image: quay.io/jhurlocker/guidellm:latest
args:
- benchmark
- --target=http://my-llm-service:8080/v1
- --model=granite-33-8b-instruct
- --rate-type=sweep
- --max-requests=1000 # <-- This causes the issue
- --data=/mnt/prompts/prefix-prompts.csv
- --output-path=/results/output.json

Operating System

OpenShift (RHOAI 3.0)

Python Version

OpenShift (RHOAI 3.0)

GuideLLM Version

latest

Installation Method

pip install guidellm

Installation Details

No response

Error Messages or Stack Traces

Additional Context

Suggested Improvements
Documentation: Add a warning to the README/docs that --max-requests with sweep mode runs synchronously in the first stage
CLI Warning: Emit a warning when --max-requests > 100 is used with --rate-type=sweep
Design Change: Consider making the synchronous stage use a smaller, fixed request count (e.g., 50-100) regardless of --max-requests
Environment
GuideLLM version: latest
Hardware: NVIDIA H100 GPUs
Platform: OpenShift (RHOAI 3.0)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions