This project demonstrates a production-grade deployment of vLLM on Kubernetes, featuring multi-model serving (Vision-Language & Text), LoRA adapters, distributed tracing, and load testing.
- Multi-Model Deployment: Running Qwen3-VL-30B (Vision) and Qwen3-8B (Text) simultaneously.
- LoRA Support: Dynamic loading of LoRA adapters (Translation, Gaia) on the 8B model.
- Distributed Tracing: Full OpenTelemetry (OTEL) integration with Jaeger for visualizing request flows.
- Load Testing: Integrated
llmperfsetup for benchmarking and traffic generation. - High Performance: Optimized for NVIDIA H100 GPUs.
- Hardware: NVIDIA GPUs (H100 recommended for this specific config).
- Software:
- Docker
- Minikube (with GPU support)
- Helm & Kubectl
- Python 3.10+ (for load testing)
Use the included Makefile to set up the Kubernetes environment:
# Install tools and setup Minikube with GPU support
make setup-k8s
# Verify GPU availability
make verify-gpuDeploy Jaeger and OpenTelemetry collectors:
kubectl apply -f configs/tracing/Deploy the Qwen models with tracing enabled:
# Add vLLM Helm repo
helm repo add vllm https://vllm-project.github.io/production-stack
helm repo update
# Deploy configuration
helm install vllm vllm/vllm-stack -f configs/values-qwen-vl-lora-tracing.yamlWait for pods to be ready:
kubectl get pods -wAccess the Jaeger UI to visualize traces:
kubectl port-forward svc/jaeger-query 16686:16686Open http://localhost:16686 in your browser.
We use llmperf to benchmark performance and generate trace data.
git clone https://github.com/ray-project/llmperf.git
cd llmperf
pip install -e .Run a quick load test against the Qwen3-8B model:
export OPENAI_API_BASE="http://localhost:30080/v1"
export OPENAI_API_KEY="test"
# Port-forward the router first
kubectl port-forward svc/vllm-router-service 30080:80 &
# Run test
python token_benchmark_ray.py \
--model "Qwen/Qwen3-8B" \
--mean-input-tokens 256 \
--mean-output-tokens 128 \
--num-concurrent-requests 5 \
--max-num-completed-requests 20 \
--llm-api openai \
--results-dir "./results/quick-test"See configs/LOAD-TESTING-QUICKSTART.md for more scenarios.
configs/values-qwen-vl-lora-tracing.yaml: Main Helm config for Qwen models + OTEL.tracing/: Kubernetes manifests for Jaeger and OTEL collector.
Makefile: Automation for cluster setup and verification.
- Qwen3-VL-30B-A3B-Instruct: Vision-Language model on GPU 0.
- Qwen3-8B: Text model (2 replicas) on GPUs 1 & 2.
- Adapters:
Qwen3-8B-Translator-LoRA,Gaia-LLM-8B.
- Adapters:
- Router: Traces request routing.
- Models: Traces inference execution.
- LoRA: Traces adapter loading and switching.
Future enhancements planned for this stack:
- Streamlit Chat UI: A web interface to chat with models, switch adapters, and upload images for VL tasks.
- KEDA Autoscaling: Scale Qwen replicas based on request queue depth.
- Secure Serving: Implement API key authentication and TLS termination.
