docs: add k8s guide and Llama-3.2-3b-Instruct example-run#161
docs: add k8s guide and Llama-3.2-3b-Instruct example-run#161sallyom wants to merge 1 commit intovllm-project:mainfrom
Conversation
5723ff7 to
f7965aa
Compare
Signed-off-by: sallyom <somalley@redhat.com>
sjmonson
left a comment
There was a problem hiding this comment.
Apologies for the late review. We were waiting for our official container image to land; see comments below about replacing your included dockerfile with the official image.
| - name: guidellm | ||
| # TODO: replace this image | ||
| image: quay.io/sallyom/guidellm:latest | ||
| imagePullPolicy: IfNotPresent | ||
| securityContext: | ||
| allowPrivilegeEscalation: false | ||
| capabilities: | ||
| drop: | ||
| - ALL | ||
| runAsNonRoot: true | ||
| seccompProfile: | ||
| type: RuntimeDefault | ||
| args: | ||
| - benchmark | ||
| - --target=$(TARGET) | ||
| - --data=$(DATA) | ||
| - --rate-type=sweep | ||
| - --model=$(MODEL) | ||
| - --output-path=/app/data/llama32-3b.yaml | ||
| env: | ||
| # HF_TOKEN is not necessary if you share/use the model PVC. Guidellm needs to access the tokenizer file. | ||
| # You can provide a path to the tokenizer file by passing `--tokenizer=/path/to/model`. If you do not | ||
| # pass the tokenizer path, Guidellm will get the tokenizer file(s) from Huggingface. | ||
| - name: HF_TOKEN | ||
| valueFrom: | ||
| secretKeyRef: | ||
| key: HF_TOKEN | ||
| name: huggingface-secret | ||
| - name: TARGET | ||
| value: "http://llm-d-inference-gateway.llm-d.svc.cluster.local:80/v1" | ||
| - name: DATA_TYPE | ||
| value: "emulated" | ||
| - name: DATA | ||
| value: "prompt_tokens=512,output_tokens=128" | ||
| - name: MODEL | ||
| value: "meta-llama/Llama-3.2-3B-Instruct" | ||
| volumeMounts: | ||
| - name: output | ||
| mountPath: /app/data |
There was a problem hiding this comment.
We now have our own image. Utilize that and the built-in support for environment variables (see /deploy for more details).
| - name: guidellm | |
| # TODO: replace this image | |
| image: quay.io/sallyom/guidellm:latest | |
| imagePullPolicy: IfNotPresent | |
| securityContext: | |
| allowPrivilegeEscalation: false | |
| capabilities: | |
| drop: | |
| - ALL | |
| runAsNonRoot: true | |
| seccompProfile: | |
| type: RuntimeDefault | |
| args: | |
| - benchmark | |
| - --target=$(TARGET) | |
| - --data=$(DATA) | |
| - --rate-type=sweep | |
| - --model=$(MODEL) | |
| - --output-path=/app/data/llama32-3b.yaml | |
| env: | |
| # HF_TOKEN is not necessary if you share/use the model PVC. Guidellm needs to access the tokenizer file. | |
| # You can provide a path to the tokenizer file by passing `--tokenizer=/path/to/model`. If you do not | |
| # pass the tokenizer path, Guidellm will get the tokenizer file(s) from Huggingface. | |
| - name: HF_TOKEN | |
| valueFrom: | |
| secretKeyRef: | |
| key: HF_TOKEN | |
| name: huggingface-secret | |
| - name: TARGET | |
| value: "http://llm-d-inference-gateway.llm-d.svc.cluster.local:80/v1" | |
| - name: DATA_TYPE | |
| value: "emulated" | |
| - name: DATA | |
| value: "prompt_tokens=512,output_tokens=128" | |
| - name: MODEL | |
| value: "meta-llama/Llama-3.2-3B-Instruct" | |
| volumeMounts: | |
| - name: output | |
| mountPath: /app/data | |
| - name: guidellm | |
| image: ghcr.io/neuralmagic/guidellm:latest | |
| imagePullPolicy: IfNotPresent | |
| securityContext: | |
| allowPrivilegeEscalation: false | |
| capabilities: | |
| drop: | |
| - ALL | |
| runAsNonRoot: true | |
| seccompProfile: | |
| type: RuntimeDefault | |
| env: | |
| # HF_TOKEN is not necessary if you share/use the model PVC. Guidellm needs to access the tokenizer file. | |
| # You can provide a path to the tokenizer file by passing `--tokenizer=/path/to/model`. If you do not | |
| # pass the tokenizer path, Guidellm will get the tokenizer file(s) from Huggingface. | |
| - name: HF_TOKEN | |
| valueFrom: | |
| secretKeyRef: | |
| key: HF_TOKEN | |
| name: huggingface-secret | |
| - name: GUIDELLM_TARGET | |
| value: "http://llm-d-inference-gateway.llm-d.svc.cluster.local:80" | |
| - name: GUIDELLM_RATE_TYPE | |
| value: "sweep" | |
| - name: GUIDELLM_DATA | |
| value: "prompt_tokens=512,output_tokens=128" | |
| - name: GUIDELLM_MODEL | |
| value: "meta-llama/Llama-3.2-3B-Instruct" | |
| volumeMounts: | |
| - name: output | |
| mountPath: /app/data |
There was a problem hiding this comment.
Also note --data-type is deprecated, I removed it in the suggestion. Additionally --rate-type was not configurable so I added it as an env.
There was a problem hiding this comment.
Drop this file since we now have an official container.
| > **📝 NOTE:** [Dockerfile](./Dockerfile) was used to build the image for the guidellm-job pod. | ||
|
|
There was a problem hiding this comment.
| > **📝 NOTE:** [Dockerfile](./Dockerfile) was used to build the image for the guidellm-job pod. |
|
|
||
| > **📝 NOTE:** The HF_TOKEN is passed to the job, but this will not be necessary if you use the same PVC as the one storing your model. | ||
| > Guidellm uses the model's tokenizer/processor files in its evaluation. You can pass a path instead with `--tokenizer=/path/to/model`. | ||
| > This eliminates the need for Guidellm to download the files from Huggingface. |
There was a problem hiding this comment.
| > This eliminates the need for Guidellm to download the files from Huggingface. | |
| > This eliminates the need for GuideLLM to download the files from Hugging Face. |
| @@ -0,0 +1,53 @@ | |||
| ## Run Guidellm with Kubernetes Job | |||
There was a problem hiding this comment.
There are a bunch case issues with the GuideLLM name. Please fix to be constant with the rest of our documentation.
| ## Run Guidellm with Kubernetes Job | |
| ## Run GuideLLM with Kubernetes Job |
|
Also please rebase on main for latest CI. |
|
I'll finally get back to this and will update based on the review, I believe it will be a nice addn still. |
I've been running in K8 with the job added in this PR, maybe it's useful to document as an example for everyone. Also, adding a simple
analyze_benchmarks.pyscript to analyze the results of a GuideLLM run. Here's an example showing the plots generated by theanalyze_benchmarks.pyscript run with output from a minikube/Llama3.2-3B-Instruct run.