docs: Add guide for local GPU and Gang Scheduling setup#3217
docs: Add guide for local GPU and Gang Scheduling setup#3217rishikesh737 wants to merge 1 commit intokubeflow:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow Trainer! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
Signed-off-by: Rishikesh Pednekar <rishikeshpednekar34@gmail.com>
0a6389f to
1c6fdb8
Compare
There was a problem hiding this comment.
Pull request overview
This PR attempts to add documentation for setting up a local development environment with GPU passthrough and Gang Scheduling (Volcano). However, the documentation is written for Kubeflow Training Operator V1, while this repository maintains Kubeflow Trainer V2 (the new architecture).
Changes:
- Added
docs/development/local-gpu-gang-scheduling.mdwith V1-specific setup instructions - Added a link to the new guide in
docs/development/developer_guide.md(under the Code Style section)
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| docs/development/local-gpu-gang-scheduling.md | New guide for GPU and Gang Scheduling setup (V1-specific, incompatible with current V2 codebase) |
| docs/development/developer_guide.md | Minor edits including duplicate Docker entry, typo fix needed, and link to new guide in unusual location |
| ## Requirements | ||
|
|
||
| - [Go](https://golang.org/) (1.22 or later) | ||
| - [Docker](https://docs.docker.com/) |
There was a problem hiding this comment.
Duplicate Docker entry. Line 8 lists Docker without a version, and line 9 specifies Docker (20.10 or later). Remove the duplicate on line 8.
| - [Docker](https://docs.docker.com/) |
| cd ./manifests/overlays/standalone | ||
| kustomize edit set image my-username/training-operator=my-username/training-operator:my-pr-01 | ||
| ``` | ||
| Update the `newTag` key in `./manifests/overlayes/standalone/kustimization.yaml` with the new image. |
There was a problem hiding this comment.
Typo in filename: "kustimization.yaml" should be "kustomization.yaml".
| Update the `newTag` key in `./manifests/overlayes/standalone/kustimization.yaml` with the new image. | |
| Update the `newTag` key in `./manifests/overlays/standalone/kustomization.yaml` with the new image. |
| ```sh | ||
| black --check --exclude '/*kubeflow_org_v1*|__init__.py|api_client.py|configuration.py|exceptions.py|rest.py' sdk/ | ||
| ``` | ||
|
|
There was a problem hiding this comment.
The link to the new guide is placed within a bullet point under the "Code Style" section. This placement is inconsistent with typical documentation structure. Consider either: 1) Creating a new top-level section like "## Additional Guides" or "## See Also" at the end of the document, or 2) Adding it to an existing relevant section like after "Running the Operator Locally" or "Testing changes locally".
| ## Additional Guides |
| # Local Development with GPU and Gang Scheduling | ||
|
|
||
| This guide explains how to set up a local development environment for the Training Operator that supports both GPU passthrough and Gang Scheduling (Volcano). | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - **Linux Environment** (Fedora/Ubuntu) | ||
| - **Docker** (v20.10+) with NVIDIA Container Runtime configured | ||
| - **Minikube** (Using Docker driver) | ||
| - **NVIDIA GPU** (drivers installed) | ||
|
|
||
| ## 1. Cluster Setup | ||
|
|
||
| Minikube must be started with the Docker driver to support GPU passthrough. The `podman` driver does not currently support the `--gpus` flag. | ||
|
|
||
| ```bash | ||
| minikube start --driver=docker --gpus=all | ||
| ``` | ||
|
|
||
| Verify GPU visibility in the node: | ||
| ```bash | ||
| kubectl get node minikube -o jsonpath='{.status.allocatable}' | ||
| # Output should contain "nvidia.com/gpu": "1" | ||
| ``` | ||
|
|
||
| ## 2. Install Volcano Scheduler | ||
|
|
||
| The standalone Training Operator manifests do not include a scheduler. For Gang Scheduling to work, you must install Volcano. | ||
|
|
||
| ```bash | ||
| kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml | ||
| ``` | ||
|
|
||
| Verify that `PodGroup` CRDs are present: | ||
| ```bash | ||
| kubectl get crds | grep podgroups.scheduling.volcano.sh | ||
| ``` | ||
|
|
||
| ## 3. Install Training Operator with Gang Scheduling Enabled | ||
|
|
||
| By default, the standalone installation **does not** enable Gang Scheduling. You must enable it by passing the `--gang-scheduler-name` flag to the operator binary. | ||
|
|
||
| ### Option A: Patching an Existing Installation | ||
| If you have already installed the operator via manifests, patch the deployment: | ||
|
|
||
| ```bash | ||
| kubectl patch deployment training-operator -n kubeflow --type='json' \ | ||
| -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args", "value": ["--gang-scheduler-name=volcano"]}]' | ||
| ``` | ||
|
|
||
| ### Option B: Verification | ||
| Check logs to ensure the operator initialized the Volcano controller: | ||
|
|
||
| ```bash | ||
| kubectl logs -n kubeflow -l control-plane=kubeflow-training-operator | ||
| ``` | ||
|
|
||
| ## 4. Verifying Gang Scheduling (Deadlock Prevention) | ||
|
|
||
| To verify that gang scheduling is active, you can submit a job that requires resources exceeding the cluster capacity. | ||
|
|
||
| 1. Create a `PyTorchJob` with `spec.schedulerName: volcano`. | ||
| 2. Ensure the `PodGroup` is created: | ||
| ```bash | ||
| kubectl get podgroups | ||
| ``` | ||
| 3. If resources are insufficient, the PodGroup should remain `Pending`, and **no pods should be created**. This prevents resource deadlocks where partial jobs consume cluster capacity. |
There was a problem hiding this comment.
This documentation appears to be for Kubeflow Training Operator V1, but this repository is for Kubeflow Trainer V2 (the new system). According to the README, V1 is maintained at the release-1.9 branch. The concepts referenced in this guide (the --gang-scheduler-name flag, manifests/overlays/standalone, and training-operator deployment) are from the V1 system and don't apply to the current V2 architecture. This documentation should either be: 1) Added to the release-1.9 branch instead, or 2) Completely rewritten to apply to Trainer V2 with its new architecture (TrainJob, TrainingRuntime APIs, and Volcano plugin configuration via RuntimePolicy.PodGroupPolicy.Volcano).
Motivation
Currently, the developer guide does not cover how to set up a local environment with GPU passthrough (Minikube + NVIDIA) or how to enable Gang Scheduling (Volcano) for the standalone operator.
The standalone installation defaults to
default-schedulerbecause the--gang-scheduler-nameflag is missing from the deployment args. This leads to resource deadlocks (split-brain scheduling) during local testing of distributed training jobs when resources are limited.Changes
docs/development/local-gpu-gang-scheduling.mddetailing:kubectl patchcommand to enable the--gang-scheduler-name=volcanoflag.docs/development/developer_guide.md.Verification
I have verified these steps on a local Fedora environment with an NVIDIA T1000.
kubectl get podgroups).