docs: Add guide for local GPU and Gang Scheduling setup by rishikesh737 · Pull Request #3217 · kubeflow/trainer

rishikesh737 · 2026-02-17T14:17:40Z

Motivation

Currently, the developer guide does not cover how to set up a local environment with GPU passthrough (Minikube + NVIDIA) or how to enable Gang Scheduling (Volcano) for the standalone operator.

The standalone installation defaults to default-scheduler because the --gang-scheduler-name flag is missing from the deployment args. This leads to resource deadlocks (split-brain scheduling) during local testing of distributed training jobs when resources are limited.

Changes

Added docs/development/local-gpu-gang-scheduling.md detailing:
- Correct Minikube driver setup for NVIDIA GPUs.
- Volcano installation steps.
- The required kubectl patch command to enable the --gang-scheduler-name=volcano flag.
- Verification steps using PodGroups to prove deadlock prevention.
Linked new guide in docs/development/developer_guide.md.

Verification

I have verified these steps on a local Fedora environment with an NVIDIA T1000.

Confirmed GPU passthrough works with the Docker driver.
Confirmed Gang Scheduling prevents partial pod creation when resources are constrained (Deadlock prevention verified via kubectl get podgroups).

google-oss-prow · 2026-02-17T14:17:46Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign astefanutti for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2026-02-17T14:17:50Z

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Slack: Join our #kubeflow-trainer Slack channel.
Meetings: Attend the Kubeflow AutoML and Training Working Group bi-weekly meetings.

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Signed-off-by: Rishikesh Pednekar <rishikeshpednekar34@gmail.com>

Copilot

Pull request overview

This PR attempts to add documentation for setting up a local development environment with GPU passthrough and Gang Scheduling (Volcano). However, the documentation is written for Kubeflow Training Operator V1, while this repository maintains Kubeflow Trainer V2 (the new architecture).

Changes:

Added docs/development/local-gpu-gang-scheduling.md with V1-specific setup instructions
Added a link to the new guide in docs/development/developer_guide.md (under the Code Style section)

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
docs/development/local-gpu-gang-scheduling.md	New guide for GPU and Gang Scheduling setup (V1-specific, incompatible with current V2 codebase)
docs/development/developer_guide.md	Minor edits including duplicate Docker entry, typo fix needed, and link to new guide in unusual location

Copilot · 2026-02-17T14:22:55Z

docs/development/developer_guide.md

+## Requirements
+
+- [Go](https://golang.org/) (1.22 or later)
+- [Docker](https://docs.docker.com/) 


Duplicate Docker entry. Line 8 lists Docker without a version, and line 9 specifies Docker (20.10 or later). Remove the duplicate on line 8.

Suggested change

- [Docker](https://docs.docker.com/)

Copilot · 2026-02-17T14:22:56Z

docs/development/developer_guide.md

+cd ./manifests/overlays/standalone
+kustomize edit set image my-username/training-operator=my-username/training-operator:my-pr-01
+```
+Update the `newTag` key in `./manifests/overlayes/standalone/kustimization.yaml` with the new image.


Typo in filename: "kustimization.yaml" should be "kustomization.yaml".

Suggested change

Update the `newTag` key in `./manifests/overlayes/standalone/kustimization.yaml` with the new image.

Update the `newTag` key in `./manifests/overlays/standalone/kustomization.yaml` with the new image.

Copilot · 2026-02-17T14:22:56Z

docs/development/developer_guide.md

+  ```sh
+  black --check --exclude '/*kubeflow_org_v1*|__init__.py|api_client.py|configuration.py|exceptions.py|rest.py' sdk/
+  ```
+


The link to the new guide is placed within a bullet point under the "Code Style" section. This placement is inconsistent with typical documentation structure. Consider either: 1) Creating a new top-level section like "## Additional Guides" or "## See Also" at the end of the document, or 2) Adding it to an existing relevant section like after "Running the Operator Locally" or "Testing changes locally".

Suggested change

## Additional Guides

Copilot · 2026-02-17T14:22:56Z

docs/development/local-gpu-gang-scheduling.md

+# Local Development with GPU and Gang Scheduling
+
+This guide explains how to set up a local development environment for the Training Operator that supports both GPU passthrough and Gang Scheduling (Volcano).
+
+## Prerequisites
+
+- **Linux Environment** (Fedora/Ubuntu)
+- **Docker** (v20.10+) with NVIDIA Container Runtime configured
+- **Minikube** (Using Docker driver)
+- **NVIDIA GPU** (drivers installed)
+
+## 1. Cluster Setup
+
+Minikube must be started with the Docker driver to support GPU passthrough. The `podman` driver does not currently support the `--gpus` flag.
+
+```bash
+minikube start --driver=docker --gpus=all
+```
+
+Verify GPU visibility in the node:
+```bash
+kubectl get node minikube -o jsonpath='{.status.allocatable}'
+# Output should contain "nvidia.com/gpu": "1"
+```
+
+## 2. Install Volcano Scheduler
+
+The standalone Training Operator manifests do not include a scheduler. For Gang Scheduling to work, you must install Volcano.
+
+```bash
+kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml
+```
+
+Verify that `PodGroup` CRDs are present:
+```bash
+kubectl get crds | grep podgroups.scheduling.volcano.sh
+```
+
+## 3. Install Training Operator with Gang Scheduling Enabled
+
+By default, the standalone installation **does not** enable Gang Scheduling. You must enable it by passing the `--gang-scheduler-name` flag to the operator binary.
+
+### Option A: Patching an Existing Installation
+If you have already installed the operator via manifests, patch the deployment:
+
+```bash
+kubectl patch deployment training-operator -n kubeflow --type='json' \
+  -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args", "value": ["--gang-scheduler-name=volcano"]}]'
+```
+
+### Option B: Verification
+Check logs to ensure the operator initialized the Volcano controller:
+
+```bash
+kubectl logs -n kubeflow -l control-plane=kubeflow-training-operator
+```
+
+## 4. Verifying Gang Scheduling (Deadlock Prevention)
+
+To verify that gang scheduling is active, you can submit a job that requires resources exceeding the cluster capacity.
+
+1. Create a `PyTorchJob` with `spec.schedulerName: volcano`.
+2. Ensure the `PodGroup` is created:
+   ```bash
+   kubectl get podgroups
+   ```
+3. If resources are insufficient, the PodGroup should remain `Pending`, and **no pods should be created**. This prevents resource deadlocks where partial jobs consume cluster capacity.


This documentation appears to be for Kubeflow Training Operator V1, but this repository is for Kubeflow Trainer V2 (the new system). According to the README, V1 is maintained at the release-1.9 branch. The concepts referenced in this guide (the --gang-scheduler-name flag, manifests/overlays/standalone, and training-operator deployment) are from the V1 system and don't apply to the current V2 architecture. This documentation should either be: 1) Added to the release-1.9 branch instead, or 2) Completely rewritten to apply to Trainer V2 with its new architecture (TrainJob, TrainingRuntime APIs, and Volcano plugin configuration via RuntimePolicy.PodGroupPolicy.Volcano).

Copilot AI review requested due to automatic review settings February 17, 2026 14:17

google-oss-prow bot requested a review from akshaychitneni February 17, 2026 14:17

google-oss-prow bot requested a review from jinchihe February 17, 2026 14:17

google-oss-prow bot added the size/L label Feb 17, 2026

Copilot started reviewing on behalf of rishikesh737 February 17, 2026 14:18 View session

docs: Add local development guide for GPU and Gang Scheduling

1c6fdb8

Signed-off-by: Rishikesh Pednekar <rishikeshpednekar34@gmail.com>

rishikesh737 force-pushed the docs/gpu-gang-scheduling-guide branch from 0a6389f to 1c6fdb8 Compare February 17, 2026 14:20

Copilot AI reviewed Feb 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Add guide for local GPU and Gang Scheduling setup#3217

docs: Add guide for local GPU and Gang Scheduling setup#3217
rishikesh737 wants to merge 1 commit intokubeflow:masterfrom
rishikesh737:docs/gpu-gang-scheduling-guide

rishikesh737 commented Feb 17, 2026

Uh oh!

google-oss-prow bot commented Feb 17, 2026

Uh oh!

github-actions bot commented Feb 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

	Update the `newTag` key in `./manifests/overlayes/standalone/kustimization.yaml` with the new image.
	Update the `newTag` key in `./manifests/overlays/standalone/kustomization.yaml` with the new image.

Conversation

rishikesh737 commented Feb 17, 2026

Motivation

Changes

Verification

Uh oh!

google-oss-prow bot commented Feb 17, 2026

Uh oh!

github-actions bot commented Feb 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments