Skip to content

docs: Add guide for local GPU and Gang Scheduling setup#3217

Open
rishikesh737 wants to merge 1 commit intokubeflow:masterfrom
rishikesh737:docs/gpu-gang-scheduling-guide
Open

docs: Add guide for local GPU and Gang Scheduling setup#3217
rishikesh737 wants to merge 1 commit intokubeflow:masterfrom
rishikesh737:docs/gpu-gang-scheduling-guide

Conversation

@rishikesh737
Copy link

Motivation

Currently, the developer guide does not cover how to set up a local environment with GPU passthrough (Minikube + NVIDIA) or how to enable Gang Scheduling (Volcano) for the standalone operator.

The standalone installation defaults to default-scheduler because the --gang-scheduler-name flag is missing from the deployment args. This leads to resource deadlocks (split-brain scheduling) during local testing of distributed training jobs when resources are limited.

Changes

  • Added docs/development/local-gpu-gang-scheduling.md detailing:
    • Correct Minikube driver setup for NVIDIA GPUs.
    • Volcano installation steps.
    • The required kubectl patch command to enable the --gang-scheduler-name=volcano flag.
    • Verification steps using PodGroups to prove deadlock prevention.
  • Linked new guide in docs/development/developer_guide.md.

Verification

I have verified these steps on a local Fedora environment with an NVIDIA T1000.

  • Confirmed GPU passthrough works with the Docker driver.
  • Confirmed Gang Scheduling prevents partial pod creation when resources are constrained (Deadlock prevention verified via kubectl get podgroups).

Copilot AI review requested due to automatic review settings February 17, 2026 14:17
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign astefanutti for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot requested a review from jinchihe February 17, 2026 14:17
@github-actions
Copy link

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Signed-off-by: Rishikesh Pednekar <rishikeshpednekar34@gmail.com>
@rishikesh737 rishikesh737 force-pushed the docs/gpu-gang-scheduling-guide branch from 0a6389f to 1c6fdb8 Compare February 17, 2026 14:20
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR attempts to add documentation for setting up a local development environment with GPU passthrough and Gang Scheduling (Volcano). However, the documentation is written for Kubeflow Training Operator V1, while this repository maintains Kubeflow Trainer V2 (the new architecture).

Changes:

  • Added docs/development/local-gpu-gang-scheduling.md with V1-specific setup instructions
  • Added a link to the new guide in docs/development/developer_guide.md (under the Code Style section)

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
docs/development/local-gpu-gang-scheduling.md New guide for GPU and Gang Scheduling setup (V1-specific, incompatible with current V2 codebase)
docs/development/developer_guide.md Minor edits including duplicate Docker entry, typo fix needed, and link to new guide in unusual location

## Requirements

- [Go](https://golang.org/) (1.22 or later)
- [Docker](https://docs.docker.com/)
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate Docker entry. Line 8 lists Docker without a version, and line 9 specifies Docker (20.10 or later). Remove the duplicate on line 8.

Suggested change
- [Docker](https://docs.docker.com/)

Copilot uses AI. Check for mistakes.
cd ./manifests/overlays/standalone
kustomize edit set image my-username/training-operator=my-username/training-operator:my-pr-01
```
Update the `newTag` key in `./manifests/overlayes/standalone/kustimization.yaml` with the new image.
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in filename: "kustimization.yaml" should be "kustomization.yaml".

Suggested change
Update the `newTag` key in `./manifests/overlayes/standalone/kustimization.yaml` with the new image.
Update the `newTag` key in `./manifests/overlays/standalone/kustomization.yaml` with the new image.

Copilot uses AI. Check for mistakes.
```sh
black --check --exclude '/*kubeflow_org_v1*|__init__.py|api_client.py|configuration.py|exceptions.py|rest.py' sdk/
```

Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link to the new guide is placed within a bullet point under the "Code Style" section. This placement is inconsistent with typical documentation structure. Consider either: 1) Creating a new top-level section like "## Additional Guides" or "## See Also" at the end of the document, or 2) Adding it to an existing relevant section like after "Running the Operator Locally" or "Testing changes locally".

Suggested change
## Additional Guides

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +67
# Local Development with GPU and Gang Scheduling

This guide explains how to set up a local development environment for the Training Operator that supports both GPU passthrough and Gang Scheduling (Volcano).

## Prerequisites

- **Linux Environment** (Fedora/Ubuntu)
- **Docker** (v20.10+) with NVIDIA Container Runtime configured
- **Minikube** (Using Docker driver)
- **NVIDIA GPU** (drivers installed)

## 1. Cluster Setup

Minikube must be started with the Docker driver to support GPU passthrough. The `podman` driver does not currently support the `--gpus` flag.

```bash
minikube start --driver=docker --gpus=all
```

Verify GPU visibility in the node:
```bash
kubectl get node minikube -o jsonpath='{.status.allocatable}'
# Output should contain "nvidia.com/gpu": "1"
```

## 2. Install Volcano Scheduler

The standalone Training Operator manifests do not include a scheduler. For Gang Scheduling to work, you must install Volcano.

```bash
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml
```

Verify that `PodGroup` CRDs are present:
```bash
kubectl get crds | grep podgroups.scheduling.volcano.sh
```

## 3. Install Training Operator with Gang Scheduling Enabled

By default, the standalone installation **does not** enable Gang Scheduling. You must enable it by passing the `--gang-scheduler-name` flag to the operator binary.

### Option A: Patching an Existing Installation
If you have already installed the operator via manifests, patch the deployment:

```bash
kubectl patch deployment training-operator -n kubeflow --type='json' \
-p='[{"op": "add", "path": "/spec/template/spec/containers/0/args", "value": ["--gang-scheduler-name=volcano"]}]'
```

### Option B: Verification
Check logs to ensure the operator initialized the Volcano controller:

```bash
kubectl logs -n kubeflow -l control-plane=kubeflow-training-operator
```

## 4. Verifying Gang Scheduling (Deadlock Prevention)

To verify that gang scheduling is active, you can submit a job that requires resources exceeding the cluster capacity.

1. Create a `PyTorchJob` with `spec.schedulerName: volcano`.
2. Ensure the `PodGroup` is created:
```bash
kubectl get podgroups
```
3. If resources are insufficient, the PodGroup should remain `Pending`, and **no pods should be created**. This prevents resource deadlocks where partial jobs consume cluster capacity.
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This documentation appears to be for Kubeflow Training Operator V1, but this repository is for Kubeflow Trainer V2 (the new system). According to the README, V1 is maintained at the release-1.9 branch. The concepts referenced in this guide (the --gang-scheduler-name flag, manifests/overlays/standalone, and training-operator deployment) are from the V1 system and don't apply to the current V2 architecture. This documentation should either be: 1) Added to the release-1.9 branch instead, or 2) Completely rewritten to apply to Trainer V2 with its new architecture (TrainJob, TrainingRuntime APIs, and Volcano plugin configuration via RuntimePolicy.PodGroupPolicy.Volcano).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments