[WIP] feat(api): KEP-3015: Workload Aware Scheduling for TrainJob#3219

Open

andreyvelich wants to merge 6 commits intokubeflow:masterfrom

andreyvelich:was-kep

Member

andreyvelich commented Feb 17, 2026

Related: #3015
Ref issues: kubernetes/enhancements#4671, kubernetes/kubernetes#132192.

This is initial KEP to support WAS in Kubeflow Trainer and TrainJob. We need to discuss whether we want to allow users to set Workload spec in TrainJob as well.

The implementation should be started after Kubernetes v1.36

Project to track WAS in Kubernetes: Workload-aware & Topology-aware Workstream (view)

/assign @tenzen-y @astefanutti @akshaychitneni @robert-bell @kubeflow/kubeflow-trainer-team

/cc @kannon92 @helayoty @wojtek-t @klueska @mm4tt @vsoch @dom4ha @VassilisVassiliadis


          feat(api): KEP-3015: Workload Aware Scheduling for TrainJob

514d8cd

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Copilot AI review requested due to automatic review settings

February 17, 2026 19:08

google-oss-prow bot added the do-not-merge/work-in-progress label

google-oss-prow bot requested a review from kannon92

February 17, 2026 19:08

google-oss-prow bot commented Feb 17, 2026

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: mm4tt, vsoch, dom4ha, VassilisVassiliadis, helayoty, wojtek-t, klueska.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

Related: #3015
Ref issues: kubernetes/enhancements#4671, kubernetes/kubernetes#132192.

This is initial KEP to support WAS in Kubeflow Trainer and TrainJob. We need to discuss whether we want to allow users to set Workload spec in TrainJob as well.

The implementation should be started after Kubernetes v1.36

Project to track WAS in Kubernetes: Workload-aware & Topology-aware Workstream (view)

/assign @tenzen-y @astefanutti @akshaychitneni @robert-bell @kubeflow/kubeflow-trainer-team

/cc @kannon92 @helayoty @wojtek-t @klueska @mm4tt @vsoch @dom4ha @VassilisVassiliadis

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow bot commented Feb 17, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andreyvelich. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow bot added the size/XL label

Copilot started reviewing on behalf of andreyvelich

February 17, 2026 19:08

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Pull request overview

Adds an initial KEP/proposal document for integrating Kubernetes Workload API–based workload-aware (gang) scheduling into Kubeflow Trainer’s TrainJob via a new PodGroupPolicy plugin, as groundwork for future implementation.

Changes:

Introduces a new KEP (3015) describing Workload/PodGroup orchestration for TrainJob scheduling.
Documents intended API surface (podGroupPolicy.workload) and controller/plugin responsibilities across build/enforcement/watch phases.
Outlines lifecycle, RBAC, feature gates, and a test plan for the proposed integration.

docs/proposals/3015-workload-aware-scheduling/README.md Outdated Show resolved Hide resolved

docs/proposals/3015-workload-aware-scheduling/README.md

+              scheduling in Kubernetes, with integration planned for:
+              - **Job controller**: [KEP-5547](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/4671-gang-scheduling): Automatic Workload/PodGroup creation for parallel jobs
+              - **JobSet**: [KEP-969](https://github.com/kubernetes-sigs/jobset/pull/1068): Gang scheduling support for job groups

Copilot AI Feb 17, 2026

The "KEP-969" label doesn't match the linked URL (which is a JobSet PR #1068), so the citation is internally inconsistent; rename the reference (e.g., "JobSet PR" / "design doc") or link to the correct KEP/issue number.

Suggested change

      
            - **JobSet**: [KEP-969](https://github.com/kubernetes-sigs/jobset/pull/1068): Gang scheduling support for job groups
          
            - **JobSet**: [JobSet PR #1068](https://github.com/kubernetes-sigs/jobset/pull/1068): Gang scheduling support for job groups

Copilot uses AI. Check for mistakes.

Member Author

andreyvelich Feb 18, 2026

KEP number is correct.

docs/proposals/3015-workload-aware-scheduling/README.md

Comment on lines +231 to +253

+              When this plugin is enabled, the TrainJob controller will create the Workload with one PodGroup
+              for launcher+node.
+              ```yaml
+              apiVersion: scheduling.k8s.io/v1alpha2
+              kind: Workload
+              metadata:
+                name: my-job
+                ownerReferences:
+                  - apiVersion: trainer.kubeflow.org/v1alpha1
+                    kind: TrainJob
+                    name: my-job
+              spec:
+                controllerRef:
+                  apiVersion: trainer.kubeflow.org/v1alpha1
+                  kind: TrainJob
+                  name: my-job
+                podGroupTemplates:
+                  - name: trainer
+                    schedulingPolicy:
+                      gang:
+                        minCount: 50 # Equal to trainJob.spec.trainer.numNodes
+              ```

Copilot AI Feb 17, 2026

In the MPI story, the Workload PodGroup is described as covering "launcher+node" pods, but minCount is set to trainJob.spec.trainer.numNodes (50) rather than total pods (typically launcher 1 + nodes 50 = 51 when runLauncherAsNode is false); align minCount with the actual number of pods in the gang as the existing PodGroup plugins do (sum of PodSets).

Copilot uses AI. Check for mistakes.

Member Author

andreyvelich Feb 18, 2026

We only consider use-cases when runLauncherAsNode: true

docs/proposals/3015-workload-aware-scheduling/README.md Outdated Show resolved Hide resolved

docs/proposals/3015-workload-aware-scheduling/README.md

Comment on lines 507 to 593

+                  // Workload plugin using native Kubernetes Workload API for gang-scheduling
+                  // Requires Kubernetes v1.35+ with GenericWorkload and GangScheduling feature gates enabled.
+                  Workload *WorkloadPodGroupPolicySource `json:"workload,omitempty"`
+              }
+              // WorkloadPodGroupPolicySource configures scheduling behavior using Kubernetes Workload API.
+              type WorkloadPodGroupPolicySource struct {}
+              ```
+              The `WorkloadPodGroupPolicySource` struct is intentionally minimal for the initial implementation.
+              The `minCount` for gang scheduling is automatically derived from `mlPolicy.numNodes` or
+              `trainJob.spec.trainer.numNodes`.
+              ### Workload Runtime Plugin
+              Similar to the Coscheduling and Volcano plugins, we implement a Workload plugin in
+              `pkg/runtime/framework/plugins/workload/workload.go`. This plugin implements the following interfaces
+              from the Pipeline Framework:
+              #### Build Phase
+              The plugin implements the `ComponentBuilder` interface to build the Workload object:
+              ```go
+              func (w *Workload) Build(ctx context.Context, info *runtime.Info, trainJob *trainv1alpha1.TrainJob) (client.Object, error) {
+                  // 1. Extract numNodes from runtime info
+                  // 2. Build Workload object with:
+                  //    - controllerRef pointing to TrainJob
+                  //    - podGroupTemplate with gang scheduling policy
+                  //    - minCount equal to numNodes
+                  // 3. Build PodGroup objects.
+                  // 4. Return Workload object for deployment
+              }
+              ```
+              #### EnforcePodGroupPolicy Phase
+              The plugin implements the `EnforcePodGroupPolicy` interface to configure the `schedulingGroup` field in Pod specs:
+              ```go
+              func (w *Workload) EnforcePodGroupPolicy(info *runtime.Info, trainJob *trainv1alpha1.TrainJob) error {
+                  // 1. Set schedulingGroup.podGroupName in all Pod templates
+              }
+              ```
+              #### WatchExtension Phase
+              The plugin implements `WatchExtension` to watch Workload resources and trigger TrainJob reconciliation:
+              ```go
+              func (w *Workload) ReconcilerBuilders() []runtime.ReconcilerBuilder {
+                  // 1. Watch Workload and PodGroup resources owned by TrainJob
+                  // 2. Trigger reconciliation on PodGroup status changes
+              }
+              ```
+              ### Resource Lifecycle
+. **Creation**: When a TrainJob is created with `podGroupPolicy.workload` configured, the Workload
+                 plugin creates the Workload and PodGroup objects with `ownerReferences` pointing to the TrainJob.
+. **Pod Association**: The plugin injects `schedulingGroup.podGroupName` into Pod specs, linking Pods to their PodGroup.
+. **Scheduling**: The kube-scheduler uses the Workload Scheduling Cycle to process entire PodGroups
+                 atomically, ensuring all Pods in a gang are scheduled together.
+. **Suspension**: When the TrainJob is suspended, Workload and PodGroup resources are preserved.
+                 In the future, we can re-create to ensure elastic TrainJob support.
+. **Deletion**: When the TrainJob is deleted, Kubernetes garbage collection automatically cleans up the Workload object (and subsequently the PodGroup).
+              The TrainJob controller requires permissions to manage Workload resources:
+              ```go
+              // +kubebuilder:rbac:groups=scheduling.k8s.io,resources=workloads,verbs=get;list;watch;create;update;patch
+              // +kubebuilder:rbac:groups=scheduling.k8s.io,resources=workloads/status,verbs=get
+              // +kubebuilder:rbac:groups=scheduling.k8s.io,resources=podgroups,verbs=get;list;watch;create;update;patch;list;watch
+              // +kubebuilder:rbac:groups=scheduling.k8s.io,resources=podgroups/status,verbs=get
+              ```
+              ### Feature Gate Dependencies
+              The Workload API requires the following Kubernetes feature gates to be enabled:
+              - `GenericWorkload`: Enables the Workload and PodGroup APIs
+              The same feature gate is required to be enabled in TrainJob config:

Copilot AI Feb 17, 2026

This section says the Workload plugin requires both GenericWorkload and GangScheduling feature gates, but the later "Feature Gate Dependencies" section only lists GenericWorkload; please make the feature-gate requirements consistent across the doc.

Copilot uses AI. Check for mistakes.

docs/proposals/3015-workload-aware-scheduling/README.md Outdated

Copilot AI Feb 17, 2026

Typo: "in-consistency" should be "inconsistency".

Suggested change

      
            We can integrate the Workload spec in the TrainJob API directly. That might introduce in-consistency
          
            We can integrate the Workload spec in the TrainJob API directly. That might introduce inconsistency

Copilot uses AI. Check for mistakes.

docs/proposals/3015-workload-aware-scheduling/README.md Outdated

Copilot AI Feb 17, 2026

The RBAC example for podgroups repeats list;watch twice in the verbs string; remove the duplicate verbs to avoid propagating an invalid/unclear RBAC snippet.

Suggested change

      
            // +kubebuilder:rbac:groups=scheduling.k8s.io,resources=podgroups,verbs=get;list;watch;create;update;patch;list;watch
          
            // +kubebuilder:rbac:groups=scheduling.k8s.io,resources=podgroups,verbs=get;list;watch;create;update;patch

Copilot uses AI. Check for mistakes.

coveralls commented Feb 17, 2026 •

edited

Loading

Pull Request Test Coverage Report for Build 22145283120

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 55.998%

Totals
Change from base Build 22081023611:	0.0%
Covered Lines:	1391
Relevant Lines:	2484

💛 - Coveralls

Contributor

vsoch commented Feb 18, 2026

I'm a little caught up on this discussion:

kubernetes-sigs/jobset#1068 (comment)

If the Workload is intended to be integrated with JobSet, are there ever cases of creating TrainJob without a backing JobSet? If not, should we not expose Workload (and gang scheduling) through JobSet instead of adding separately via PodGroupPolicy? What happens when we have PodGroupPolicy plus underlying JobSet support to create the Workload? Or is this intended as a temporary patch to expose it while it is being worked on for JobSet?

Contributor

kannon92 commented Feb 18, 2026

I'm a little caught up on this discussion:

kubernetes-sigs/jobset#1068 (comment)

If the Workload is intended to be integrated with JobSet, are there ever cases of creating TrainJob without a backing JobSet? If not, should we not expose Workload (and gang scheduling) through JobSet instead of adding separately via PodGroupPolicy? What happens when we have PodGroupPolicy plus underlying JobSet support to create the Workload? Or is this intended as a temporary patch to expose it while it is being worked on for JobSet?

Its a hard discussion to follow.

The general stance now is that the api that a user creates (sometimes known as true workload) should probably manage the podGroups / Workloads.

For Trainer we could maybe consider delegating this to JobSet for JobSet based TrainingRuntimes but that doesn't solve the problem for Flux or whatever other backend we have.

What happens when we have PodGroupPolicy plus underlying JobSet support to create the Workload? Or is this intended as a temporary patch to expose it while it is being worked on for JobSet?

The main outcome would be that if there exists a workloadRef or PodGroupRef on the pod template we are to assume that the WAS scheduled is "handled". But we haven't updated the KEP for JobSet. That was at least what we were discussing on all the features.

Contributor

vsoch commented Feb 18, 2026

The FluxPolicy implementation is just tweaking a JobSet, so I think if support was added to that underlying abstraction it would be inherited by Flux. I want to make sure we are careful to (essentially) not implement this layer twice, but rather just once at the most appropriate level (maybe JobSet?)

andreyvelich and others added 3 commits

February 18, 2026 14:45


          Update docs/proposals/3015-workload-aware-scheduling/README.md

6aef33d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>


          Update docs/proposals/3015-workload-aware-scheduling/README.md

088e4ef

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>


          Fix feature gates

1f24cf0

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Member Author

andreyvelich commented Feb 18, 2026

Yes, @kannon92 is right – we’ve had several discussions about this over the past few months.

In general, higher-level controllers should manage Workload and PodGroup objects, since they have the most complete understanding of the AI application they are orchestrating. For example:

We can integrate TrainJob with Grove or Slurm CRDs, which are not based on JobSet (cc @Ronkahn21 @EkinKarabulut).
When TrainJob bootstraps the data cache, the PodGroup configuration may need to differ (cc @akshaychitneni).

For now, I suggest that we add validation in the TrainJob controller to ensure users do not configure the Workload API directly in JobSet or Job objects.

andreyvelich added 2 commits

February 18, 2026 15:00


          Update Custom Setting for PodGroupTemplates alternative

aa1f34e

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>


          Add Defaulting/Validation

09eb7ae

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress size/XL