[WIP] feat(api): KEP-3015: Workload Aware Scheduling for TrainJob#3219
[WIP] feat(api): KEP-3015: Workload Aware Scheduling for TrainJob#3219andreyvelich wants to merge 6 commits intokubeflow:masterfrom
Conversation
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
|
@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: mm4tt, vsoch, dom4ha, VassilisVassiliadis, helayoty, wojtek-t, klueska. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
Adds an initial KEP/proposal document for integrating Kubernetes Workload API–based workload-aware (gang) scheduling into Kubeflow Trainer’s TrainJob via a new PodGroupPolicy plugin, as groundwork for future implementation.
Changes:
- Introduces a new KEP (3015) describing Workload/PodGroup orchestration for TrainJob scheduling.
- Documents intended API surface (
podGroupPolicy.workload) and controller/plugin responsibilities across build/enforcement/watch phases. - Outlines lifecycle, RBAC, feature gates, and a test plan for the proposed integration.
| scheduling in Kubernetes, with integration planned for: | ||
|
|
||
| - **Job controller**: [KEP-5547](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/4671-gang-scheduling): Automatic Workload/PodGroup creation for parallel jobs | ||
| - **JobSet**: [KEP-969](https://github.com/kubernetes-sigs/jobset/pull/1068): Gang scheduling support for job groups |
There was a problem hiding this comment.
The "KEP-969" label doesn't match the linked URL (which is a JobSet PR #1068), so the citation is internally inconsistent; rename the reference (e.g., "JobSet PR" / "design doc") or link to the correct KEP/issue number.
| - **JobSet**: [KEP-969](https://github.com/kubernetes-sigs/jobset/pull/1068): Gang scheduling support for job groups | |
| - **JobSet**: [JobSet PR #1068](https://github.com/kubernetes-sigs/jobset/pull/1068): Gang scheduling support for job groups |
There was a problem hiding this comment.
KEP number is correct.
| When this plugin is enabled, the TrainJob controller will create the Workload with one PodGroup | ||
| for launcher+node. | ||
|
|
||
| ```yaml | ||
| apiVersion: scheduling.k8s.io/v1alpha2 | ||
| kind: Workload | ||
| metadata: | ||
| name: my-job | ||
| ownerReferences: | ||
| - apiVersion: trainer.kubeflow.org/v1alpha1 | ||
| kind: TrainJob | ||
| name: my-job | ||
| spec: | ||
| controllerRef: | ||
| apiVersion: trainer.kubeflow.org/v1alpha1 | ||
| kind: TrainJob | ||
| name: my-job | ||
| podGroupTemplates: | ||
| - name: trainer | ||
| schedulingPolicy: | ||
| gang: | ||
| minCount: 50 # Equal to trainJob.spec.trainer.numNodes | ||
| ``` |
There was a problem hiding this comment.
In the MPI story, the Workload PodGroup is described as covering "launcher+node" pods, but minCount is set to trainJob.spec.trainer.numNodes (50) rather than total pods (typically launcher 1 + nodes 50 = 51 when runLauncherAsNode is false); align minCount with the actual number of pods in the gang as the existing PodGroup plugins do (sum of PodSets).
There was a problem hiding this comment.
We only consider use-cases when runLauncherAsNode: true
| // Workload plugin using native Kubernetes Workload API for gang-scheduling | ||
| // Requires Kubernetes v1.35+ with GenericWorkload and GangScheduling feature gates enabled. | ||
| Workload *WorkloadPodGroupPolicySource `json:"workload,omitempty"` | ||
| } | ||
|
|
||
| // WorkloadPodGroupPolicySource configures scheduling behavior using Kubernetes Workload API. | ||
| type WorkloadPodGroupPolicySource struct {} | ||
| ``` | ||
|
|
||
| The `WorkloadPodGroupPolicySource` struct is intentionally minimal for the initial implementation. | ||
| The `minCount` for gang scheduling is automatically derived from `mlPolicy.numNodes` or | ||
| `trainJob.spec.trainer.numNodes`. | ||
|
|
||
| ### Workload Runtime Plugin | ||
|
|
||
| Similar to the Coscheduling and Volcano plugins, we implement a Workload plugin in | ||
| `pkg/runtime/framework/plugins/workload/workload.go`. This plugin implements the following interfaces | ||
| from the Pipeline Framework: | ||
|
|
||
| #### Build Phase | ||
|
|
||
| The plugin implements the `ComponentBuilder` interface to build the Workload object: | ||
|
|
||
| ```go | ||
| func (w *Workload) Build(ctx context.Context, info *runtime.Info, trainJob *trainv1alpha1.TrainJob) (client.Object, error) { | ||
| // 1. Extract numNodes from runtime info | ||
| // 2. Build Workload object with: | ||
| // - controllerRef pointing to TrainJob | ||
| // - podGroupTemplate with gang scheduling policy | ||
| // - minCount equal to numNodes | ||
| // 3. Build PodGroup objects. | ||
| // 4. Return Workload object for deployment | ||
| } | ||
| ``` | ||
|
|
||
| #### EnforcePodGroupPolicy Phase | ||
|
|
||
| The plugin implements the `EnforcePodGroupPolicy` interface to configure the `schedulingGroup` field in Pod specs: | ||
|
|
||
| ```go | ||
| func (w *Workload) EnforcePodGroupPolicy(info *runtime.Info, trainJob *trainv1alpha1.TrainJob) error { | ||
| // 1. Set schedulingGroup.podGroupName in all Pod templates | ||
| } | ||
| ``` | ||
|
|
||
| #### WatchExtension Phase | ||
|
|
||
| The plugin implements `WatchExtension` to watch Workload resources and trigger TrainJob reconciliation: | ||
|
|
||
| ```go | ||
| func (w *Workload) ReconcilerBuilders() []runtime.ReconcilerBuilder { | ||
| // 1. Watch Workload and PodGroup resources owned by TrainJob | ||
| // 2. Trigger reconciliation on PodGroup status changes | ||
| } | ||
| ``` | ||
|
|
||
| ### Resource Lifecycle | ||
|
|
||
| 1. **Creation**: When a TrainJob is created with `podGroupPolicy.workload` configured, the Workload | ||
| plugin creates the Workload and PodGroup objects with `ownerReferences` pointing to the TrainJob. | ||
|
|
||
| 1. **Pod Association**: The plugin injects `schedulingGroup.podGroupName` into Pod specs, linking Pods to their PodGroup. | ||
|
|
||
| 1. **Scheduling**: The kube-scheduler uses the Workload Scheduling Cycle to process entire PodGroups | ||
| atomically, ensuring all Pods in a gang are scheduled together. | ||
|
|
||
| 1. **Suspension**: When the TrainJob is suspended, Workload and PodGroup resources are preserved. | ||
| In the future, we can re-create to ensure elastic TrainJob support. | ||
|
|
||
| 1. **Deletion**: When the TrainJob is deleted, Kubernetes garbage collection automatically cleans up the Workload object (and subsequently the PodGroup). | ||
|
|
||
| The TrainJob controller requires permissions to manage Workload resources: | ||
|
|
||
| ```go | ||
| // +kubebuilder:rbac:groups=scheduling.k8s.io,resources=workloads,verbs=get;list;watch;create;update;patch | ||
| // +kubebuilder:rbac:groups=scheduling.k8s.io,resources=workloads/status,verbs=get | ||
| // +kubebuilder:rbac:groups=scheduling.k8s.io,resources=podgroups,verbs=get;list;watch;create;update;patch;list;watch | ||
| // +kubebuilder:rbac:groups=scheduling.k8s.io,resources=podgroups/status,verbs=get | ||
| ``` | ||
|
|
||
| ### Feature Gate Dependencies | ||
|
|
||
| The Workload API requires the following Kubernetes feature gates to be enabled: | ||
|
|
||
| - `GenericWorkload`: Enables the Workload and PodGroup APIs | ||
|
|
||
| The same feature gate is required to be enabled in TrainJob config: |
There was a problem hiding this comment.
This section says the Workload plugin requires both GenericWorkload and GangScheduling feature gates, but the later "Feature Gate Dependencies" section only lists GenericWorkload; please make the feature-gate requirements consistent across the doc.
There was a problem hiding this comment.
Typo: "in-consistency" should be "inconsistency".
| We can integrate the Workload spec in the TrainJob API directly. That might introduce in-consistency | |
| We can integrate the Workload spec in the TrainJob API directly. That might introduce inconsistency |
There was a problem hiding this comment.
The RBAC example for podgroups repeats list;watch twice in the verbs string; remove the duplicate verbs to avoid propagating an invalid/unclear RBAC snippet.
| // +kubebuilder:rbac:groups=scheduling.k8s.io,resources=podgroups,verbs=get;list;watch;create;update;patch;list;watch | |
| // +kubebuilder:rbac:groups=scheduling.k8s.io,resources=podgroups,verbs=get;list;watch;create;update;patch |
Pull Request Test Coverage Report for Build 22145283120Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
|
I'm a little caught up on this discussion: kubernetes-sigs/jobset#1068 (comment) If the Workload is intended to be integrated with JobSet, are there ever cases of creating TrainJob without a backing JobSet? If not, should we not expose Workload (and gang scheduling) through JobSet instead of adding separately via PodGroupPolicy? What happens when we have PodGroupPolicy plus underlying JobSet support to create the Workload? Or is this intended as a temporary patch to expose it while it is being worked on for JobSet? |
Its a hard discussion to follow. The general stance now is that the api that a user creates (sometimes known as true workload) should probably manage the podGroups / Workloads. For Trainer we could maybe consider delegating this to JobSet for JobSet based TrainingRuntimes but that doesn't solve the problem for Flux or whatever other backend we have.
The main outcome would be that if there exists a workloadRef or PodGroupRef on the pod template we are to assume that the WAS scheduled is "handled". But we haven't updated the KEP for JobSet. That was at least what we were discussing on all the features. |
|
The FluxPolicy implementation is just tweaking a JobSet, so I think if support was added to that underlying abstraction it would be inherited by Flux. I want to make sure we are careful to (essentially) not implement this layer twice, but rather just once at the most appropriate level (maybe JobSet?) |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
|
Yes, @kannon92 is right – we’ve had several discussions about this over the past few months. In general, higher-level controllers should manage Workload and PodGroup objects, since they have the most complete understanding of the AI application they are orchestrating. For example:
For now, I suggest that we add validation in the TrainJob controller to ensure users do not configure the Workload API directly in JobSet or Job objects. |
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Related: #3015
Ref issues: kubernetes/enhancements#4671, kubernetes/kubernetes#132192.
This is initial KEP to support WAS in Kubeflow Trainer and TrainJob. We need to discuss whether we want to allow users to set Workload spec in TrainJob as well.
The implementation should be started after Kubernetes v1.36
Project to track WAS in Kubernetes: Workload-aware & Topology-aware Workstream (view)
/assign @tenzen-y @astefanutti @akshaychitneni @robert-bell @kubeflow/kubeflow-trainer-team
/cc @kannon92 @helayoty @wojtek-t @klueska @mm4tt @vsoch @dom4ha @VassilisVassiliadis