[WIP] feat(api): Replace PodTemplateOverrides with TemplateOverrides#3199
[WIP] feat(api): Replace PodTemplateOverrides with TemplateOverrides#3199andreyvelich wants to merge 6 commits intokubeflow:masterfrom
Conversation
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
This PR introduces a breaking API change to replace PodTemplateOverrides with manager-scoped TemplateOverrides, aiming to group override ownership boundaries more clearly across controllers/users.
Changes:
- Replaces
TrainJobSpec.PodTemplateOverrideswithTrainJobSpec.TemplateOverrideskeyed bymanager. - Introduces new API types for
TemplateOverride, including job-level and pod-level override histories. - Updates the v2 proposal/KEP documentation to describe the new API shape and examples.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| pkg/apis/trainer/v1alpha1/trainjob_types.go | Updates TrainJob API types to add manager-keyed TemplateOverrides and new override structs. |
| docs/proposals/2170-kubeflow-trainer-v2/README.md | Updates the proposal to document TemplateOverrides, including rationale and YAML examples. |
| // JobTemplateOverride represents a custom override that will be applied to the JobTemplateSpec | ||
| type JobTemplateOverride struct { | ||
| // Time is the timestamp of when the JobTemplateOverride entry was added. | ||
| // +required | ||
| Time metav1.Time `json:"time,omitempty"` | ||
|
|
There was a problem hiding this comment.
In the docs code snippet, Time is marked as +required and uses a non-pointer metav1.Time, which doesn’t match the actual API types in pkg/apis/trainer/v1alpha1/trainjob_types.go where Time is optional (*metav1.Time).
| // +listType=map | ||
| // +listMapKey=time | ||
| JobTemplateOverrides []JobTemplateOverride `json:"job,omitempty"` | ||
|
|
||
| // podTemplateOverrides defines overrides that applied to PodTemplateSpec | ||
| // +kubebuilder:validation:XValidation:rule="!has(oldSelf) || oldSelf.all(old, self.exists(new, old.time == new.time && old == new))", message="existing entries cannot be modified or removed in template overrides" | ||
| // +kubebuilder:validation:XValidation:rule="!has(oldSelf) || size(self) >= size(oldSelf)", message="pod template override entries cannot be deleted" | ||
| // +listType=map | ||
| // +listMapKey=time |
There was a problem hiding this comment.
JobTemplateOverrides/PodTemplateOverrides are declared as +listType=map with +listMapKey=time, but the keyed field Time is optional (*metav1.Time), which makes the map key potentially unset and breaks map-list semantics (unique/stable keys for merge/validation).
| // +listType=map | |
| // +listMapKey=time | |
| JobTemplateOverrides []JobTemplateOverride `json:"job,omitempty"` | |
| // podTemplateOverrides defines overrides that applied to PodTemplateSpec | |
| // +kubebuilder:validation:XValidation:rule="!has(oldSelf) || oldSelf.all(old, self.exists(new, old.time == new.time && old == new))", message="existing entries cannot be modified or removed in template overrides" | |
| // +kubebuilder:validation:XValidation:rule="!has(oldSelf) || size(self) >= size(oldSelf)", message="pod template override entries cannot be deleted" | |
| // +listType=map | |
| // +listMapKey=time | |
| JobTemplateOverrides []JobTemplateOverride `json:"job,omitempty"` | |
| // podTemplateOverrides defines overrides that applied to PodTemplateSpec | |
| // +kubebuilder:validation:XValidation:rule="!has(oldSelf) || oldSelf.all(old, self.exists(new, old.time == new.time && old == new))", message="existing entries cannot be modified or removed in template overrides" | |
| // +kubebuilder:validation:XValidation:rule="!has(oldSelf) || size(self) >= size(oldSelf)", message="pod template override entries cannot be deleted" |
| TargetJobs []TemplateOverrideTargetJob `json:"targetJobs,omitempty"` | ||
|
|
||
| // metadata overrides the Job template metadata or JobSet metadata. | ||
| // If targetJobs is specified, these values are merged with the specific ReplicatedJob's Job template metadata. | ||
| // If targetJobs is empty, these values are merged with the JobSet object metadata. | ||
| // +optional | ||
| Metadata *metav1.ObjectMeta `json:"metadata,omitempty"` |
There was a problem hiding this comment.
I’ve been thinking that we could start using JobTemplateOverride instead of the dedicated Labels and Annotations fields we currently expose in the TrainJob.spec API.
The idea would be:
- If targetJob is omitted, the override is applied to the JobSet
- If targetJob is set, the override is applied to the specific Job
One concern is that once we introduce JobTemplateSpecOverride, it could potentially contain fields relevant to both Job and JobSet, which may introduce ambiguity. I’m not entirely sure what the better way to handle that would be, though I also don’t see a clearly better alternative at the moment.
@tenzen-y @kaisoz @mimowo @astefanutti @kannon92 , I’d really appreciate your thoughts on this approach.
| // Time is the timestamp of when the JobTemplateOverride entry was added. If value is omitted, | ||
| // controller defaults this value to the current timestamp. | ||
| // +optional | ||
| Time *metav1.Time `json:"time,omitempty"` |
There was a problem hiding this comment.
Time will be set server-side by Trainer admission mutating webhook when TrainJob is created/updated.
| JobTemplateOverrides []JobTemplateOverride `json:"job,omitempty"` | ||
|
|
||
| // podTemplateOverrides defines overrides that applied to PodTemplateSpec | ||
| // +kubebuilder:validation:XValidation:rule="!has(oldSelf) || oldSelf.all(old, self.exists(new, old.time == new.time && old == new))", message="existing entries cannot be modified or removed in template overrides" | ||
| // +kubebuilder:validation:XValidation:rule="!has(oldSelf) || size(self) >= size(oldSelf)", message="pod template override entries cannot be deleted" | ||
| // +listType=map | ||
| // +listMapKey=time | ||
| PodTemplateOverrides []PodTemplateOverride `json:"pod,omitempty"` |
There was a problem hiding this comment.
Do you prefer pod and job or podTemplateOverrides and jobTemplateOverrides?
|
cc @kubeflow/kubeflow-trainer-team @kubeflow/kubeflow-sdk-team @akshaychitneni |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
| // templateOverrides defines template overrides that will be applied to the TrainJob's training runtime template. | ||
| // +listType=map | ||
| // +listMapKey=manager | ||
| TemplateOverrides []TemplateOverride `json:"templateOverrides,omitempty"` |
There was a problem hiding this comment.
As I mentioned previously, I still think separate override fields would be better. Because the external scheduler and the external job manager could be separated. In that case, scheduling constraints (podTemplate) will be managed by the external scheduler, and job parameters (jobTemplate) will be managed by the external job manager.
If we combine those into templateOverrides as in this proposal, there is no way to decouple those.
podTemplateOverrides:
- manager:
name: kueue
time: xyz
targetJobs:
- name: trainer
spec:
nodeSelector:
accelerator: nvidia-gpu
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
jobTemplateOverrides: // or runtimeParameterOverrides? in any case, we can revisit that in the future.
- manager:
name: abc
time: xyz
targetJobs:
- name: trainer
...There was a problem hiding this comment.
Do you see any limitations with the following API to define your example @tenzen-y ?
templateOverrides:
- manager: kueue.x-k8s.io/manager
pod:
- time: "2026-02-17T10:00:00Z"
targetJobs:
- name: trainer
spec:
nodeSelector:
accelerator: nvidia-gpu
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
- manager: abc.example.com/abc
job:
- time: "2026-02-17T10:00:00Z"
targetJobs:
- name: trainer
metadata:
labels:
custom-label: valueThere was a problem hiding this comment.
One advantage of having separate override fields would be to be backward compatible.
Even if the API is still alpha, podTemplateOverrides are already used quite a lot so it's be easier to maintain compatibility.
Also it makes it clearer what the scope of each override type is.
There was a problem hiding this comment.
@tenzen-y
We briefly discussed this today during the Trainer call.
Recording: https://youtu.be/e9_g28XdpHg?t=830
One challenge with this approach is that it prevents us from using
+listType=map +listMapKey=manager, because the list becomes atomic, as @kaisoz pointed out in previous PRs:
- manager:
name: kueue
time: xyzIf we don't want to place all overrides under TemplateOverride API, I think we have two options:
Option 1
Place overrides under an overrides slice. The fields would be immutable, but new override entries could be appended over time.
Pros: Provides a clear history of appended overrides.
Cons: YAML grow in size
podTemplateOverrides:
- manager: kueue.x-k8s.io/manager
overrides:
- time: "2026-02-17T10:00:00Z"
targetJobs:
- name: trainer
spec:
nodeSelector:
accelerator: nvidia-gpu
jobTemplateOverrides:
- manager: abc.example.com/abc
overrides:
- time: "2026-02-17T10:00:00Z"
targetJobs:
- name: trainer
metadata:
labels:
custom-label: valueOption 2
Place overrides directly under each entry and make the API mutable.
Pros: Simpler structure.
Cons: History is not preserved
podTemplateOverrides:
- manager: kueue.x-k8s.io/manager
time: "2026-02-17T10:00:00Z"
targetJobs:
- name: trainer
spec:
nodeSelector:
accelerator: nvidia-gpu
jobTemplateOverrides:
- manager: "abc.example.com/abc"
time: "2026-02-17T10:00:00Z"
targetJobs:
- name: trainer
metadata:
labels:
custom-label: valueOption 3
Keep what we have in the KEP right now.
example: #3199 (comment)
Any thoughts ?
cc @VassilisVassiliadis @kannon92 @mimowo @astefanutti @vsoch
If you can provide any feedback for the API, it would be super helpful!
There was a problem hiding this comment.
So I understand the difference between option 2 and what is there now (option 3) the general "templteOverrides" with a list of manager pod|job is being replaced with a single podTemplateOverrides and jobTemplateOverrides either with overrides or directly under it. And a list of overrides is valid in all cases, e.g.,
podTemplateOverrides:
- manager: kueue.x-k8s.io/manager
time: "2026-02-17T10:00:00Z"
targetJobs:
- name: trainer
spec:
nodeSelector:
accelerator: nvidia-gpu
- manager: kueue.x-k8s.io/another-manager
...A question. What happens if there is conflicting information? E.g., two sets of overrides, and different nodeSelector for the same managers:
podTemplateOverrides:
- manager: kueue.x-k8s.io/manager
time: "2026-02-17T10:00:00Z"
targetJobs:
- name: trainer
spec:
nodeSelector:
accelerator: nvidia-gpu
- manager: kueue.x-k8s.io/another-manager
time: "2026-02-17T10:00:00Z"
targetJobs:
- name: trainer
spec:
nodeSelector:
accelerator: nvidia-another-gpuI don't think preserving history is a strong priority, and having to consolidate "old" information (versus one source of truth) is adding a challenge that does not need to be there. I like Option 3 best, but I want to better understand why we allow a listing. If there is a duplicate manager would it not validate? And is this interface expected to be most utilized by the user (writing a YAML TrainJob with overrides) or internal controllers (e.g., FluxPolicy) or both? I'd like to see Command/Args/Environment support, and I suspect that would be in the PodTemplateOverrides?
This BREAKING CHANGE will replace
PodTemplateOverridewithTemplateOverridesAPI.We would like to group overrides by manager for clear ownership boundaries.
This PR updates KEP, APIs, and implementation.
TemplateOverrides will have:
Related: #3020