Skip to content

[WIP] feat(api): Replace PodTemplateOverrides with TemplateOverrides#3199

Open
andreyvelich wants to merge 6 commits intokubeflow:masterfrom
andreyvelich:template-override-api
Open

[WIP] feat(api): Replace PodTemplateOverrides with TemplateOverrides#3199
andreyvelich wants to merge 6 commits intokubeflow:masterfrom
andreyvelich:template-override-api

Conversation

@andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Feb 10, 2026

This BREAKING CHANGE will replace PodTemplateOverride with TemplateOverrides API.

We would like to group overrides by manager for clear ownership boundaries.

This PR updates KEP, APIs, and implementation.

TemplateOverrides will have:

  • PodTemplateOverrides - for Pod overrides
  • JobTemplateOverrides - for JobSet/Job overrides

Related: #3020

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Copilot AI review requested due to automatic review settings February 10, 2026 13:58
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andreyvelich. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a breaking API change to replace PodTemplateOverrides with manager-scoped TemplateOverrides, aiming to group override ownership boundaries more clearly across controllers/users.

Changes:

  • Replaces TrainJobSpec.PodTemplateOverrides with TrainJobSpec.TemplateOverrides keyed by manager.
  • Introduces new API types for TemplateOverride, including job-level and pod-level override histories.
  • Updates the v2 proposal/KEP documentation to describe the new API shape and examples.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File Description
pkg/apis/trainer/v1alpha1/trainjob_types.go Updates TrainJob API types to add manager-keyed TemplateOverrides and new override structs.
docs/proposals/2170-kubeflow-trainer-v2/README.md Updates the proposal to document TemplateOverrides, including rationale and YAML examples.

Comment on lines +799 to +804
// JobTemplateOverride represents a custom override that will be applied to the JobTemplateSpec
type JobTemplateOverride struct {
// Time is the timestamp of when the JobTemplateOverride entry was added.
// +required
Time metav1.Time `json:"time,omitempty"`

Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the docs code snippet, Time is marked as +required and uses a non-pointer metav1.Time, which doesn’t match the actual API types in pkg/apis/trainer/v1alpha1/trainjob_types.go where Time is optional (*metav1.Time).

Copilot uses AI. Check for mistakes.
Comment on lines +268 to +276
// +listType=map
// +listMapKey=time
JobTemplateOverrides []JobTemplateOverride `json:"job,omitempty"`

// podTemplateOverrides defines overrides that applied to PodTemplateSpec
// +kubebuilder:validation:XValidation:rule="!has(oldSelf) || oldSelf.all(old, self.exists(new, old.time == new.time && old == new))", message="existing entries cannot be modified or removed in template overrides"
// +kubebuilder:validation:XValidation:rule="!has(oldSelf) || size(self) >= size(oldSelf)", message="pod template override entries cannot be deleted"
// +listType=map
// +listMapKey=time
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JobTemplateOverrides/PodTemplateOverrides are declared as +listType=map with +listMapKey=time, but the keyed field Time is optional (*metav1.Time), which makes the map key potentially unset and breaks map-list semantics (unique/stable keys for merge/validation).

Suggested change
// +listType=map
// +listMapKey=time
JobTemplateOverrides []JobTemplateOverride `json:"job,omitempty"`
// podTemplateOverrides defines overrides that applied to PodTemplateSpec
// +kubebuilder:validation:XValidation:rule="!has(oldSelf) || oldSelf.all(old, self.exists(new, old.time == new.time && old == new))", message="existing entries cannot be modified or removed in template overrides"
// +kubebuilder:validation:XValidation:rule="!has(oldSelf) || size(self) >= size(oldSelf)", message="pod template override entries cannot be deleted"
// +listType=map
// +listMapKey=time
JobTemplateOverrides []JobTemplateOverride `json:"job,omitempty"`
// podTemplateOverrides defines overrides that applied to PodTemplateSpec
// +kubebuilder:validation:XValidation:rule="!has(oldSelf) || oldSelf.all(old, self.exists(new, old.time == new.time && old == new))", message="existing entries cannot be modified or removed in template overrides"
// +kubebuilder:validation:XValidation:rule="!has(oldSelf) || size(self) >= size(oldSelf)", message="pod template override entries cannot be deleted"

Copilot uses AI. Check for mistakes.
Comment on lines +299 to +305
TargetJobs []TemplateOverrideTargetJob `json:"targetJobs,omitempty"`

// metadata overrides the Job template metadata or JobSet metadata.
// If targetJobs is specified, these values are merged with the specific ReplicatedJob's Job template metadata.
// If targetJobs is empty, these values are merged with the JobSet object metadata.
// +optional
Metadata *metav1.ObjectMeta `json:"metadata,omitempty"`
Copy link
Member Author

@andreyvelich andreyvelich Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve been thinking that we could start using JobTemplateOverride instead of the dedicated Labels and Annotations fields we currently expose in the TrainJob.spec API.

The idea would be:

  • If targetJob is omitted, the override is applied to the JobSet
  • If targetJob is set, the override is applied to the specific Job

One concern is that once we introduce JobTemplateSpecOverride, it could potentially contain fields relevant to both Job and JobSet, which may introduce ambiguity. I’m not entirely sure what the better way to handle that would be, though I also don’t see a clearly better alternative at the moment.

@tenzen-y @kaisoz @mimowo @astefanutti @kannon92 , I’d really appreciate your thoughts on this approach.

Comment on lines +310 to +313
// Time is the timestamp of when the JobTemplateOverride entry was added. If value is omitted,
// controller defaults this value to the current timestamp.
// +optional
Time *metav1.Time `json:"time,omitempty"`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Time will be set server-side by Trainer admission mutating webhook when TrainJob is created/updated.

Comment on lines +270 to +277
JobTemplateOverrides []JobTemplateOverride `json:"job,omitempty"`

// podTemplateOverrides defines overrides that applied to PodTemplateSpec
// +kubebuilder:validation:XValidation:rule="!has(oldSelf) || oldSelf.all(old, self.exists(new, old.time == new.time && old == new))", message="existing entries cannot be modified or removed in template overrides"
// +kubebuilder:validation:XValidation:rule="!has(oldSelf) || size(self) >= size(oldSelf)", message="pod template override entries cannot be deleted"
// +listType=map
// +listMapKey=time
PodTemplateOverrides []PodTemplateOverride `json:"pod,omitempty"`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you prefer pod and job or podTemplateOverrides and jobTemplateOverrides?

@andreyvelich
Copy link
Member Author

cc @kubeflow/kubeflow-trainer-team @kubeflow/kubeflow-sdk-team @akshaychitneni

andreyvelich and others added 5 commits February 10, 2026 15:21
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Comment on lines +315 to +318
// templateOverrides defines template overrides that will be applied to the TrainJob's training runtime template.
// +listType=map
// +listMapKey=manager
TemplateOverrides []TemplateOverride `json:"templateOverrides,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned previously, I still think separate override fields would be better. Because the external scheduler and the external job manager could be separated. In that case, scheduling constraints (podTemplate) will be managed by the external scheduler, and job parameters (jobTemplate) will be managed by the external job manager.

If we combine those into templateOverrides as in this proposal, there is no way to decouple those.

podTemplateOverrides:
- manager: 
    name: kueue
    time: xyz
  targetJobs:
  - name: trainer
  spec:
    nodeSelector:
      accelerator: nvidia-gpu
    tolerations:
    - key: "nvidia.com/gpu"
       operator: "Exists"
       effect: "NoSchedule"
jobTemplateOverrides: // or runtimeParameterOverrides? in any case, we can revisit that in the future.
- manager: 
    name: abc
    time: xyz
  targetJobs:
  - name: trainer
  ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you see any limitations with the following API to define your example @tenzen-y ?

templateOverrides:
  - manager: kueue.x-k8s.io/manager
    pod:
      - time: "2026-02-17T10:00:00Z"
        targetJobs:
          - name: trainer
        spec:
          nodeSelector:
            accelerator: nvidia-gpu
          tolerations:
            - key: "nvidia.com/gpu"
              operator: "Exists"
              effect: "NoSchedule"

  - manager: abc.example.com/abc
    job:
      - time: "2026-02-17T10:00:00Z"
        targetJobs:
          - name: trainer
        metadata:
          labels:
            custom-label: value

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One advantage of having separate override fields would be to be backward compatible.

Even if the API is still alpha, podTemplateOverrides are already used quite a lot so it's be easier to maintain compatibility.

Also it makes it clearer what the scope of each override type is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y
We briefly discussed this today during the Trainer call.
Recording: https://youtu.be/e9_g28XdpHg?t=830

One challenge with this approach is that it prevents us from using
+listType=map +listMapKey=manager, because the list becomes atomic, as @kaisoz pointed out in previous PRs:

- manager: 
    name: kueue
    time: xyz

If we don't want to place all overrides under TemplateOverride API, I think we have two options:

Option 1

Place overrides under an overrides slice. The fields would be immutable, but new override entries could be appended over time.

Pros: Provides a clear history of appended overrides.
Cons: YAML grow in size

podTemplateOverrides:
  - manager: kueue.x-k8s.io/manager
    overrides:
      - time: "2026-02-17T10:00:00Z"
        targetJobs:
          - name: trainer
        spec:
          nodeSelector:
            accelerator: nvidia-gpu

jobTemplateOverrides:
  - manager: abc.example.com/abc
    overrides:
      - time: "2026-02-17T10:00:00Z"
        targetJobs:
          - name: trainer
        metadata:
          labels:
            custom-label: value

Option 2

Place overrides directly under each entry and make the API mutable.
Pros: Simpler structure.
Cons: History is not preserved

podTemplateOverrides:
  - manager: kueue.x-k8s.io/manager
    time: "2026-02-17T10:00:00Z"
    targetJobs:
      - name: trainer
    spec:
      nodeSelector:
        accelerator: nvidia-gpu

jobTemplateOverrides:
  - manager: "abc.example.com/abc"
    time: "2026-02-17T10:00:00Z"
    targetJobs:
      - name: trainer
    metadata:
      labels:
        custom-label: value

Option 3

Keep what we have in the KEP right now.
example: #3199 (comment)

Any thoughts ?

cc @VassilisVassiliadis @kannon92 @mimowo @astefanutti @vsoch
If you can provide any feedback for the API, it would be super helpful!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I understand the difference between option 2 and what is there now (option 3) the general "templteOverrides" with a list of manager pod|job is being replaced with a single podTemplateOverrides and jobTemplateOverrides either with overrides or directly under it. And a list of overrides is valid in all cases, e.g.,

podTemplateOverrides:
  - manager: kueue.x-k8s.io/manager
    time: "2026-02-17T10:00:00Z"
    targetJobs:
      - name: trainer
    spec:
      nodeSelector:
        accelerator: nvidia-gpu
  - manager: kueue.x-k8s.io/another-manager
...

A question. What happens if there is conflicting information? E.g., two sets of overrides, and different nodeSelector for the same managers:

podTemplateOverrides:
  - manager: kueue.x-k8s.io/manager
    time: "2026-02-17T10:00:00Z"
    targetJobs:
      - name: trainer
    spec:
      nodeSelector:
        accelerator: nvidia-gpu
  - manager: kueue.x-k8s.io/another-manager
    time: "2026-02-17T10:00:00Z"
    targetJobs:
      - name: trainer
    spec:
      nodeSelector:
        accelerator: nvidia-another-gpu

I don't think preserving history is a strong priority, and having to consolidate "old" information (versus one source of truth) is adding a challenge that does not need to be there. I like Option 3 best, but I want to better understand why we allow a listing. If there is a duplicate manager would it not validate? And is this interface expected to be most utilized by the user (writing a YAML TrainJob with overrides) or internal controllers (e.g., FluxPolicy) or both? I'd like to see Command/Args/Environment support, and I suspect that would be in the PodTemplateOverrides?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments