chore: Trainer: Specialized Trainers by szaher · Pull Request #287 · kubeflow/sdk

szaher · 2026-02-11T19:25:08Z

Proposing framework-aware trainer classes (TorchTrainer, MPITrainer, JAXTrainer, XGBoostTrainer) with automatic runtime discovery via the trainer.kubeflow.org/framework label, and a RuntimeConfig dataclass to separate per-job environment settings from training logic.

Issue: #285

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes #

Checklist:

Docs included if any changes are user facing

Proposing framework-aware trainer classes (TorchTrainer, MPITrainer, JAXTrainer, XGBoostTrainer) with automatic runtime discovery via the trainer.kubeflow.org/framework label, and a RuntimeConfig dataclass to separate per-job environment settings from training logic. Issue: #285 Signed-off-by: Saad Zaher <szaher@redhat.com>

google-oss-prow · 2026-02-11T19:25:15Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign astefanutti for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coveralls · 2026-02-11T19:28:22Z

Pull Request Test Coverage Report for Build 21919758470

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 67.948%

Totals
Change from base Build 21919105330:	0.0%
Covered Lines:	2828
Relevant Lines:	4162

💛 - Coveralls

Copilot

Pull request overview

This PR introduces a design proposal for specialized trainer abstractions in the Kubeflow SDK. The proposal addresses the gap between the overly generic CustomTrainer and the overly narrow BuiltinTrainer by introducing framework-aware trainer classes with automatic runtime discovery.

Changes:

Proposes a BaseTrainer abstract interface for all specialized trainers
Defines four Tier 1 framework-specific trainers: TorchTrainer, MPITrainer, JAXTrainer, and XGBoostTrainer
Introduces RuntimeConfig dataclass to separate runtime environment configuration from training logic
Describes runtime auto-discovery using the trainer.kubeflow.org/framework label
Maintains 100% backward compatibility with existing trainer APIs

docs/proposals/285-specialized-trainers/README.md

astefanutti · 2026-02-12T08:05:12Z

docs/proposals/285-specialized-trainers/README.md

+
+### B. Tier 1: Framework-Specific Trainers
+
+Each Tier 1 trainer maps 1:1 to a framework identified by the


Should we exclude / enforce there can be only one training runtime per trainer.kubeflow.org/framework label value or should we design a more flexible mechanism. I'm thinking about having multiple Torch runtimes with different PyTorch versions, or different accelerators.

astefanutti · 2026-02-12T08:10:44Z

docs/proposals/285-specialized-trainers/README.md

+
+---
+
+## User-Facing API Examples


It would be useful to have an example of tier 2 trainer that are configuration driven, e.g., compare current and after situation for TorchTune trainer.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

Co-authored-by: Antonin Stefanutti <astefanutti@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

Co-authored-by: Antonin Stefanutti <astefanutti@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

szaher · 2026-02-19T22:43:32Z

/reopen

google-oss-prow · 2026-02-19T22:43:36Z

@szaher: Failed to re-open PR: state cannot be changed. The repository that submitted this pull request has been deleted.

Details

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

szaher · 2026-02-19T22:45:41Z

for some reason this pr was deleted even before I delete the repo, so I created a fresh clone and push it again in #308

Copilot AI review requested due to automatic review settings February 11, 2026 19:25

google-oss-prow bot requested a review from astefanutti February 11, 2026 19:25

google-oss-prow bot requested a review from Electronic-Waste February 11, 2026 19:25

szaher changed the title ~~Trainer: Specialized Trainers~~ chore: Trainer: Specialized Trainers Feb 11, 2026

google-oss-prow bot added the size/XXL label Feb 11, 2026

Copilot started reviewing on behalf of szaher February 11, 2026 19:26 View session

Copilot AI reviewed Feb 11, 2026

View reviewed changes

docs/proposals/285-specialized-trainers/README.md Outdated Show resolved Hide resolved

docs/proposals/285-specialized-trainers/README.md Outdated Show resolved Hide resolved

astefanutti reviewed Feb 12, 2026

View reviewed changes

szaher and others added 4 commits February 19, 2026 16:35

Update docs/proposals/285-specialized-trainers/README.md

6795b74

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

Update docs/proposals/285-specialized-trainers/README.md

83d2713

Co-authored-by: Antonin Stefanutti <astefanutti@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

Update docs/proposals/285-specialized-trainers/README.md

5902038

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

Update docs/proposals/285-specialized-trainers/README.md

1562655

Co-authored-by: Antonin Stefanutti <astefanutti@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

szaher closed this by deleting the head repository Feb 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

chore: Trainer: Specialized Trainers#287

chore: Trainer: Specialized Trainers#287
szaher wants to merge 5 commits intokubeflow:mainfrom
szaher:KEP-285

szaher commented Feb 11, 2026

Uh oh!

google-oss-prow bot commented Feb 11, 2026

Uh oh!

coveralls commented Feb 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

astefanutti Feb 12, 2026

Uh oh!

astefanutti Feb 12, 2026

Uh oh!

szaher commented Feb 19, 2026

Uh oh!

google-oss-prow bot commented Feb 19, 2026

Uh oh!

szaher commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		### B. Tier 1: Framework-Specific Trainers

		Each Tier 1 trainer maps 1:1 to a framework identified by the

Comments

Conversation

szaher commented Feb 11, 2026

Uh oh!

google-oss-prow bot commented Feb 11, 2026

Uh oh!

coveralls commented Feb 11, 2026

Pull Request Test Coverage Report for Build 21919758470

Details

💛 - Coveralls

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

astefanutti Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

astefanutti Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

szaher commented Feb 19, 2026

Uh oh!

google-oss-prow bot commented Feb 19, 2026

Uh oh!

szaher commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants