Skip to content

Comments

chore: Trainer: Specialized Trainers#287

Closed
szaher wants to merge 5 commits intokubeflow:mainfrom
szaher:KEP-285
Closed

chore: Trainer: Specialized Trainers#287
szaher wants to merge 5 commits intokubeflow:mainfrom
szaher:KEP-285

Conversation

@szaher
Copy link
Member

@szaher szaher commented Feb 11, 2026

Proposing framework-aware trainer classes (TorchTrainer, MPITrainer, JAXTrainer, XGBoostTrainer) with automatic runtime discovery via the trainer.kubeflow.org/framework label, and a RuntimeConfig dataclass to separate per-job environment settings from training logic.

Issue: #285

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes #

Checklist:

  • Docs included if any changes are user facing

Proposing framework-aware trainer classes (TorchTrainer,
MPITrainer, JAXTrainer, XGBoostTrainer) with automatic runtime discovery
via the trainer.kubeflow.org/framework label, and a RuntimeConfig
dataclass to separate per-job environment settings from training logic.

Issue: #285

Signed-off-by: Saad Zaher <szaher@redhat.com>
Copilot AI review requested due to automatic review settings February 11, 2026 19:25
@google-oss-prow
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign astefanutti for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@szaher szaher changed the title Trainer: Specialized Trainers chore: Trainer: Specialized Trainers Feb 11, 2026
@coveralls
Copy link

Pull Request Test Coverage Report for Build 21919758470

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 67.948%

Totals Coverage Status
Change from base Build 21919105330: 0.0%
Covered Lines: 2828
Relevant Lines: 4162

💛 - Coveralls

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a design proposal for specialized trainer abstractions in the Kubeflow SDK. The proposal addresses the gap between the overly generic CustomTrainer and the overly narrow BuiltinTrainer by introducing framework-aware trainer classes with automatic runtime discovery.

Changes:

  • Proposes a BaseTrainer abstract interface for all specialized trainers
  • Defines four Tier 1 framework-specific trainers: TorchTrainer, MPITrainer, JAXTrainer, and XGBoostTrainer
  • Introduces RuntimeConfig dataclass to separate runtime environment configuration from training logic
  • Describes runtime auto-discovery using the trainer.kubeflow.org/framework label
  • Maintains 100% backward compatibility with existing trainer APIs


### B. Tier 1: Framework-Specific Trainers

Each Tier 1 trainer maps 1:1 to a framework identified by the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we exclude / enforce there can be only one training runtime per trainer.kubeflow.org/framework label value or should we design a more flexible mechanism. I'm thinking about having multiple Torch runtimes with different PyTorch versions, or different accelerators.


---

## User-Facing API Examples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to have an example of tier 2 trainer that are configuration driven, e.g., compare current and after situation for TorchTune trainer.

szaher and others added 4 commits February 19, 2026 16:35
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>
Co-authored-by: Antonin Stefanutti <astefanutti@users.noreply.github.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>
Co-authored-by: Antonin Stefanutti <astefanutti@users.noreply.github.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>
@szaher szaher closed this by deleting the head repository Feb 19, 2026
@szaher
Copy link
Member Author

szaher commented Feb 19, 2026

/reopen

@google-oss-prow
Copy link
Contributor

@szaher: Failed to re-open PR: state cannot be changed. The repository that submitted this pull request has been deleted.

Details

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@szaher
Copy link
Member Author

szaher commented Feb 19, 2026

for some reason this pr was deleted even before I delete the repo, so I created a fresh clone and push it again in #308

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants