chore: Trainer: Specialized Trainers by szaher · Pull Request #308 · kubeflow/sdk

szaher · 2026-02-19T22:43:57Z

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes #

Checklist:

Docs included if any changes are user facing

Proposing framework-aware trainer classes (TorchTrainer, MPITrainer, JAXTrainer, XGBoostTrainer) with automatic runtime discovery via the trainer.kubeflow.org/framework label, and a RuntimeConfig dataclass to separate per-job environment settings from training logic. Issue: kubeflow#285 Signed-off-by: Saad Zaher <szaher@redhat.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

Co-authored-by: Antonin Stefanutti <astefanutti@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

Co-authored-by: Antonin Stefanutti <astefanutti@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

google-oss-prow · 2026-02-19T22:44:03Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kramaranya for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

This PR adds a comprehensive design proposal for specialized trainer abstractions and a RuntimeConfig dataclass to the Kubeflow SDK. The proposal addresses current limitations in the SDK's trainer subsystem by introducing framework-aware trainer classes that bridge the gap between the generic CustomTrainer and the highly specific BuiltinTrainer.

Changes:

Adds a detailed design proposal document describing a new BaseTrainer abstract interface and specialized framework trainers (TorchTrainer, MPITrainer, JAXTrainer, XGBoostTrainer)
Proposes a RuntimeConfig dataclass to cleanly separate runtime environment settings from training logic
Includes comprehensive documentation covering motivation, design details, API examples, migration strategy, test plan, and alternatives considered

Copilot · 2026-02-19T22:47:02Z

docs/proposals/285-specialized-trainers/README.md

+3. **Deprecating `CustomTrainer` or `BuiltinTrainer`.** Both remain supported.
+   Specialized trainers are an additional option, not a replacement.
+4. **Tier 2 trainer implementations.** This proposal defines the extension mechanism
+   and interface. Concrete Tier 2 implementations (HuggingFace, DeepSpeed, Unsloth,


The company name should be spelled "Hugging Face" (with a space) rather than "HuggingFace" throughout the document. This applies to references in text and comments, though the class name "HuggingFaceTrainer" would be correct as Python class names don't use spaces.

Suggested change

and interface. Concrete Tier 2 implementations (HuggingFace, DeepSpeed, Unsloth,

and interface. Concrete Tier 2 implementations (Hugging Face, DeepSpeed, Unsloth,

Copilot · 2026-02-19T22:47:03Z

docs/proposals/285-specialized-trainers/README.md

+# Example: future HuggingFaceTrainer (NOT part of this proposal's implementation scope)
+
+@dataclass
+class TransformersTrainer(BaseTrainer):
+    """Trainer for HuggingFace Transformers training.
+
+    Wraps HuggingFace's Trainer API and maps to a PyTorch runtime.


The company name should be spelled "Hugging Face" (with a space) rather than "HuggingFace" in the comment and docstring text.

Suggested change

# Example: future HuggingFaceTrainer (NOT part of this proposal's implementation scope)

@dataclass

class TransformersTrainer(BaseTrainer):

"""Trainer for HuggingFace Transformers training.

Wraps HuggingFace's Trainer API and maps to a PyTorch runtime.

# Example: future Hugging Face trainer (NOT part of this proposal's implementation scope)

@dataclass

class TransformersTrainer(BaseTrainer):

"""Trainer for Hugging Face Transformers training.

Wraps Hugging Face's Trainer API and maps to a PyTorch runtime.

Copilot · 2026-02-19T22:47:03Z

docs/proposals/285-specialized-trainers/README.md

+          │
+    ┌─────┴──────────┐
+    │                │
+HuggingFace     DeepSpeed


The company name should be spelled "Hugging Face" (with a space) rather than "HuggingFace" in the diagram text.

Suggested change

HuggingFace DeepSpeed

Hugging Face DeepSpeed

szaher and others added 5 commits February 11, 2026 19:17

Update docs/proposals/285-specialized-trainers/README.md

6795b74

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

Update docs/proposals/285-specialized-trainers/README.md

83d2713

Co-authored-by: Antonin Stefanutti <astefanutti@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

Update docs/proposals/285-specialized-trainers/README.md

5902038

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

Update docs/proposals/285-specialized-trainers/README.md

1562655

Co-authored-by: Antonin Stefanutti <astefanutti@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

Copilot AI review requested due to automatic review settings February 19, 2026 22:43

google-oss-prow bot requested review from Electronic-Waste and kramaranya February 19, 2026 22:44

google-oss-prow bot added the size/XXL label Feb 19, 2026

Copilot started reviewing on behalf of szaher February 19, 2026 22:44 View session

szaher mentioned this pull request Feb 19, 2026

chore: Trainer: Specialized Trainers #287

Closed

1 task

Copilot AI reviewed Feb 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

chore: Trainer: Specialized Trainers#308

chore: Trainer: Specialized Trainers#308
szaher wants to merge 5 commits intokubeflow:mainfrom
szaher:KEP-285

szaher commented Feb 19, 2026

Uh oh!

google-oss-prow bot commented Feb 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	and interface. Concrete Tier 2 implementations (HuggingFace, DeepSpeed, Unsloth,
	and interface. Concrete Tier 2 implementations (Hugging Face, DeepSpeed, Unsloth,

Comments

Conversation

szaher commented Feb 19, 2026

Uh oh!

google-oss-prow bot commented Feb 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant