Conversation
Proposing framework-aware trainer classes (TorchTrainer, MPITrainer, JAXTrainer, XGBoostTrainer) with automatic runtime discovery via the trainer.kubeflow.org/framework label, and a RuntimeConfig dataclass to separate per-job environment settings from training logic. Issue: #285 Signed-off-by: Saad Zaher <szaher@redhat.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Pull Request Test Coverage Report for Build 21919758470Details
💛 - Coveralls |
There was a problem hiding this comment.
Pull request overview
This PR introduces a design proposal for specialized trainer abstractions in the Kubeflow SDK. The proposal addresses the gap between the overly generic CustomTrainer and the overly narrow BuiltinTrainer by introducing framework-aware trainer classes with automatic runtime discovery.
Changes:
- Proposes a
BaseTrainerabstract interface for all specialized trainers - Defines four Tier 1 framework-specific trainers:
TorchTrainer,MPITrainer,JAXTrainer, andXGBoostTrainer - Introduces
RuntimeConfigdataclass to separate runtime environment configuration from training logic - Describes runtime auto-discovery using the
trainer.kubeflow.org/frameworklabel - Maintains 100% backward compatibility with existing trainer APIs
|
|
||
| ### B. Tier 1: Framework-Specific Trainers | ||
|
|
||
| Each Tier 1 trainer maps 1:1 to a framework identified by the |
There was a problem hiding this comment.
Should we exclude / enforce there can be only one training runtime per trainer.kubeflow.org/framework label value or should we design a more flexible mechanism. I'm thinking about having multiple Torch runtimes with different PyTorch versions, or different accelerators.
|
|
||
| --- | ||
|
|
||
| ## User-Facing API Examples |
There was a problem hiding this comment.
It would be useful to have an example of tier 2 trainer that are configuration driven, e.g., compare current and after situation for TorchTune trainer.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>
Co-authored-by: Antonin Stefanutti <astefanutti@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>
Co-authored-by: Antonin Stefanutti <astefanutti@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>
|
/reopen |
|
@szaher: Failed to re-open PR: state cannot be changed. The repository that submitted this pull request has been deleted. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
for some reason this pr was deleted even before I delete the repo, so I created a fresh clone and push it again in #308 |
Proposing framework-aware trainer classes (TorchTrainer, MPITrainer, JAXTrainer, XGBoostTrainer) with automatic runtime discovery via the trainer.kubeflow.org/framework label, and a RuntimeConfig dataclass to separate per-job environment settings from training logic.
Issue: #285
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...format, will close the issue(s) when PR gets merged):Fixes #
Checklist: