Introduce local trainer client by eoinfennessy · Pull Request #2610 · kubeflow/trainer

eoinfennessy · 2025-04-22T13:20:08Z

What this PR does / why we need it:

This PR introduces LocalTrainerClient to the Python SDK. This client implements the same interface as the existing TrainerClient, and enables users to run training jobs in Docker containers, without requiring a Kubernetes cluster.

Fixes kubeflow/sdk#22

Old PR: opendatahub-io#1

Checklist:

Docs included if any changes are user facing

google-oss-prow · 2025-04-22T13:20:14Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

review-notebook-app · 2025-04-22T13:20:14Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

anishasthana · 2025-04-22T13:57:54Z

@eoinfennessy Drive-by reviewer here... but a quick note: KFP local mode was also implemented in a manner which basically has docker hardcoded into the system. Can I suggest/recommend renaming instances of docker to container, and implementing in a way that makes it possible for users to use Docker or Podman as needed?

KFP local doesn't function without docker on your system without significant finagling, which is a significant barrier to entry.

Signed-off-by: Eoin Fennessy <efenness@redhat.com>

astefanutti · 2025-04-22T15:03:36Z

examples/local-trainer-client/image-classification/mnist-pytorch-ddp.ipynb

+   "metadata": {},
+   "cell_type": "markdown",
+   "source": [
+    "# Using `LocalTrainerClient` for MNIST image classification with PyTorch DDP\n",


My understanding of the main value of the local mode is portability, so I'd rather demonstrate how to run the existing notebook example rather than duplicating it.

The main purpose of this notebook is to make readers aware of what LocalTrainerClient is and give a high level idea of how it works by showing the Docker resources created. There is some duplication with one of the other notebooks, but maybe it would detract from the focus of the other notebook if we were to update it to include information about what LocalTrainerClient is and ask users to examine the Docker resources created. WDYT?

Agreed regarding highlighting portability -- maybe we could eventually update the getting started guide on the Kubeflow website to make users aware of the LocalTrainerClient by giving them options to use it?

This would be good for a "quickstart"

astefanutti · 2025-04-22T15:05:42Z

sdk/pyproject.toml

  "Topic :: Software Development :: Libraries :: Python Modules",
 ]
-dependencies = ["kubernetes>=27.2.0", "pydantic>=2.10.0"]
+dependencies = ["kubernetes>=27.2.0", "pydantic>=2.10.0", "docker>=7.1.0"]


It should probably be an optional dependency?

Agreed. @szaher, we will need to consider how this will work with plans for extras in the unifying Kubeflow SDK.

astefanutti · 2025-04-22T15:06:45Z

sdk/kubeflow/trainer/config/local_runtimes/torch_distributed.yaml

@@ -0,0 +1,34 @@
+apiVersion: trainer.kubeflow.org/v1alpha1


To demonstrate portability, we should rather not duplicate in-tree training runtimes.

the idea was to use the same runtime definitions for the trainer here, but we can definitly change that to more better/light definitions.

Users can also provide a path to their own runtime YAMLs when creating a LocalTrainerClient. This will override the path to the built-in runtimes, allowing them to use different runtimes to the ones included with the SDK. The built-in runtimes have been included for the user's convenience -- this allows data scientists to start using with the local trainer without requiring them to provide a path to runtimes files.

sdk/kubeflow/trainer/api/abstract_trainer_client.py

szaher · 2025-04-22T15:48:11Z

sdk/kubeflow/trainer/api/abstract_trainer_client.py

+from kubeflow.trainer.types import types
+
+
+class AbstractTrainerClient(ABC):


I think BaseTrainer name would make more sense here? or AbstractTrainer

I think this could cause some confusion because a Trainer type already exists, which is different to the TrainerClient type that we are defining an interface for here.

szaher · 2025-04-22T15:50:58Z

sdk/kubeflow/trainer/api/local_trainer_client.py

+from kubeflow.trainer.utils import utils
+
+
+class LocalTrainerClient(AbstractTrainerClient):


would ContainerTrainerClient or LocalContainerTrainer be more meaningful here?

szaher · 2025-04-22T15:53:07Z

sdk/kubeflow/trainer/api/local_trainer_client.py

+        )
+
+        if local_runtimes_path is None:
+            self.local_runtimes_path = resources.files(constants.PACKAGE_NAME) / constants.LOCAL_RUNTIMES_PATH


we can use the build system to package runtime definitions to be included along with the code

[tool.setuptools.package-data] "my_package" = ["runtimes/*.yaml"]

or

from setuptools import setup, find_packages setup( name='my_package', packages=find_packages(), package_data={ 'my_package': ["runtimes/*.yaml"], }, )

szaher · 2025-04-22T15:56:05Z

sdk/kubeflow/trainer/config/local_runtimes/torch_distributed.yaml

@@ -0,0 +1,34 @@
+apiVersion: trainer.kubeflow.org/v1alpha1


the idea was to use the same runtime definitions for the trainer here, but we can definitly change that to more better/light definitions.

szaher · 2025-04-22T15:57:58Z

sdk/kubeflow/trainer/docker_job_client/docker_job_client.py

+from kubeflow.trainer.utils import utils
+
+
+class DockerJobClient:


Let's make it more generic. Let's rename this to ContainerJobClient.

I've been thinking about this and maybe we should get away from the word "client" and instead use "runner".

My idea for this is to create an abstract JobRunner class that a DockerJobRunner and PodmanJobRunner implement.

The init function for the LocalTrainerClient would allow users to provide any type of job runner:

def __init__(job_runner: Optional[JobRunner] = None, ...)

...and users could then specify the runner they want to use (if they do not want to use the default):

client = LocalTrainerClient( job_runner=PodmanJobRunner(), # Allows users to specify config for the client, e.g. host address etc. )

Yeah I am preferential to JobRunner too

coveralls · 2025-04-22T16:25:44Z

Pull Request Test Coverage Report for Build 14596676280

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 67.356%

Totals
Change from base Build 14581099833:	0.0%
Covered Lines:	1758
Relevant Lines:	2610

💛 - Coveralls

eoinfennessy · 2025-04-23T10:36:09Z

@eoinfennessy Drive-by reviewer here... but a quick note: KFP local mode was also implemented in a manner which basically has docker hardcoded into the system. Can I suggest/recommend renaming instances of docker to container, and implementing in a way that makes it possible for users to use Docker or Podman as needed?

KFP local doesn't function without docker on your system without significant finagling, which is a significant barrier to entry.

@anishasthana, thank you for your review and your advice. Implementing this in such a way that many different container runtimes can be used and specified by the user makes sense. We'll aim to do this very soon by adding Podman as a job runner.

Signed-off-by: Eoin Fennessy <efenness@redhat.com>

...and set groundwork for adding more job runners Signed-off-by: Eoin Fennessy <efenness@redhat.com>

Signed-off-by: Eoin Fennessy <efenness@redhat.com>

franciscojavierarceo · 2025-04-25T20:11:52Z

@eoinfennessy Drive-by reviewer here... but a quick note: KFP local mode was also implemented in a manner which basically has docker hardcoded into the system. Can I suggest/recommend renaming instances of docker to container, and implementing in a way that makes it possible for users to use Docker or Podman as needed?

KFP local doesn't function without docker on your system without significant finagling, which is a significant barrier to entry.

We will also implement subprocess runner :D

tenzen-y · 2025-04-25T20:26:33Z

@eoinfennessy, could you open a proposal PR, first to store it in https://github.com/kubeflow/trainer/tree/master/docs/proposals?
I think this is useful to evaluate your proposal. So, we can keep opening this.

/hold

Signed-off-by: Eoin Fennessy <efenness@redhat.com>

franciscojavierarceo · 2025-05-17T02:26:38Z

Hi @tenzen-y I think we should definitely open up a KEP.

I also think it'd be useful for us to release this as an alpha feature (e.g., logging that this is an unstable product and invite users to give us feedback) so that we can get early feedback from users to assess how useful users find it.

This is beneficial in the sense that (1) we begin moving at a faster pace and historically that has been challenging, (2) we explicitly make users aware about the potential instability of the new tool, and (3) invites users to engage with us and share feedback. We could even explicitly link to our repo to file a github issue.

eoinfennessy · 2025-06-05T09:56:53Z

Closing.

A new PR has been opened at kubeflow/sdk#13

google-oss-prow bot requested a review from Electronic-Waste April 22, 2025 13:20

google-oss-prow bot requested a review from kuizhiqing April 22, 2025 13:20

google-oss-prow bot added the size/XL label Apr 22, 2025

eoinfennessy changed the title ~~Add local trainer client~~ Introduce local trainer client Apr 22, 2025

eoinfennessy mentioned this pull request Apr 22, 2025

[RHOAIENG-22675] Introduce local trainer client opendatahub-io/trainer-sdk#1

Closed

1 task

eoinfennessy added 16 commits April 22, 2025 14:58

Add abstract base class for trainer clients

f1e3636

Signed-off-by: Eoin Fennessy <efenness@redhat.com>

Add LocalTrainerClient class with unimplemented methods

4b67f1c

Signed-off-by: Eoin Fennessy <efenness@redhat.com>

Implement 'list_runtimes' method

46a8ea7

Signed-off-by: Eoin Fennessy <efenness@redhat.com>

Implement 'get_runtime' method

1243dd1

Signed-off-by: Eoin Fennessy <efenness@redhat.com>

Introduce LocalJobClient

9db65cb

Signed-off-by: Eoin Fennessy <efenness@redhat.com>

Use trainer func for container entrypoint and command

540190d

Signed-off-by: Eoin Fennessy <efenness@redhat.com>

Enable multi-node distributed torch jobs

936c9bc

Signed-off-by: Eoin Fennessy <efenness@redhat.com>

Add LocalTrainerClient to trainer package

7654b09

Signed-off-by: Eoin Fennessy <efenness@redhat.com>

Move default LocalJobClient instantiation out of __init__ signature

3676ebe

Signed-off-by: Eoin Fennessy <efenness@redhat.com>

Rename LocalJobClient to DockerJobClient

e09152e

Signed-off-by: Eoin Fennessy <efenness@redhat.com>

Rename TrainerClientABC to AbstractTrainerClient

a675acc

Signed-off-by: Eoin Fennessy <efenness@redhat.com>

Use importlib for referencing runtime YAML files

88ed2e5

Signed-off-by: Eoin Fennessy <efenness@redhat.com>

Change "master" to "head"

f586ec3

Signed-off-by: Eoin Fennessy <efenness@redhat.com>

Add example notebook

50adf30

Signed-off-by: Eoin Fennessy <efenness@redhat.com>

Use Optional instead of union types

80cd4bb

Signed-off-by: Eoin Fennessy <efenness@redhat.com>

Add warning that LocalTrainerClient is an alpha feature

ba16065

Signed-off-by: Eoin Fennessy <efenness@redhat.com>

eoinfennessy force-pushed the add-local-trainer-client branch from a909ecc to ba16065 Compare April 22, 2025 13:59

astefanutti reviewed Apr 22, 2025

View reviewed changes

szaher reviewed Apr 22, 2025

View reviewed changes

eoinfennessy added 3 commits April 24, 2025 09:49

Fix pre-commit fails

bc91f68

Signed-off-by: Eoin Fennessy <efenness@redhat.com>

Update copyright year in new files

edebbf4

Signed-off-by: Eoin Fennessy <efenness@redhat.com>

Rename DockerJobClient to DockerJobRunner

fb7e397

...and set groundwork for adding more job runners Signed-off-by: Eoin Fennessy <efenness@redhat.com>

eoinfennessy force-pushed the add-local-trainer-client branch from 579fc46 to fb7e397 Compare April 24, 2025 10:49

eoinfennessy added 2 commits April 24, 2025 13:19

Implement get_job methods

946866b

Signed-off-by: Eoin Fennessy <efenness@redhat.com>

Implement list_jobs methods

fcc6fec

Signed-off-by: Eoin Fennessy <efenness@redhat.com>

google-oss-prow bot added the do-not-merge/hold label Apr 25, 2025

Add docstrings

12dea43

Signed-off-by: Eoin Fennessy <efenness@redhat.com>

google-oss-prow bot added size/XXL and removed size/XL labels May 1, 2025

franciscojavierarceo closed this May 17, 2025

franciscojavierarceo reopened this May 17, 2025

eoinfennessy mentioned this pull request Jun 5, 2025

feat(trainer): Introduce LocalTrainerClient kubeflow/sdk#13

Closed

1 task

eoinfennessy closed this Jun 5, 2025

		from kubeflow.trainer.types import types


		class AbstractTrainerClient(ABC):

		from kubeflow.trainer.utils import utils


		class LocalTrainerClient(AbstractTrainerClient):

		from kubeflow.trainer.utils import utils


		class DockerJobClient:

Comments

Conversation

eoinfennessy commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

google-oss-prow bot commented Apr 22, 2025

Uh oh!

review-notebook-app bot commented Apr 22, 2025

Uh oh!

anishasthana commented Apr 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coveralls commented Apr 22, 2025

Pull Request Test Coverage Report for Build 14596676280

Details

💛 - Coveralls

Uh oh!

eoinfennessy commented Apr 23, 2025

Uh oh!

franciscojavierarceo commented Apr 25, 2025

Uh oh!

tenzen-y commented Apr 25, 2025

Uh oh!

franciscojavierarceo commented May 17, 2025

Uh oh!

eoinfennessy commented Jun 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

eoinfennessy commented Apr 22, 2025 •

edited

Loading