Skip to content

Comments

Introduce local trainer client#2610

Closed
eoinfennessy wants to merge 22 commits intokubeflow:masterfrom
eoinfennessy:add-local-trainer-client
Closed

Introduce local trainer client#2610
eoinfennessy wants to merge 22 commits intokubeflow:masterfrom
eoinfennessy:add-local-trainer-client

Conversation

@eoinfennessy
Copy link
Member

@eoinfennessy eoinfennessy commented Apr 22, 2025

What this PR does / why we need it:

This PR introduces LocalTrainerClient to the Python SDK. This client implements the same interface as the existing TrainerClient, and enables users to run training jobs in Docker containers, without requiring a Kubernetes cluster.

Fixes kubeflow/sdk#22

Old PR: opendatahub-io#1

Checklist:

  • Docs included if any changes are user facing

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@google-oss-prow google-oss-prow bot requested a review from kuizhiqing April 22, 2025 13:20
@eoinfennessy eoinfennessy changed the title Add local trainer client Introduce local trainer client Apr 22, 2025
@anishasthana
Copy link
Contributor

@eoinfennessy Drive-by reviewer here... but a quick note: KFP local mode was also implemented in a manner which basically has docker hardcoded into the system. Can I suggest/recommend renaming instances of docker to container, and implementing in a way that makes it possible for users to use Docker or Podman as needed?

KFP local doesn't function without docker on your system without significant finagling, which is a significant barrier to entry.

Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
@eoinfennessy eoinfennessy force-pushed the add-local-trainer-client branch from a909ecc to ba16065 Compare April 22, 2025 13:59
"metadata": {},
"cell_type": "markdown",
"source": [
"# Using `LocalTrainerClient` for MNIST image classification with PyTorch DDP\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of the main value of the local mode is portability, so I'd rather demonstrate how to run the existing notebook example rather than duplicating it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main purpose of this notebook is to make readers aware of what LocalTrainerClient is and give a high level idea of how it works by showing the Docker resources created. There is some duplication with one of the other notebooks, but maybe it would detract from the focus of the other notebook if we were to update it to include information about what LocalTrainerClient is and ask users to examine the Docker resources created. WDYT?

Agreed regarding highlighting portability -- maybe we could eventually update the getting started guide on the Kubeflow website to make users aware of the LocalTrainerClient by giving them options to use it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be good for a "quickstart"

"Topic :: Software Development :: Libraries :: Python Modules",
]
dependencies = ["kubernetes>=27.2.0", "pydantic>=2.10.0"]
dependencies = ["kubernetes>=27.2.0", "pydantic>=2.10.0", "docker>=7.1.0"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should probably be an optional dependency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. @szaher, we will need to consider how this will work with plans for extras in the unifying Kubeflow SDK.

@@ -0,0 +1,34 @@
apiVersion: trainer.kubeflow.org/v1alpha1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To demonstrate portability, we should rather not duplicate in-tree training runtimes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea was to use the same runtime definitions for the trainer here, but we can definitly change that to more better/light definitions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users can also provide a path to their own runtime YAMLs when creating a LocalTrainerClient. This will override the path to the built-in runtimes, allowing them to use different runtimes to the ones included with the SDK. The built-in runtimes have been included for the user's convenience -- this allows data scientists to start using with the local trainer without requiring them to provide a path to runtimes files.

from kubeflow.trainer.types import types


class AbstractTrainerClient(ABC):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think BaseTrainer name would make more sense here? or AbstractTrainer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could cause some confusion because a Trainer type already exists, which is different to the TrainerClient type that we are defining an interface for here.

from kubeflow.trainer.utils import utils


class LocalTrainerClient(AbstractTrainerClient):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would ContainerTrainerClient or LocalContainerTrainer be more meaningful here?

)

if local_runtimes_path is None:
self.local_runtimes_path = resources.files(constants.PACKAGE_NAME) / constants.LOCAL_RUNTIMES_PATH
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use the build system to package runtime definitions to be included along with the code

[tool.setuptools.package-data]
"my_package" = ["runtimes/*.yaml"]

or

 from setuptools import setup, find_packages

 setup(
     name='my_package',
     packages=find_packages(),
     package_data={
         'my_package': ["runtimes/*.yaml"],
     },
 )

@@ -0,0 +1,34 @@
apiVersion: trainer.kubeflow.org/v1alpha1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea was to use the same runtime definitions for the trainer here, but we can definitly change that to more better/light definitions.

from kubeflow.trainer.utils import utils


class DockerJobClient:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make it more generic. Let's rename this to ContainerJobClient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about this and maybe we should get away from the word "client" and instead use "runner".

My idea for this is to create an abstract JobRunner class that a DockerJobRunner and PodmanJobRunner implement.

The init function for the LocalTrainerClient would allow users to provide any type of job runner:

def __init__(job_runner: Optional[JobRunner] = None, ...)

...and users could then specify the runner they want to use (if they do not want to use the default):

client = LocalTrainerClient(
    job_runner=PodmanJobRunner(), # Allows users to specify config for the client, e.g. host address etc.
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I am preferential to JobRunner too

@coveralls
Copy link

Pull Request Test Coverage Report for Build 14596676280

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 67.356%

Totals Coverage Status
Change from base Build 14581099833: 0.0%
Covered Lines: 1758
Relevant Lines: 2610

💛 - Coveralls

@eoinfennessy
Copy link
Member Author

@eoinfennessy Drive-by reviewer here... but a quick note: KFP local mode was also implemented in a manner which basically has docker hardcoded into the system. Can I suggest/recommend renaming instances of docker to container, and implementing in a way that makes it possible for users to use Docker or Podman as needed?

KFP local doesn't function without docker on your system without significant finagling, which is a significant barrier to entry.

@anishasthana, thank you for your review and your advice. Implementing this in such a way that many different container runtimes can be used and specified by the user makes sense. We'll aim to do this very soon by adding Podman as a job runner.

Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
...and set groundwork for adding more job runners

Signed-off-by: Eoin Fennessy <efenness@redhat.com>
@eoinfennessy eoinfennessy force-pushed the add-local-trainer-client branch from 579fc46 to fb7e397 Compare April 24, 2025 10:49
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
@franciscojavierarceo
Copy link
Contributor

@eoinfennessy Drive-by reviewer here... but a quick note: KFP local mode was also implemented in a manner which basically has docker hardcoded into the system. Can I suggest/recommend renaming instances of docker to container, and implementing in a way that makes it possible for users to use Docker or Podman as needed?

KFP local doesn't function without docker on your system without significant finagling, which is a significant barrier to entry.

We will also implement subprocess runner :D

@tenzen-y
Copy link
Member

@eoinfennessy, could you open a proposal PR, first to store it in https://github.com/kubeflow/trainer/tree/master/docs/proposals?
I think this is useful to evaluate your proposal. So, we can keep opening this.

/hold

Signed-off-by: Eoin Fennessy <efenness@redhat.com>
@google-oss-prow google-oss-prow bot added size/XXL and removed size/XL labels May 1, 2025
@franciscojavierarceo
Copy link
Contributor

Hi @tenzen-y I think we should definitely open up a KEP.

I also think it'd be useful for us to release this as an alpha feature (e.g., logging that this is an unstable product and invite users to give us feedback) so that we can get early feedback from users to assess how useful users find it.

This is beneficial in the sense that (1) we begin moving at a faster pace and historically that has been challenging, (2) we explicitly make users aware about the potential instability of the new tool, and (3) invites users to engage with us and share feedback. We could even explicitly link to our repo to file a github issue.

@eoinfennessy
Copy link
Member Author

Closing.

A new PR has been opened at kubeflow/sdk#13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Local Execution of Training Jobs

7 participants