Introduce local trainer client#2610
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
@eoinfennessy Drive-by reviewer here... but a quick note: KFP local mode was also implemented in a manner which basically has docker hardcoded into the system. Can I suggest/recommend renaming instances of docker to container, and implementing in a way that makes it possible for users to use Docker or Podman as needed? KFP local doesn't function without docker on your system without significant finagling, which is a significant barrier to entry. |
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
a909ecc to
ba16065
Compare
| "metadata": {}, | ||
| "cell_type": "markdown", | ||
| "source": [ | ||
| "# Using `LocalTrainerClient` for MNIST image classification with PyTorch DDP\n", |
There was a problem hiding this comment.
My understanding of the main value of the local mode is portability, so I'd rather demonstrate how to run the existing notebook example rather than duplicating it.
There was a problem hiding this comment.
The main purpose of this notebook is to make readers aware of what LocalTrainerClient is and give a high level idea of how it works by showing the Docker resources created. There is some duplication with one of the other notebooks, but maybe it would detract from the focus of the other notebook if we were to update it to include information about what LocalTrainerClient is and ask users to examine the Docker resources created. WDYT?
Agreed regarding highlighting portability -- maybe we could eventually update the getting started guide on the Kubeflow website to make users aware of the LocalTrainerClient by giving them options to use it?
There was a problem hiding this comment.
This would be good for a "quickstart"
| "Topic :: Software Development :: Libraries :: Python Modules", | ||
| ] | ||
| dependencies = ["kubernetes>=27.2.0", "pydantic>=2.10.0"] | ||
| dependencies = ["kubernetes>=27.2.0", "pydantic>=2.10.0", "docker>=7.1.0"] |
There was a problem hiding this comment.
It should probably be an optional dependency?
There was a problem hiding this comment.
Agreed. @szaher, we will need to consider how this will work with plans for extras in the unifying Kubeflow SDK.
| @@ -0,0 +1,34 @@ | |||
| apiVersion: trainer.kubeflow.org/v1alpha1 | |||
There was a problem hiding this comment.
To demonstrate portability, we should rather not duplicate in-tree training runtimes.
There was a problem hiding this comment.
the idea was to use the same runtime definitions for the trainer here, but we can definitly change that to more better/light definitions.
There was a problem hiding this comment.
Users can also provide a path to their own runtime YAMLs when creating a LocalTrainerClient. This will override the path to the built-in runtimes, allowing them to use different runtimes to the ones included with the SDK. The built-in runtimes have been included for the user's convenience -- this allows data scientists to start using with the local trainer without requiring them to provide a path to runtimes files.
| from kubeflow.trainer.types import types | ||
|
|
||
|
|
||
| class AbstractTrainerClient(ABC): |
There was a problem hiding this comment.
I think BaseTrainer name would make more sense here? or AbstractTrainer
There was a problem hiding this comment.
I think this could cause some confusion because a Trainer type already exists, which is different to the TrainerClient type that we are defining an interface for here.
| from kubeflow.trainer.utils import utils | ||
|
|
||
|
|
||
| class LocalTrainerClient(AbstractTrainerClient): |
There was a problem hiding this comment.
would ContainerTrainerClient or LocalContainerTrainer be more meaningful here?
| ) | ||
|
|
||
| if local_runtimes_path is None: | ||
| self.local_runtimes_path = resources.files(constants.PACKAGE_NAME) / constants.LOCAL_RUNTIMES_PATH |
There was a problem hiding this comment.
we can use the build system to package runtime definitions to be included along with the code
[tool.setuptools.package-data]
"my_package" = ["runtimes/*.yaml"]or
from setuptools import setup, find_packages
setup(
name='my_package',
packages=find_packages(),
package_data={
'my_package': ["runtimes/*.yaml"],
},
)| @@ -0,0 +1,34 @@ | |||
| apiVersion: trainer.kubeflow.org/v1alpha1 | |||
There was a problem hiding this comment.
the idea was to use the same runtime definitions for the trainer here, but we can definitly change that to more better/light definitions.
| from kubeflow.trainer.utils import utils | ||
|
|
||
|
|
||
| class DockerJobClient: |
There was a problem hiding this comment.
Let's make it more generic. Let's rename this to ContainerJobClient.
There was a problem hiding this comment.
I've been thinking about this and maybe we should get away from the word "client" and instead use "runner".
My idea for this is to create an abstract JobRunner class that a DockerJobRunner and PodmanJobRunner implement.
The init function for the LocalTrainerClient would allow users to provide any type of job runner:
def __init__(job_runner: Optional[JobRunner] = None, ...)...and users could then specify the runner they want to use (if they do not want to use the default):
client = LocalTrainerClient(
job_runner=PodmanJobRunner(), # Allows users to specify config for the client, e.g. host address etc.
)There was a problem hiding this comment.
Yeah I am preferential to JobRunner too
Pull Request Test Coverage Report for Build 14596676280Details
💛 - Coveralls |
@anishasthana, thank you for your review and your advice. Implementing this in such a way that many different container runtimes can be used and specified by the user makes sense. We'll aim to do this very soon by adding Podman as a job runner. |
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
...and set groundwork for adding more job runners Signed-off-by: Eoin Fennessy <efenness@redhat.com>
579fc46 to
fb7e397
Compare
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
We will also implement subprocess runner :D |
|
@eoinfennessy, could you open a proposal PR, first to store it in https://github.com/kubeflow/trainer/tree/master/docs/proposals? /hold |
Signed-off-by: Eoin Fennessy <efenness@redhat.com>
|
Hi @tenzen-y I think we should definitely open up a KEP. I also think it'd be useful for us to release this as an alpha feature (e.g., logging that this is an unstable product and invite users to give us feedback) so that we can get early feedback from users to assess how useful users find it. This is beneficial in the sense that (1) we begin moving at a faster pace and historically that has been challenging, (2) we explicitly make users aware about the potential instability of the new tool, and (3) invites users to engage with us and share feedback. We could even explicitly link to our repo to file a github issue. |
|
Closing. A new PR has been opened at kubeflow/sdk#13 |
What this PR does / why we need it:
This PR introduces
LocalTrainerClientto the Python SDK. This client implements the same interface as the existingTrainerClient, and enables users to run training jobs in Docker containers, without requiring a Kubernetes cluster.Fixes kubeflow/sdk#22
Old PR: opendatahub-io#1
Checklist: