-
Notifications
You must be signed in to change notification settings - Fork 105
chore: add local development with SDK backends guide #261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Raakshass
wants to merge
1
commit into
kubeflow:main
Choose a base branch
from
Raakshass:docs/local-development
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,239 @@ | ||
| Local Development with SDK Backends | ||
| ==================================== | ||
|
|
||
| This guide explains how to run Kubeflow training jobs locally using the SDK's | ||
| different backends, helping you iterate faster before deploying to a Kubernetes | ||
| cluster. | ||
|
|
||
| Overview | ||
| -------- | ||
|
|
||
| The Kubeflow Trainer SDK provides three backends for running training jobs: | ||
|
|
||
| .. list-table:: Backend Comparison | ||
| :header-rows: 1 | ||
| :widths: 20 35 45 | ||
|
|
||
| * - Backend | ||
| - Best For | ||
| - Requirements | ||
| * - **Local Process** | ||
| - Quick prototyping, single-node testing | ||
| - Python 3.9+ | ||
| * - **Container** | ||
| - Multi-node training, reproducibility | ||
| - Docker or Podman installed | ||
| * - **Kubernetes** | ||
| - Production deployments | ||
| - K8s cluster with Trainer operator | ||
|
|
||
| All backends use the same ``TrainerClient`` interface - only the configuration | ||
| changes. This means you can develop locally and deploy to production with | ||
| minimal code changes. | ||
|
|
||
| Local Process Backend | ||
| --------------------- | ||
|
|
||
| The fastest option for quick testing. Runs training directly as Python processes. | ||
|
|
||
| **When to use:** | ||
|
|
||
| - Rapid prototyping and debugging | ||
| - Testing training logic without container overhead | ||
| - Environments without Docker/Podman | ||
|
|
||
| **Example:** | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| from kubeflow.trainer import TrainerClient, LocalProcessBackendConfig | ||
| from kubeflow.trainer import CustomTrainer | ||
|
|
||
| # Configure local process backend | ||
| backend_config = LocalProcessBackendConfig() | ||
| client = TrainerClient(backend_config=backend_config) | ||
|
|
||
| # Define your training function | ||
| def train_model(): | ||
| import torch | ||
| print(f"Training on device: {torch.cuda.current_device() if torch.cuda.is_available() else 'cpu'}") | ||
| # Your training logic here | ||
|
|
||
| # Create trainer and run | ||
| trainer = CustomTrainer(func=train_model) | ||
| job_name = client.train(trainer=trainer) | ||
|
|
||
| # View logs | ||
| client.get_job_logs(name=job_name, follow=True) | ||
|
|
||
| **Limitations:** | ||
|
|
||
| - Single-node only (no distributed training) | ||
| - No container isolation | ||
|
|
||
| Container Backend (Docker/Podman) | ||
| --------------------------------- | ||
|
|
||
| Run training in isolated containers with full multi-node distributed training support. | ||
|
|
||
| **When to use:** | ||
|
|
||
| - Distributed training with multiple workers | ||
| - Reproducible containerized environments | ||
| - Testing production-like setups locally | ||
|
|
||
| **Example with Docker:** | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| from kubeflow.trainer import TrainerClient, ContainerBackendConfig | ||
| from kubeflow.trainer import CustomTrainer | ||
|
|
||
| # Configure Docker backend | ||
| backend_config = ContainerBackendConfig( | ||
| container_runtime="docker", # or "podman" | ||
| ) | ||
| client = TrainerClient(backend_config=backend_config) | ||
|
|
||
| # Same trainer works - now with multi-node support! | ||
| trainer = CustomTrainer( | ||
| func=train_model, | ||
| num_nodes=4, # Distributed across 4 containers | ||
| ) | ||
| job_name = client.train(trainer=trainer) | ||
|
|
||
| .. _container-host-configuration: | ||
|
|
||
| Container Host Configuration | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| When using the Container backend on **macOS** , you may need to configure the | ||
| ``container_host`` parameter to point to your Docker or Podman socket. This is | ||
| because the default socket path differs across operating systems. | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
| :widths: 20 80 | ||
|
|
||
| * - OS | ||
| - Default ``container_host`` | ||
| * - Linux | ||
| - ``unix:///var/run/docker.sock`` (Docker) or ``unix:///run/user/<UID>/podman/podman.sock`` (Podman) | ||
| * - macOS | ||
| - ``unix://$HOME/.docker/run/docker.sock`` (Docker Desktop) or check ``podman machine inspect`` for Podman | ||
| * - Windows | ||
| - ``npipe:////./pipe/docker_engine`` (Docker Desktop) | ||
|
|
||
| **Example for macOS:** | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| import os | ||
|
|
||
| backend_config = ContainerBackendConfig( | ||
| container_runtime="docker", | ||
| # macOS Docker Desktop socket path | ||
| container_host=f"unix://{os.environ['HOME']}/.docker/run/docker.sock", | ||
| ) | ||
| client = TrainerClient(backend_config=backend_config) | ||
|
|
||
| .. note:: | ||
|
|
||
| If you encounter ``Cannot connect to Docker daemon`` errors on macOS, | ||
| verify the socket path by running ``docker context inspect`` and check | ||
| the ``Host`` value in the output. | ||
|
|
||
| **Choosing Docker vs Podman:** | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
| :widths: 30 70 | ||
|
|
||
| * - Runtime | ||
| - Recommended For | ||
| * - Docker | ||
| - General use, especially on macOS/Windows | ||
| * - Podman | ||
| - Linux servers, rootless/security-focused environments | ||
|
|
||
| Switching Between Backends | ||
| -------------------------- | ||
|
|
||
| The key benefit of the SDK is seamless backend switching. Your training code | ||
| stays the same - only the backend configuration changes: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| # Development: Use local process for fast iteration | ||
| from kubeflow.trainer import LocalProcessBackendConfig | ||
| backend_config = LocalProcessBackendConfig() | ||
|
|
||
| # Testing: Switch to Docker for distributed testing | ||
| from kubeflow.trainer import ContainerBackendConfig | ||
| backend_config = ContainerBackendConfig(container_runtime="docker") | ||
|
|
||
| # Production: Deploy to Kubernetes | ||
| from kubeflow.trainer import KubernetesBackendConfig | ||
| backend_config = KubernetesBackendConfig(namespace="kubeflow") | ||
|
|
||
| # Same client and trainer code works with all backends! | ||
| client = TrainerClient(backend_config=backend_config) | ||
| job_name = client.train(trainer=trainer) | ||
|
|
||
| Common Operations | ||
| ----------------- | ||
|
|
||
| These operations work identically across all backends: | ||
|
|
||
| **List Jobs:** | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| jobs = client.list_jobs() | ||
| for job in jobs: | ||
| print(f"{job.name}: {job.status}") | ||
|
|
||
| **View Logs:** | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| # Follow logs in real-time | ||
| for log_line in client.get_job_logs(name=job_name, follow=True): | ||
| print(log_line) | ||
|
|
||
| **Wait for Completion:** | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| job = client.wait_for_job_status( | ||
| name=job_name, | ||
| timeout=3600, # 1 hour timeout | ||
| ) | ||
|
|
||
| **Delete Jobs:** | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| client.delete_job(name=job_name) | ||
|
|
||
| Troubleshooting | ||
| --------------- | ||
|
|
||
| **Local Process Backend:** | ||
|
|
||
| - ``ModuleNotFoundError``: Ensure dependencies are installed in current environment | ||
| - Training hangs: Check for infinite loops in your training function | ||
|
|
||
| **Container Backend:** | ||
|
|
||
| - ``Cannot connect to Docker daemon``: Start Docker/Podman service. On macOS, | ||
| verify the socket path — see :ref:`container-host-configuration`. | ||
| - Image pull errors: Check network connectivity and image registry access | ||
| - Permission denied: For Podman, ensure rootless mode is configured | ||
|
|
||
| Next Steps | ||
| ---------- | ||
|
|
||
| - `Custom Training <../train/custom-training.html>`_ - Define your trainers | ||
| - `Distributed Training <../train/distributed.html>`_ - Scale across nodes | ||
| - `Kubeflow Trainer Docs <https://www.kubeflow.org/docs/components/trainer/>`_ - Full documentation |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.