diff --git a/docs/source/getting-started/index.rst b/docs/source/getting-started/index.rst index b9a014112..94cc2fbd2 100644 --- a/docs/source/getting-started/index.rst +++ b/docs/source/getting-started/index.rst @@ -55,7 +55,7 @@ Here's how simple it is to train a model: Next Steps ---------- -.. grid:: 2 +.. grid:: 3 :gutter: 3 .. grid-item-card:: Installation @@ -69,3 +69,9 @@ Next Steps :link-type: doc Train your first model step-by-step. + + .. grid-item-card:: Local Development + :link: local-development + :link-type: doc + + Run training jobs locally using different SDK backends. diff --git a/docs/source/getting-started/local-development.rst b/docs/source/getting-started/local-development.rst new file mode 100644 index 000000000..85f4b139b --- /dev/null +++ b/docs/source/getting-started/local-development.rst @@ -0,0 +1,239 @@ +Local Development with SDK Backends +==================================== + +This guide explains how to run Kubeflow training jobs locally using the SDK's +different backends, helping you iterate faster before deploying to a Kubernetes +cluster. + +Overview +-------- + +The Kubeflow Trainer SDK provides three backends for running training jobs: + +.. list-table:: Backend Comparison + :header-rows: 1 + :widths: 20 35 45 + + * - Backend + - Best For + - Requirements + * - **Local Process** + - Quick prototyping, single-node testing + - Python 3.9+ + * - **Container** + - Multi-node training, reproducibility + - Docker or Podman installed + * - **Kubernetes** + - Production deployments + - K8s cluster with Trainer operator + +All backends use the same ``TrainerClient`` interface - only the configuration +changes. This means you can develop locally and deploy to production with +minimal code changes. + +Local Process Backend +--------------------- + +The fastest option for quick testing. Runs training directly as Python processes. + +**When to use:** + +- Rapid prototyping and debugging +- Testing training logic without container overhead +- Environments without Docker/Podman + +**Example:** + +.. code-block:: python + + from kubeflow.trainer import TrainerClient, LocalProcessBackendConfig + from kubeflow.trainer import CustomTrainer + + # Configure local process backend + backend_config = LocalProcessBackendConfig() + client = TrainerClient(backend_config=backend_config) + + # Define your training function + def train_model(): + import torch + print(f"Training on device: {torch.cuda.current_device() if torch.cuda.is_available() else 'cpu'}") + # Your training logic here + + # Create trainer and run + trainer = CustomTrainer(func=train_model) + job_name = client.train(trainer=trainer) + + # View logs + client.get_job_logs(name=job_name, follow=True) + +**Limitations:** + +- Single-node only (no distributed training) +- No container isolation + +Container Backend (Docker/Podman) +--------------------------------- + +Run training in isolated containers with full multi-node distributed training support. + +**When to use:** + +- Distributed training with multiple workers +- Reproducible containerized environments +- Testing production-like setups locally + +**Example with Docker:** + +.. code-block:: python + + from kubeflow.trainer import TrainerClient, ContainerBackendConfig + from kubeflow.trainer import CustomTrainer + + # Configure Docker backend + backend_config = ContainerBackendConfig( + container_runtime="docker", # or "podman" + ) + client = TrainerClient(backend_config=backend_config) + + # Same trainer works - now with multi-node support! + trainer = CustomTrainer( + func=train_model, + num_nodes=4, # Distributed across 4 containers + ) + job_name = client.train(trainer=trainer) + +.. _container-host-configuration: + +Container Host Configuration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When using the Container backend on **macOS** , you may need to configure the +``container_host`` parameter to point to your Docker or Podman socket. This is +because the default socket path differs across operating systems. + +.. list-table:: + :header-rows: 1 + :widths: 20 80 + + * - OS + - Default ``container_host`` + * - Linux + - ``unix:///var/run/docker.sock`` (Docker) or ``unix:///run/user//podman/podman.sock`` (Podman) + * - macOS + - ``unix://$HOME/.docker/run/docker.sock`` (Docker Desktop) or check ``podman machine inspect`` for Podman + * - Windows + - ``npipe:////./pipe/docker_engine`` (Docker Desktop) + +**Example for macOS:** + +.. code-block:: python + + import os + + backend_config = ContainerBackendConfig( + container_runtime="docker", + # macOS Docker Desktop socket path + container_host=f"unix://{os.environ['HOME']}/.docker/run/docker.sock", + ) + client = TrainerClient(backend_config=backend_config) + +.. note:: + + If you encounter ``Cannot connect to Docker daemon`` errors on macOS, + verify the socket path by running ``docker context inspect`` and check + the ``Host`` value in the output. + +**Choosing Docker vs Podman:** + +.. list-table:: + :header-rows: 1 + :widths: 30 70 + + * - Runtime + - Recommended For + * - Docker + - General use, especially on macOS/Windows + * - Podman + - Linux servers, rootless/security-focused environments + +Switching Between Backends +-------------------------- + +The key benefit of the SDK is seamless backend switching. Your training code +stays the same - only the backend configuration changes: + +.. code-block:: python + + # Development: Use local process for fast iteration + from kubeflow.trainer import LocalProcessBackendConfig + backend_config = LocalProcessBackendConfig() + + # Testing: Switch to Docker for distributed testing + from kubeflow.trainer import ContainerBackendConfig + backend_config = ContainerBackendConfig(container_runtime="docker") + + # Production: Deploy to Kubernetes + from kubeflow.trainer import KubernetesBackendConfig + backend_config = KubernetesBackendConfig(namespace="kubeflow") + + # Same client and trainer code works with all backends! + client = TrainerClient(backend_config=backend_config) + job_name = client.train(trainer=trainer) + +Common Operations +----------------- + +These operations work identically across all backends: + +**List Jobs:** + +.. code-block:: python + + jobs = client.list_jobs() + for job in jobs: + print(f"{job.name}: {job.status}") + +**View Logs:** + +.. code-block:: python + + # Follow logs in real-time + for log_line in client.get_job_logs(name=job_name, follow=True): + print(log_line) + +**Wait for Completion:** + +.. code-block:: python + + job = client.wait_for_job_status( + name=job_name, + timeout=3600, # 1 hour timeout + ) + +**Delete Jobs:** + +.. code-block:: python + + client.delete_job(name=job_name) + +Troubleshooting +--------------- + +**Local Process Backend:** + +- ``ModuleNotFoundError``: Ensure dependencies are installed in current environment +- Training hangs: Check for infinite loops in your training function + +**Container Backend:** + +- ``Cannot connect to Docker daemon``: Start Docker/Podman service. On macOS, + verify the socket path — see :ref:`container-host-configuration`. +- Image pull errors: Check network connectivity and image registry access +- Permission denied: For Podman, ensure rootless mode is configured + +Next Steps +---------- + +- `Custom Training <../train/custom-training.html>`_ - Define your trainers +- `Distributed Training <../train/distributed.html>`_ - Scale across nodes +- `Kubeflow Trainer Docs `_ - Full documentation diff --git a/docs/source/index.rst b/docs/source/index.rst index 698296379..2eb59b2dc 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -136,6 +136,7 @@ Getting Involved getting-started/installation getting-started/quickstart + getting-started/local-development .. toctree:: :maxdepth: 2