From ab33c7e7f1c61946b66d239c2fd46ce74dd0e165 Mon Sep 17 00:00:00 2001 From: vinayak sharma Date: Wed, 11 Feb 2026 09:41:09 +0530 Subject: [PATCH 1/4] docs: add humanized tutorial series for real-world scenarios Signed-off-by: vinayak sharma --- content/en/docs/tutorials/_index.md | 38 ++++++ content/en/docs/tutorials/argo-workflows.md | 123 +++++++++++++++++ content/en/docs/tutorials/gpu-scheduling.md | 101 ++++++++++++++ content/en/docs/tutorials/multi-tenancy.md | 134 ++++++++++++++++++ content/en/docs/tutorials/spark.md | 113 ++++++++++++++++ content/en/docs/tutorials/tensorflow.md | 143 ++++++++++++++++++++ 6 files changed, 652 insertions(+) create mode 100644 content/en/docs/tutorials/_index.md create mode 100644 content/en/docs/tutorials/argo-workflows.md create mode 100644 content/en/docs/tutorials/gpu-scheduling.md create mode 100644 content/en/docs/tutorials/multi-tenancy.md create mode 100644 content/en/docs/tutorials/spark.md create mode 100644 content/en/docs/tutorials/tensorflow.md diff --git a/content/en/docs/tutorials/_index.md b/content/en/docs/tutorials/_index.md new file mode 100644 index 00000000..af1adef2 --- /dev/null +++ b/content/en/docs/tutorials/_index.md @@ -0,0 +1,38 @@ ++++ +title = "Tutorial Series" +linktitle = "Tutorials" +date = 2026-02-11 +publishdate = 2026-02-11 +lastmod = 2026-02-11 +draft = false +toc = true +type = "docs" + +[menu.docs] + parent = "getting-started" + weight = 2 ++++ + +Welcome to the Volcano Tutorial Series. This section provides end-to-end guides for running production-grade batch workloads on Kubernetes using Volcano. + +## Why These Tutorials? + +While the core documentation explains *how* Volcano's individual features work, these tutorials demonstrate *why* they matter in real-world environments. Each guide is structured to help you move from a problem statement to a working deployment: + +- **Background**: Understand the specific challenges (e.g., gang scheduling, resource starvation) addressed by the tutorial. +- **Scenario**: A practical use case you might encounter in a production cluster. +- **Step-by-Step Deployment**: Clear commands and complete, ready to use YAML manifests. +- **Verification**: How to confirm your job is running and being scheduled correctly. + +## Tutorial Series + +- **[Distributed TensorFlow](tensorflow)**: Orchestrate high-performance ML training jobs with parameter servers and workers. +- **[Apache Spark](spark)**: Prevent resource starvation in big data processing pipelines. +- **[GPU Resource Management](gpu-scheduling)**: Maximize hardware efficiency through fractional sharing (vGPU) and isolation. +- **[Multi-tenancy](multi-tenancy)**: Configure fair share scheduling and hierarchical queues for different teams. +- **[Argo Workflows](argo-workflows)**: Integrate Volcano's advanced scheduling into your CI/CD and data pipelines. + +--- + +> [!TIP] +> Each tutorial is designed to be self-contained. However, if you are new to Volcano, we recommend starting with the [Quick Start: Deploy a Volcano Job](https://volcano.sh/en/docs/tutorials/#quick-start-deploy-a-volcano-job) guide. diff --git a/content/en/docs/tutorials/argo-workflows.md b/content/en/docs/tutorials/argo-workflows.md new file mode 100644 index 00000000..ffa8bbea --- /dev/null +++ b/content/en/docs/tutorials/argo-workflows.md @@ -0,0 +1,123 @@ ++++ +title = "Integrating with Argo Workflows" +linktitle = "Argo Workflows" +date = 2026-02-11 +publishdate = 2026-02-11 +lastmod = 2026-02-11 +draft = false +toc = true +type = "docs" + +[menu.docs] + parent = "tutorials" + weight = 50 ++++ + +This tutorial shows how to use Volcano as the scheduler for Argo Workflows to gain advanced batch scheduling features for your CI/CD and data processing pipelines. + +## Background + +Argo Workflows is a popular cloud-native workflow engine for orchestrating parallel jobs on Kubernetes. While Argo excels at managing dependencies and execution flow, it often relies on the default Kubernetes scheduler for individual steps. + +By integrating Volcano as the scheduler for Argo Workflows, you unlock advanced batch scheduling capabilities: + +- **Bin-packing**: Optimize resource utilization by packing tasks onto the fewest number of nodes. +- **Fair Sharing**: Ensure that workflow steps across different tenants or namespaces are scheduled fairly according to configured weights. +- **Gang Scheduling**: For workflows involving multiple parallel pods that must start together, Volcano ensures they are managed as a single unit (PodGroup). + +## Scenario + +A common workflow scenario involves a "Main" entrypoint that triggers one or more "Task" steps. In this tutorial, you will configure a simple Argo Workflow to use Volcano for its underlying pod scheduling. + +## Prerequisites + +Before you begin, ensure you have: +- A Kubernetes cluster with Volcano installed. +- [Argo Workflows](https://argoproj.github.io/argo-workflows/installation/) installed in your cluster. + +## Deployment Step-by-Step + +### 1. Create the Workflow Manifest + +You can configure Argo to use Volcano at the individual template level using the `schedulerName` field. Create a file named `volcano-workflow.yaml`: + +```yaml +apiVersion: argoproj.io/v1alpha1 +kind: Workflow +metadata: + generateName: volcano-workflow- +spec: + entrypoint: main + templates: + - name: main + steps: + - - name: step1 + template: whalesay + - name: whalesay + container: + image: docker/whalesay + command: [cowsay] + args: ["Hello from Argo + Volcano!"] + schedulerName: volcano # Explicitly tell Argo to use Volcano +``` + +### 2. Apply the Workflow + +Run the following command to submit your workflow: + +```bash +argo submit volcano-workflow.yaml +``` + +## Advanced: Deploying VolcanoJobs from Argo + +For tasks that require native Volcano features like `minAvailable` or specific `plugins`, you can submit a `VolcanoJob` directly as a resource template: + +```yaml + - name: volcano-job-step + resource: + action: create + successCondition: status.state == Completed # Wait for the Job to finish + manifest: | + apiVersion: batch.volcano.sh/v1alpha1 + kind: Job + metadata: + generateName: argo-step- + spec: + minAvailable: 1 + schedulerName: volcano + tasks: + - replicas: 1 + template: + spec: + containers: + - name: main + image: alpine + command: ["echo", "running inside volcano job"] +``` + +## Verification + +### Check Workflow Status + +Monitor the progress of your workflow using the Argo CLI: + +```bash +argo get @latest +``` + +### Verify the Scheduler + +Check the details of any pod created by the workflow to ensure it was handled by Volcano: + +```bash +kubectl get pod -o jsonpath='{.spec.schedulerName}' +``` + +The output should be `volcano`. + +## Notes + +- **Global Configuration**: You can make Volcano the default scheduler for *all* Argo Workflows by updating the `workflow-controller-configmap` with `containerRuntimeExecutor: k8sapi` and setting the default scheduler name. +- **ServiceAccount Permissions**: If using the `resource` template to create `VolcanoJobs`, ensure the ServiceAccount used by the Argo controller has RBAC permissions to `create`, `get`, and `watch` resources in the `batch.volcano.sh` group. +- **PodGroups**: When a pod is scheduled by Volcano, a `PodGroup` is automatically created. You can inspect it with `kubectl get podgroups`. diff --git a/content/en/docs/tutorials/gpu-scheduling.md b/content/en/docs/tutorials/gpu-scheduling.md new file mode 100644 index 00000000..f8651e00 --- /dev/null +++ b/content/en/docs/tutorials/gpu-scheduling.md @@ -0,0 +1,101 @@ ++++ +title = "GPU Scheduling and Resource Management" +linktitle = "GPU Scheduling" +date = 2026-02-11 +publishdate = 2026-02-11 +lastmod = 2026-02-11 +draft = false +toc = true +type = "docs" + +[menu.docs] + parent = "tutorials" + weight = 30 ++++ + +This tutorial covers how to efficiently manage GPU resources using Volcano, including fractional GPU sharing (vGPU) and hardware-based isolation (MIG). + +## Background + +GPUs are high-performance but expensive resources. In standard Kubernetes, a physical GPU is typically treated as an indivisible unit—one GPU can only be assigned to one container. This often leads to significant underutilization, especially for smaller workloads like model inference or development tasks that don't require the full compute power or memory of a modern GPU. + +Volcano addresses this by providing robust **vGPU (virtual GPU) scheduling**. This allows you to: + +- **Fractional Sharing**: Slice a single physical GPU into multiple virtual GPUs (vGPUs). +- **Resource Isolation**: Enforce specific compute (cores) and memory limits for each container sharing the physical hardware. +- **Multiple Modes**: Support both software based slicing (via VCUDA) and hardware based isolation (via NVIDIA MIG). + +## Scenario + +Suppose you have a cluster where multiple users need to run lightweight inference tasks. Instead of dedicating one physical GPU to each user, you can partition each GPU to support multiple users simultaneously. + +In this tutorial, you will deploy a Volcano Job that requests a fractional share of a GPU: **20% of the compute power** and **2000MiB of memory**. + +## Prerequisites + +Before you begin, ensure you have: +- A Kubernetes cluster with nodes equipped with NVIDIA GPUs. +- The [NVIDIA Container Toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) installed on your nodes. +- Volcano installed and the `volcano-vgpu-device-plugin` deployed. + +## Deployment Step-by-Step + +### 1. Create the GPU Sharing Manifest + +Create a file named `gpu-sharing-job.yaml` with the following content: + +```yaml +apiVersion: batch.volcano.sh/v1alpha1 +kind: Job +metadata: + name: gpu-sharing-tutorial +spec: + minAvailable: 1 + schedulerName: volcano + template: + spec: + containers: + - name: gpu-container + image: nvidia/cuda:11.0-base + command: ["sh", "-c", "nvidia-smi && sleep 3600"] + resources: + limits: + volcano.sh/vgpu-number: 1 # Request 1 virtual GPU + volcano.sh/vgpu-memory: 2000 # Limit to 2000MiB of GPU memory + volcano.sh/vgpu-cores: 20 # Limit to 20% of GPU compute + restartPolicy: Never +``` + +### 2. Apply the Manifest + +Run the following command to deploy the job: + +```bash +kubectl apply -f gpu-sharing-job.yaml +``` + +## Verification + +### Check Resource Allocation + +Verify that your pod has been scheduled to a node with available vGPU resources: + +```bash +kubectl get pods -l volcano.sh/job-name=gpu-sharing-tutorial +``` + +### Inspect the Container + +Check the logs to verify that the container correctly detects the GPU environment via `nvidia-smi`: + +```bash +kubectl logs gpu-sharing-tutorial-completion-task-0 +``` + +Even though it is a shared physical GPU, the `volcano-vgpu-device-plugin` ensures the container only utilizes the allocated memory and compute slices. + +## Notes + +- **Insufficient Resources**: If pods remain `Pending` with "insufficient volcano.sh/vgpu-number", check if your nodes are correctly labeled and the `volcano-vgpu-device-plugin` is healthy. +- **Memory Limits**: If your application fails with Out of Memory (OOM) on the GPU, ensure the `vgpu-memory` limit is large enough for your specific model requirements. +- **Hardware Isolation**: For mission critical workloads requiring strict hardware level isolation, consider using **Dynamic MIG** mode if your hardware supports it (e.g., A100/H100). diff --git a/content/en/docs/tutorials/multi-tenancy.md b/content/en/docs/tutorials/multi-tenancy.md new file mode 100644 index 00000000..2790f380 --- /dev/null +++ b/content/en/docs/tutorials/multi-tenancy.md @@ -0,0 +1,134 @@ ++++ +title = "Multi-tenant Queue Configuration" +linktitle = "Multi-tenancy" +date = 2026-02-11 +publishdate = 2026-02-11 +lastmod = 2026-02-11 +draft = false +toc = true +type = "docs" + +[menu.docs] + parent = "tutorials" + weight = 40 ++++ + +This tutorial demonstrates how to set up a multi-tenant environment using Volcano's hierarchical queues, enabling fair resource sharing between different teams or departments. + +## Background + +In multi-tenant clusters, different teams or departments share the same underlying resources. Without proper management, one team's high-volume workload can monopolize the cluster, causing "starvation" for other users. + +Volcano's **Hierarchical Queues** provide a sophisticated mechanism to handle these scenarios by allowing you to: + +- **Guarantees and Limits**: Set minimum resource entitlements (`deserved`) to ensure every team has access to capacity. +- **Fair Sharing**: Dynamically re-allocate unused resources between teams based on their configured weights. +- **Resource Reclamation**: Automatically reclaim resources from teams over-using their share when other teams have pending jobs. + +## Scenario + +Consider a "Research" department with two sub-teams: **Team A** and **Team B**. You want to ensure that even if Team A submits a massive batch of jobs, Team B is guaranteed at least 50% of the department's allocated capacity when they need it. + +In this tutorial, you will build a queue hierarchy and deploy a job to verify that resources are allocated according to the team's entitlements. + +## Prerequisites + +Before you begin, ensure you have: +- A Kubernetes cluster with Volcano installed. +- Admin permissions to create `Queue` resources. + +## Deployment Step-by-Step + +### 1. Create the Queue Hierarchy + +Create a file named `team-queues.yaml` to define a parent-child relationship between the department and the teams: + +```yaml +# 1. Define the parent Research Queue +apiVersion: scheduling.volcano.sh/v1beta1 +kind: Queue +metadata: + name: research-parent +spec: + weight: 1 + capability: + cpu: 10 + memory: 20Gi +--- +# 2. Define Team A's Queue as a child +apiVersion: scheduling.volcano.sh/v1beta1 +kind: Queue +metadata: + name: team-a-queue +spec: + parent: research-parent + weight: 1 + deserved: # Guaranteed capacity for Team A + cpu: 5 + memory: 10Gi +--- +# 3. Define Team B's Queue as a child +apiVersion: scheduling.volcano.sh/v1beta1 +kind: Queue +metadata: + name: team-b-queue +spec: + parent: research-parent + weight: 1 + deserved: # Guaranteed capacity for Team B + cpu: 5 + memory: 10Gi +``` + +### 2. Apply the Queues + +Run the following command to establish the hierarchy: + +```bash +kubectl apply -f team-queues.yaml +``` + +### 3. Deploy a Job to a Specific Queue + +To use a queue, specify its name in the `spec.queue` field of your Volcano Job: + +```yaml +apiVersion: batch.volcano.sh/v1alpha1 +kind: Job +metadata: + name: team-a-job +spec: + queue: team-a-queue # Submit specifically to Team A's pool + tasks: + - replicas: 2 + template: + spec: + containers: + - image: busybox + name: worker + command: ["sleep", "3600"] + resources: + requests: + cpu: 1 +``` + +## Verification + +### Check Queue Status + +Monitor how resources are allocated and redistributed across your hierarchy: + +```bash +kubectl get queues research-parent -o yaml +kubectl get queues team-a-queue -o yaml +``` + +### Observe Resource Sharing + +If Team B has no active jobs, Team A can "borrow" resources up to the parent's `capability` (10 CPU). However, the moment Team B submits a job, Volcano's scheduler will identify the over usage and trigger **resource reclamation** from Team A to satisfy Team B's `deserved` entitlement. + +## Notes + +- **Leaf Queues**: In Volcano, you should always submit jobs to **leaf queues** (the bottom level of the hierarchy). +- **Entitlement Sums**: Ensure the sum of `deserved` resources of all child queues does not exceed the `capability` of their parent. +- **Queue State**: By default, new queues are created in the `Open` state. If a queue is `Closed` (e.g., during maintenance), no new jobs will be accepted. diff --git a/content/en/docs/tutorials/spark.md b/content/en/docs/tutorials/spark.md new file mode 100644 index 00000000..84933166 --- /dev/null +++ b/content/en/docs/tutorials/spark.md @@ -0,0 +1,113 @@ ++++ +title = "Batch Processing with Apache Spark" +linktitle = "Apache Spark" +date = 2026-02-11 +publishdate = 2026-02-11 +lastmod = 2026-02-11 +draft = false +toc = true +type = "docs" + +[menu.docs] + parent = "tutorials" + weight = 20 ++++ + +This tutorial explains how to orchestrate Apache Spark applications using Volcano for optimized resource utilization and scheduling. + +## Background + +Apache Spark is a widely used engine for large-scale data processing. When running Spark on Kubernetes, a driver pod is created to manage several executor pods. In clusters with high concurrency or limited resources, this model faces a specific challenge: + +- **Resource Starvation**: Standard schedulers may allow multiple Spark drivers to start, but fail to provide enough resources for their executors. This leads to a state where many jobs are "started" but none can progress, effectively deadlocking the cluster. +- **Gang Scheduling**: To prevent starvation, Spark jobs should ideally be scheduled as a "gang" ensuring the driver only starts if the minimum required resources for executors are also available. + +Volcano enables gang scheduling for Spark, ensuring that resources are allocated efficiently and preventing drivers from idling while waiting for executors. + +## Scenario + +A typical Spark batch job consists of one driver and multiple executors performing parallel data processing tasks. In this tutorial, you will run a simple Spark Pi application using the **Spark Operator**. Volcano will act as the batch scheduler to ensure the driver and its executors are managed as a single unit. + +## Prerequisites + +Before you begin, ensure you have: +- A Kubernetes cluster with Volcano installed. +- The [Spark Operator](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) installed in your cluster. +- A ServiceAccount (e.g., `spark`) with the necessary permissions to manage pods. + +## Deployment Step-by-Step + +### 1. Create the SparkApplication Manifest + +Create a file named `spark-pi.yaml` using the `SparkApplication` Custom Resource: + +```yaml +apiVersion: "sparkoperator.k8s.io/v1beta2" +kind: SparkApplication +metadata: + name: spark-tutorial-pi + namespace: default +spec: + type: Scala + mode: cluster + image: "gcr.io/spark-operator/spark:v3.1.1" + mainClass: org.apache.spark.examples.SparkPi + mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.1.1.jar" + sparkVersion: "3.1.1" + batchScheduler: "volcano" # Crucial: tells the operator to use Volcano + restartPolicy: + type: OnFailure + onFailureRetries: 3 + onFailureRetryInterval: 10 + driver: + cores: 1 + memory: "512m" + labels: + version: 3.1.1 + serviceAccount: spark # Ensure this SA exists + executor: + cores: 1 + instances: 2 + memory: "512m" + labels: + version: 3.1.1 +``` + +### 2. Apply the manifest + +Deploy the Spark application using `kubectl`: + +```bash +kubectl apply -f spark-pi.yaml +``` + +## Verification + +### Check Application Status + +You can monitor the progress of your Spark application using the standard `kubectl` command: + +```bash +kubectl get sparkapplication spark-tutorial-pi +``` + +Expected output: +```text +NAME STATUS ATTEMPTS START FINISH AGE +spark-tutorial-pi COMPLETED 1 2026-02-11T09:40:00Z 2026-02-11T09:42:00Z 2m +``` + +### Monitor Pods + +Watch the driver and executor pods as they are scheduled by Volcano: + +```bash +kubectl get pods -l spark-role=driver +kubectl get pods -l spark-role=executor +``` + +## Notes + +- **Starvation Prevention**: If your driver pod remains in `Pending` state, check if your Volcano `Queue` has enough `capability` to satisfy the combined resource requests of the driver and the minimum executors. +- **Scheduler Configuration**: Ensure that the `batchScheduler` field in your manifest is explicitly set to `"volcano"`. If omitted, the application will default to the standard Kubernetes scheduler. +- **RBAC**: If the driver fails to create executor pods, verify that the `serviceAccount` used has a `RoleBinding` allowing it to create and manage pods in the namespace. diff --git a/content/en/docs/tutorials/tensorflow.md b/content/en/docs/tutorials/tensorflow.md new file mode 100644 index 00000000..a8a90c49 --- /dev/null +++ b/content/en/docs/tutorials/tensorflow.md @@ -0,0 +1,143 @@ ++++ +title = "Running Distributed TensorFlow Training" +linktitle = "Distributed TensorFlow" +date = 2026-02-11 +publishdate = 2026-02-11 +lastmod = 2026-02-11 +draft = false +toc = true +type = "docs" + +[menu.docs] + parent = "tutorials" + weight = 10 ++++ + +This tutorial demonstrates how to run a distributed TensorFlow training job on Volcano using the Parameter Server (PS) and Worker model. + +## Background + +Distributed training is essential for large scale deep learning models that cannot fit or run efficiently on a single node. However, running distributed workloads like TensorFlow on standard Kubernetes presents several challenges: + +- **Gang Scheduling**: Distributed training requires all components (Parameter Servers and Workers) to start together. If only a subset is scheduled, the job may hang indefinitely, wasting cluster resources. +- **Service Discovery**: Parameter Servers and Workers need a way to find each other to synchronize gradients and parameters. +- **Lifecycle Management**: The job should be able to recover from worker failures or preemptions without manual intervention. + +Volcano addresses these issues by providing native gang scheduling, service discovery via its `svc` plugin, and flexible lifecycle policies. + +## Scenario + +A common distributed training pattern is the **Parameter Server (PS) and Worker model**. The PS performs model-related services (storing and updating parameters), while Workers train the model, calculate gradients, and send updates to the PS. + +In this tutorial, you will deploy a distributed MNIST training job using 1 Parameter Server and 2 Workers. Volcano ensures that all 3 pods are scheduled as a single unit or not at all. + +## Prerequisites + +Before you begin, ensure you have: +- A Kubernetes cluster with Volcano installed. +- `kubectl` configured to access your cluster. + +## Deployment Step-by-Step + +### 1. Create the Job Manifest + +Create a file named `tensorflow-dist.yaml` with the following content: + +```yaml +apiVersion: batch.volcano.sh/v1alpha1 +kind: Job +metadata: + name: tf-dist-mnist +spec: + minAvailable: 3 # Total pods required: 1 PS + 2 Workers + schedulerName: volcano + plugins: + env: [] + svc: [] # Enables service discovery between PS and Workers + policies: + - event: PodEvicted + action: RestartJob # Ensures job resilience + queue: default + tasks: + - replicas: 1 + name: ps + template: + spec: + containers: + - command: + - sh + - -c + - | + PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | tr "\n" "," | sed 's/,$//'`; + WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | tr "\n" "," | sed 's/,$//'`; + export TF_CONFIG="{\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"ps\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"}"; + python /var/tf_dist_mnist/dist_mnist.py + image: volcanosh/dist-mnist-tf-example:0.0.1 + name: tensorflow + ports: + - containerPort: 2222 + name: tfjob-port + restartPolicy: Never + - replicas: 2 + name: worker + policies: + - event: TaskCompleted + action: CompleteJob # Marked as completed once workers finish + template: + spec: + containers: + - command: + - sh + - -c + - | + PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | tr "\n" "," | sed 's/,$//'`; + WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | tr "\n" "," | sed 's/,$//'`; + export TF_CONFIG="{\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"worker\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"}"; + python /var/tf_dist_mnist/dist_mnist.py + image: volcanosh/dist-mnist-tf-example:0.0.1 + name: tensorflow + ports: + - containerPort: 2222 + name: tfjob-port + restartPolicy: Never +``` + +### 2. Apply the Manifest + +Run the following command to deploy the job: + +```bash +kubectl apply -f tensorflow-dist.yaml +``` + +## Verification + +### Check Job and Pod Status + +You can monitor the status of your Volcano Job and its associated pods: + +```bash +kubectl get pods -l volcano.sh/job-name=tf-dist-mnist +``` + +Expected output showing all 3 pods running: +```text +NAME READY STATUS RESTARTS AGE +tf-dist-mnist-ps-0 1/1 Running 0 30s +tf-dist-mnist-worker-0 1/1 Running 0 30s +tf-dist-mnist-worker-1 1/1 Running 0 30s +``` + +### View Training Logs + +Check the logs of a worker to see the training progress: + +```bash +kubectl logs tf-dist-mnist-worker-0 +``` + +## Notes + +- **Gang Scheduling**: If pods remain in `Pending` state, it usually means the cluster lacks enough resources to start the **entire** gang (3 pods). +- **Service Discovery**: The `svc` plugin is critical. It generates hostnames in `/etc/volcano/` which the PS and Workers use to communicate. +- **Failover**: If a pod is evicted, the defined policy `RestartJob` will ensure the training resumes properly. From aac76c0b2c999a5ec7c4546b1625aed38174df65 Mon Sep 17 00:00:00 2001 From: vinayak sharma Date: Fri, 13 Feb 2026 14:22:56 +0530 Subject: [PATCH 2/4] docs: fix tutorial index frontmatter and alert syntax Signed-off-by: vinayak sharma --- content/en/docs/tutorials/_index.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/en/docs/tutorials/_index.md b/content/en/docs/tutorials/_index.md index af1adef2..478b39da 100644 --- a/content/en/docs/tutorials/_index.md +++ b/content/en/docs/tutorials/_index.md @@ -11,6 +11,7 @@ type = "docs" [menu.docs] parent = "getting-started" weight = 2 + identifier = "tutorials" +++ Welcome to the Volcano Tutorial Series. This section provides end-to-end guides for running production-grade batch workloads on Kubernetes using Volcano. @@ -34,5 +35,4 @@ While the core documentation explains *how* Volcano's individual features work, --- -> [!TIP] -> Each tutorial is designed to be self-contained. However, if you are new to Volcano, we recommend starting with the [Quick Start: Deploy a Volcano Job](https://volcano.sh/en/docs/tutorials/#quick-start-deploy-a-volcano-job) guide. +> **Tip:** Each tutorial is designed to be self-contained. However, if you are new to Volcano, we recommend starting with the [Quick Start: Deploy a Volcano Job](https://volcano.sh/en/docs/tutorials/#quick-start-deploy-a-volcano-job) guide. From a767cb0f765b330f9226965e79d8b5665c0da316 Mon Sep 17 00:00:00 2001 From: vinayak sharma Date: Fri, 13 Feb 2026 14:34:32 +0530 Subject: [PATCH 3/4] docs: define tutorials menu in menus.toml to fix build Signed-off-by: vinayak sharma --- config/_default/menus.toml | 7 +++++++ content/en/docs/tutorials/_index.md | 4 ---- 2 files changed, 7 insertions(+), 4 deletions(-) diff --git a/config/_default/menus.toml b/config/_default/menus.toml index bc509a63..871c1b42 100644 --- a/config/_default/menus.toml +++ b/config/_default/menus.toml @@ -63,6 +63,13 @@ weight = 2 identifier = "getting-started" +[[docs]] + name = "Tutorials" + weight = 10 + identifier = "tutorials" + parent = "getting-started" + url = "/en/docs/tutorials/" + [[docs]] name = "Concepts" weight = 3 diff --git a/content/en/docs/tutorials/_index.md b/content/en/docs/tutorials/_index.md index 478b39da..8246e53a 100644 --- a/content/en/docs/tutorials/_index.md +++ b/content/en/docs/tutorials/_index.md @@ -8,10 +8,6 @@ draft = false toc = true type = "docs" -[menu.docs] - parent = "getting-started" - weight = 2 - identifier = "tutorials" +++ Welcome to the Volcano Tutorial Series. This section provides end-to-end guides for running production-grade batch workloads on Kubernetes using Volcano. From 7b54d50151706469594e746220979620ccb3950b Mon Sep 17 00:00:00 2001 From: vinayak sharma Date: Fri, 13 Feb 2026 14:38:40 +0530 Subject: [PATCH 4/4] docs: use absolute paths for internal links in tutorial index Signed-off-by: vinayak sharma --- content/en/docs/tutorials/_index.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/content/en/docs/tutorials/_index.md b/content/en/docs/tutorials/_index.md index 8246e53a..0106a004 100644 --- a/content/en/docs/tutorials/_index.md +++ b/content/en/docs/tutorials/_index.md @@ -23,11 +23,11 @@ While the core documentation explains *how* Volcano's individual features work, ## Tutorial Series -- **[Distributed TensorFlow](tensorflow)**: Orchestrate high-performance ML training jobs with parameter servers and workers. -- **[Apache Spark](spark)**: Prevent resource starvation in big data processing pipelines. -- **[GPU Resource Management](gpu-scheduling)**: Maximize hardware efficiency through fractional sharing (vGPU) and isolation. -- **[Multi-tenancy](multi-tenancy)**: Configure fair share scheduling and hierarchical queues for different teams. -- **[Argo Workflows](argo-workflows)**: Integrate Volcano's advanced scheduling into your CI/CD and data pipelines. +- **[Distributed TensorFlow](/en/docs/tutorials/tensorflow/)**: Orchestrate high-performance ML training jobs with parameter servers and workers. +- **[Apache Spark](/en/docs/tutorials/spark/)**: Prevent resource starvation in big data processing pipelines. +- **[GPU Resource Management](/en/docs/tutorials/gpu-scheduling/)**: Maximize hardware efficiency through fractional sharing (vGPU) and isolation. +- **[Multi-tenancy](/en/docs/tutorials/multi-tenancy/)**: Configure fair share scheduling and hierarchical queues for different teams. +- **[Argo Workflows](/en/docs/tutorials/argo-workflows/)**: Integrate Volcano's advanced scheduling into your CI/CD and data pipelines. ---