Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletions content/en/docs/tutorials/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
+++
title = "Tutorial Series"
linktitle = "Tutorials"
date = 2026-02-11
publishdate = 2026-02-11
lastmod = 2026-02-11
draft = false
toc = true
type = "docs"

[menu.docs]
parent = "getting-started"
weight = 2
+++

Welcome to the Volcano Tutorial Series. This section provides end-to-end guides for running production-grade batch workloads on Kubernetes using Volcano.

## Why These Tutorials?

While the core documentation explains *how* Volcano's individual features work, these tutorials demonstrate *why* they matter in real-world environments. Each guide is structured to help you move from a problem statement to a working deployment:

- **Background**: Understand the specific challenges (e.g., gang scheduling, resource starvation) addressed by the tutorial.
- **Scenario**: A practical use case you might encounter in a production cluster.
- **Step-by-Step Deployment**: Clear commands and complete, ready to use YAML manifests.
- **Verification**: How to confirm your job is running and being scheduled correctly.

## Tutorial Series

- **[Distributed TensorFlow](tensorflow)**: Orchestrate high-performance ML training jobs with parameter servers and workers.
- **[Apache Spark](spark)**: Prevent resource starvation in big data processing pipelines.
- **[GPU Resource Management](gpu-scheduling)**: Maximize hardware efficiency through fractional sharing (vGPU) and isolation.
- **[Multi-tenancy](multi-tenancy)**: Configure fair share scheduling and hierarchical queues for different teams.
- **[Argo Workflows](argo-workflows)**: Integrate Volcano's advanced scheduling into your CI/CD and data pipelines.

---

> [!TIP]
> Each tutorial is designed to be self-contained. However, if you are new to Volcano, we recommend starting with the [Quick Start: Deploy a Volcano Job](https://volcano.sh/en/docs/tutorials/#quick-start-deploy-a-volcano-job) guide.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The link provided in the tip points to an anchor (#quick-start-deploy-a-volcano-job) on the current page, but this anchor does not exist. This will result in a broken link for users. Please update the URL to point to the correct Quick Start guide, or add the relevant content to this page.

123 changes: 123 additions & 0 deletions content/en/docs/tutorials/argo-workflows.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
+++
title = "Integrating with Argo Workflows"
linktitle = "Argo Workflows"
date = 2026-02-11
publishdate = 2026-02-11
lastmod = 2026-02-11
draft = false
toc = true
type = "docs"

[menu.docs]
parent = "tutorials"
weight = 50
+++

This tutorial shows how to use Volcano as the scheduler for Argo Workflows to gain advanced batch scheduling features for your CI/CD and data processing pipelines.

## Background

Argo Workflows is a popular cloud-native workflow engine for orchestrating parallel jobs on Kubernetes. While Argo excels at managing dependencies and execution flow, it often relies on the default Kubernetes scheduler for individual steps.

By integrating Volcano as the scheduler for Argo Workflows, you unlock advanced batch scheduling capabilities:

- **Bin-packing**: Optimize resource utilization by packing tasks onto the fewest number of nodes.
- **Fair Sharing**: Ensure that workflow steps across different tenants or namespaces are scheduled fairly according to configured weights.
- **Gang Scheduling**: For workflows involving multiple parallel pods that must start together, Volcano ensures they are managed as a single unit (PodGroup).

## Scenario

A common workflow scenario involves a "Main" entrypoint that triggers one or more "Task" steps. In this tutorial, you will configure a simple Argo Workflow to use Volcano for its underlying pod scheduling.

## Prerequisites

Before you begin, ensure you have:
- A Kubernetes cluster with Volcano installed.
- [Argo Workflows](https://argoproj.github.io/argo-workflows/installation/) installed in your cluster.

## Deployment Step-by-Step

### 1. Create the Workflow Manifest

You can configure Argo to use Volcano at the individual template level using the `schedulerName` field. Create a file named `volcano-workflow.yaml`:

```yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: volcano-workflow-
spec:
entrypoint: main
templates:
- name: main
steps:
- - name: step1
template: whalesay
- name: whalesay
container:
image: docker/whalesay
command: [cowsay]
args: ["Hello from Argo + Volcano!"]
schedulerName: volcano # Explicitly tell Argo to use Volcano
```

### 2. Apply the Workflow

Run the following command to submit your workflow:

```bash
argo submit volcano-workflow.yaml
```

## Advanced: Deploying VolcanoJobs from Argo

For tasks that require native Volcano features like `minAvailable` or specific `plugins`, you can submit a `VolcanoJob` directly as a resource template:

```yaml
- name: volcano-job-step
resource:
action: create
successCondition: status.state == Completed # Wait for the Job to finish

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The successCondition for the resource template is incorrect. The status.state field of a Volcano Job is an object, so checking for equality with a string will fail. To check for completion, you should evaluate the phase subfield. The condition should be status.state.phase == 'Completed'. Note that Completed must be a string literal.

Suggested change
successCondition: status.state == Completed # Wait for the Job to finish
successCondition: status.state.phase == 'Completed' # Wait for the Job to finish

manifest: |
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
generateName: argo-step-
spec:
minAvailable: 1
schedulerName: volcano
tasks:
- replicas: 1
template:
spec:
containers:
- name: main
image: alpine
command: ["echo", "running inside volcano job"]
```

## Verification

### Check Workflow Status

Monitor the progress of your workflow using the Argo CLI:

```bash
argo get @latest
```

### Verify the Scheduler

Check the details of any pod created by the workflow to ensure it was handled by Volcano:

```bash
kubectl get pod <pod-name> -o jsonpath='{.spec.schedulerName}'
```

The output should be `volcano`.

## Notes

- **Global Configuration**: You can make Volcano the default scheduler for *all* Argo Workflows by updating the `workflow-controller-configmap` with `containerRuntimeExecutor: k8sapi` and setting the default scheduler name.
- **ServiceAccount Permissions**: If using the `resource` template to create `VolcanoJobs`, ensure the ServiceAccount used by the Argo controller has RBAC permissions to `create`, `get`, and `watch` resources in the `batch.volcano.sh` group.
- **PodGroups**: When a pod is scheduled by Volcano, a `PodGroup` is automatically created. You can inspect it with `kubectl get podgroups`.
101 changes: 101 additions & 0 deletions content/en/docs/tutorials/gpu-scheduling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
+++
title = "GPU Scheduling and Resource Management"
linktitle = "GPU Scheduling"
date = 2026-02-11
publishdate = 2026-02-11
lastmod = 2026-02-11
draft = false
toc = true
type = "docs"

[menu.docs]
parent = "tutorials"
weight = 30
+++

This tutorial covers how to efficiently manage GPU resources using Volcano, including fractional GPU sharing (vGPU) and hardware-based isolation (MIG).

## Background

GPUs are high-performance but expensive resources. In standard Kubernetes, a physical GPU is typically treated as an indivisible unit—one GPU can only be assigned to one container. This often leads to significant underutilization, especially for smaller workloads like model inference or development tasks that don't require the full compute power or memory of a modern GPU.

Volcano addresses this by providing robust **vGPU (virtual GPU) scheduling**. This allows you to:

- **Fractional Sharing**: Slice a single physical GPU into multiple virtual GPUs (vGPUs).
- **Resource Isolation**: Enforce specific compute (cores) and memory limits for each container sharing the physical hardware.
- **Multiple Modes**: Support both software based slicing (via VCUDA) and hardware based isolation (via NVIDIA MIG).

## Scenario

Suppose you have a cluster where multiple users need to run lightweight inference tasks. Instead of dedicating one physical GPU to each user, you can partition each GPU to support multiple users simultaneously.

In this tutorial, you will deploy a Volcano Job that requests a fractional share of a GPU: **20% of the compute power** and **2000MiB of memory**.

## Prerequisites

Before you begin, ensure you have:
- A Kubernetes cluster with nodes equipped with NVIDIA GPUs.
- The [NVIDIA Container Toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) installed on your nodes.
- Volcano installed and the `volcano-vgpu-device-plugin` deployed.

## Deployment Step-by-Step

### 1. Create the GPU Sharing Manifest

Create a file named `gpu-sharing-job.yaml` with the following content:

```yaml
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: gpu-sharing-tutorial
spec:
minAvailable: 1
schedulerName: volcano
template:
spec:
containers:
- name: gpu-container
image: nvidia/cuda:11.0-base
command: ["sh", "-c", "nvidia-smi && sleep 3600"]
resources:
limits:
volcano.sh/vgpu-number: 1 # Request 1 virtual GPU
volcano.sh/vgpu-memory: 2000 # Limit to 2000MiB of GPU memory
volcano.sh/vgpu-cores: 20 # Limit to 20% of GPU compute
restartPolicy: Never
Comment on lines +48 to +66

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The provided Volcano Job manifest is invalid. The spec.template field is not a valid field for a batch.volcano.sh/v1alpha1 Job. You should use spec.tasks, which is a list of tasks, where each task contains a pod template. The current manifest will fail to apply. I've suggested a corrected manifest that wraps the pod template inside a task.

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: gpu-sharing-tutorial
spec:
  minAvailable: 1
  schedulerName: volcano
  tasks:
    - replicas: 1
      name: gpu-task
      template:
        spec:
          containers:
            - name: gpu-container
              image: nvidia/cuda:11.0-base
              command: ["sh", "-c", "nvidia-smi && sleep 3600"]
              resources:
                limits:
                  volcano.sh/vgpu-number: 1    # Request 1 virtual GPU
                  volcano.sh/vgpu-memory: 2000 # Limit to 2000MiB of GPU memory
                  volcano.sh/vgpu-cores: 20    # Limit to 20% of GPU compute
          restartPolicy: Never

```

### 2. Apply the Manifest

Run the following command to deploy the job:

```bash
kubectl apply -f gpu-sharing-job.yaml
```

## Verification

### Check Resource Allocation

Verify that your pod has been scheduled to a node with available vGPU resources:

```bash
kubectl get pods -l volcano.sh/job-name=gpu-sharing-tutorial
```

### Inspect the Container

Check the logs to verify that the container correctly detects the GPU environment via `nvidia-smi`:

```bash
kubectl logs gpu-sharing-tutorial-completion-task-0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Following the correction to the Job manifest, the pod name used in this kubectl logs command will be incorrect. A Volcano pod's name is constructed as <job-name>-<task-name>-<index>. With the suggested manifest fix (which names the task gpu-task), the correct pod name would be gpu-sharing-tutorial-gpu-task-0.

Suggested change
kubectl logs gpu-sharing-tutorial-completion-task-0
kubectl logs gpu-sharing-tutorial-gpu-task-0

```

Even though it is a shared physical GPU, the `volcano-vgpu-device-plugin` ensures the container only utilizes the allocated memory and compute slices.

## Notes

- **Insufficient Resources**: If pods remain `Pending` with "insufficient volcano.sh/vgpu-number", check if your nodes are correctly labeled and the `volcano-vgpu-device-plugin` is healthy.
- **Memory Limits**: If your application fails with Out of Memory (OOM) on the GPU, ensure the `vgpu-memory` limit is large enough for your specific model requirements.
- **Hardware Isolation**: For mission critical workloads requiring strict hardware level isolation, consider using **Dynamic MIG** mode if your hardware supports it (e.g., A100/H100).
134 changes: 134 additions & 0 deletions content/en/docs/tutorials/multi-tenancy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
+++
title = "Multi-tenant Queue Configuration"
linktitle = "Multi-tenancy"
date = 2026-02-11
publishdate = 2026-02-11
lastmod = 2026-02-11
draft = false
toc = true
type = "docs"

[menu.docs]
parent = "tutorials"
weight = 40
+++

This tutorial demonstrates how to set up a multi-tenant environment using Volcano's hierarchical queues, enabling fair resource sharing between different teams or departments.

## Background

In multi-tenant clusters, different teams or departments share the same underlying resources. Without proper management, one team's high-volume workload can monopolize the cluster, causing "starvation" for other users.

Volcano's **Hierarchical Queues** provide a sophisticated mechanism to handle these scenarios by allowing you to:

- **Guarantees and Limits**: Set minimum resource entitlements (`deserved`) to ensure every team has access to capacity.
- **Fair Sharing**: Dynamically re-allocate unused resources between teams based on their configured weights.
- **Resource Reclamation**: Automatically reclaim resources from teams over-using their share when other teams have pending jobs.

## Scenario

Consider a "Research" department with two sub-teams: **Team A** and **Team B**. You want to ensure that even if Team A submits a massive batch of jobs, Team B is guaranteed at least 50% of the department's allocated capacity when they need it.

In this tutorial, you will build a queue hierarchy and deploy a job to verify that resources are allocated according to the team's entitlements.

## Prerequisites

Before you begin, ensure you have:
- A Kubernetes cluster with Volcano installed.
- Admin permissions to create `Queue` resources.

## Deployment Step-by-Step

### 1. Create the Queue Hierarchy

Create a file named `team-queues.yaml` to define a parent-child relationship between the department and the teams:

```yaml
# 1. Define the parent Research Queue
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: research-parent
spec:
weight: 1
capability:
cpu: 10
memory: 20Gi
---
# 2. Define Team A's Queue as a child
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: team-a-queue
spec:
parent: research-parent
weight: 1
deserved: # Guaranteed capacity for Team A
cpu: 5
memory: 10Gi
---
# 3. Define Team B's Queue as a child
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: team-b-queue
spec:
parent: research-parent
weight: 1
deserved: # Guaranteed capacity for Team B
cpu: 5
memory: 10Gi
```

### 2. Apply the Queues

Run the following command to establish the hierarchy:

```bash
kubectl apply -f team-queues.yaml
```

### 3. Deploy a Job to a Specific Queue

To use a queue, specify its name in the `spec.queue` field of your Volcano Job:

```yaml
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: team-a-job
spec:
queue: team-a-queue # Submit specifically to Team A's pool
tasks:
- replicas: 2
template:
spec:
containers:
- image: busybox
name: worker
command: ["sleep", "3600"]
resources:
requests:
cpu: 1
```

## Verification

### Check Queue Status

Monitor how resources are allocated and redistributed across your hierarchy:

```bash
kubectl get queues research-parent -o yaml
kubectl get queues team-a-queue -o yaml
```

### Observe Resource Sharing

If Team B has no active jobs, Team A can "borrow" resources up to the parent's `capability` (10 CPU). However, the moment Team B submits a job, Volcano's scheduler will identify the over usage and trigger **resource reclamation** from Team A to satisfy Team B's `deserved` entitlement.

## Notes

- **Leaf Queues**: In Volcano, you should always submit jobs to **leaf queues** (the bottom level of the hierarchy).
- **Entitlement Sums**: Ensure the sum of `deserved` resources of all child queues does not exceed the `capability` of their parent.
- **Queue State**: By default, new queues are created in the `Open` state. If a queue is `Closed` (e.g., during maintenance), no new jobs will be accepted.
Loading