From 5b336761edad70da7bd39207af3dd0cb7c14a616 Mon Sep 17 00:00:00 2001 From: Lakshya Saxena Date: Thu, 5 Feb 2026 09:36:46 +0000 Subject: [PATCH 1/6] docs: add HyperJob concept documentation for multi-cluster job splitting Introduce HyperJob documentation in both English and Chinese Concepts sections, covering multi-cluster job splitting, key features, use cases, and distinctions from standard Volcano Jobs Signed-off-by: Lakshya Saxena --- content/en/docs/hyperjob.md | 192 ++++++++++++++++++++ content/en/docs/multi_cluster_scheduling.md | 3 + content/zh/docs/hyperjob.md | 190 +++++++++++++++++++ content/zh/docs/multi_cluster_scheduling.md | 3 + 4 files changed, 388 insertions(+) create mode 100644 content/en/docs/hyperjob.md create mode 100644 content/zh/docs/hyperjob.md diff --git a/content/en/docs/hyperjob.md b/content/en/docs/hyperjob.md new file mode 100644 index 00000000..44d857dc --- /dev/null +++ b/content/en/docs/hyperjob.md @@ -0,0 +1,192 @@ +++ +title = "HyperJob" +description = "Multi-cluster job splitting and high-level scheduling with HyperJob" +date = 2026-02-05 +lastmod = 2026-02-05 + +draft = false # Is this a draft? true/false +toc = true # Show table of contents? true/false +type = "docs" # Do not modify. + +# Add menu entry to sidebar. +linktitle = "HyperJob" +[menu.docs] + parent = "concepts" + weight = 4 ++++ + +## Overview + +HyperJob is a high-level scheduling abstraction built on top of Volcano Job for **multi-cluster AI workloads**. +Instead of submitting and managing multiple Volcano Jobs in different clusters, users submit a single HyperJob, +and the system automatically **splits**, **dispatches**, and **tracks** the underlying jobs across clusters. + +HyperJob is designed for scenarios where: + +- A single Kubernetes cluster **does not have enough resources** for a large AI/ML training job. +- You want to **combine capacity from multiple clusters** (for example, across regions or environments). +- You need a **unified control plane and status view** for a job that actually runs in many clusters. + +HyperJob leverages Volcano Job as the basic execution unit and works together with **Volcano Global** and +**Karmada** to make multi-cluster job orchestration as simple as running a single job in one cluster. + +## Key Features and Advantages + +- **Automatic Job Splitting** + - Splits a large logical job into multiple child jobs that can be scheduled to different clusters. + - Splitting can be based on replicas, resources, or other policies defined in HyperJob. + +- **Unified Status Management** + - HyperJob exposes **one high-level status** that aggregates the states of all underlying jobs. + - Users can start, stop, and observe the entire multi-cluster workload from a single resource. + +- **Simplified Multi-Cluster Usage** + - Users no longer need to manually create and manage jobs per cluster. + - Cluster selection and placement policies are handled by the HyperJob control plane plus Volcano Global. + +- **High-Level Scheduling** + - Acts as a **meta-scheduler** on top of Volcano Job. + - Decides how many replicas go to which cluster, and then lets Volcano Job handle in-cluster scheduling + (gang scheduling, fair sharing, queue priority, and so on). + +- **Resource Optimization Across Clusters** + - Makes it easier to **utilize fragmented or heterogeneous capacity** across multiple clusters. + - Can spread load to avoid hot spots and improve overall throughput for large AI/ML workloads. + +## HyperJob vs. Standard Volcano Job + +Both HyperJob and Volcano Job are part of the same ecosystem, but they focus on different scopes: + +| Aspect | Volcano Job | HyperJob | +|-------------------------------|--------------------------------------------------------|--------------------------------------------------------------------------| +| Scope | Single cluster | Multiple clusters | +| Execution unit | One `Job` in one cluster | One logical HyperJob mapped to **multiple** underlying Volcano Jobs | +| Multi-cluster awareness | Not aware | Native multi-cluster abstraction | +| Job splitting | Not provided | **Built-in automatic job splitting** | +| Status view | Per cluster, per job | **Unified status** across clusters and child jobs | +| When to use | Cluster is big enough; single-cluster scheduling only | Workloads need more capacity than one cluster or must span many clusters | + +**When to use Volcano Job** + +- You only run in a single cluster. +- The workload size fits within that cluster’s resource capacity. +- You want straightforward integration with existing controllers or pipelines that already speak Volcano Job. + +**When to use HyperJob** + +- Your AI/ML workload (for example, LLM pre-training or fine-tuning) needs **aggregate resources from multiple clusters**. +- You want a **single submission and control interface** for a distributed training or inference job. +- You need to **combine heterogeneous resources** (different GPU models, zones, or hardware generations) while + letting the system choose where each part of the job runs. + +## Typical Use Cases + +- **Multi-Cluster LLM Training** + - Train a large language model where a single cluster cannot provide enough GPUs or network bandwidth. + - HyperJob splits replicas across multiple clusters while presenting a single logical job to the user. + +- **Heterogeneous Resource Scheduling** + - Combine clusters with different GPU types (for example, A100, H100, or other accelerators). + - HyperJob can assign subtasks to the most suitable cluster based on resource type and availability. + +- **Resource Overflow and Bursting** + - When a primary cluster is close to saturation, HyperJob can place additional replicas into other clusters + without changing user-facing APIs. + +- **Geographically Distributed Training** + - Distribute parts of a workload across clusters in different regions or data centers. + - Useful for latency-sensitive scenarios or to comply with data locality requirements. + +## Architecture Overview + +At a high level, HyperJob works as follows (conceptual view): + +1. **User submits a HyperJob** to a control-plane cluster. +2. The HyperJob controller: + - Interprets the desired replicas and resources. + - Applies **splitting policies** (for example, per-cluster replica counts or resource quotas). +3. For each target cluster, the controller creates one or more **underlying Volcano Jobs**. +4. **Volcano Global** and **Karmada** handle: + - Multi-cluster scheduling and `ResourceBinding` management. + - Cross-cluster queue and priority handling. + - Fair sharing and admission control across clusters. +5. HyperJob continuously watches the state of all child jobs and **aggregates status** back to the HyperJob resource. + +In this architecture: + +- HyperJob focuses on **job-level abstraction and splitting**. +- Volcano Job focuses on **in-cluster batch scheduling**. +- Volcano Global + Karmada focus on **multi-cluster coordination and placement**. + +For more details on the multi-cluster layer, see +[Multi-Cluster AI Job Scheduling](/en/docs/multi_cluster_scheduling/) and the +[Volcano Global](https://github.com/volcano-sh/volcano-global) project. + +## Example: HyperJob YAML (Conceptual) + +The exact HyperJob API is defined in the Volcano design and implementation. +The following example is a **simplified conceptual** manifest to illustrate how a HyperJob can describe +a logical job and its split across clusters. For the authoritative API, always refer to the +[HyperJob design document](https://github.com/volcano-sh/volcano/blob/master/docs/design/hyperjob-multi-cluster-job-splitting.md) +and the corresponding CRD definition in the Volcano repository. + +```yaml +apiVersion: batch.volcano.sh/v1alpha1 +kind: HyperJob +metadata: + name: llm-train-hyperjob +spec: + # High-level description of the logical job + template: + apiVersion: batch.volcano.sh/v1alpha1 + kind: Job + spec: + minAvailable: 64 + schedulerName: volcano + queue: global-ai + tasks: + - name: trainer + replicas: 64 + template: + spec: + containers: + - name: trainer + image: example.com/llm-train:latest + resources: + requests: + cpu: "8" + memory: "64Gi" + nvidia.com/gpu: "1" + restartPolicy: OnFailure + + # Conceptual splitting policy (names and fields are illustrative) + splitPolicy: + strategy: ByCluster + clusters: + - name: cluster-a + replicas: 32 + - name: cluster-b + replicas: 32 +``` + +In practice, the real HyperJob spec may include: + +- More detailed **cluster selection** and **constraints**. +- Fields to describe how to **map HyperJob status** from child jobs. +- Policies for **retry, rollback, and cleanup** across clusters. + +Always check the latest Volcano documentation and code for the exact API. + +## Related Concepts and References + +- **Volcano Job**: The core batch job abstraction in a single cluster. + See [VolcanoJob](/en/docs/vcjob/) for details. +- **Queue**: Controls resource sharing and priority. + See [Queue](/en/docs/queue/) and [Queue Resource Management](/en/docs/queue_resource_management/). +- **Multi-Cluster AI Job Scheduling**: + See [Multi-Cluster AI Job Scheduling](/en/docs/multi_cluster_scheduling/) for Volcano Global architecture + and usage patterns. +- **HyperJob Design Document**: + See the detailed design: + `https://github.com/volcano-sh/volcano/blob/master/docs/design/hyperjob-multi-cluster-job-splitting.md` + diff --git a/content/en/docs/multi_cluster_scheduling.md b/content/en/docs/multi_cluster_scheduling.md index de8efacb..290d1932 100644 --- a/content/en/docs/multi_cluster_scheduling.md +++ b/content/en/docs/multi_cluster_scheduling.md @@ -51,3 +51,6 @@ Volcano Global consists of two components: ## Usage Guide Please refer to: [Volcano Global Deploy](https://github.com/volcano-sh/volcano-global/blob/main/docs/deploy/README.md). + +For a high-level multi-cluster abstraction that can automatically split a logical job into multiple Volcano Jobs +across clusters, see [HyperJob](/en/docs/hyperjob/). diff --git a/content/zh/docs/hyperjob.md b/content/zh/docs/hyperjob.md new file mode 100644 index 00000000..55e5485e --- /dev/null +++ b/content/zh/docs/hyperjob.md @@ -0,0 +1,190 @@ +++ +title = "HyperJob" +description = "基于 HyperJob 的多集群作业拆分与高级调度" +date = 2026-02-05 +lastmod = 2026-02-05 + +draft = false # 是否为草稿 +toc = true # 是否显示目录 +type = "docs" # 请勿修改 + +# 侧边栏菜单 +linktitle = "HyperJob" +[menu.docs] + parent = "concepts" + weight = 4 ++++ + +## 概述 + +HyperJob 是构建在 Volcano Job 之上的 **多集群高层作业抽象**。 +用户不再需要在多个集群中分别创建和管理 Volcano Job,而是只需要提交一个 HyperJob, +系统就会自动在多个集群之间 **拆分**、**下发**、并 **聚合管理** 底层作业。 + +HyperJob 主要解决以下问题: + +- 单个 Kubernetes 集群 **资源不足以承载** 大规模 AI/ML 训练任务。 +- 希望 **统一利用多个集群的算力**(例如跨地域、跨环境的 GPU 集群)。 +- 需要对实际运行在多个集群中的作业,仍然只保留 **一个统一的控制与状态视图**。 + +HyperJob 以 Volcano Job 为执行单元,并结合 **Volcano Global** 与 **Karmada**, +让多集群 AI 作业的使用体验尽量接近“在单集群中运行一个 Job”。 + +## 核心特性与优势 + +- **自动作业拆分** + - 将一个大型逻辑作业自动拆分为多个子 Job,下发到不同集群执行。 + - 拆分可以基于副本数、资源配额或其他在 HyperJob 中定义的策略。 + +- **统一状态管理** + - HyperJob 提供一个 **统一的高层状态**,对多个底层 Volcano Job 的状态进行聚合。 + - 用户可以通过一个 HyperJob 完成启动、停止和状态查看等操作。 + +- **简化多集群使用** + - 用户不需要关心每个集群中具体的 Job 清单。 + - 集群选择、实例分布和资源放置由 HyperJob 控制面和 Volcano Global 共同完成。 + +- **高层调度能力** + - HyperJob 扮演 **“Job 级别的元调度器(meta-scheduler)”** 角色。 + - 负责确定多少副本下发到哪个集群,再由各集群内的 Volcano Job 完成批量调度 + (包括 Gang 调度、公平共享、队列优先级等)。 + +- **跨集群资源优化** + - 便于 **利用多个集群中零散或异构的资源**。 + - 可以把负载分摊到多个集群,提升整体吞吐与资源利用率。 + +## HyperJob 与普通 Volcano Job 的对比 + +HyperJob 与 Volcano Job 同属一个调度体系,但关注的粒度不同: + +| 对比项 | Volcano Job | HyperJob | +|----------------------------|--------------------------------------------------|-------------------------------------------------------------------------------| +| 调度范围 | 单集群 | 多集群 | +| 执行单元 | 单个集群中的一个 `Job` | 一个 HyperJob 对应多个底层 Volcano Job | +| 多集群感知 | 不感知多集群 | 原生面向多集群的高层抽象 | +| 作业拆分 | 不提供 | **内置自动拆分** | +| 状态视图 | 每个集群、每个 Job 各自管理 | **统一的 HyperJob 状态视图** | +| 典型使用场景 | 单集群即可满足资源与调度需求 | 需要聚合多个集群资源或必须跨集群运行的作业 | + +**适合直接使用 Volcano Job 的场景** + +- 所有工作负载只运行在 **单个集群** 中。 +- 集群本身的资源足以容纳训练或推理任务。 +- 已有控制器/流水线已经与 Volcano Job 打通,只需要单集群调度能力。 + +**适合使用 HyperJob 的场景** + +- LLM 预训练/大规模分布式训练等,需要 **多个集群的总算力** 才能满足需求。 +- 希望对跨集群的训练/推理任务,仍然只提交 **一个逻辑作业** 并进行统一管理。 +- 需要组合多种 **异构资源**(不同 GPU 型号、不同可用区或机型),让系统自动选择合适的集群。 + +## 典型使用场景 + +- **多集群 LLM 训练** + - 单个集群 GPU 规模不足时,将训练副本拆分到多个集群执行。 + - 用户只需要管理一个 HyperJob,对底层多个 Volcano Job 无感知。 + +- **异构资源调度** + - 组合多个拥有不同 GPU 型号或算力配置的集群。 + - HyperJob 根据资源类型与可用性,将不同子任务分配到最合适的集群中。 + +- **资源溢出与弹性扩展** + - 当主集群资源接近饱和时,可通过 HyperJob 自动将多余副本调度到其他集群。 + - 无需改动上层作业定义或业务逻辑。 + +- **跨地域/多机房训练** + - 将训练任务拆分到不同地域或机房的集群中执行。 + - 可结合数据本地化、合规要求、链路延迟等因素进行策略控制。 + +## 架构概览 + +HyperJob 的典型工作流程如下(概念性描述): + +1. **用户在控制平面集群中提交一个 HyperJob。** +2. HyperJob 控制器: + - 解析期望的副本数和资源需求。 + - 根据 **拆分策略** 决定各目标集群的副本与资源分配。 +3. 对每个目标集群,控制器创建一个或多个 **底层 Volcano Job**。 +4. **Volcano Global** 与 **Karmada** 负责: + - 多集群的调度与 `ResourceBinding` 管理。 + - 跨集群队列与作业优先级管理。 + - 多租户公平调度与资源准入控制。 +5. HyperJob 持续跟踪所有子 Job 的状态,并将其 **聚合回 HyperJob 状态** 中。 + +在该架构中: + +- HyperJob 关注 **作业级抽象与拆分逻辑**。 +- Volcano Job 关注 **单集群内的批量调度能力**。 +- Volcano Global + Karmada 关注 **多集群维度的资源协调与放置**。 + +多集群架构的详细介绍,可参考 +[多集群AI作业调度](/zh/docs/multi_cluster_scheduling/) 与 +[Volcano Global](https://github.com/volcano-sh/volcano-global) 项目。 + +## HyperJob YAML 示例(概念性) + +HyperJob 的具体 API 以 Volcano 设计与实现为准。 +下面示例为 **简化的概念示例**,用于帮助理解 HyperJob 如何描述一个逻辑作业及其跨集群拆分方式。 +权威、最新的字段定义请以 +[HyperJob 设计文档](https://github.com/volcano-sh/volcano/blob/master/docs/design/hyperjob-multi-cluster-job-splitting.md) +及 Volcano 仓库中的 CRD 定义为准。 + +```yaml +apiVersion: batch.volcano.sh/v1alpha1 +kind: HyperJob +metadata: + name: llm-train-hyperjob +spec: + # 逻辑作业的高层模板 + template: + apiVersion: batch.volcano.sh/v1alpha1 + kind: Job + spec: + minAvailable: 64 + schedulerName: volcano + queue: global-ai + tasks: + - name: trainer + replicas: 64 + template: + spec: + containers: + - name: trainer + image: example.com/llm-train:latest + resources: + requests: + cpu: "8" + memory: "64Gi" + nvidia.com/gpu: "1" + restartPolicy: OnFailure + + # 拆分策略(字段名称仅作示意,实际以实现为准) + splitPolicy: + strategy: ByCluster + clusters: + - name: cluster-a + replicas: 32 + - name: cluster-b + replicas: 32 +``` + +在真实环境中,HyperJob 规范可能还包括: + +- 更精细的 **集群选择与约束条件**。 +- 描述如何将子 Job 状态 **聚合映射为 HyperJob 状态** 的字段。 +- 跨集群的 **重试、回滚与清理策略** 等高级能力。 + +请始终参考最新的 Volcano 文档与代码获取准确 API。 + +## 相关概念与参考链接 + +- **Volcano Job**:单集群批作业的核心抽象。 + 详情见 [VolcanoJob](/zh/docs/vcjob/)。 +- **Queue(队列)**:控制资源共享与优先级。 + 详情见 [Queue](/zh/docs/queue/) 与 [队列资源管理](/zh/docs/queue_resource_management/)。 +- **多集群AI作业调度**: + 详情见 [多集群AI作业调度](/zh/docs/multi_cluster_scheduling/),了解 Volcano Global 架构与使用方式。 +- **HyperJob 设计文档**: + 设计细节见: + `https://github.com/volcano-sh/volcano/blob/master/docs/design/hyperjob-multi-cluster-job-splitting.md` + diff --git a/content/zh/docs/multi_cluster_scheduling.md b/content/zh/docs/multi_cluster_scheduling.md index 06131f00..0a01fe4d 100644 --- a/content/zh/docs/multi_cluster_scheduling.md +++ b/content/zh/docs/multi_cluster_scheduling.md @@ -47,3 +47,6 @@ Volcano global主要包含两个组件: ## 使用指导 请参考: [Volcano global deploy](https://github.com/volcano-sh/volcano-global/blob/main/docs/deploy/README.md)。 + +如果希望在多集群环境下,以一个高层抽象来统一描述作业并自动拆分为多个跨集群的 Volcano Job, +可以参考 [HyperJob](/zh/docs/hyperjob/) 概念文档。 From 483dc9b285a39d8d21dd5101d2314dfde72c4599 Mon Sep 17 00:00:00 2001 From: Lakshya Saxena Date: Thu, 5 Feb 2026 15:14:47 +0530 Subject: [PATCH 2/6] Update metadata for HyperJob documentation Signed-off-by: Lakshya Saxena --- content/en/docs/hyperjob.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/en/docs/hyperjob.md b/content/en/docs/hyperjob.md index 44d857dc..799447f2 100644 --- a/content/en/docs/hyperjob.md +++ b/content/en/docs/hyperjob.md @@ -1,4 +1,4 @@ -++ ++++ title = "HyperJob" description = "Multi-cluster job splitting and high-level scheduling with HyperJob" date = 2026-02-05 From 5d0fccd1e847311c63d3b5d23a3709e22d1ac1d2 Mon Sep 17 00:00:00 2001 From: Lakshya Saxena Date: Thu, 5 Feb 2026 15:15:26 +0530 Subject: [PATCH 3/6] Update metadata in hyperjob.md Signed-off-by: Lakshya Saxena --- content/zh/docs/hyperjob.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/zh/docs/hyperjob.md b/content/zh/docs/hyperjob.md index 55e5485e..3fdda257 100644 --- a/content/zh/docs/hyperjob.md +++ b/content/zh/docs/hyperjob.md @@ -1,4 +1,4 @@ -++ ++++ title = "HyperJob" description = "基于 HyperJob 的多集群作业拆分与高级调度" date = 2026-02-05 From 3fb4e59b2c11268bb10fe07adf54b08e7dc8005f Mon Sep 17 00:00:00 2001 From: Lakshya Saxena Date: Thu, 12 Feb 2026 12:52:25 +0000 Subject: [PATCH 4/6] fix/refactor: Introduce HyperJob multi-cluster job splitting concept with comparison table, real-world use cases, and complete YAML examples from design doc. Removed architecture overview to keep focus on user-facing concepts Signed-off-by: Lakshya Saxena --- content/en/docs/hyperjob.md | 176 ++++++++++++++++++++---------------- content/zh/docs/hyperjob.md | 174 +++++++++++++++++++---------------- 2 files changed, 195 insertions(+), 155 deletions(-) diff --git a/content/en/docs/hyperjob.md b/content/en/docs/hyperjob.md index 799447f2..675c38a5 100644 --- a/content/en/docs/hyperjob.md +++ b/content/en/docs/hyperjob.md @@ -1,4 +1,8 @@ +<<<<<<< HEAD +++ +======= ++++ +>>>>>>> 21db252 (fix/refactor: Introduce HyperJob multi-cluster job splitting concept with comparison table, real-world use cases, and complete YAML examples from design doc. Removed architecture overview to keep focus on user-facing concepts) title = "HyperJob" description = "Multi-cluster job splitting and high-level scheduling with HyperJob" date = 2026-02-05 @@ -55,21 +59,22 @@ HyperJob leverages Volcano Job as the basic execution unit and works together wi ## HyperJob vs. Standard Volcano Job -Both HyperJob and Volcano Job are part of the same ecosystem, but they focus on different scopes: +HyperJob is built on top of Volcano Job, not as a replacement. It extends Volcano's capabilities to multi-cluster scenarios while preserving all the features of Volcano Job within each cluster. + +| Aspect | Volcano Job | HyperJob | +|--------|-------------|----------| +| **Scope** | Single cluster | Multiple clusters | +| **Abstraction Level** | Cluster-level primitive (manages Pods) | Meta-level primitive (manages Volcano Jobs) | +| **Primary Use Case** | Batch workload scheduling | Large-scale training across heterogeneous clusters | +| **Job Composition** | Single job with multiple tasks | Composition of multiple Volcano Jobs | +| **Status Tracking** | Tracks pods within a single job | Aggregates status from multiple Volcano Jobs across clusters | -| Aspect | Volcano Job | HyperJob | -|-------------------------------|--------------------------------------------------------|--------------------------------------------------------------------------| -| Scope | Single cluster | Multiple clusters | -| Execution unit | One `Job` in one cluster | One logical HyperJob mapped to **multiple** underlying Volcano Jobs | -| Multi-cluster awareness | Not aware | Native multi-cluster abstraction | -| Job splitting | Not provided | **Built-in automatic job splitting** | -| Status view | Per cluster, per job | **Unified status** across clusters and child jobs | -| When to use | Cluster is big enough; single-cluster scheduling only | Workloads need more capacity than one cluster or must span many clusters | +HyperJob is designed for scenarios where training requirements exceed single cluster capacity or need to leverage heterogeneous accelerator resources across different clusters. **When to use Volcano Job** - You only run in a single cluster. -- The workload size fits within that cluster’s resource capacity. +- The workload size fits within that cluster's resource capacity. - You want straightforward integration with existing controllers or pipelines that already speak Volcano Job. **When to use HyperJob** @@ -97,85 +102,102 @@ Both HyperJob and Volcano Job are part of the same ecosystem, but they focus on - Distribute parts of a workload across clusters in different regions or data centers. - Useful for latency-sensitive scenarios or to comply with data locality requirements. -## Architecture Overview - -At a high level, HyperJob works as follows (conceptual view): - -1. **User submits a HyperJob** to a control-plane cluster. -2. The HyperJob controller: - - Interprets the desired replicas and resources. - - Applies **splitting policies** (for example, per-cluster replica counts or resource quotas). -3. For each target cluster, the controller creates one or more **underlying Volcano Jobs**. -4. **Volcano Global** and **Karmada** handle: - - Multi-cluster scheduling and `ResourceBinding` management. - - Cross-cluster queue and priority handling. - - Fair sharing and admission control across clusters. -5. HyperJob continuously watches the state of all child jobs and **aggregates status** back to the HyperJob resource. - -In this architecture: - -- HyperJob focuses on **job-level abstraction and splitting**. -- Volcano Job focuses on **in-cluster batch scheduling**. -- Volcano Global + Karmada focus on **multi-cluster coordination and placement**. - -For more details on the multi-cluster layer, see -[Multi-Cluster AI Job Scheduling](/en/docs/multi_cluster_scheduling/) and the -[Volcano Global](https://github.com/volcano-sh/volcano-global) project. +## Example: HyperJob YAML -## Example: HyperJob YAML (Conceptual) +### Case 1: Large-scale Training Job Splitting -The exact HyperJob API is defined in the Volcano design and implementation. -The following example is a **simplified conceptual** manifest to illustrate how a HyperJob can describe -a logical job and its split across clusters. For the authoritative API, always refer to the -[HyperJob design document](https://github.com/volcano-sh/volcano/blob/master/docs/design/hyperjob-multi-cluster-job-splitting.md) -and the corresponding CRD definition in the Volcano repository. +A research team wants to train a large language model that requires 256 GPUs, but their largest cluster only has 128 GPUs. Using HyperJob, they can split the training job into two sub-jobs, each with 128 GPUs, and run them across two clusters. ```yaml -apiVersion: batch.volcano.sh/v1alpha1 +apiVersion: training.volcano.sh/v1alpha1 kind: HyperJob metadata: - name: llm-train-hyperjob + name: llm-training spec: - # High-level description of the logical job - template: - apiVersion: batch.volcano.sh/v1alpha1 - kind: Job - spec: - minAvailable: 64 - schedulerName: volcano - queue: global-ai + minAvailable: 2 + maxDomains: 2 + replicatedJobs: + - name: trainer + replicas: 2 + templateSpec: tasks: - - name: trainer - replicas: 64 - template: - spec: - containers: - - name: trainer - image: example.com/llm-train:latest - resources: - requests: - cpu: "8" - memory: "64Gi" - nvidia.com/gpu: "1" - restartPolicy: OnFailure - - # Conceptual splitting policy (names and fields are illustrative) - splitPolicy: - strategy: ByCluster - clusters: - - name: cluster-a - replicas: 32 - - name: cluster-b - replicas: 32 + - name: worker + replicas: 128 + template: + spec: + containers: + - name: trainer + image: training-image:v1 + resources: + requests: + nvidia.com/gpu: 1 ``` -In practice, the real HyperJob spec may include: +### Case 2: Heterogeneous Clusters -- More detailed **cluster selection** and **constraints**. -- Fields to describe how to **map HyperJob status** from child jobs. -- Policies for **retry, rollback, and cleanup** across clusters. +An organization has multiple clusters with different generations of accelerators (e.g., Ascend NPU 910B and 910C). They need to run a training job across these heterogeneous clusters. -Always check the latest Volcano documentation and code for the exact API. +```yaml +apiVersion: training.volcano.sh/v1alpha1 +kind: HyperJob +metadata: + name: ascend-heterogeneous-training +spec: + minAvailable: 2 + replicatedJobs: + - name: trainer-910b + replicas: 1 + clusterNames: ["cluster-ascend-910b-1", "cluster-ascend-910b-2"] + templateSpec: + tasks: + - name: worker + replicas: 64 + template: + spec: + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: hardware-type + operator: In + values: + - Ascend910B + containers: + - name: trainer + image: training-image:v1 + resources: + requests: + ascend910c: 1 + limits: + ascend910c: 1 + - name: trainer-910c + replicas: 1 + clusterNames: ["cluster-ascend-910c-1"] + templateSpec: + tasks: + - name: worker + replicas: 64 + template: + spec: + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: hardware-type + operator: In + values: + - Ascend910C + containers: + - name: trainer + image: training-image:v1 + resources: + requests: + ascend910c: 1 + limits: + ascend910c: 1 +``` ## Related Concepts and References diff --git a/content/zh/docs/hyperjob.md b/content/zh/docs/hyperjob.md index 3fdda257..2a5af63d 100644 --- a/content/zh/docs/hyperjob.md +++ b/content/zh/docs/hyperjob.md @@ -28,7 +28,7 @@ HyperJob 主要解决以下问题: - 需要对实际运行在多个集群中的作业,仍然只保留 **一个统一的控制与状态视图**。 HyperJob 以 Volcano Job 为执行单元,并结合 **Volcano Global** 与 **Karmada**, -让多集群 AI 作业的使用体验尽量接近“在单集群中运行一个 Job”。 +让多集群 AI 作业的使用体验尽量接近"在单集群中运行一个 Job"。 ## 核心特性与优势 @@ -45,7 +45,7 @@ HyperJob 以 Volcano Job 为执行单元,并结合 **Volcano Global** 与 **Ka - 集群选择、实例分布和资源放置由 HyperJob 控制面和 Volcano Global 共同完成。 - **高层调度能力** - - HyperJob 扮演 **“Job 级别的元调度器(meta-scheduler)”** 角色。 + - HyperJob 扮演 **"Job 级别的元调度器(meta-scheduler)"** 角色。 - 负责确定多少副本下发到哪个集群,再由各集群内的 Volcano Job 完成批量调度 (包括 Gang 调度、公平共享、队列优先级等)。 @@ -55,16 +55,17 @@ HyperJob 以 Volcano Job 为执行单元,并结合 **Volcano Global** 与 **Ka ## HyperJob 与普通 Volcano Job 的对比 -HyperJob 与 Volcano Job 同属一个调度体系,但关注的粒度不同: +HyperJob 构建在 Volcano Job 之上,并非替代品。它将 Volcano 的能力扩展到多集群场景,同时在每个集群内保留 Volcano Job 的所有特性。 -| 对比项 | Volcano Job | HyperJob | -|----------------------------|--------------------------------------------------|-------------------------------------------------------------------------------| -| 调度范围 | 单集群 | 多集群 | -| 执行单元 | 单个集群中的一个 `Job` | 一个 HyperJob 对应多个底层 Volcano Job | -| 多集群感知 | 不感知多集群 | 原生面向多集群的高层抽象 | -| 作业拆分 | 不提供 | **内置自动拆分** | -| 状态视图 | 每个集群、每个 Job 各自管理 | **统一的 HyperJob 状态视图** | -| 典型使用场景 | 单集群即可满足资源与调度需求 | 需要聚合多个集群资源或必须跨集群运行的作业 | +| 对比项 | Volcano Job | HyperJob | +|-------------------------|--------------------------------------------------|-------------------------------------------------------------------------------| +| **范围** | 单集群 | 多集群 | +| **抽象层级** | 集群级原语(管理 Pod) | 元级原语(管理 Volcano Job) | +| **主要用例** | 批量工作负载调度 | 跨异构集群的大规模训练 | +| **作业组成** | 单个作业包含多个任务 | 多个 Volcano Job 的组合 | +| **状态跟踪** | 跟踪单个作业内的 Pod | 聚合多个集群中多个 Volcano Job 的状态 | + +HyperJob 专为训练需求超出单集群容量或需要利用不同集群的异构加速器资源的场景而设计。 **适合直接使用 Volcano Job 的场景** @@ -96,85 +97,102 @@ HyperJob 与 Volcano Job 同属一个调度体系,但关注的粒度不同: - 将训练任务拆分到不同地域或机房的集群中执行。 - 可结合数据本地化、合规要求、链路延迟等因素进行策略控制。 -## 架构概览 - -HyperJob 的典型工作流程如下(概念性描述): - -1. **用户在控制平面集群中提交一个 HyperJob。** -2. HyperJob 控制器: - - 解析期望的副本数和资源需求。 - - 根据 **拆分策略** 决定各目标集群的副本与资源分配。 -3. 对每个目标集群,控制器创建一个或多个 **底层 Volcano Job**。 -4. **Volcano Global** 与 **Karmada** 负责: - - 多集群的调度与 `ResourceBinding` 管理。 - - 跨集群队列与作业优先级管理。 - - 多租户公平调度与资源准入控制。 -5. HyperJob 持续跟踪所有子 Job 的状态,并将其 **聚合回 HyperJob 状态** 中。 - -在该架构中: +## HyperJob YAML 示例 -- HyperJob 关注 **作业级抽象与拆分逻辑**。 -- Volcano Job 关注 **单集群内的批量调度能力**。 -- Volcano Global + Karmada 关注 **多集群维度的资源协调与放置**。 +### 场景 1:大规模训练作业拆分 -多集群架构的详细介绍,可参考 -[多集群AI作业调度](/zh/docs/multi_cluster_scheduling/) 与 -[Volcano Global](https://github.com/volcano-sh/volcano-global) 项目。 - -## HyperJob YAML 示例(概念性) - -HyperJob 的具体 API 以 Volcano 设计与实现为准。 -下面示例为 **简化的概念示例**,用于帮助理解 HyperJob 如何描述一个逻辑作业及其跨集群拆分方式。 -权威、最新的字段定义请以 -[HyperJob 设计文档](https://github.com/volcano-sh/volcano/blob/master/docs/design/hyperjob-multi-cluster-job-splitting.md) -及 Volcano 仓库中的 CRD 定义为准。 +研究团队希望训练一个需要 256 个 GPU 的大型语言模型,但他们最大的集群只有 128 个 GPU。使用 HyperJob,他们可以将训练作业拆分为两个子作业,每个子作业使用 128 个 GPU,并在两个集群上运行。 ```yaml -apiVersion: batch.volcano.sh/v1alpha1 +apiVersion: training.volcano.sh/v1alpha1 kind: HyperJob metadata: - name: llm-train-hyperjob + name: llm-training spec: - # 逻辑作业的高层模板 - template: - apiVersion: batch.volcano.sh/v1alpha1 - kind: Job - spec: - minAvailable: 64 - schedulerName: volcano - queue: global-ai + minAvailable: 2 + maxDomains: 2 + replicatedJobs: + - name: trainer + replicas: 2 + templateSpec: tasks: - - name: trainer - replicas: 64 - template: - spec: - containers: - - name: trainer - image: example.com/llm-train:latest - resources: - requests: - cpu: "8" - memory: "64Gi" - nvidia.com/gpu: "1" - restartPolicy: OnFailure - - # 拆分策略(字段名称仅作示意,实际以实现为准) - splitPolicy: - strategy: ByCluster - clusters: - - name: cluster-a - replicas: 32 - - name: cluster-b - replicas: 32 + - name: worker + replicas: 128 + template: + spec: + containers: + - name: trainer + image: training-image:v1 + resources: + requests: + nvidia.com/gpu: 1 ``` -在真实环境中,HyperJob 规范可能还包括: +### 场景 2:异构集群 -- 更精细的 **集群选择与约束条件**。 -- 描述如何将子 Job 状态 **聚合映射为 HyperJob 状态** 的字段。 -- 跨集群的 **重试、回滚与清理策略** 等高级能力。 +某组织拥有多个具有不同代次加速器的集群(例如 Ascend NPU 910B 和 910C)。他们需要在这些异构集群上运行训练作业。 -请始终参考最新的 Volcano 文档与代码获取准确 API。 +```yaml +apiVersion: training.volcano.sh/v1alpha1 +kind: HyperJob +metadata: + name: ascend-heterogeneous-training +spec: + minAvailable: 2 + replicatedJobs: + - name: trainer-910b + replicas: 1 + clusterNames: ["cluster-ascend-910b-1", "cluster-ascend-910b-2"] + templateSpec: + tasks: + - name: worker + replicas: 64 + template: + spec: + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: hardware-type + operator: In + values: + - Ascend910B + containers: + - name: trainer + image: training-image:v1 + resources: + requests: + ascend910c: 1 + limits: + ascend910c: 1 + - name: trainer-910c + replicas: 1 + clusterNames: ["cluster-ascend-910c-1"] + templateSpec: + tasks: + - name: worker + replicas: 64 + template: + spec: + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: hardware-type + operator: In + values: + - Ascend910C + containers: + - name: trainer + image: training-image:v1 + resources: + requests: + ascend910c: 1 + limits: + ascend910c: 1 +``` ## 相关概念与参考链接 From b4b0aa2eb75f0d34972adec12ab15a4769076ecc Mon Sep 17 00:00:00 2001 From: Lakshya Saxena Date: Thu, 12 Feb 2026 13:02:30 +0000 Subject: [PATCH 5/6] revert multi_cluster_scheduling changes Signed-off-by: Lakshya Saxena --- content/en/docs/multi_cluster_scheduling.md | 3 --- content/zh/docs/multi_cluster_scheduling.md | 3 --- 2 files changed, 6 deletions(-) diff --git a/content/en/docs/multi_cluster_scheduling.md b/content/en/docs/multi_cluster_scheduling.md index 290d1932..de8efacb 100644 --- a/content/en/docs/multi_cluster_scheduling.md +++ b/content/en/docs/multi_cluster_scheduling.md @@ -51,6 +51,3 @@ Volcano Global consists of two components: ## Usage Guide Please refer to: [Volcano Global Deploy](https://github.com/volcano-sh/volcano-global/blob/main/docs/deploy/README.md). - -For a high-level multi-cluster abstraction that can automatically split a logical job into multiple Volcano Jobs -across clusters, see [HyperJob](/en/docs/hyperjob/). diff --git a/content/zh/docs/multi_cluster_scheduling.md b/content/zh/docs/multi_cluster_scheduling.md index 0a01fe4d..06131f00 100644 --- a/content/zh/docs/multi_cluster_scheduling.md +++ b/content/zh/docs/multi_cluster_scheduling.md @@ -47,6 +47,3 @@ Volcano global主要包含两个组件: ## 使用指导 请参考: [Volcano global deploy](https://github.com/volcano-sh/volcano-global/blob/main/docs/deploy/README.md)。 - -如果希望在多集群环境下,以一个高层抽象来统一描述作业并自动拆分为多个跨集群的 Volcano Job, -可以参考 [HyperJob](/zh/docs/hyperjob/) 概念文档。 From 126d149878cb01c41db072e9667a80ebd627b4e6 Mon Sep 17 00:00:00 2001 From: Lakshya Saxena Date: Thu, 12 Feb 2026 13:06:14 +0000 Subject: [PATCH 6/6] fix rebase conflicts Signed-off-by: Lakshya Saxena --- content/en/docs/hyperjob.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/content/en/docs/hyperjob.md b/content/en/docs/hyperjob.md index 675c38a5..861c7f31 100644 --- a/content/en/docs/hyperjob.md +++ b/content/en/docs/hyperjob.md @@ -1,8 +1,4 @@ -<<<<<<< HEAD -+++ -======= +++ ->>>>>>> 21db252 (fix/refactor: Introduce HyperJob multi-cluster job splitting concept with comparison table, real-world use cases, and complete YAML examples from design doc. Removed architecture overview to keep focus on user-facing concepts) title = "HyperJob" description = "Multi-cluster job splitting and high-level scheduling with HyperJob" date = 2026-02-05