diff --git a/content/en/docs/hyperjob.md b/content/en/docs/hyperjob.md new file mode 100644 index 00000000..861c7f31 --- /dev/null +++ b/content/en/docs/hyperjob.md @@ -0,0 +1,210 @@ ++++ +title = "HyperJob" +description = "Multi-cluster job splitting and high-level scheduling with HyperJob" +date = 2026-02-05 +lastmod = 2026-02-05 + +draft = false # Is this a draft? true/false +toc = true # Show table of contents? true/false +type = "docs" # Do not modify. + +# Add menu entry to sidebar. +linktitle = "HyperJob" +[menu.docs] + parent = "concepts" + weight = 4 ++++ + +## Overview + +HyperJob is a high-level scheduling abstraction built on top of Volcano Job for **multi-cluster AI workloads**. +Instead of submitting and managing multiple Volcano Jobs in different clusters, users submit a single HyperJob, +and the system automatically **splits**, **dispatches**, and **tracks** the underlying jobs across clusters. + +HyperJob is designed for scenarios where: + +- A single Kubernetes cluster **does not have enough resources** for a large AI/ML training job. +- You want to **combine capacity from multiple clusters** (for example, across regions or environments). +- You need a **unified control plane and status view** for a job that actually runs in many clusters. + +HyperJob leverages Volcano Job as the basic execution unit and works together with **Volcano Global** and +**Karmada** to make multi-cluster job orchestration as simple as running a single job in one cluster. + +## Key Features and Advantages + +- **Automatic Job Splitting** + - Splits a large logical job into multiple child jobs that can be scheduled to different clusters. + - Splitting can be based on replicas, resources, or other policies defined in HyperJob. + +- **Unified Status Management** + - HyperJob exposes **one high-level status** that aggregates the states of all underlying jobs. + - Users can start, stop, and observe the entire multi-cluster workload from a single resource. + +- **Simplified Multi-Cluster Usage** + - Users no longer need to manually create and manage jobs per cluster. + - Cluster selection and placement policies are handled by the HyperJob control plane plus Volcano Global. + +- **High-Level Scheduling** + - Acts as a **meta-scheduler** on top of Volcano Job. + - Decides how many replicas go to which cluster, and then lets Volcano Job handle in-cluster scheduling + (gang scheduling, fair sharing, queue priority, and so on). + +- **Resource Optimization Across Clusters** + - Makes it easier to **utilize fragmented or heterogeneous capacity** across multiple clusters. + - Can spread load to avoid hot spots and improve overall throughput for large AI/ML workloads. + +## HyperJob vs. Standard Volcano Job + +HyperJob is built on top of Volcano Job, not as a replacement. It extends Volcano's capabilities to multi-cluster scenarios while preserving all the features of Volcano Job within each cluster. + +| Aspect | Volcano Job | HyperJob | +|--------|-------------|----------| +| **Scope** | Single cluster | Multiple clusters | +| **Abstraction Level** | Cluster-level primitive (manages Pods) | Meta-level primitive (manages Volcano Jobs) | +| **Primary Use Case** | Batch workload scheduling | Large-scale training across heterogeneous clusters | +| **Job Composition** | Single job with multiple tasks | Composition of multiple Volcano Jobs | +| **Status Tracking** | Tracks pods within a single job | Aggregates status from multiple Volcano Jobs across clusters | + +HyperJob is designed for scenarios where training requirements exceed single cluster capacity or need to leverage heterogeneous accelerator resources across different clusters. + +**When to use Volcano Job** + +- You only run in a single cluster. +- The workload size fits within that cluster's resource capacity. +- You want straightforward integration with existing controllers or pipelines that already speak Volcano Job. + +**When to use HyperJob** + +- Your AI/ML workload (for example, LLM pre-training or fine-tuning) needs **aggregate resources from multiple clusters**. +- You want a **single submission and control interface** for a distributed training or inference job. +- You need to **combine heterogeneous resources** (different GPU models, zones, or hardware generations) while + letting the system choose where each part of the job runs. + +## Typical Use Cases + +- **Multi-Cluster LLM Training** + - Train a large language model where a single cluster cannot provide enough GPUs or network bandwidth. + - HyperJob splits replicas across multiple clusters while presenting a single logical job to the user. + +- **Heterogeneous Resource Scheduling** + - Combine clusters with different GPU types (for example, A100, H100, or other accelerators). + - HyperJob can assign subtasks to the most suitable cluster based on resource type and availability. + +- **Resource Overflow and Bursting** + - When a primary cluster is close to saturation, HyperJob can place additional replicas into other clusters + without changing user-facing APIs. + +- **Geographically Distributed Training** + - Distribute parts of a workload across clusters in different regions or data centers. + - Useful for latency-sensitive scenarios or to comply with data locality requirements. + +## Example: HyperJob YAML + +### Case 1: Large-scale Training Job Splitting + +A research team wants to train a large language model that requires 256 GPUs, but their largest cluster only has 128 GPUs. Using HyperJob, they can split the training job into two sub-jobs, each with 128 GPUs, and run them across two clusters. + +```yaml +apiVersion: training.volcano.sh/v1alpha1 +kind: HyperJob +metadata: + name: llm-training +spec: + minAvailable: 2 + maxDomains: 2 + replicatedJobs: + - name: trainer + replicas: 2 + templateSpec: + tasks: + - name: worker + replicas: 128 + template: + spec: + containers: + - name: trainer + image: training-image:v1 + resources: + requests: + nvidia.com/gpu: 1 +``` + +### Case 2: Heterogeneous Clusters + +An organization has multiple clusters with different generations of accelerators (e.g., Ascend NPU 910B and 910C). They need to run a training job across these heterogeneous clusters. + +```yaml +apiVersion: training.volcano.sh/v1alpha1 +kind: HyperJob +metadata: + name: ascend-heterogeneous-training +spec: + minAvailable: 2 + replicatedJobs: + - name: trainer-910b + replicas: 1 + clusterNames: ["cluster-ascend-910b-1", "cluster-ascend-910b-2"] + templateSpec: + tasks: + - name: worker + replicas: 64 + template: + spec: + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: hardware-type + operator: In + values: + - Ascend910B + containers: + - name: trainer + image: training-image:v1 + resources: + requests: + ascend910c: 1 + limits: + ascend910c: 1 + - name: trainer-910c + replicas: 1 + clusterNames: ["cluster-ascend-910c-1"] + templateSpec: + tasks: + - name: worker + replicas: 64 + template: + spec: + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: hardware-type + operator: In + values: + - Ascend910C + containers: + - name: trainer + image: training-image:v1 + resources: + requests: + ascend910c: 1 + limits: + ascend910c: 1 +``` + +## Related Concepts and References + +- **Volcano Job**: The core batch job abstraction in a single cluster. + See [VolcanoJob](/en/docs/vcjob/) for details. +- **Queue**: Controls resource sharing and priority. + See [Queue](/en/docs/queue/) and [Queue Resource Management](/en/docs/queue_resource_management/). +- **Multi-Cluster AI Job Scheduling**: + See [Multi-Cluster AI Job Scheduling](/en/docs/multi_cluster_scheduling/) for Volcano Global architecture + and usage patterns. +- **HyperJob Design Document**: + See the detailed design: + `https://github.com/volcano-sh/volcano/blob/master/docs/design/hyperjob-multi-cluster-job-splitting.md` + diff --git a/content/zh/docs/hyperjob.md b/content/zh/docs/hyperjob.md new file mode 100644 index 00000000..2a5af63d --- /dev/null +++ b/content/zh/docs/hyperjob.md @@ -0,0 +1,208 @@ ++++ +title = "HyperJob" +description = "基于 HyperJob 的多集群作业拆分与高级调度" +date = 2026-02-05 +lastmod = 2026-02-05 + +draft = false # 是否为草稿 +toc = true # 是否显示目录 +type = "docs" # 请勿修改 + +# 侧边栏菜单 +linktitle = "HyperJob" +[menu.docs] + parent = "concepts" + weight = 4 ++++ + +## 概述 + +HyperJob 是构建在 Volcano Job 之上的 **多集群高层作业抽象**。 +用户不再需要在多个集群中分别创建和管理 Volcano Job,而是只需要提交一个 HyperJob, +系统就会自动在多个集群之间 **拆分**、**下发**、并 **聚合管理** 底层作业。 + +HyperJob 主要解决以下问题: + +- 单个 Kubernetes 集群 **资源不足以承载** 大规模 AI/ML 训练任务。 +- 希望 **统一利用多个集群的算力**(例如跨地域、跨环境的 GPU 集群)。 +- 需要对实际运行在多个集群中的作业,仍然只保留 **一个统一的控制与状态视图**。 + +HyperJob 以 Volcano Job 为执行单元,并结合 **Volcano Global** 与 **Karmada**, +让多集群 AI 作业的使用体验尽量接近"在单集群中运行一个 Job"。 + +## 核心特性与优势 + +- **自动作业拆分** + - 将一个大型逻辑作业自动拆分为多个子 Job,下发到不同集群执行。 + - 拆分可以基于副本数、资源配额或其他在 HyperJob 中定义的策略。 + +- **统一状态管理** + - HyperJob 提供一个 **统一的高层状态**,对多个底层 Volcano Job 的状态进行聚合。 + - 用户可以通过一个 HyperJob 完成启动、停止和状态查看等操作。 + +- **简化多集群使用** + - 用户不需要关心每个集群中具体的 Job 清单。 + - 集群选择、实例分布和资源放置由 HyperJob 控制面和 Volcano Global 共同完成。 + +- **高层调度能力** + - HyperJob 扮演 **"Job 级别的元调度器(meta-scheduler)"** 角色。 + - 负责确定多少副本下发到哪个集群,再由各集群内的 Volcano Job 完成批量调度 + (包括 Gang 调度、公平共享、队列优先级等)。 + +- **跨集群资源优化** + - 便于 **利用多个集群中零散或异构的资源**。 + - 可以把负载分摊到多个集群,提升整体吞吐与资源利用率。 + +## HyperJob 与普通 Volcano Job 的对比 + +HyperJob 构建在 Volcano Job 之上,并非替代品。它将 Volcano 的能力扩展到多集群场景,同时在每个集群内保留 Volcano Job 的所有特性。 + +| 对比项 | Volcano Job | HyperJob | +|-------------------------|--------------------------------------------------|-------------------------------------------------------------------------------| +| **范围** | 单集群 | 多集群 | +| **抽象层级** | 集群级原语(管理 Pod) | 元级原语(管理 Volcano Job) | +| **主要用例** | 批量工作负载调度 | 跨异构集群的大规模训练 | +| **作业组成** | 单个作业包含多个任务 | 多个 Volcano Job 的组合 | +| **状态跟踪** | 跟踪单个作业内的 Pod | 聚合多个集群中多个 Volcano Job 的状态 | + +HyperJob 专为训练需求超出单集群容量或需要利用不同集群的异构加速器资源的场景而设计。 + +**适合直接使用 Volcano Job 的场景** + +- 所有工作负载只运行在 **单个集群** 中。 +- 集群本身的资源足以容纳训练或推理任务。 +- 已有控制器/流水线已经与 Volcano Job 打通,只需要单集群调度能力。 + +**适合使用 HyperJob 的场景** + +- LLM 预训练/大规模分布式训练等,需要 **多个集群的总算力** 才能满足需求。 +- 希望对跨集群的训练/推理任务,仍然只提交 **一个逻辑作业** 并进行统一管理。 +- 需要组合多种 **异构资源**(不同 GPU 型号、不同可用区或机型),让系统自动选择合适的集群。 + +## 典型使用场景 + +- **多集群 LLM 训练** + - 单个集群 GPU 规模不足时,将训练副本拆分到多个集群执行。 + - 用户只需要管理一个 HyperJob,对底层多个 Volcano Job 无感知。 + +- **异构资源调度** + - 组合多个拥有不同 GPU 型号或算力配置的集群。 + - HyperJob 根据资源类型与可用性,将不同子任务分配到最合适的集群中。 + +- **资源溢出与弹性扩展** + - 当主集群资源接近饱和时,可通过 HyperJob 自动将多余副本调度到其他集群。 + - 无需改动上层作业定义或业务逻辑。 + +- **跨地域/多机房训练** + - 将训练任务拆分到不同地域或机房的集群中执行。 + - 可结合数据本地化、合规要求、链路延迟等因素进行策略控制。 + +## HyperJob YAML 示例 + +### 场景 1:大规模训练作业拆分 + +研究团队希望训练一个需要 256 个 GPU 的大型语言模型,但他们最大的集群只有 128 个 GPU。使用 HyperJob,他们可以将训练作业拆分为两个子作业,每个子作业使用 128 个 GPU,并在两个集群上运行。 + +```yaml +apiVersion: training.volcano.sh/v1alpha1 +kind: HyperJob +metadata: + name: llm-training +spec: + minAvailable: 2 + maxDomains: 2 + replicatedJobs: + - name: trainer + replicas: 2 + templateSpec: + tasks: + - name: worker + replicas: 128 + template: + spec: + containers: + - name: trainer + image: training-image:v1 + resources: + requests: + nvidia.com/gpu: 1 +``` + +### 场景 2:异构集群 + +某组织拥有多个具有不同代次加速器的集群(例如 Ascend NPU 910B 和 910C)。他们需要在这些异构集群上运行训练作业。 + +```yaml +apiVersion: training.volcano.sh/v1alpha1 +kind: HyperJob +metadata: + name: ascend-heterogeneous-training +spec: + minAvailable: 2 + replicatedJobs: + - name: trainer-910b + replicas: 1 + clusterNames: ["cluster-ascend-910b-1", "cluster-ascend-910b-2"] + templateSpec: + tasks: + - name: worker + replicas: 64 + template: + spec: + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: hardware-type + operator: In + values: + - Ascend910B + containers: + - name: trainer + image: training-image:v1 + resources: + requests: + ascend910c: 1 + limits: + ascend910c: 1 + - name: trainer-910c + replicas: 1 + clusterNames: ["cluster-ascend-910c-1"] + templateSpec: + tasks: + - name: worker + replicas: 64 + template: + spec: + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: hardware-type + operator: In + values: + - Ascend910C + containers: + - name: trainer + image: training-image:v1 + resources: + requests: + ascend910c: 1 + limits: + ascend910c: 1 +``` + +## 相关概念与参考链接 + +- **Volcano Job**:单集群批作业的核心抽象。 + 详情见 [VolcanoJob](/zh/docs/vcjob/)。 +- **Queue(队列)**:控制资源共享与优先级。 + 详情见 [Queue](/zh/docs/queue/) 与 [队列资源管理](/zh/docs/queue_resource_management/)。 +- **多集群AI作业调度**: + 详情见 [多集群AI作业调度](/zh/docs/multi_cluster_scheduling/),了解 Volcano Global 架构与使用方式。 +- **HyperJob 设计文档**: + 设计细节见: + `https://github.com/volcano-sh/volcano/blob/master/docs/design/hyperjob-multi-cluster-job-splitting.md` +