diff --git a/content/en/blog/Meet Cloud Native Batch Computing with Volcano in AI & Big Data Scenarios.md b/content/en/blog/Meet Cloud Native Batch Computing with Volcano in AI & Big Data Scenarios.md index 1dbacf7c..6d959d46 100644 --- a/content/en/blog/Meet Cloud Native Batch Computing with Volcano in AI & Big Data Scenarios.md +++ b/content/en/blog/Meet Cloud Native Batch Computing with Volcano in AI & Big Data Scenarios.md @@ -7,7 +7,7 @@ date = 2024-03-08 lastmod = 2024-03-08 datemonth = "Mar" dateyear = "2024" -dateday = 08 +dateday = "08" draft = false # Is this a draft? true/false toc = true # Show table of contents? true/false diff --git a/content/en/blog/Volcano-1.11.0-release.md b/content/en/blog/Volcano-1.11.0-release.md index 6afa798c..2a51ebc8 100644 --- a/content/en/blog/Volcano-1.11.0-release.md +++ b/content/en/blog/Volcano-1.11.0-release.md @@ -7,7 +7,7 @@ date = 2025-02-07 lastmod = 2025-02-07 datemonth = "Feb" dateyear = "2025" -dateday = 07 +dateday = "07" draft = false # Is this a draft? true/false toc = true # Show table of contents? true/false diff --git a/content/en/blog/how-volcano-boosts-distributed-training-and-inference-performance.md b/content/en/blog/how-volcano-boosts-distributed-training-and-inference-performance.md index 77699dcf..c91dc2c1 100644 --- a/content/en/blog/how-volcano-boosts-distributed-training-and-inference-performance.md +++ b/content/en/blog/how-volcano-boosts-distributed-training-and-inference-performance.md @@ -7,7 +7,7 @@ date = 2025-04-01 lastmod = 2025-04-01 datemonth = "Apr" dateyear = "2025" -dateday = 01 +dateday = "01" draft = false # Is this a draft? true/false toc = true # Show table of contents? true/false diff --git a/content/en/docs/binpack.md b/content/en/docs/binpack.md new file mode 100644 index 00000000..0fee231c --- /dev/null +++ b/content/en/docs/binpack.md @@ -0,0 +1,105 @@ ++++ +title = "Binpack" + +date = 2021-05-13 +lastmod = 2025-11-11 + +draft = false # Is this a draft? true/false +toc = true # Show table of contents? true/false +type = "docs" # Do not modify. + +# Add menu entry to sidebar. +linktitle = "Binpack" +[menu.docs] + parent = "plugins" + weight = 5 ++++ + +## Overview + +The goal of the Binpack scheduling algorithm is to fill existing nodes as much as possible (trying not to allocate to empty nodes). In the concrete implementation, the Binpack scheduling algorithm scores the nodes that can accommodate the task, with higher scores indicating higher resource utilization rates. The Binpack algorithm can fill up nodes as much as possible, consolidating application loads on some nodes, which is very conducive to the Kubernetes cluster's node auto-scaling functionality. + +## How It Works + +The Binpack algorithm is injected into the Volcano Scheduler process as a plugin and is applied during the node selection stage for Pods. When calculating the Binpack score, the Volcano Scheduler considers various resources requested by the Pod and averages them according to the weights configured for each resource. + +Key characteristics: + +- **Resource Weight**: Each resource type (CPU, Memory, GPU, etc.) can have a different weight in the scoring calculation, depending on the weight value configured by the administrator. +- **Plugin Weight**: Different plugins also need to be assigned different weights when calculating node scores. The scheduler also sets score weights for the Binpack plugin. +- **NodeOrderFn**: The plugin implements the NodeOrderFn to score nodes based on how efficiently they would be utilized after placing the task. + +## Scenario + +The Binpack algorithm is beneficial for small jobs that can fill as many nodes as possible: + +### Big Data Scenarios + +Single query jobs in big data processing benefit from Binpack by consolidating workloads and maximizing resource utilization on active nodes. + +### E-commerce High Concurrency + +Order generation in e-commerce flash sale scenarios can leverage Binpack to efficiently use available resources during peak loads. + +### AI Inference + +Single identification jobs in AI inference scenarios benefit from consolidated scheduling, reducing resource fragmentation. + +### Internet Services + +High concurrency service scenarios on the Internet benefit from Binpack by reducing fragmentation within nodes and reserving sufficient resource space on idle machines for Pods that have applied for more resource requests, maximizing the utilization of idle resources in the cluster. + +## Configuration + +The Binpack plugin is configured in the scheduler ConfigMap with optional weight parameters: + +```yaml +tiers: +- plugins: + - name: binpack + arguments: + binpack.weight: 10 + binpack.cpu: 1 + binpack.memory: 1 + binpack.resources: nvidia.com/gpu + binpack.resources.nvidia.com/gpu: 2 +``` + +### Configuration Parameters + +| Parameter | Description | Default | +|-----------|-------------|---------| +| `binpack.weight` | Overall weight of the Binpack plugin score | 1 | +| `binpack.cpu` | Weight for CPU resource in scoring | 1 | +| `binpack.memory` | Weight for Memory resource in scoring | 1 | +| `binpack.resources` | Additional resources to consider | - | +| `binpack.resources.` | Weight for specific resource type | 1 | + +## Example + +Here's an example scheduler configuration that uses Binpack to prioritize node filling: + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: volcano-scheduler-configmap + namespace: volcano-system +data: + volcano-scheduler.conf: | + actions: "enqueue, allocate, backfill" + tiers: + - plugins: + - name: priority + - name: gang + - plugins: + - name: predicates + - name: nodeorder + - name: binpack + arguments: + binpack.weight: 10 + binpack.cpu: 2 + binpack.memory: 1 +``` + +In this configuration, the Binpack plugin is given a weight of 10, and CPU is weighted twice as much as memory in the scoring calculation. diff --git a/content/en/docs/drf.md b/content/en/docs/drf.md new file mode 100644 index 00000000..13b2d803 --- /dev/null +++ b/content/en/docs/drf.md @@ -0,0 +1,119 @@ ++++ +title = "DRF" + +date = 2021-05-13 +lastmod = 2025-11-11 + +draft = false # Is this a draft? true/false +toc = true # Show table of contents? true/false +type = "docs" # Do not modify. + +# Add menu entry to sidebar. +linktitle = "DRF" +[menu.docs] + parent = "plugins" + weight = 7 ++++ + +{{
}} + +## Overview + +The full name of the DRF scheduling algorithm is **Dominant Resource Fairness**, which is a scheduling algorithm based on the container group's Dominant Resource. The Dominant Resource is the largest percentage of all required resources for a container group relative to the total cluster resources. + +The DRF algorithm selects the container group with the smallest Dominant Resource share for priority scheduling. This approach can accommodate more jobs without allowing a single resource-heavy job to starve a large number of smaller jobs. The DRF scheduling algorithm ensures that in an environment where many types of resources coexist, the fair allocation principle is satisfied as much as possible. + +## How It Works + +The DRF plugin: + +1. **Observes Dominant Resource**: For each job, it identifies which resource (CPU, Memory, GPU, etc.) represents the largest share of cluster resources +2. **Calculates Share Value**: Computes each job's share value based on its dominant resource usage +3. **Prioritizes Lower Share**: Jobs with lower share values (using less of their dominant resource) get higher scheduling priority + +Key functions implemented: + +- **JobOrderFn**: Orders jobs based on their dominant resource share, giving priority to jobs with smaller shares +- **PreemptableFn**: Determines if a job can be preempted based on resource fairness calculations + +The plugin attempts to calculate the total amount of resources allocated to the preemptor and preempted tasks, triggering preemption when the preemptor task has fewer resources. + +## Scenario + +The DRF scheduling algorithm gives priority to the throughput of businesses in the cluster and is suitable for batch processing scenarios: + +### AI Training + +Single AI training jobs benefit from DRF as it ensures fair resource allocation across multiple training workloads. + +### Big Data Processing + +Single big data calculation and query jobs can share resources fairly with other workloads in the cluster. + +### Mixed Resource Workloads + +In environments with diverse resource requirements (CPU-intensive, Memory-intensive, GPU-intensive jobs), DRF ensures fair allocation across all resource dimensions. + +## Configuration + +The DRF plugin is configured in the scheduler ConfigMap: + +```yaml +tiers: +- plugins: + - name: priority + - name: gang +- plugins: + - name: drf + - name: predicates + - name: proportion +``` + +## Example + +Consider a cluster with the following resources: +- 100 CPUs +- 400 GB Memory + +And two jobs: +- **Job A**: Each task requires 2 CPUs and 8 GB Memory +- **Job B**: Each task requires 1 CPU and 32 GB Memory + +For Job A: +- CPU share per task: 2/100 = 2% +- Memory share per task: 8/400 = 2% +- Dominant resource: CPU and Memory are equal (2%) + +For Job B: +- CPU share per task: 1/100 = 1% +- Memory share per task: 32/400 = 8% +- Dominant resource: Memory (8%) + +With DRF, Job A would be scheduled first because its dominant resource share (2%) is smaller than Job B's (8%). This ensures that neither job can monopolize the cluster by requesting large amounts of a single resource. + +### VolcanoJob Example + +```yaml +apiVersion: batch.volcano.sh/v1alpha1 +kind: Job +metadata: + name: drf-example-job +spec: + schedulerName: volcano + minAvailable: 2 + tasks: + - replicas: 2 + name: worker + template: + spec: + containers: + - name: worker + image: busybox + resources: + requests: + cpu: "2" + memory: "8Gi" + limits: + cpu: "2" + memory: "8Gi" +``` diff --git a/content/en/docs/gang.md b/content/en/docs/gang.md new file mode 100644 index 00000000..39c8c73e --- /dev/null +++ b/content/en/docs/gang.md @@ -0,0 +1,91 @@ ++++ +title = "Gang" + +date = 2021-05-13 +lastmod = 2025-11-11 + +draft = false # Is this a draft? true/false +toc = true # Show table of contents? true/false +type = "docs" # Do not modify. + +# Add menu entry to sidebar. +linktitle = "Gang" +[menu.docs] + parent = "plugins" + weight = 4 ++++ + +{{
}} + +## Overview + +The Gang scheduling strategy is one of the core scheduling algorithms of the Volcano Scheduler. It meets the scheduling requirements of "All or nothing" in the scheduling process and avoids the waste of cluster resources caused by arbitrary scheduling of Pods. The Gang scheduler algorithm observes whether the scheduled number of Pods under a Job meets the minimum number of runs. When the minimum number of runs of Job is satisfied, the scheduling action is executed for all Pods under the Job; otherwise, it is not executed. + +## How It Works + +The Gang plugin considers tasks not in the `Ready` state (including Binding, Bound, Running, Allocated, Succeed, and Pipelined) as having a higher priority. It checks whether the resources allocated to the queue can meet the resources required by the task to run `minAvailable` pods after trying to evict some pods and reclaim resources. If yes, the Gang plugin will proceed with scheduling. + +Key functions implemented by the Gang plugin: + +- **JobReadyFn**: Checks if a job has enough resources to meet its `minAvailable` requirement +- **JobPipelinedFn**: Checks if a job can be pipelined +- **JobValidFn**: Validates if a job's Gang constraint is satisfied + +## Scenario + +The Gang scheduling algorithm based on the container group concept is well suited for scenarios that require multi-process collaboration: + +### AI and Deep Learning + +AI scenes often contain complex processes including Data Ingestion, Data Analysts, Data Splitting, Trainers, Serving, and Logging. These require a group of containers to work together, making them suitable for the container-based Gang scheduling strategy. + +### MPI and HPC + +Multi-thread parallel computing communication scenarios under the MPI computing framework are also suitable for Gang scheduling because master and slave processes need to work together. Containers under the container group are highly correlated, and there may be resource contention. Overall scheduling allocation can effectively solve deadlock situations. + +### Resource Efficiency + +In the case of insufficient cluster resources, the Gang scheduling strategy can significantly improve the utilization of cluster resources by preventing partial job allocations that would waste resources waiting for other tasks. + +## Configuration + +The Gang plugin is typically enabled by default and configured in the scheduler ConfigMap: + +```yaml +tiers: +- plugins: + - name: priority + - name: gang + - name: conformance +``` + +## Example + +Here's an example of a VolcanoJob that uses Gang scheduling: + +```yaml +apiVersion: batch.volcano.sh/v1alpha1 +kind: Job +metadata: + name: tensorflow-job +spec: + minAvailable: 3 # Gang constraint: at least 3 pods must be schedulable + schedulerName: volcano + tasks: + - replicas: 1 + name: ps + template: + spec: + containers: + - name: tensorflow + image: tensorflow/tensorflow:latest + - replicas: 2 + name: worker + template: + spec: + containers: + - name: tensorflow + image: tensorflow/tensorflow:latest +``` + +In this example, the job will only be scheduled if all 3 pods (1 ps + 2 workers) can be allocated resources simultaneously. diff --git a/content/en/docs/nodeorder.md b/content/en/docs/nodeorder.md new file mode 100644 index 00000000..909f13e6 --- /dev/null +++ b/content/en/docs/nodeorder.md @@ -0,0 +1,167 @@ ++++ +title = "Nodeorder" + +date = 2021-05-13 +lastmod = 2025-11-11 + +draft = false # Is this a draft? true/false +toc = true # Show table of contents? true/false +type = "docs" # Do not modify. + +# Add menu entry to sidebar. +linktitle = "Nodeorder" +[menu.docs] + parent = "plugins" + weight = 11 ++++ + +## Overview + +The Nodeorder plugin is a scheduling optimization strategy that scores nodes from various dimensions through simulated assignments to find the node that is best suited for the current task. The scoring parameters are configured by the user. + +## How It Works + +The Nodeorder plugin implements the **NodeOrderFn** to score all nodes for a task using a series of scoring algorithms. The node with the highest score is considered to be the most suitable node for the task. + +Scoring dimensions include: +- **Affinity**: Node and pod affinity/anti-affinity scores +- **LeastRequestedResource**: Prefers nodes with more available resources +- **MostRequestedResource**: Prefers nodes with fewer available resources (consolidation) +- **BalancedResourceAllocation**: Prefers nodes with balanced resource usage +- **ImageLocality**: Prefers nodes that already have the container images + +## Scenario + +The Nodeorder plugin provides scoring criteria across multiple dimensions for scheduling. The combination of different dimensions enables users to flexibly configure appropriate scheduling policies according to their own needs. + +### Workload Distribution + +By adjusting weights for different scoring dimensions, you can control how workloads are distributed across the cluster: +- Use **LeastRequestedResource** to spread workloads evenly +- Use **MostRequestedResource** to consolidate workloads (similar to Binpack) + +### Image Locality Optimization + +For large container images, using **ImageLocality** scoring can reduce pod startup time by preferring nodes that already have the required images. + +### Affinity Optimization + +The **Affinity** dimension ensures that pods are placed according to their affinity and anti-affinity rules. + +## Configuration + +The Nodeorder plugin is enabled in the scheduler ConfigMap with configurable weights: + +```yaml +tiers: +- plugins: + - name: predicates + - name: nodeorder + arguments: + nodeorder.weight: 10 + nodeorder.leastrequested.enable: true + nodeorder.mostrequested.enable: false + nodeorder.nodeaffinity.enable: true + nodeorder.podaffinity.enable: true + nodeorder.balancedresource.enable: true + nodeorder.imagelocality.enable: true +``` + +### Configuration Parameters + +| Parameter | Description | Default | +|-----------|-------------|---------| +| `nodeorder.weight` | Overall weight of the Nodeorder plugin | 1 | +| `nodeorder.leastrequested.enable` | Enable least requested resource scoring | true | +| `nodeorder.mostrequested.enable` | Enable most requested resource scoring | false | +| `nodeorder.nodeaffinity.enable` | Enable node affinity scoring | true | +| `nodeorder.podaffinity.enable` | Enable pod affinity scoring | true | +| `nodeorder.balancedresource.enable` | Enable balanced resource scoring | true | +| `nodeorder.imagelocality.enable` | Enable image locality scoring | true | +| `nodeorder.leastrequested.weight` | Weight for least requested scoring | 1 | +| `nodeorder.mostrequested.weight` | Weight for most requested scoring | 1 | +| `nodeorder.nodeaffinity.weight` | Weight for node affinity scoring | 1 | +| `nodeorder.podaffinity.weight` | Weight for pod affinity scoring | 1 | +| `nodeorder.balancedresource.weight` | Weight for balanced resource scoring | 1 | +| `nodeorder.imagelocality.weight` | Weight for image locality scoring | 1 | + +## Example + +### Configuration for Spreading Workloads + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: volcano-scheduler-configmap + namespace: volcano-system +data: + volcano-scheduler.conf: | + actions: "enqueue, allocate, backfill" + tiers: + - plugins: + - name: priority + - name: gang + - plugins: + - name: predicates + - name: nodeorder + arguments: + nodeorder.leastrequested.enable: true + nodeorder.leastrequested.weight: 2 + nodeorder.balancedresource.enable: true +``` + +### Configuration for Consolidating Workloads + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: volcano-scheduler-configmap + namespace: volcano-system +data: + volcano-scheduler.conf: | + actions: "enqueue, allocate, backfill" + tiers: + - plugins: + - name: priority + - name: gang + - plugins: + - name: predicates + - name: nodeorder + arguments: + nodeorder.mostrequested.enable: true + nodeorder.mostrequested.weight: 2 + nodeorder.leastrequested.enable: false +``` + +### Job with Pod Affinity + +```yaml +apiVersion: batch.volcano.sh/v1alpha1 +kind: Job +metadata: + name: affinity-job +spec: + schedulerName: volcano + minAvailable: 2 + tasks: + - replicas: 2 + name: worker + template: + spec: + affinity: + podAffinity: + preferredDuringSchedulingIgnoredDuringExecution: + - weight: 100 + podAffinityTerm: + labelSelector: + matchLabels: + app: cache + topologyKey: kubernetes.io/hostname + containers: + - name: worker + image: busybox +``` + +The Nodeorder plugin will score nodes higher if they already have pods matching the affinity rules. diff --git a/content/en/docs/numa-aware.md b/content/en/docs/numa-aware.md new file mode 100644 index 00000000..19b2350b --- /dev/null +++ b/content/en/docs/numa-aware.md @@ -0,0 +1,179 @@ ++++ +title = "Numa-aware" + +date = 2021-05-13 +lastmod = 2025-11-11 + +draft = false # Is this a draft? true/false +toc = true # Show table of contents? true/false +type = "docs" # Do not modify. + +# Add menu entry to sidebar. +linktitle = "Numa-aware" +[menu.docs] + parent = "plugins" + weight = 14 ++++ + +## Overview + +When a node runs many CPU-bound pods, the workload can move to different CPU cores depending on whether the pod is throttled and which CPU cores are available at scheduling time. Many workloads are not sensitive to this migration and work fine without any intervention. However, in workloads where CPU cache affinity and scheduling latency significantly affect workload performance, special CPU management policies are needed to determine placement preferences on the node. + +## The Challenge + +The CPU Manager and Topology Manager are Kubelet components that help with CPU placement. However, they have the following limitations: + +1. **Scheduler Unawareness**: The scheduler is not topology-aware. This means a pod might be scheduled on a node only to fail due to the Topology Manager. This is unacceptable for TensorFlow jobs—if any worker or parameter server fails on a node, the entire job will fail. + +2. **Node-level Only**: These managers operate at the node level, which results in an inability to match the best node for NUMA topology across the entire cluster. + +## How Numa-aware Plugin Works + +The Numa-aware plugin aims to address these limitations: + +- **CPU Resource Topology Scheduling**: Supports scheduling based on CPU topology +- **Pod-level Topology Policies**: Supports topology policies at the pod level + +The plugin: +1. Collects NUMA topology information from nodes +2. Evaluates CPU and memory placement requirements +3. Scores nodes based on NUMA affinity +4. Ensures tasks are placed on nodes that can satisfy their topology requirements + +## Scenario + +Common scenarios for NUMA-aware scheduling are computation-intensive jobs that are sensitive to CPU parameters and scheduling delays: + +### Scientific Computing + +High-performance scientific calculations benefit from NUMA-aware scheduling to ensure optimal memory access patterns. + +### Video Processing + +Video decoding workloads can achieve better performance when scheduled with NUMA awareness. + +### Animation Rendering + +Animation rendering jobs that are CPU-intensive benefit from optimized CPU and memory placement. + +### Big Data Offline Processing + +Large-scale data processing jobs can achieve better throughput with NUMA-optimized scheduling. + +## Configuration + +### Enable Topology Manager on Nodes + +First, ensure the Kubelet is configured with topology management: + +```yaml +# kubelet configuration +topologyManagerPolicy: single-numa-node +cpuManagerPolicy: static +``` + +### Scheduler Configuration + +Enable the Numa-aware plugin: + +```yaml +tiers: +- plugins: + - name: priority + - name: gang +- plugins: + - name: predicates + - name: nodeorder + - name: numa-aware + arguments: + numa-aware.weight: 10 +``` + +### Configuration Parameters + +| Parameter | Description | Default | +|-----------|-------------|---------| +| `numa-aware.weight` | Weight of NUMA-aware scoring | 1 | + +## Example + +### Node with NUMA Topology + +A typical NUMA node might have: +- 2 NUMA nodes +- Each with 16 CPU cores +- Each with 64GB memory + +``` +NUMA Node 0: CPU 0-15, 64GB Memory +NUMA Node 1: CPU 16-31, 64GB Memory +``` + +### Job Requiring NUMA Awareness + +```yaml +apiVersion: batch.volcano.sh/v1alpha1 +kind: Job +metadata: + name: numa-aware-job +spec: + schedulerName: volcano + minAvailable: 1 + tasks: + - replicas: 1 + name: compute + template: + metadata: + annotations: + volcano.sh/numa-topology-policy: single-numa-node + spec: + containers: + - name: compute + image: compute-intensive-app:latest + resources: + requests: + cpu: "8" + memory: "32Gi" + limits: + cpu: "8" + memory: "32Gi" +``` + +In this example: +- The job requests 8 CPUs and 32GB memory +- The NUMA policy requires all resources from a single NUMA node +- The scheduler will find a node that can satisfy this requirement from a single NUMA node + +### Pod with Topology Policy Annotation + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: numa-sensitive-pod + annotations: + volcano.sh/numa-topology-policy: best-effort +spec: + schedulerName: volcano + containers: + - name: app + image: myapp:latest + resources: + requests: + cpu: "4" + memory: "16Gi" + limits: + cpu: "4" + memory: "16Gi" +``` + +### NUMA Topology Policies + +The plugin supports several topology policies: + +| Policy | Description | +|--------|-------------| +| `none` | No NUMA preference | +| `best-effort` | Try to place on optimal NUMA node, but don't fail if not possible | +| `restricted` | Only place on nodes that can satisfy the NUMA requirement | +| `single-numa-node` | All resources must come from a single NUMA node | diff --git a/content/en/docs/plugins-overview.md b/content/en/docs/plugins-overview.md new file mode 100644 index 00000000..6bf43719 --- /dev/null +++ b/content/en/docs/plugins-overview.md @@ -0,0 +1,57 @@ ++++ +title = "Plugins" + +date = 2021-05-13 +lastmod = 2025-11-11 + +draft = false # Is this a draft? true/false +toc = true # Show table of contents? true/false +type = "docs" # Do not modify. + +# Add menu entry to sidebar. +linktitle = "Plugins" +[menu.docs] + name = "Plugins" + parent = "scheduler" + weight = 3 + identifier = "plugins" ++++ + +## Overview + +Plugins are the core components of Volcano scheduler that provide specific scheduling algorithms and strategies. They work together with [Actions](/en/docs/actions/) to implement the scheduling logic. While actions define what should be done in each scheduling step, plugins provide the detailed algorithms for how it should be done. + +## How Plugins Work + +Plugins are registered with the scheduler and are called during the execution of actions. Each plugin can implement one or more of the following functions: + +- **JobOrderFn**: Determines the order of jobs in the scheduling queue +- **TaskOrderFn**: Determines the order of tasks within a job +- **PredicateFn**: Filters out nodes that cannot accommodate a task +- **NodeOrderFn**: Scores nodes to find the best fit for a task +- **PreemptableFn**: Identifies tasks that can be preempted +- **ReclaimableFn**: Identifies tasks that can be reclaimed +- **JobReadyFn**: Checks if a job is ready to be scheduled +- **JobPipelinedFn**: Checks if a job can be pipelined + +## Available Plugins + +Volcano provides the following plugins: + +| Plugin | Description | +|--------|-------------| +| [Gang](/en/docs/gang/) | Ensures all tasks of a job are scheduled together (All or nothing) | +| [Binpack](/en/docs/binpack/) | Packs tasks onto nodes to maximize resource utilization | +| [Priority](/en/docs/priority/) | Sorts jobs and tasks based on priority | +| [DRF](/en/docs/drf/) | Dominant Resource Fairness for fair resource allocation | +| [Proportion](/en/docs/proportion/) | Queue-based resource allocation with proportional sharing | +| [Task-topology](/en/docs/task-topology/) | Considers task affinity and anti-affinity within a job | +| [Predicates](/en/docs/predicates/) | Filters nodes based on predicates including GPU requirements | +| [Nodeorder](/en/docs/nodeorder/) | Scores nodes using various dimensions | +| [SLA](/en/docs/sla/) | Service Level Agreement constraints for jobs | +| [TDM](/en/docs/tdm/) | Time Division Multiplexing for shared node resources | +| [Numa-aware](/en/docs/numa-aware/) | NUMA topology-aware scheduling for CPU-bound workloads | + +## Plugin Configuration + +Plugins are configured in the Volcano scheduler ConfigMap. diff --git a/content/en/docs/plugins.md b/content/en/docs/plugins.md deleted file mode 100644 index cc8020ba..00000000 --- a/content/en/docs/plugins.md +++ /dev/null @@ -1,176 +0,0 @@ -+++ -title = "Plugins" - -date = 2021-05-13 -lastmod = 2021-05-13 - -draft = false # Is this a draft? true/false -toc = true # Show table of contents? true/false -type = "docs" # Do not modify. - -# Add menu entry to sidebar. -linktitle = "Plugins" -[menu.docs] - parent = "scheduler" - weight = 3 -+++ - - - -### Gang - -{{
}} - -#### Overview - -The Gang scheduling strategy is one of the core scheduling algorithms of the Volcano-Scheduler. It meets the scheduling requirements of "All or nothing" in the scheduling process and avoids the waste of cluster resources caused by arbitrary scheduling of Pod. The Gang scheduler algorithm is to observe whether the scheduled number of Pods under Job meets the minimum number of runs. When the minimum number of runs of Job is satisfied, the scheduling action is executed for all Pods under Job; otherwise, it is not executed. - -#### Scenario - -The Gang scheduling algorithm based on the container group concept is well suited for scenarios that require multi-process collaboration. AI scenes often contain complex processes. Data Ingestion, Data Analysts, Data Splitting, trainers, Serving, Logging, etc., which require a group of containers to work together, are suitable for container-based Gang scheduling strategies. Multi-thread parallel computing communication scenarios under MPI computing framework are also suitable for Gang scheduling because master and slave processes need to work together. Containers under the container group are highly correlated, and there may be resource contention. The overall scheduling allocation can effectively solve the deadlock. - -In the case of insufficient cluster resources, the scheduling strategy of Gang can significantly improve the utilization of cluster resources. - -### Binpack - -#### Overview - -The goal of the BinPack scheduling algorithm is to fill as many existing nodes as possible (try not to allocate blank nodes). In the concrete implementation, BinPack scheduling algorithm scores the nodes that can be delivered, and the higher the score, the higher the resource utilization rate of nodes. Binpack algorithm can fill up the nodes as much as possible to close the application load to some nodes, which is very conducive to the automatic expansion capacity function of K8s cluster nodes. - -The BinPack algorithm is injected into the Volcano-Scheduler process as a plug-in and will be applied during the Pod stage of node selection. When calculating the Binpack algorithm, the Volcano-Scheduler considers the various resources requested by Pod and averages them according to the weights configured for each resource. The weight of each resource in the node score calculation is different, depending on the weight value configured by the administrator for each resource. Different plug-ins also need to assign different weights when calculating node scores, and the Scheduler also sets the score weights for BinPack plugins. - -#### Scenario - -The BinPack algorithm is good for small jobs that can fill as many nodes as possible. For example, the single query job in the big data scene, the order generation in the e-commerce seckill scene, the single identification job in the AI scene, and the high concurrency service scene on the Internet, etc. This scheduling algorithm can reduce the fragmentation in the node as much as possible, and reserve enough resource space on the idle machine for Pod which has applied for more resource requests, so as to maximize the utilization of idle resources under the cluster. - - - -### Priority - -{{
}} - -#### Overview - -The Priority Plugin provides the implementation of job, Task sorting, and PreempTablefn, a function that calculates sacrifice jobs. Job sorting according to priorityClassName, the task of sorting by priorityClassName, createTime, id in turn. - -#### Scenario - -When the cluster runs multiple jobs but is low on resources, and each Job has a different number of Pods waiting to be scheduled, if you use the Kubernetes default scheduler, the Job with more Pods will eventually get more of the cluster's resources. In this case, the Volcano-Scheduler provides algorithms that enable different jobs to share cluster resources in a fair-share. - -The Priority Plugin enables users to customize their job and task priorities, and to customize scheduling policies at different levels according to their own needs. Priority is arranged according to Job's PriorityClassName at the application level. For example, there are financial scenarios, Internet of Things monitoring scenarios and other applications requiring high real-time performance in the cluster, and the Priority Plugin can ensure that they are scheduled in Priority. - - - -### DRF - -{{
}} - -#### Overview - -The full name of DRF scheduling algorithm is Dominant Resource Fairness, which is a scheduling algorithm based on the container group Dominant Resource. Dominant Resource is the largest percentage of all required resources for a container group. The DRF algorithm selects the Dominant Resource that is the smallest in a series of container groups for priority scheduling. This can meet more job, not because a fat business, starve a large number of small business. DRF scheduling algorithm can ensure that many types of resources coexist in the environment, as far as possible to meet the fair principle of allocation. - -#### Scenario - -The DRF scheduling algorithm gives priority to the throughput of the business in the cluster and is suitable for batch small business scenarios such as a single AI training, a single big data calculation and a query. - - - -### Proportion - -#### Overview - -Proportion scheduling algorithm uses the concept of queue to control the Proportion of total resources allocated in the cluster. Each queue allocates a certain proportion of cluster resources. For example, there are three teams that share A pool of resources on A cluster: Team A uses up to 40% of the total cluster, Team B uses up to 30%, and Team C uses up to 30%. If the amount of work delivered exceeds the team's maximum available resources, there is a queue. - -#### Scenario - -Proportion scheduling algorithm improves the flexibility and elasticity of cluster scheduling. The most typical scenario is that when multiple development teams in a company share a cluster, this scheduling algorithm can handle the requirements of shared resource matching and isolation between different departments very well. In multi-service mixed scenarios, such as computation-intensive AI business, network IO-intensive MPI and HPC business, and storage-intensive big data business, Proportion scheduling algorithm can allocate shared resources according to demand through matching. - - - -### Task-topology - -#### Overview - -The task-topology algorithm is an algorithm that computes the priority of tasks and nodes based on the affinity and anti-affinity configuration between tasks within a Job. By configuring the affinity and anti-affinity policies between tasks within the Job and using the Task-Topology algorithm, tasks with affinity configurations can be scheduled to the same node first, and PODs with anti-affinity configurations to different nodes. - -#### Scenario - -node affinity: - -- Task-topology is important for improving computational efficiency in deep learning computing scenarios. Using the TensorFlow calculation as an example, configure the affinity between "ps" and "worker". Task-topology algorithm enables "ps" and "worker" to be scheduled to the same node as far as possible, so as to improve the efficiency of network and data interaction between "ps" and "worker", thus improving the computing efficiency. -- Tasks in HPC and MPI scenarios are highly synchronized and need high-speed network IO. - -Anti-affinity: - -- Take the TensorFlow calculation as an example, the anti-affinity between "ps" and "ps" - -- Master and slave backup of e-commerce service scene, data disaster tolerant, to ensure that there are spare jobs to continue to provide service after a job fails. - - - -### Predicates - -#### Overview - -The Predicate Plugin calls the PredicateGPU with pod and nodeInfo as parameters to evaluate and pre-select jobs based on the results. - -#### Scenario - -In AI scenarios where GPU resources are required, the Predicate Plugin can quickly filter out those that require the GPU for centralized scheduling. - - -### Nodeorder - -#### Overview - -The NodeOrder Plugin is a scheduling optimization strategy that scores nodes from various dimensions through simulated assignments to find the node that is best suited for the current job. The scoring parameters are configured by the user. The parameter contains the Affinity、reqResource、LeastReqResource、MostResource、balanceReqResouce. - -#### Scenario - -NodeOrder Plugin provides scoring criteria of multiple dimensions for scheduling, and the combination of different dimensions enables users to flexibly configure appropriate scheduling policies according to their own needs. - - - -### SLA - -#### Overview - -When users apply jobs to Volcano, they may need adding some particular constraints to job, for example, longest Pending time aiming to prevent job from starving. And these constraints can be regarded as Service Level Agreement (SLA) which are agreed between volcano and user. So sla plugin is provided to receive and realize SLA settings for both individual job and whole cluster. - -#### Scenario - -Users can customize SLA related parameters in their own cluster according to business needs. For example, for clusters with high real-time service requirements, JobWaitingTime can be set as small as possible. For clusters with bulk computing jobs, JobWaitingTime can be set to larger. The parameters of a specific SLA and the optimization of the parameters need to be combined with the specific business and related performance measurement results. - -### TDM - -#### Overview - -The full name of TDM is Time Division Multiplexing. In a co-located environment, some nodes are in both Kubernetes cluster and Yarn cluster. For these nodes, Kubernetes and Yarn cluster can use these resource by time-sharing multiplexing.The TDM Plugin marks these nodes as `revocable nodes`. TDM plugin will try to dispatch `preemptable task` to `revocable node` in node revocable time and evict the `preemptable task` from `revocable node` out of revocable time.. TDM Plugin improves the time-division multiplexing ability of node resources in the scheduling process of Volcano. - -#### Scenario - -In ToB business, cloud vendors provide cloud-based resources for merchants, and different merchants adopt different container arrangement frameworks (Kubernetes/YARN, etc.). TDM Plugin improves the time-sharing efficiency of common node resources and further improves the utilization rate of resources. - - - -### Numa-aware - -#### Overview - -When the node runs many CPU-bound pods, the workload can move to different CPU cores depending on whether the pod is throttled and which CPU cores are available at scheduling time. Many workloads are not sensitive to this migration and thus work fine without any intervention. However, in workloads where CPU cache affinity and scheduling latency significantly affect workload performance, the kubelet allows alternative CPU management policies to determine some placement preferences on the node. - -The CPU Manager and the Topology Manager are all Kubelet components, However There is the following limitation: - -- The scheduler is not topology-aware. so it is possible to be scheduled on a node and then fail on the node due to the Topology Manager. this is unacceptable for TensorFlow job. If any worker or ps failed on node, the job will fail. -- The managers are node-level that results in an inability to match the best node for NUMA topology in the whole cluster. - -The Numa-Aware Plugin aims to address these limitations. - -- Support cpu resource topology scheduling. -- Support pod-level topology policies. - -#### Scenario - -Common scenarios for NUMA-Aware are computation-intensive jobs that are sensitive to CPU parameters, scheduling delays. Such as scientific calculation, video decoding, animation rendering, big data offline processing and other specific scenes. - - diff --git a/content/en/docs/predicates.md b/content/en/docs/predicates.md new file mode 100644 index 00000000..1dd16226 --- /dev/null +++ b/content/en/docs/predicates.md @@ -0,0 +1,144 @@ ++++ +title = "Predicates" + +date = 2021-05-13 +lastmod = 2025-11-11 + +draft = false # Is this a draft? true/false +toc = true # Show table of contents? true/false +type = "docs" # Do not modify. + +# Add menu entry to sidebar. +linktitle = "Predicates" +[menu.docs] + parent = "plugins" + weight = 10 ++++ + +## Overview + +The Predicates plugin determines whether a task can be bound to a node by using a series of evaluation algorithms. It filters out nodes that cannot accommodate the task based on various criteria including resource availability, node conditions, and special requirements like GPU resources. + +## How It Works + +The Predicates plugin calls various predicate functions with the pod and nodeInfo as parameters to evaluate and pre-select nodes for scheduling. It implements: + +- **PredicateFn**: A function that returns true if a node can accommodate a task, false otherwise + +The plugin evaluates multiple criteria: +- Node resources (CPU, Memory, etc.) +- Node conditions and taints +- Pod affinity and anti-affinity rules +- Volume constraints +- GPU and other accelerator availability + +## Scenario + +### AI Workloads with GPU Requirements + +In AI scenarios where GPU resources are required, the Predicates plugin can quickly filter out nodes that have the required GPU resources for centralized scheduling. + +### Resource Filtering + +The plugin ensures that only nodes with sufficient resources are considered for task placement, preventing scheduling failures due to resource constraints. + +### Node Condition Filtering + +Nodes with conditions that prevent scheduling (e.g., NotReady, MemoryPressure, DiskPressure) are filtered out. + +## Configuration + +The Predicates plugin is enabled in the scheduler ConfigMap: + +```yaml +tiers: +- plugins: + - name: priority + - name: gang +- plugins: + - name: predicates + - name: nodeorder + - name: binpack +``` + +### Configuration Parameters + +The Predicates plugin supports several configuration options: + +```yaml +tiers: +- plugins: + - name: predicates + arguments: + predicate.GPUSharingEnable: true + predicate.CacheEnable: true + predicate.ProportionalEnable: true + predicate.resources: nvidia.com/gpu + predicate.resources.nvidia.com/gpu.weight: 100 +``` + +| Parameter | Description | Default | +|-----------|-------------|---------| +| `predicate.GPUSharingEnable` | Enable GPU sharing predicate | false | +| `predicate.CacheEnable` | Enable predicate caching for performance | true | +| `predicate.ProportionalEnable` | Enable proportional resource predicate | false | + +## Example + +### Job Requiring GPU Resources + +```yaml +apiVersion: batch.volcano.sh/v1alpha1 +kind: Job +metadata: + name: gpu-training-job +spec: + schedulerName: volcano + minAvailable: 1 + tasks: + - replicas: 1 + name: trainer + template: + spec: + containers: + - name: tensorflow + image: tensorflow/tensorflow:latest-gpu + resources: + requests: + nvidia.com/gpu: "1" + limits: + nvidia.com/gpu: "1" +``` + +The Predicates plugin will filter nodes to only include those with available GPU resources. + +### Job with Node Affinity + +```yaml +apiVersion: batch.volcano.sh/v1alpha1 +kind: Job +metadata: + name: zone-specific-job +spec: + schedulerName: volcano + minAvailable: 1 + tasks: + - replicas: 1 + name: worker + template: + spec: + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: topology.kubernetes.io/zone + operator: In + values: + - us-west-2a + containers: + - name: worker + image: busybox +``` + +The Predicates plugin will ensure the job is only scheduled to nodes in the specified zone. diff --git a/content/en/docs/priority.md b/content/en/docs/priority.md new file mode 100644 index 00000000..e83039c4 --- /dev/null +++ b/content/en/docs/priority.md @@ -0,0 +1,111 @@ ++++ +title = "Priority" + +date = 2021-05-13 +lastmod = 2025-11-11 + +draft = false # Is this a draft? true/false +toc = true # Show table of contents? true/false +type = "docs" # Do not modify. + +# Add menu entry to sidebar. +linktitle = "Priority" +[menu.docs] + parent = "plugins" + weight = 6 ++++ + +{{
}} + +## Overview + +The Priority Plugin provides the implementation of job and task sorting, as well as PreemptableFn—a function that calculates which jobs can be preempted. Jobs are sorted according to `priorityClassName`, and tasks are sorted in turn by `priorityClassName`, `createTime`, and `id`. + +## How It Works + +The Priority plugin implements several key functions: + +- **JobOrderFn**: Compares two jobs and determines their relative priority based on `job.spec.priorityClassName` +- **TaskOrderFn**: Compares two tasks and determines their relative priority by comparing `task.priorityClassName`, `task.createTime`, and `task.id` in order +- **PreemptableFn**: Identifies tasks that can be preempted based on priority levels + +## Scenario + +When the cluster runs multiple jobs but is low on resources, and each job has a different number of Pods waiting to be scheduled, using the Kubernetes default scheduler would result in jobs with more Pods ultimately getting more of the cluster's resources. In this case, the Volcano Scheduler provides algorithms that enable different jobs to share cluster resources in a fair-share manner. + +### Custom Priority Levels + +The Priority Plugin enables users to customize their job and task priorities, and to configure scheduling policies at different levels according to their own needs. Priority is arranged according to Job's `priorityClassName` at the application level. + +### Real-time Requirements + +For clusters with applications requiring high real-time performance, such as: + +- Financial services scenarios +- IoT monitoring scenarios +- Real-time analytics + +The Priority Plugin can ensure that these high-priority workloads are scheduled first. + +## Configuration + +The Priority plugin is typically placed in the first tier of plugins: + +```yaml +tiers: +- plugins: + - name: priority + - name: gang + - name: conformance +``` + +## Example + +### Creating Priority Classes + +First, create the PriorityClasses in your cluster: + +```yaml +apiVersion: scheduling.k8s.io/v1 +kind: PriorityClass +metadata: + name: high-priority +value: 1000000 +globalDefault: false +description: "High priority for critical workloads" +--- +apiVersion: scheduling.k8s.io/v1 +kind: PriorityClass +metadata: + name: low-priority +value: 1000 +globalDefault: false +description: "Low priority for batch workloads" +``` + +### Using Priority in VolcanoJob + +Then reference the priority class in your jobs: + +```yaml +apiVersion: batch.volcano.sh/v1alpha1 +kind: Job +metadata: + name: high-priority-job +spec: + schedulerName: volcano + priorityClassName: high-priority + minAvailable: 1 + tasks: + - replicas: 1 + name: task + template: + spec: + priorityClassName: high-priority + containers: + - name: worker + image: busybox + command: ["sleep", "1000"] +``` + +In this example, the job with `high-priority` will be scheduled before jobs with `low-priority` when resources are limited. diff --git a/content/en/docs/proportion.md b/content/en/docs/proportion.md new file mode 100644 index 00000000..00b77e4d --- /dev/null +++ b/content/en/docs/proportion.md @@ -0,0 +1,168 @@ ++++ +title = "Proportion" + +date = 2021-05-13 +lastmod = 2025-11-11 + +draft = false # Is this a draft? true/false +toc = true # Show table of contents? true/false +type = "docs" # Do not modify. + +# Add menu entry to sidebar. +linktitle = "Proportion" +[menu.docs] + parent = "plugins" + weight = 8 ++++ + +## Overview + +The Proportion scheduling algorithm uses the concept of **Queue** to control the proportion of total resources allocated in the cluster. Each queue is allocated a certain proportion of cluster resources. + +For example, if there are three teams sharing a pool of resources on a cluster: +- Team A can use up to 40% of the total cluster +- Team B can use up to 30% +- Team C can use up to 30% + +If the amount of work delivered exceeds the team's maximum available resources, the jobs will be queued. + +## How It Works + +The Proportion plugin manages resource allocation based on queue configurations: + +- **Queue Weight**: Each queue has a weight that determines its share of cluster resources +- **Capability**: Maximum resources a queue can use +- **Guarantee**: Minimum resources guaranteed to a queue +- **Deserved Resources**: Resources a queue should receive based on its weight + +Key functions implemented: + +- **QueueOrderFn**: Orders queues for scheduling based on their resource utilization +- **ReclaimableFn**: Determines if resources can be reclaimed from a queue +- **OverusedFn**: Checks if a queue is using more than its deserved share + +## Scenario + +The Proportion scheduling algorithm improves the flexibility and elasticity of cluster scheduling: + +### Multi-team Resource Sharing + +The most typical scenario is when multiple development teams in a company share a cluster. This scheduling algorithm handles the requirements of shared resource allocation and isolation between different departments very well. + +### Multi-service Mixed Scenarios + +In environments with diverse workloads: +- **Computation-intensive**: AI business +- **Network IO-intensive**: MPI and HPC business +- **Storage-intensive**: Big data business + +The Proportion scheduling algorithm can allocate shared resources according to demand through matching. + +## Configuration + +### Queue Definition + +First, create queues with appropriate resource allocations: + +```yaml +apiVersion: scheduling.volcano.sh/v1beta1 +kind: Queue +metadata: + name: team-a-queue +spec: + weight: 4 + capability: + cpu: "40" + memory: "80Gi" +--- +apiVersion: scheduling.volcano.sh/v1beta1 +kind: Queue +metadata: + name: team-b-queue +spec: + weight: 3 + capability: + cpu: "30" + memory: "60Gi" +--- +apiVersion: scheduling.volcano.sh/v1beta1 +kind: Queue +metadata: + name: team-c-queue +spec: + weight: 3 + capability: + cpu: "30" + memory: "60Gi" +``` + +### Scheduler Configuration + +Enable the Proportion plugin in the scheduler: + +```yaml +tiers: +- plugins: + - name: priority + - name: gang +- plugins: + - name: drf + - name: predicates + - name: proportion + - name: nodeorder +``` + +## Example + +### Using Queues in VolcanoJob + +Submit jobs to specific queues: + +```yaml +apiVersion: batch.volcano.sh/v1alpha1 +kind: Job +metadata: + name: team-a-job +spec: + schedulerName: volcano + queue: team-a-queue + minAvailable: 2 + tasks: + - replicas: 2 + name: worker + template: + spec: + containers: + - name: worker + image: busybox + resources: + requests: + cpu: "2" + memory: "4Gi" +``` + +### Queue with Guarantee and Capability + +Create a queue with both minimum guarantee and maximum capability: + +```yaml +apiVersion: scheduling.volcano.sh/v1beta1 +kind: Queue +metadata: + name: production-queue +spec: + weight: 5 + guarantee: + resource: + cpu: "10" + memory: "20Gi" + capability: + cpu: "50" + memory: "100Gi" + reclaimable: true +``` + +In this configuration: +- The queue is guaranteed at least 10 CPUs and 20Gi memory +- It can use up to 50 CPUs and 100Gi memory when resources are available +- Resources can be reclaimed from this queue when other queues need them diff --git a/content/en/docs/sla.md b/content/en/docs/sla.md new file mode 100644 index 00000000..640dd3c0 --- /dev/null +++ b/content/en/docs/sla.md @@ -0,0 +1,131 @@ ++++ +title = "SLA" + +date = 2021-05-13 +lastmod = 2025-11-11 + +draft = false # Is this a draft? true/false +toc = true # Show table of contents? true/false +type = "docs" # Do not modify. + +# Add menu entry to sidebar. +linktitle = "SLA" +[menu.docs] + parent = "plugins" + weight = 12 ++++ + +## Overview + +When users submit jobs to Volcano, they may need to add particular constraints to jobs, for example, the longest Pending time to prevent jobs from starving. These constraints can be regarded as **Service Level Agreements (SLA)** which are agreed upon between Volcano and the user. The SLA plugin is provided to receive and enforce SLA settings for both individual jobs and the entire cluster. + +## How It Works + +The SLA plugin monitors job waiting times and can take actions when SLA constraints are violated: + +- **JobWaitingTime**: Maximum time a job can wait in the pending state +- **JobEnqueuedFn**: Checks if a job meets SLA requirements before being enqueued + +When a job's waiting time exceeds the configured threshold, the scheduler can take corrective actions such as prioritizing the job or notifying administrators. + +## Scenario + +Users can customize SLA-related parameters in their own cluster according to business needs: + +### Real-time Services + +For clusters with high real-time service requirements, `JobWaitingTime` can be set as small as possible to ensure jobs are scheduled quickly or flagged for attention. + +### Batch Computing + +For clusters primarily running bulk computing jobs, `JobWaitingTime` can be set larger to allow for more flexible scheduling over time. + +### Multi-tenant Environments + +In multi-tenant clusters, different queues or namespaces can have different SLA requirements based on their service tier. + +## Configuration + +Enable the SLA plugin in the scheduler ConfigMap: + +```yaml +tiers: +- plugins: + - name: priority + - name: gang + - name: sla + arguments: + sla.JobWaitingTime: 10m +``` + +### Configuration Parameters + +| Parameter | Description | Default | +|-----------|-------------|---------| +| `sla.JobWaitingTime` | Maximum waiting time for a job | - | + +The `JobWaitingTime` parameter can be specified using duration format (e.g., `5m`, `1h`, `30s`). + +## Example + +### Cluster-wide SLA Configuration + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: volcano-scheduler-configmap + namespace: volcano-system +data: + volcano-scheduler.conf: | + actions: "enqueue, allocate, backfill" + tiers: + - plugins: + - name: priority + - name: gang + - name: sla + arguments: + sla.JobWaitingTime: 30m + - plugins: + - name: drf + - name: predicates + - name: proportion + - name: nodeorder +``` + +### Job with SLA Annotation + +You can also specify SLA constraints at the job level: + +```yaml +apiVersion: batch.volcano.sh/v1alpha1 +kind: Job +metadata: + name: sla-constrained-job + annotations: + volcano.sh/sla-waiting-time: "10m" +spec: + schedulerName: volcano + minAvailable: 1 + tasks: + - replicas: 1 + name: worker + template: + spec: + containers: + - name: worker + image: busybox + command: ["sleep", "3600"] +``` + +In this example, if the job waits more than 10 minutes in the pending state, the SLA plugin will flag it for priority scheduling or administrative attention. + +### Monitoring SLA Violations + +Volcano exposes metrics that can be used to monitor SLA compliance: + +- Job waiting time metrics +- SLA violation counts +- Queue-level SLA statistics + +These metrics can be integrated with monitoring systems like Prometheus to track SLA compliance across the cluster. diff --git a/content/en/docs/task-topology.md b/content/en/docs/task-topology.md new file mode 100644 index 00000000..8fdbf257 --- /dev/null +++ b/content/en/docs/task-topology.md @@ -0,0 +1,146 @@ ++++ +title = "Task-topology" + +date = 2021-05-13 +lastmod = 2025-11-11 + +draft = false # Is this a draft? true/false +toc = true # Show table of contents? true/false +type = "docs" # Do not modify. + +# Add menu entry to sidebar. +linktitle = "Task-topology" +[menu.docs] + parent = "plugins" + weight = 9 ++++ + +## Overview + +The Task-topology algorithm computes the priority of tasks and nodes based on the affinity and anti-affinity configuration between tasks within a Job. By configuring the affinity and anti-affinity policies between tasks within the Job and using the Task-topology algorithm, tasks with affinity configurations can be scheduled to the same node first, while tasks with anti-affinity configurations are scheduled to different nodes. + +## How It Works + +The Task-topology plugin analyzes task relationships within a job and optimizes placement: + +- **Affinity**: Tasks that benefit from being on the same node (e.g., for fast local communication) +- **Anti-affinity**: Tasks that should be on different nodes (e.g., for fault tolerance) + +Key functions implemented: + +- **TaskOrderFn**: Orders tasks based on topology preferences +- **NodeOrderFn**: Scores nodes based on how well they satisfy topology requirements + +## Scenario + +### Node Affinity + +#### Deep Learning and TensorFlow + +Task-topology is important for improving computational efficiency in deep learning computing scenarios. Using TensorFlow computation as an example, configure the affinity between "ps" (parameter server) and "worker". The Task-topology algorithm enables "ps" and "worker" to be scheduled to the same node as much as possible, improving the efficiency of network and data interaction between them, thus improving computing efficiency. + +#### HPC and MPI + +Tasks in HPC and MPI scenarios are highly synchronized and need high-speed network IO. Placing related tasks on the same node reduces network latency and improves performance. + +### Anti-affinity + +#### Parameter Server Distribution + +In TensorFlow computation, anti-affinity between "ps" instances can ensure they are distributed across different nodes for better load distribution. + +#### High Availability + +E-commerce service scenarios benefit from anti-affinity for master-slave backup and data disaster tolerance, ensuring that backup jobs continue to provide service after a primary job fails. + +## Configuration + +Enable the Task-topology plugin in the scheduler: + +```yaml +tiers: +- plugins: + - name: priority + - name: gang +- plugins: + - name: predicates + - name: nodeorder + - name: task-topology +``` + +## Example + +### Job with Task Affinity + +```yaml +apiVersion: batch.volcano.sh/v1alpha1 +kind: Job +metadata: + name: tensorflow-job +spec: + schedulerName: volcano + minAvailable: 3 + policies: + - event: PodEvicted + action: RestartJob + tasks: + - replicas: 1 + name: ps + policies: + - event: TaskCompleted + action: CompleteJob + template: + metadata: + labels: + role: ps + spec: + containers: + - name: tensorflow + image: tensorflow/tensorflow:latest + - replicas: 2 + name: worker + template: + metadata: + labels: + role: worker + spec: + containers: + - name: tensorflow + image: tensorflow/tensorflow:latest + plugins: + env: [] + svc: [] + topologyPolicy: + mode: affinity + tiers: + - tasks: + - ps + - worker +``` + +### Job with Task Anti-affinity + +```yaml +apiVersion: batch.volcano.sh/v1alpha1 +kind: Job +metadata: + name: ha-service +spec: + schedulerName: volcano + minAvailable: 2 + tasks: + - replicas: 2 + name: master + template: + spec: + containers: + - name: master + image: my-service:latest + topologyPolicy: + mode: anti-affinity + tiers: + - tasks: + - master +``` + +In this example, the two master replicas will be scheduled to different nodes to ensure high availability. diff --git a/content/en/docs/tdm.md b/content/en/docs/tdm.md new file mode 100644 index 00000000..6b75dd86 --- /dev/null +++ b/content/en/docs/tdm.md @@ -0,0 +1,153 @@ ++++ +title = "TDM" + +date = 2021-05-13 +lastmod = 2025-11-11 + +draft = false # Is this a draft? true/false +toc = true # Show table of contents? true/false +type = "docs" # Do not modify. + +# Add menu entry to sidebar. +linktitle = "TDM" +[menu.docs] + parent = "plugins" + weight = 13 ++++ + +## Overview + +The full name of TDM is **Time Division Multiplexing**. In a co-located environment, some nodes may belong to both a Kubernetes cluster and a YARN cluster. For these nodes, Kubernetes and YARN clusters can use these resources through time-sharing multiplexing. + +The TDM plugin marks these nodes as `revocable nodes`. During the node's revocable time window, the TDM plugin will try to dispatch `preemptable tasks` to `revocable nodes`. Outside of the revocable time window, the TDM plugin evicts the `preemptable tasks` from `revocable nodes`. + +The TDM plugin improves the time-division multiplexing ability of node resources in the scheduling process of Volcano. + +## How It Works + +The TDM plugin manages time-based resource sharing: + +1. **Revocable Nodes**: Nodes labeled as revocable that can be shared between orchestration systems +2. **Revocable Time Windows**: Defined time periods when nodes are available for Kubernetes workloads +3. **Preemptable Tasks**: Tasks that can be evicted when the revocable time window ends + +Key functions: + +- **PredicateFn**: Checks if a task can be scheduled on a revocable node during the current time window +- **PreemptableFn**: Determines if tasks should be evicted based on time constraints + +## Scenario + +### ToB Business + +In ToB (Business-to-Business) scenarios, cloud vendors provide cloud-based resources for merchants, and different merchants adopt different container orchestration frameworks (Kubernetes, YARN, etc.). The TDM plugin improves the time-sharing efficiency of common node resources and further improves resource utilization. + +### Hybrid Clusters + +Organizations running both Kubernetes and Hadoop/YARN workloads can use TDM to share physical nodes between the two systems, with time-based scheduling ensuring workloads don't interfere with each other. + +### Cost Optimization + +By enabling time-division multiplexing, organizations can maximize the utilization of their hardware infrastructure by sharing nodes across different workload types during different time periods. + +## Configuration + +### Node Labels + +First, mark nodes as revocable: + +```bash +kubectl label node volcano.sh/revocable-node=true +kubectl annotate node volcano.sh/revocable-zone="zone-a" +``` + +### Scheduler Configuration + +Enable the TDM plugin with time window configuration: + +```yaml +tiers: +- plugins: + - name: priority + - name: gang +- plugins: + - name: predicates + - name: tdm + arguments: + tdm.revocable-zone.zone-a: "0 8 * * *:0 18 * * *" + tdm.revocable-zone.zone-b: "0 20 * * *:0 6 * * *" +``` + +### Configuration Parameters + +| Parameter | Description | Format | +|-----------|-------------|--------| +| `tdm.revocable-zone.` | Time window for a revocable zone | `:` | + +The time windows are specified using cron expressions: +- `0 8 * * *` means "at 8:00 AM every day" +- `0 18 * * *` means "at 6:00 PM every day" + +## Example + +### Revocable Node Configuration + +```bash +# Mark a node as revocable in zone-a +kubectl label node worker-node-1 volcano.sh/revocable-node=true +kubectl annotate node worker-node-1 volcano.sh/revocable-zone=zone-a +``` + +### Scheduler ConfigMap with TDM + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: volcano-scheduler-configmap + namespace: volcano-system +data: + volcano-scheduler.conf: | + actions: "enqueue, allocate, backfill" + tiers: + - plugins: + - name: priority + - name: gang + - plugins: + - name: predicates + - name: tdm + arguments: + # zone-a is available for Kubernetes from 8 AM to 6 PM + tdm.revocable-zone.zone-a: "0 8 * * *:0 18 * * *" + - name: nodeorder +``` + +### Preemptable Job + +Submit a job that can be scheduled on revocable nodes: + +```yaml +apiVersion: batch.volcano.sh/v1alpha1 +kind: Job +metadata: + name: preemptable-job + annotations: + volcano.sh/preemptable: "true" +spec: + schedulerName: volcano + minAvailable: 1 + tasks: + - replicas: 1 + name: worker + template: + spec: + containers: + - name: worker + image: busybox + command: ["sleep", "3600"] +``` + +In this example: +- The job is marked as preemptable +- It can be scheduled on revocable nodes during the configured time window +- It will be evicted when the time window ends diff --git a/content/zh/blog/Quick-Start-Volcano.md b/content/zh/blog/Quick-Start-Volcano.md index 186e1ebd..74b667fa 100644 --- a/content/zh/blog/Quick-Start-Volcano.md +++ b/content/zh/blog/Quick-Start-Volcano.md @@ -7,7 +7,7 @@ date = 2019-03-28 lastmod = 2020-09-07 datemonth = "Sep" dateyear = "2020" -dateday = 07 +dateday = "07" draft = false # Is this a draft? true/false toc = true # Show table of contents? true/false diff --git a/content/zh/blog/Volcano-1.11.0-release.md b/content/zh/blog/Volcano-1.11.0-release.md index fd040c75..6a33c3a6 100644 --- a/content/zh/blog/Volcano-1.11.0-release.md +++ b/content/zh/blog/Volcano-1.11.0-release.md @@ -7,7 +7,7 @@ date = 2025-02-07 lastmod = 2025-02-07 datemonth = "Feb" dateyear = "2025" -dateday = 07 +dateday = "07" draft = false # Is this a draft? true/false toc = true # Show table of contents? true/false diff --git a/content/zh/blog/kube-batch-customers.md b/content/zh/blog/kube-batch-customers.md index 0d5014bd..9e06f7c2 100644 --- a/content/zh/blog/kube-batch-customers.md +++ b/content/zh/blog/kube-batch-customers.md @@ -7,7 +7,7 @@ date = 2019-01-28 lastmod = 2020-09-07 datemonth = "Sep" dateyear = "2020" -dateday = 07 +dateday = "07" draft = false # Is this a draft? true/false toc = true # Show table of contents? true/false diff --git a/content/zh/blog/kube-batch-startup.md b/content/zh/blog/kube-batch-startup.md index 854de5c0..99a3b2f8 100644 --- a/content/zh/blog/kube-batch-startup.md +++ b/content/zh/blog/kube-batch-startup.md @@ -7,7 +7,7 @@ date = 2019-01-28 lastmod = 2020-09-07 datemonth = "Sep" dateyear = "2020" -dateday = 07 +dateday = "07" draft = false # Is this a draft? true/false toc = true # Show table of contents? true/false diff --git a/content/zh/docs/binpack.md b/content/zh/docs/binpack.md new file mode 100644 index 00000000..7e945edb --- /dev/null +++ b/content/zh/docs/binpack.md @@ -0,0 +1,105 @@ ++++ +title = "Binpack" + +date = 2021-05-13 +lastmod = 2025-11-11 + +draft = false # Is this a draft? true/false +toc = true # Show table of contents? true/false +type = "docs" # Do not modify. + +# Add menu entry to sidebar. +linktitle = "Binpack" +[menu.docs] + parent = "plugins" + weight = 5 ++++ + +## 简介 + +Binpack调度算法的目标是尽量把已有的节点填满(尽量不往空白节点分配)。具体实现上,binpack调度算法是给可以投递的节点打分,分数越高表示节点的资源利用率越高。binpack算法能够尽可能填满节点,将应用负载靠拢在部分节点,这非常有利于K8S集群节点的自动扩缩容功能。 + +## 工作原理 + +Binpack算法以插件的形式,注入到volcano-scheduler调度过程中,将会应用在Pod优选节点的阶段。Volcano-scheduler在计算binpack算法时,会考虑Pod请求的各种资源,并根据各种资源所配置的权重做平均。 + +关键特性: + +- **资源权重**: 每种资源类型(CPU、Memory、GPU等)在评分计算中可以有不同的权重,这取决于管理员为每种资源配置的权重值。 +- **插件权重**: 不同的插件在计算节点分数时,也需要分配不同的权重,scheduler也为binpack插件设置了分数权重。 +- **NodeOrderFn**: 插件实现NodeOrderFn,根据放置任务后节点的利用效率为节点评分。 + +## 场景 + +binpack算法对能够尽可能填满节点的小作业有利: + +### 大数据场景 + +大数据处理中的单次查询作业通过整合工作负载并最大化活动节点上的资源利用率来受益于Binpack。 + +### 电商高并发 + +电商秒杀场景订单生成可以利用Binpack在峰值负载期间高效使用可用资源。 + +### AI推理 + +AI推理场景中的单次识别作业受益于整合调度,减少资源碎片。 + +### 互联网服务 + +互联网上的高并发服务场景受益于Binpack,通过减少节点内的碎片并在空闲机器上为申请了更多资源请求的Pod预留足够的资源空间,使集群下空闲资源得到最大化的利用。 + +## 配置 + +Binpack插件在调度器ConfigMap中配置,可选权重参数: + +```yaml +tiers: +- plugins: + - name: binpack + arguments: + binpack.weight: 10 + binpack.cpu: 1 + binpack.memory: 1 + binpack.resources: nvidia.com/gpu + binpack.resources.nvidia.com/gpu: 2 +``` + +### 配置参数 + +| 参数 | 描述 | 默认值 | +|------|------|--------| +| `binpack.weight` | Binpack插件分数的整体权重 | 1 | +| `binpack.cpu` | CPU资源在评分中的权重 | 1 | +| `binpack.memory` | Memory资源在评分中的权重 | 1 | +| `binpack.resources` | 要考虑的额外资源 | - | +| `binpack.resources.` | 特定资源类型的权重 | 1 | + +## 示例 + +以下是使用Binpack优先填充节点的调度器配置示例: + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: volcano-scheduler-configmap + namespace: volcano-system +data: + volcano-scheduler.conf: | + actions: "enqueue, allocate, backfill" + tiers: + - plugins: + - name: priority + - name: gang + - plugins: + - name: predicates + - name: nodeorder + - name: binpack + arguments: + binpack.weight: 10 + binpack.cpu: 2 + binpack.memory: 1 +``` + +在此配置中,Binpack插件被赋予权重10,CPU在评分计算中的权重是内存的两倍。 diff --git a/content/zh/docs/drf.md b/content/zh/docs/drf.md new file mode 100644 index 00000000..a0f13182 --- /dev/null +++ b/content/zh/docs/drf.md @@ -0,0 +1,117 @@ ++++ +title = "DRF" + +date = 2021-05-13 +lastmod = 2025-11-11 + +draft = false # Is this a draft? true/false +toc = true # Show table of contents? true/false +type = "docs" # Do not modify. + +# Add menu entry to sidebar. +linktitle = "DRF" +[menu.docs] + parent = "plugins" + weight = 7 ++++ + +{{
}} + +## 简介 + +DRF调度算法的全称是**Dominant Resource Fairness**,是基于容器组Dominant Resource的调度算法。volcano-scheduler观察每个Job请求的主导资源,并将其作为对集群资源使用的一种度量,根据Job的主导资源,计算Job的share值,在调度的过程中,具有较低share值的Job将具有更高的调度优先级。这样能够满足更多的作业,不会因为一个胖业务,饿死大批小业务。DRF调度算法能够确保在多种类型资源共存的环境下,尽可能满足分配的公平原则。 + +## 工作原理 + +DRF插件: + +1. **观察主导资源**: 对于每个作业,识别哪种资源(CPU、Memory、GPU等)代表了集群资源的最大份额 +2. **计算Share值**: 根据其主导资源使用情况计算每个作业的share值 +3. **优先调度较低Share**: 具有较低share值(使用较少主导资源)的作业获得更高的调度优先级 + +实现的关键功能: + +- **JobOrderFn**: 根据主导资源份额对作业排序,优先调度份额较小的作业 +- **PreemptableFn**: 根据资源公平性计算确定作业是否可以被抢占 + +插件尝试计算分配给抢占者和被抢占任务的资源总量,当抢占者任务资源较少时触发抢占。 + +## 场景 + +DRF调度算法优先考虑集群中业务的吞吐量,适用批处理场景: + +### AI训练 + +单次AI训练作业受益于DRF,因为它确保了多个训练工作负载之间的公平资源分配。 + +### 大数据处理 + +单次大数据计算和查询作业可以与集群中的其他工作负载公平共享资源。 + +### 混合资源工作负载 + +在具有不同资源需求(CPU密集型、内存密集型、GPU密集型作业)的环境中,DRF确保所有资源维度的公平分配。 + +## 配置 + +DRF插件在调度器ConfigMap中配置: + +```yaml +tiers: +- plugins: + - name: priority + - name: gang +- plugins: + - name: drf + - name: predicates + - name: proportion +``` + +## 示例 + +考虑一个具有以下资源的集群: +- 100个CPU +- 400 GB内存 + +以及两个作业: +- **作业A**: 每个任务需要2个CPU和8 GB内存 +- **作业B**: 每个任务需要1个CPU和32 GB内存 + +对于作业A: +- 每个任务的CPU份额:2/100 = 2% +- 每个任务的内存份额:8/400 = 2% +- 主导资源:CPU和内存相等(2%) + +对于作业B: +- 每个任务的CPU份额:1/100 = 1% +- 每个任务的内存份额:32/400 = 8% +- 主导资源:内存(8%) + +使用DRF,作业A将首先被调度,因为其主导资源份额(2%)小于作业B的(8%)。这确保了没有作业可以通过请求大量单一资源来垄断集群。 + +### VolcanoJob示例 + +```yaml +apiVersion: batch.volcano.sh/v1alpha1 +kind: Job +metadata: + name: drf-example-job +spec: + schedulerName: volcano + minAvailable: 2 + tasks: + - replicas: 2 + name: worker + template: + spec: + containers: + - name: worker + image: busybox + resources: + requests: + cpu: "2" + memory: "8Gi" + limits: + cpu: "2" + memory: "8Gi" +``` diff --git a/content/zh/docs/gang.md b/content/zh/docs/gang.md new file mode 100644 index 00000000..4dd8b98b --- /dev/null +++ b/content/zh/docs/gang.md @@ -0,0 +1,91 @@ ++++ +title = "Gang" + +date = 2021-05-13 +lastmod = 2025-11-11 + +draft = false # Is this a draft? true/false +toc = true # Show table of contents? true/false +type = "docs" # Do not modify. + +# Add menu entry to sidebar. +linktitle = "Gang" +[menu.docs] + parent = "plugins" + weight = 4 ++++ + +{{
}} + +## 简介 + +Gang调度策略是volcano-scheduler的核心调度算法之一,它满足了调度过程中的"All or nothing"的调度需求,避免Pod的任意调度导致集群资源的浪费。具体算法是,观察Job下的Pod已调度数量是否满足了最小运行数量,当Job的最小运行数量得到满足时,为Job下的所有Pod执行调度动作,否则,不执行。 + +## 工作原理 + +Gang插件认为不处于`Ready`状态(包括Binding、Bound、Running、Allocated、Succeed和Pipelined)的任务具有更高的优先级。它检查分配给队列的资源是否能够满足任务运行`minAvailable`个Pod所需的资源。如果可以,Gang插件将继续进行调度。 + +Gang插件实现的关键功能: + +- **JobReadyFn**: 检查作业是否有足够的资源满足其`minAvailable`要求 +- **JobPipelinedFn**: 检查作业是否可以进行流水线处理 +- **JobValidFn**: 验证作业的Gang约束是否满足 + +## 场景 + +基于容器组概念的Gang调度算法十分适合需要多进程协作的场景: + +### AI和深度学习 + +AI场景往往包含复杂的流程,Data Ingestion、Data Analysts、Data Splitting、Trainer、Serving、Logging等,需要一组容器进行协同工作,就很适合基于容器组的Gang调度策略。 + +### MPI和HPC + +MPI计算框架下的多线程并行计算通信场景,由于需要主从进程协同工作,也非常适合使用Gang调度策略。容器组下的容器高度相关也可能存在资源争抢,整体调度分配,能够有效解决死锁。 + +### 资源效率 + +在集群资源不足的场景下,Gang的调度策略对于集群资源的利用率的提升是非常明显的,通过防止部分作业分配来避免资源浪费。 + +## 配置 + +Gang插件通常默认启用,在调度器ConfigMap中配置: + +```yaml +tiers: +- plugins: + - name: priority + - name: gang + - name: conformance +``` + +## 示例 + +以下是使用Gang调度的VolcanoJob示例: + +```yaml +apiVersion: batch.volcano.sh/v1alpha1 +kind: Job +metadata: + name: tensorflow-job +spec: + minAvailable: 3 # Gang约束:至少3个pod必须可调度 + schedulerName: volcano + tasks: + - replicas: 1 + name: ps + template: + spec: + containers: + - name: tensorflow + image: tensorflow/tensorflow:latest + - replicas: 2 + name: worker + template: + spec: + containers: + - name: tensorflow + image: tensorflow/tensorflow:latest +``` + +在此示例中,只有当所有3个pod(1个ps + 2个worker)都能同时分配资源时,作业才会被调度。 diff --git a/content/zh/docs/nodeorder.md b/content/zh/docs/nodeorder.md new file mode 100644 index 00000000..2eb87b75 --- /dev/null +++ b/content/zh/docs/nodeorder.md @@ -0,0 +1,167 @@ ++++ +title = "Nodeorder" + +date = 2021-05-13 +lastmod = 2025-11-11 + +draft = false # Is this a draft? true/false +toc = true # Show table of contents? true/false +type = "docs" # Do not modify. + +# Add menu entry to sidebar. +linktitle = "Nodeorder" +[menu.docs] + parent = "plugins" + weight = 11 ++++ + +## 简介 + +Nodeorder plugin是一种调度优选策略:通过模拟分配从各个维度为node打分,找到最适合当前作业的node。打分参数由用户来配置。 + +## 工作原理 + +Nodeorder插件实现**NodeOrderFn**,使用一系列评分算法为任务的所有节点评分。得分最高的节点被认为是最适合该任务的节点。 + +评分维度包括: +- **Affinity**: 节点和pod亲和性/反亲和性分数 +- **LeastRequestedResource**: 优先选择可用资源更多的节点 +- **MostRequestedResource**: 优先选择可用资源更少的节点(整合) +- **BalancedResourceAllocation**: 优先选择资源使用均衡的节点 +- **ImageLocality**: 优先选择已有容器镜像的节点 + +## 场景 + +Nodeorder plugin给调度提供了多个维度的打分标准,不同维度的组合,能够让用户根据自身需求灵活的配置合适的调度策略。 + +### 工作负载分布 + +通过调整不同评分维度的权重,您可以控制工作负载在集群中的分布方式: +- 使用**LeastRequestedResource**均匀分散工作负载 +- 使用**MostRequestedResource**整合工作负载(类似于Binpack) + +### 镜像本地性优化 + +对于大型容器镜像,使用**ImageLocality**评分可以通过优先选择已有所需镜像的节点来减少pod启动时间。 + +### 亲和性优化 + +**Affinity**维度确保pod根据其亲和性和反亲和性规则放置。 + +## 配置 + +Nodeorder插件在调度器ConfigMap中启用,可配置权重: + +```yaml +tiers: +- plugins: + - name: predicates + - name: nodeorder + arguments: + nodeorder.weight: 10 + nodeorder.leastrequested.enable: true + nodeorder.mostrequested.enable: false + nodeorder.nodeaffinity.enable: true + nodeorder.podaffinity.enable: true + nodeorder.balancedresource.enable: true + nodeorder.imagelocality.enable: true +``` + +### 配置参数 + +| 参数 | 描述 | 默认值 | +|------|------|--------| +| `nodeorder.weight` | Nodeorder插件的整体权重 | 1 | +| `nodeorder.leastrequested.enable` | 启用最少请求资源评分 | true | +| `nodeorder.mostrequested.enable` | 启用最多请求资源评分 | false | +| `nodeorder.nodeaffinity.enable` | 启用节点亲和性评分 | true | +| `nodeorder.podaffinity.enable` | 启用pod亲和性评分 | true | +| `nodeorder.balancedresource.enable` | 启用均衡资源评分 | true | +| `nodeorder.imagelocality.enable` | 启用镜像本地性评分 | true | +| `nodeorder.leastrequested.weight` | 最少请求评分的权重 | 1 | +| `nodeorder.mostrequested.weight` | 最多请求评分的权重 | 1 | +| `nodeorder.nodeaffinity.weight` | 节点亲和性评分的权重 | 1 | +| `nodeorder.podaffinity.weight` | pod亲和性评分的权重 | 1 | +| `nodeorder.balancedresource.weight` | 均衡资源评分的权重 | 1 | +| `nodeorder.imagelocality.weight` | 镜像本地性评分的权重 | 1 | + +## 示例 + +### 分散工作负载的配置 + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: volcano-scheduler-configmap + namespace: volcano-system +data: + volcano-scheduler.conf: | + actions: "enqueue, allocate, backfill" + tiers: + - plugins: + - name: priority + - name: gang + - plugins: + - name: predicates + - name: nodeorder + arguments: + nodeorder.leastrequested.enable: true + nodeorder.leastrequested.weight: 2 + nodeorder.balancedresource.enable: true +``` + +### 整合工作负载的配置 + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: volcano-scheduler-configmap + namespace: volcano-system +data: + volcano-scheduler.conf: | + actions: "enqueue, allocate, backfill" + tiers: + - plugins: + - name: priority + - name: gang + - plugins: + - name: predicates + - name: nodeorder + arguments: + nodeorder.mostrequested.enable: true + nodeorder.mostrequested.weight: 2 + nodeorder.leastrequested.enable: false +``` + +### 具有Pod亲和性的作业 + +```yaml +apiVersion: batch.volcano.sh/v1alpha1 +kind: Job +metadata: + name: affinity-job +spec: + schedulerName: volcano + minAvailable: 2 + tasks: + - replicas: 2 + name: worker + template: + spec: + affinity: + podAffinity: + preferredDuringSchedulingIgnoredDuringExecution: + - weight: 100 + podAffinityTerm: + labelSelector: + matchLabels: + app: cache + topologyKey: kubernetes.io/hostname + containers: + - name: worker + image: busybox +``` + +如果节点已有匹配亲和性规则的pod,Nodeorder插件将为这些节点打更高的分数。 diff --git a/content/zh/docs/numa-aware.md b/content/zh/docs/numa-aware.md new file mode 100644 index 00000000..c1da4704 --- /dev/null +++ b/content/zh/docs/numa-aware.md @@ -0,0 +1,179 @@ ++++ +title = "Numa-aware" + +date = 2021-05-13 +lastmod = 2025-11-11 + +draft = false # Is this a draft? true/false +toc = true # Show table of contents? true/false +type = "docs" # Do not modify. + +# Add menu entry to sidebar. +linktitle = "Numa-aware" +[menu.docs] + parent = "plugins" + weight = 14 ++++ + +## 简介 + +当节点运行多个cpu密集的pod。基于pod是否可以迁移cpu以及调度周期cpu资源状况,工作负载可以迁移到不同的cpu核心下。许多工作负载对cpu资源迁移并不敏感。然而,有一些cpu的缓存亲和度以及调度延迟显著影响性能的工作负载,kubelet允许可选的cpu编排策略(cpu management)来确定节点上cpu资源的绑定分配。 + +## 挑战 + +cpu manager以及topology manager都是kubelet的组件,帮助进行CPU放置。然而,它存在如下局限: + +1. **调度器不感知**: 基于kubelet的调度组件不支持topology-aware。所以可能由于Topology manager,导致整个node上的调度失败。这对Tensorflow job是难以接受的,因为一旦有任何worker task挂掉,整个作业都将调度失败。 + +2. **仅节点级别**: 这些manager是节点级这导致无法在整个集群中匹配numa topology的最佳节点。 + +## Numa-aware插件工作原理 + +Numa-aware plugin致力于解决如上局限: + +- **支持cpu资源的拓扑调度**: 支持基于CPU拓扑的调度 +- **支持pod级别的拓扑协议**: 支持pod级别的拓扑策略 + +插件: +1. 从节点收集NUMA拓扑信息 +2. 评估CPU和内存放置要求 +3. 根据NUMA亲和性为节点评分 +4. 确保任务被放置在能够满足其拓扑要求的节点上 + +## 场景 + +Numa-aware的常见场景是那些对cpu参数敏感、调度延迟敏感的计算密集型作业: + +### 科学计算 + +高性能科学计算受益于NUMA感知调度,以确保最佳的内存访问模式。 + +### 视频处理 + +视频解码工作负载可以通过NUMA感知调度实现更好的性能。 + +### 动漫动画渲染 + +动漫动画渲染等CPU密集型作业受益于优化的CPU和内存放置。 + +### 大数据离线处理 + +大规模数据处理作业可以通过NUMA优化的调度实现更好的吞吐量。 + +## 配置 + +### 在节点上启用Topology Manager + +首先,确保kubelet配置了拓扑管理: + +```yaml +# kubelet配置 +topologyManagerPolicy: single-numa-node +cpuManagerPolicy: static +``` + +### 调度器配置 + +启用Numa-aware插件: + +```yaml +tiers: +- plugins: + - name: priority + - name: gang +- plugins: + - name: predicates + - name: nodeorder + - name: numa-aware + arguments: + numa-aware.weight: 10 +``` + +### 配置参数 + +| 参数 | 描述 | 默认值 | +|------|------|--------| +| `numa-aware.weight` | NUMA感知评分的权重 | 1 | + +## 示例 + +### 具有NUMA拓扑的节点 + +一个典型的NUMA节点可能有: +- 2个NUMA节点 +- 每个有16个CPU核心 +- 每个有64GB内存 + +``` +NUMA Node 0: CPU 0-15, 64GB Memory +NUMA Node 1: CPU 16-31, 64GB Memory +``` + +### 需要NUMA感知的作业 + +```yaml +apiVersion: batch.volcano.sh/v1alpha1 +kind: Job +metadata: + name: numa-aware-job +spec: + schedulerName: volcano + minAvailable: 1 + tasks: + - replicas: 1 + name: compute + template: + metadata: + annotations: + volcano.sh/numa-topology-policy: single-numa-node + spec: + containers: + - name: compute + image: compute-intensive-app:latest + resources: + requests: + cpu: "8" + memory: "32Gi" + limits: + cpu: "8" + memory: "32Gi" +``` + +在此示例中: +- 作业请求8个CPU和32GB内存 +- NUMA策略要求所有资源来自单个NUMA节点 +- 调度器将找到一个能够从单个NUMA节点满足此要求的节点 + +### 具有拓扑策略注解的Pod + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: numa-sensitive-pod + annotations: + volcano.sh/numa-topology-policy: best-effort +spec: + schedulerName: volcano + containers: + - name: app + image: myapp:latest + resources: + requests: + cpu: "4" + memory: "16Gi" + limits: + cpu: "4" + memory: "16Gi" +``` + +### NUMA拓扑策略 + +插件支持多种拓扑策略: + +| 策略 | 描述 | +|------|------| +| `none` | 无NUMA偏好 | +| `best-effort` | 尝试放置在最佳NUMA节点,但如果不可能也不会失败 | +| `restricted` | 仅放置在能够满足NUMA要求的节点 | +| `single-numa-node` | 所有资源必须来自单个NUMA节点 | diff --git a/content/zh/docs/plugins-overview.md b/content/zh/docs/plugins-overview.md new file mode 100644 index 00000000..1e50e349 --- /dev/null +++ b/content/zh/docs/plugins-overview.md @@ -0,0 +1,84 @@ ++++ +title = "Plugins" + +date = 2021-05-13 +lastmod = 2026-01-19 + +draft = false +toc = true +type = "docs" + +linktitle = "Plugins" +[menu.docs] + name = "Plugins" + parent = "scheduler" + weight = 3 + identifier = "plugins" ++++ + +## 概述 + +插件是Volcano调度器的核心组件,提供特定的调度算法和策略。它们与[Actions](/zh/docs/actions/)配合工作来实现调度逻辑。Actions定义每个调度步骤应该做什么,而插件提供具体的算法来实现。 + +## 插件工作原理 + +插件在调度器中注册,并在Actions执行期间被调用。每个插件可以实现以下一个或多个功能: + +- **JobOrderFn**: 确定调度队列中作业的顺序 +- **TaskOrderFn**: 确定作业中任务的顺序 +- **PredicateFn**: 过滤无法容纳任务的节点 +- **NodeOrderFn**: 为节点评分以找到最适合任务的节点 +- **PreemptableFn**: 识别可以被抢占的任务 +- **ReclaimableFn**: 识别可以被回收的任务 +- **JobReadyFn**: 检查作业是否准备好被调度 +- **JobPipelinedFn**: 检查作业是否可以进行流水线处理 + +## 可用插件 + +Volcano提供以下插件: + +| 插件 | 描述 | +|------|------| +| [Gang](/zh/docs/gang/) | 确保作业的所有任务一起调度(全有或全无) | +| [Binpack](/zh/docs/binpack/) | 将任务打包到节点上以最大化资源利用率 | +| [Priority](/zh/docs/priority/) | 根据优先级对作业和任务进行排序 | +| [DRF](/zh/docs/drf/) | 主导资源公平性,实现公平资源分配 | +| [Proportion](/zh/docs/proportion/) | 基于队列的资源分配,按比例共享 | +| [Task-topology](/zh/docs/task-topology/) | 考虑作业内任务的亲和性和反亲和性 | +| [Predicates](/zh/docs/predicates/) | 基于预选条件(包括GPU需求)过滤节点 | +| [Nodeorder](/zh/docs/nodeorder/) | 使用多个维度为节点评分 | +| [SLA](/zh/docs/sla/) | 作业的服务级别协议约束 | +| [TDM](/zh/docs/tdm/) | 共享节点资源的时分复用 | +| [Numa-aware](/zh/docs/numa-aware/) | CPU密集型工作负载的NUMA拓扑感知调度 | + +## 插件配置 + +插件在Volcano调度器的ConfigMap中配置。以下是一个示例配置: + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: volcano-scheduler-configmap + namespace: volcano-system +data: + volcano-scheduler.conf: | + actions: "enqueue, allocate, backfill" + tiers: + - plugins: + - name: priority + - name: gang + - name: conformance + - plugins: + - name: drf + - name: predicates + - name: proportion + - name: nodeorder + - name: binpack +``` + +插件被组织成层级(tiers),较高层级(首先列出)的插件会在较低层级的插件之前被评估。这允许您创建调度策略的层次结构。 + +## 自定义插件 + +Volcano的插件架构是可扩展的。您可以通过遵循插件接口并将其注册到调度器来实现自定义插件。有关开发自定义插件的更多信息,请参阅[Volcano贡献指南](/zh/docs/contribution/)。 diff --git a/content/zh/docs/plugins.md b/content/zh/docs/plugins.md deleted file mode 100644 index c3350c9a..00000000 --- a/content/zh/docs/plugins.md +++ /dev/null @@ -1,176 +0,0 @@ -+++ -title = "Plugins" - -date = 2021-05-13 -lastmod = 2021-05-13 - -draft = false # Is this a draft? true/false -toc = true # Show table of contents? true/false -type = "docs" # Do not modify. - -# Add menu entry to sidebar. -linktitle = "Plugins" -[menu.docs] - parent = "scheduler" - weight = 3 -+++ - -### Gang - -{{
}} - -#### 简介 - -Gang调度策略是volcano-scheduler的核心调度算法之一,它满足了调度过程中的“All or nothing”的调度需求,避免Pod的任意调度导致集群资源的浪费。具体算法是,观察Job下的Pod已调度数量是否满足了最小运行数量,当Job的最小运行数量得到满足时,为Job下的所有Pod执行调度动作,否则,不执行。 - -#### 场景 - -基于容器组概念的Gang调度算法十分适合需要多进程协作的场景。AI场景往往包含复杂的流程,Data Ingestion、Data Analysts、Data Splitting、Trainer、Serving、Logging等,需要一组容器进行协同工作,就很适合基于容器组的Gang调度策略。MPI计算框架下的多线程并行计算通信场景,由于需要主从进程协同工作,也非常适合使用Gang调度策略。容器组下的容器高度相关也可能存在资源争抢,整体调度分配,能够有效解决死锁。 - -在集群资源不足的场景下,gang的调度策略对于集群资源的利用率的提升是非常明显的。 - - - -### Binpack - -#### 简介 - -binpack调度算法的目标是尽量把已有的节点填满(尽量不往空白节点分配)。具体实现上,binpack调度算法是给可以投递的节点打分,分数越高表示节点的资源利用率越高。binpack算法能够尽可能填满节点,将应用负载靠拢在部分节点,这非常有利于K8S集群节点的自动扩缩容功能。 - -Binpack算法以插件的形式,注入到volcano-scheduler调度过程中,将会应用在Pod优选节点的阶段。Volcano-scheduler在计算binpack算法时,会考虑Pod请求的各种资源,并根据各种资源所配置的权重做平均。每种资源在节点分值计算过程中的权重并不一样,这取决于管理员为每种资源配置的权重值。同时不同的插件在计算节点分数时,也需要分配不同的权重,scheduler也为binpack插件设置了分数权重。 - -#### 场景 - -binpack算法对能够尽可能填满节点的小作业有利。例如大数据场景下的单次查询作业、电商秒杀场景订单生成、AI场景的单次识别作业以及互联网高并发的服务场景等。这种调度算法能够尽可能减小节点内的碎片,在空闲的机器上为申请了更大资源请求的Pod预留足够的资源空间,使集群下空闲资源得到最大化的利用。 - - - -### Priority - -{{
}} - -#### 简介 - -Priority plugin提供了job、task排序的实现,以及计算牺牲作业的函数preemptableFn。job的排序根据priorityClassName,task的排序依次根据priorityClassName、createTime、id。 - -#### 场景 - -当集群运行了多个Job,但资源不足,并且每个Job下有不等数量的Pod等待被调度的时候,如果使用Kubernetes默认调度器,那么最终,具有更多Pod数量的Job将分得更多的集群资源。在这种情况下,volcano-scheduler提供算法支持不同的Job以fair-share的形式共享集群资源。 - -Priority plugin能够让用户自定义job、task优先级,根据自己的需求在不同层次来定制调度策略。根据job的priorityClassName在应用层面进行优先级排序,例如集群中有金融场景、物联网监控场景等需要较高实时性的应用,Priority plugin能够保证其优先得到调度。 - - - -### DRF -{{
}} -#### 简介 - -DRF调度算法的全称是Dominant Resource Fairness,是基于容器组Dominant Resource的调度算法。volcano-scheduler观察每个Job请求的主导资源,并将其作为对集群资源使用的一种度量,根据Job的主导资源,计算Job的share值,在调度的过程中,具有较低share值的Job将具有更高的调度优先级。这样能够满足更多的作业,不会因为一个胖业务,饿死大批小业务。DRF调度算法能够确保在多种类型资源共存的环境下,尽可能满足分配的公平原则。 - -#### 场景 - -DRF调度算法优先考虑集群中业务的吞吐量,适用单次AI训练、单次大数据计算以及查询等批处理小业务场景。 - - - -### Proportion - -#### 简介 -Proportion调度算法是使用queue的概念,用来控制集群总资源的分配比例。每一个queue分配到的集群资源比例是一定的。举例来说,有3个团队,共享一个集群上的资源池:A团队最多使用总集群的40%,B团队最多使用30%,C团队最多使用30%。如果投递的作业量超过团队最大可用资源,就需要排队。 - -#### 场景 - -Proportion调度算法为集群的调度带来了弹性、灵活性上面的提升。最典型的场景就是在一个公司的多个开发团队,共用一个集群的时候,这种调度算法能够很好的处理不同部门之间的共享资源配比和隔离的需求。在多业务混合场景,如计算密集型的AI业务,网络IO密集型的MPI、HPC业务,存储密集型的大数据业务,Proportion调度算法通过配比,能很好的按需分配共享资源。 - - - -### Task-topology - -#### 简介 - -Task-topology算法是一种根据Job内task之间亲和性和反亲和性配置计算task优先级和Node优先级的算法。通过在Job内配置task之间的亲和性和反亲和性策略,并使用task-topology算法,可优先将具有亲和性配置的task调度到同一个节点上,将具有反亲和性配置的Pod调度到不同的节点上。 - -#### 场景 - -node affinity: - -- Task-topology对于提升深度学习计算场景下的计算效率非常重要。以TensorFlow计算为例,配置“ps”和“worker”之间的亲和性。Task-topology算法,可使“ps”和“worker”尽量调度到同一台节点上,从而提升“ps”和“worker”之间进行网络和数据交互的效率,进而提升计算效率。 -- HPC、MPI场景下task之间具有高度同步性,需要高速的网络IO。 - -Anti-affinity: - -- 以TensorFlow计算为例,“ps”与“ps”之间的反亲和性。 -- 电商服务场景的主从备份,数据容灾,保证一个作业挂掉之后有备用作业继续提供服务。 - - - -### Predicates - -#### 简介 - -Predicate plugin通过pod、nodeInfo作为参数,调用predicateGPU,根据计算结果对作业进行评估预选。 - -#### 场景 - -在AI的应用场景下,GPU资源是必需,Predicate plugin可以快速筛选出来需要GPU的进行集中调度。 - - - -### Nodeorder - -#### 简介 - -Nodeorder plugin是一种调度优选策略:通过模拟分配从各个维度为node打分,找到最适合当前作业的node。打分参数由用户来配置。参数包含了Affinity、reqResource,、LeastReqResource、MostReqResource、balanceReqResouce。 - -#### 场景 - -Nodeorder plugin给调度提供了多个维度的打分标准,不同维度的组合,能够让用户根据自身需求灵活的配置合适的调度策略。 - - - -### SLA - -#### 简介 - -SLA的全称是Service Level agreement。用户向volcano提交job的时候,可能会给job增加特殊的约束,例如最长等待时间(JobWaitingTime)。这些约束条件可以视为用户与volcano之间的服务协议。SLA plugin可以为单个作业/整个集群接收或者发送SLA参数。 - -#### 场景 - -根据业务的需要用户可以在自己的集群定制SLA相关参数。例如实时性服务要求较高的集群,JobWaitingTime可以设置的尽量小。批量计算作业为主的集群,JobWaitingTime可以设置较大。具体SLA的参数以及参数的优化需要结合具体的业务以及相关的性能测评结果。 - - - -### Tdm - -#### 简介 - -Tdm的全称是Time Division Multiplexing。在一些场景中,一些节点既属于Kubernetes集群也属于Yarn集群。Tdm plugin 需要管理员为这些节点标记为`revocable node`。Tdm plugin会在该类节点可被撤销的时间段内尝试把`preemptable task`调度给`revocable node`,并在该时间段之外清除`revocable node`上的`preemptable task`。Tdm plugin提高了volcano在调度过程中节点资源的分时复用能力。 - -#### 场景 - -适用于ToB业务中,云厂商为商家提供云化资源,不同的商家采取不同的容器编排框架(Kubernetes/Yarn等),Tdm plugin提高公共节点资源的分时使用效率,进一步提升资源的利用率。 - - - -### Numa-aware - -#### 简介 - -当节点运行多个cpu密集的pod。基于pod是否可以迁移cpu已经调度周期cpu资源状况,工作负载可以迁移到不同的cpu核心下。许多工作负载对cpu资源迁移并不敏感。然而,有一些cpu的缓存亲和度以及调度延迟显著影响性能的工作负载,kubelet允许可选的cpu编排策略(cpu management)来确定节点上cpu资源的绑定分配。 - -cpu manager以及topology manager都是kubelet的组件,它存在如下局限: - -- 基于kubelet的调度组件不支持topology-aware。所以可能由于Topology manager,导致整个node上的调度失败。这对Tensorflow job是难以接受的,因为一旦有任何worker task挂掉,整个作业都将调度失败。 -- 这些manager是节点级这导致无法在整个集群中匹配numa topology的最佳节点。 - -Numa-aware plugin致力于解决如上局限。 - -- 支持cpu资源的拓扑调度。 -- 支持pod级别的拓扑协议。 - -#### 场景 - -Numa-aware的常见场景是那些对cpu参数敏感\调度延迟敏感的计算密集型作业。如科学计算、视频解码、动漫动画渲染、大数据离线处理等具体场景。 - - - diff --git a/content/zh/docs/predicates.md b/content/zh/docs/predicates.md new file mode 100644 index 00000000..c632c000 --- /dev/null +++ b/content/zh/docs/predicates.md @@ -0,0 +1,144 @@ ++++ +title = "Predicates" + +date = 2021-05-13 +lastmod = 2025-11-11 + +draft = false # Is this a draft? true/false +toc = true # Show table of contents? true/false +type = "docs" # Do not modify. + +# Add menu entry to sidebar. +linktitle = "Predicates" +[menu.docs] + parent = "plugins" + weight = 10 ++++ + +## 简介 + +Predicate plugin通过pod、nodeInfo作为参数,调用predicateGPU,根据计算结果对作业进行评估预选。它根据各种标准过滤无法容纳任务的节点,包括资源可用性、节点条件以及GPU资源等特殊要求。 + +## 工作原理 + +Predicates插件调用各种预选函数,以pod和nodeInfo作为参数来评估和预选调度节点。它实现: + +- **PredicateFn**: 如果节点可以容纳任务返回true,否则返回false + +插件评估多个标准: +- 节点资源(CPU、Memory等) +- 节点条件和污点 +- Pod亲和性和反亲和性规则 +- 存储卷约束 +- GPU和其他加速器可用性 + +## 场景 + +### 具有GPU需求的AI工作负载 + +在AI的应用场景下,GPU资源是必需,Predicate plugin可以快速筛选出来需要GPU的进行集中调度。 + +### 资源过滤 + +插件确保只有具有足够资源的节点被考虑用于任务放置,防止由于资源约束导致的调度失败。 + +### 节点条件过滤 + +具有阻止调度条件的节点(例如NotReady、MemoryPressure、DiskPressure)被过滤掉。 + +## 配置 + +Predicates插件在调度器ConfigMap中启用: + +```yaml +tiers: +- plugins: + - name: priority + - name: gang +- plugins: + - name: predicates + - name: nodeorder + - name: binpack +``` + +### 配置参数 + +Predicates插件支持多个配置选项: + +```yaml +tiers: +- plugins: + - name: predicates + arguments: + predicate.GPUSharingEnable: true + predicate.CacheEnable: true + predicate.ProportionalEnable: true + predicate.resources: nvidia.com/gpu + predicate.resources.nvidia.com/gpu.weight: 100 +``` + +| 参数 | 描述 | 默认值 | +|------|------|--------| +| `predicate.GPUSharingEnable` | 启用GPU共享预选 | false | +| `predicate.CacheEnable` | 启用预选缓存以提高性能 | true | +| `predicate.ProportionalEnable` | 启用比例资源预选 | false | + +## 示例 + +### 需要GPU资源的作业 + +```yaml +apiVersion: batch.volcano.sh/v1alpha1 +kind: Job +metadata: + name: gpu-training-job +spec: + schedulerName: volcano + minAvailable: 1 + tasks: + - replicas: 1 + name: trainer + template: + spec: + containers: + - name: tensorflow + image: tensorflow/tensorflow:latest-gpu + resources: + requests: + nvidia.com/gpu: "1" + limits: + nvidia.com/gpu: "1" +``` + +Predicates插件将过滤节点,仅包括具有可用GPU资源的节点。 + +### 具有节点亲和性的作业 + +```yaml +apiVersion: batch.volcano.sh/v1alpha1 +kind: Job +metadata: + name: zone-specific-job +spec: + schedulerName: volcano + minAvailable: 1 + tasks: + - replicas: 1 + name: worker + template: + spec: + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: topology.kubernetes.io/zone + operator: In + values: + - us-west-2a + containers: + - name: worker + image: busybox +``` + +Predicates插件将确保作业仅被调度到指定区域的节点。 diff --git a/content/zh/docs/priority.md b/content/zh/docs/priority.md new file mode 100644 index 00000000..846a505c --- /dev/null +++ b/content/zh/docs/priority.md @@ -0,0 +1,105 @@ ++++ +title = "Priority" + +date = 2021-05-13 +lastmod = 2025-11-11 + +draft = false # Is this a draft? true/false +toc = true # Show table of contents? true/false +type = "docs" # Do not modify. + +# Add menu entry to sidebar. +linktitle = "Priority" +[menu.docs] + parent = "plugins" + weight = 6 ++++ + +{{
}} + +## 简介 + +Priority plugin提供了job、task排序的实现,以及计算牺牲作业的函数preemptableFn。job的排序根据priorityClassName,task的排序依次根据priorityClassName、createTime、id。 + +## 工作原理 + +Priority插件实现了几个关键功能: + +- **JobOrderFn**: 比较两个作业并根据`job.spec.priorityClassName`确定它们的相对优先级 +- **TaskOrderFn**: 比较两个任务并通过依次比较`task.priorityClassName`、`task.createTime`和`task.id`来确定它们的相对优先级 +- **PreemptableFn**: 根据优先级级别识别可以被抢占的任务 + +## 场景 + +当集群运行了多个Job,但资源不足,并且每个Job下有不等数量的Pod等待被调度的时候,如果使用Kubernetes默认调度器,那么最终,具有更多Pod数量的Job将分得更多的集群资源。在这种情况下,volcano-scheduler提供算法支持不同的Job以fair-share的形式共享集群资源。 + +### 自定义优先级 + +Priority plugin能够让用户自定义job、task优先级,根据自己的需求在不同层次来定制调度策略。根据job的priorityClassName在应用层面进行优先级排序。 + +### 实时性要求 + +例如集群中有金融场景、物联网监控场景等需要较高实时性的应用,Priority plugin能够保证其优先得到调度。 + +## 配置 + +Priority插件通常放在插件的第一层: + +```yaml +tiers: +- plugins: + - name: priority + - name: gang + - name: conformance +``` + +## 示例 + +### 创建PriorityClass + +首先,在集群中创建PriorityClass: + +```yaml +apiVersion: scheduling.k8s.io/v1 +kind: PriorityClass +metadata: + name: high-priority +value: 1000000 +globalDefault: false +description: "关键工作负载的高优先级" +--- +apiVersion: scheduling.k8s.io/v1 +kind: PriorityClass +metadata: + name: low-priority +value: 1000 +globalDefault: false +description: "批处理工作负载的低优先级" +``` + +### 在VolcanoJob中使用优先级 + +然后在作业中引用优先级类: + +```yaml +apiVersion: batch.volcano.sh/v1alpha1 +kind: Job +metadata: + name: high-priority-job +spec: + schedulerName: volcano + priorityClassName: high-priority + minAvailable: 1 + tasks: + - replicas: 1 + name: task + template: + spec: + priorityClassName: high-priority + containers: + - name: worker + image: busybox + command: ["sleep", "1000"] +``` + +在此示例中,当资源有限时,具有`high-priority`的作业将在具有`low-priority`的作业之前被调度。 diff --git a/content/zh/docs/proportion.md b/content/zh/docs/proportion.md new file mode 100644 index 00000000..6b2845a1 --- /dev/null +++ b/content/zh/docs/proportion.md @@ -0,0 +1,168 @@ ++++ +title = "Proportion" + +date = 2021-05-13 +lastmod = 2026-01-19 + +draft = false # Is this a draft? true/false +toc = true # Show table of contents? true/false +type = "docs" # Do not modify. + +# Add menu entry to sidebar. +linktitle = "Proportion" +[menu.docs] + parent = "plugins" + weight = 8 ++++ + +## 简介 + +Proportion调度算法是使用**Queue**的概念,用来控制集群总资源的分配比例。每一个queue分配到的集群资源比例是一定的。 + +举例来说,有3个团队,共享一个集群上的资源池: +- A团队最多使用总集群的40% +- B团队最多使用30% +- C团队最多使用30% + +如果投递的作业量超过团队最大可用资源,就需要排队。 + +## 工作原理 + +Proportion插件根据队列配置管理资源分配: + +- **Queue Weight**: 每个队列有一个权重,决定其集群资源的份额 +- **Capability**: 队列可以使用的最大资源 +- **Guarantee**: 保证给队列的最小资源 +- **Deserved Resources**: 队列根据其权重应该获得的资源 + +实现的关键功能: + +- **QueueOrderFn**: 根据资源利用率对队列进行调度排序 +- **ReclaimableFn**: 确定是否可以从队列回收资源 +- **OverusedFn**: 检查队列是否使用了超过其应得份额的资源 + +## 场景 + +Proportion调度算法为集群的调度带来了弹性、灵活性上面的提升: + +### 多团队资源共享 + +最典型的场景就是在一个公司的多个开发团队,共用一个集群的时候,这种调度算法能够很好的处理不同部门之间的共享资源配比和隔离的需求。 + +### 多业务混合场景 + +在多业务混合场景: +- **计算密集型**: AI业务 +- **网络IO密集型**: MPI和HPC业务 +- **存储密集型**: 大数据业务 + +Proportion调度算法通过配比,能很好的按需分配共享资源。 + +## 配置 + +### 队列定义 + +首先,创建具有适当资源分配的队列: + +```yaml +apiVersion: scheduling.volcano.sh/v1beta1 +kind: Queue +metadata: + name: team-a-queue +spec: + weight: 4 + capability: + cpu: "40" + memory: "80Gi" +--- +apiVersion: scheduling.volcano.sh/v1beta1 +kind: Queue +metadata: + name: team-b-queue +spec: + weight: 3 + capability: + cpu: "30" + memory: "60Gi" +--- +apiVersion: scheduling.volcano.sh/v1beta1 +kind: Queue +metadata: + name: team-c-queue +spec: + weight: 3 + capability: + cpu: "30" + memory: "60Gi" +``` + +### 调度器配置 + +在调度器中启用Proportion插件: + +```yaml +tiers: +- plugins: + - name: priority + - name: gang +- plugins: + - name: drf + - name: predicates + - name: proportion + - name: nodeorder +``` + +## 示例 + +### 在VolcanoJob中使用队列 + +将作业提交到特定队列: + +```yaml +apiVersion: batch.volcano.sh/v1alpha1 +kind: Job +metadata: + name: team-a-job +spec: + schedulerName: volcano + queue: team-a-queue + minAvailable: 2 + tasks: + - replicas: 2 + name: worker + template: + spec: + containers: + - name: worker + image: busybox + resources: + requests: + cpu: "2" + memory: "4Gi" +``` + +### 具有保证和能力的队列 + +创建同时具有最小保证和最大能力的队列: + +```yaml +apiVersion: scheduling.volcano.sh/v1beta1 +kind: Queue +metadata: + name: production-queue +spec: + weight: 5 + guarantee: + resource: + cpu: "10" + memory: "20Gi" + capability: + cpu: "50" + memory: "100Gi" + reclaimable: true +``` + +在此配置中: +- 队列保证至少有10个CPU和20Gi内存 +- 当资源可用时,它可以使用最多50个CPU和100Gi内存 +- 当其他队列需要时,可以从此队列回收资源 diff --git a/content/zh/docs/sla.md b/content/zh/docs/sla.md new file mode 100644 index 00000000..0b8121d8 --- /dev/null +++ b/content/zh/docs/sla.md @@ -0,0 +1,131 @@ ++++ +title = "SLA" + +date = 2021-05-13 +lastmod = 2026-01-19 + +draft = false # Is this a draft? true/false +toc = true # Show table of contents? true/false +type = "docs" # Do not modify. + +# Add menu entry to sidebar. +linktitle = "SLA" +[menu.docs] + parent = "plugins" + weight = 12 ++++ + +## 简介 + +SLA的全称是**Service Level Agreement**。用户向volcano提交job的时候,可能会给job增加特殊的约束,例如最长等待时间(JobWaitingTime)。这些约束条件可以视为用户与volcano之间的服务协议。SLA plugin可以为单个作业/整个集群接收或者发送SLA参数。 + +## 工作原理 + +SLA插件监控作业等待时间,并在违反SLA约束时采取行动: + +- **JobWaitingTime**: 作业在pending状态下可以等待的最长时间 +- **JobEnqueuedFn**: 在入队之前检查作业是否满足SLA要求 + +当作业的等待时间超过配置的阈值时,调度器可以采取纠正措施,如优先调度该作业或通知管理员。 + +## 场景 + +根据业务的需要用户可以在自己的集群定制SLA相关参数: + +### 实时服务 + +例如实时性服务要求较高的集群,JobWaitingTime可以设置的尽量小,以确保作业快速被调度或标记为需要关注。 + +### 批量计算 + +批量计算作业为主的集群,JobWaitingTime可以设置较大,以允许更灵活的调度。 + +### 多租户环境 + +在多租户集群中,不同的队列或命名空间可以根据其服务层级有不同的SLA要求。 + +## 配置 + +在调度器ConfigMap中启用SLA插件: + +```yaml +tiers: +- plugins: + - name: priority + - name: gang + - name: sla + arguments: + sla.JobWaitingTime: 10m +``` + +### 配置参数 + +| 参数 | 描述 | 默认值 | +|------|------|--------| +| `sla.JobWaitingTime` | 作业的最大等待时间 | - | + +`JobWaitingTime`参数可以使用持续时间格式指定(例如`5m`、`1h`、`30s`)。 + +## 示例 + +### 集群级SLA配置 + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: volcano-scheduler-configmap + namespace: volcano-system +data: + volcano-scheduler.conf: | + actions: "enqueue, allocate, backfill" + tiers: + - plugins: + - name: priority + - name: gang + - name: sla + arguments: + sla.JobWaitingTime: 30m + - plugins: + - name: drf + - name: predicates + - name: proportion + - name: nodeorder +``` + +### 具有SLA注解的作业 + +您还可以在作业级别指定SLA约束: + +```yaml +apiVersion: batch.volcano.sh/v1alpha1 +kind: Job +metadata: + name: sla-constrained-job + annotations: + volcano.sh/sla-waiting-time: "10m" +spec: + schedulerName: volcano + minAvailable: 1 + tasks: + - replicas: 1 + name: worker + template: + spec: + containers: + - name: worker + image: busybox + command: ["sleep", "3600"] +``` + +在此示例中,如果作业在pending状态下等待超过10分钟,SLA插件将标记它以进行优先调度或管理员关注。 + +### 监控SLA违规 + +Volcano公开了可用于监控SLA合规性的指标: + +- 作业等待时间指标 +- SLA违规计数 +- 队列级SLA统计 + +这些指标可以与Prometheus等监控系统集成,以跟踪集群中的SLA合规性。 diff --git a/content/zh/docs/task-topology.md b/content/zh/docs/task-topology.md new file mode 100644 index 00000000..19c31ab2 --- /dev/null +++ b/content/zh/docs/task-topology.md @@ -0,0 +1,146 @@ ++++ +title = "Task-topology" + +date = 2021-05-13 +lastmod = 2026-01-19 + +draft = false # Is this a draft? true/false +toc = true # Show table of contents? true/false +type = "docs" # Do not modify. + +# Add menu entry to sidebar. +linktitle = "Task-topology" +[menu.docs] + parent = "plugins" + weight = 9 ++++ + +## 简介 + +Task-topology算法是一种根据Job内task之间亲和性和反亲和性配置计算task优先级和Node优先级的算法。通过在Job内配置task之间的亲和性和反亲和性策略,并使用task-topology算法,可优先将具有亲和性配置的task调度到同一个节点上,将具有反亲和性配置的Pod调度到不同的节点上。 + +## 工作原理 + +Task-topology插件分析作业中的任务关系并优化放置: + +- **亲和性**: 受益于在同一节点上的任务(例如,用于快速本地通信) +- **反亲和性**: 应该在不同节点上的任务(例如,用于容错) + +实现的关键功能: + +- **TaskOrderFn**: 根据拓扑偏好对任务排序 +- **NodeOrderFn**: 根据节点满足拓扑要求的程度为节点评分 + +## 场景 + +### 节点亲和性 + +#### 深度学习和TensorFlow + +Task-topology对于提升深度学习计算场景下的计算效率非常重要。以TensorFlow计算为例,配置"ps"(参数服务器)和"worker"之间的亲和性。Task-topology算法,可使"ps"和"worker"尽量调度到同一台节点上,从而提升"ps"和"worker"之间进行网络和数据交互的效率,进而提升计算效率。 + +#### HPC和MPI + +HPC、MPI场景下task之间具有高度同步性,需要高速的网络IO。将相关任务放在同一节点上可以减少网络延迟并提高性能。 + +### 反亲和性 + +#### 参数服务器分布 + +以TensorFlow计算为例,"ps"与"ps"之间的反亲和性可以确保它们分布在不同的节点上以实现更好的负载分布。 + +#### 高可用性 + +电商服务场景的主从备份,数据容灾,保证一个作业挂掉之后有备用作业继续提供服务。 + +## 配置 + +在调度器中启用Task-topology插件: + +```yaml +tiers: +- plugins: + - name: priority + - name: gang +- plugins: + - name: predicates + - name: nodeorder + - name: task-topology +``` + +## 示例 + +### 具有任务亲和性的作业 + +```yaml +apiVersion: batch.volcano.sh/v1alpha1 +kind: Job +metadata: + name: tensorflow-job +spec: + schedulerName: volcano + minAvailable: 3 + policies: + - event: PodEvicted + action: RestartJob + tasks: + - replicas: 1 + name: ps + policies: + - event: TaskCompleted + action: CompleteJob + template: + metadata: + labels: + role: ps + spec: + containers: + - name: tensorflow + image: tensorflow/tensorflow:latest + - replicas: 2 + name: worker + template: + metadata: + labels: + role: worker + spec: + containers: + - name: tensorflow + image: tensorflow/tensorflow:latest + plugins: + env: [] + svc: [] + topologyPolicy: + mode: affinity + tiers: + - tasks: + - ps + - worker +``` + +### 具有任务反亲和性的作业 + +```yaml +apiVersion: batch.volcano.sh/v1alpha1 +kind: Job +metadata: + name: ha-service +spec: + schedulerName: volcano + minAvailable: 2 + tasks: + - replicas: 2 + name: master + template: + spec: + containers: + - name: master + image: my-service:latest + topologyPolicy: + mode: anti-affinity + tiers: + - tasks: + - master +``` + +在此示例中,两个master副本将被调度到不同的节点以确保高可用性。 diff --git a/content/zh/docs/tdm.md b/content/zh/docs/tdm.md new file mode 100644 index 00000000..5092ffc2 --- /dev/null +++ b/content/zh/docs/tdm.md @@ -0,0 +1,153 @@ ++++ +title = "TDM" + +date = 2021-05-13 +lastmod = 2025-11-11 + +draft = false # Is this a draft? true/false +toc = true # Show table of contents? true/false +type = "docs" # Do not modify. + +# Add menu entry to sidebar. +linktitle = "TDM" +[menu.docs] + parent = "plugins" + weight = 13 ++++ + +## 简介 + +Tdm的全称是**Time Division Multiplexing**(时分复用)。在一些场景中,一些节点既属于Kubernetes集群也属于Yarn集群。对于这些节点,Kubernetes和YARN集群可以通过时分复用来使用这些资源。 + +Tdm plugin需要管理员为这些节点标记为`revocable node`。Tdm plugin会在该类节点可被撤销的时间段内尝试把`preemptable task`调度给`revocable node`,并在该时间段之外清除`revocable node`上的`preemptable task`。 + +Tdm plugin提高了volcano在调度过程中节点资源的分时复用能力。 + +## 工作原理 + +TDM插件管理基于时间的资源共享: + +1. **可撤销节点**: 标记为可撤销的节点,可以在编排系统之间共享 +2. **可撤销时间窗口**: 定义节点可用于Kubernetes工作负载的时间段 +3. **可抢占任务**: 当可撤销时间窗口结束时可以被驱逐的任务 + +关键功能: + +- **PredicateFn**: 检查在当前时间窗口期间任务是否可以被调度到可撤销节点 +- **PreemptableFn**: 根据时间约束确定任务是否应该被驱逐 + +## 场景 + +### ToB业务 + +适用于ToB业务中,云厂商为商家提供云化资源,不同的商家采取不同的容器编排框架(Kubernetes/Yarn等),Tdm plugin提高公共节点资源的分时使用效率,进一步提升资源的利用率。 + +### 混合集群 + +同时运行Kubernetes和Hadoop/YARN工作负载的组织可以使用TDM在两个系统之间共享物理节点,基于时间的调度确保工作负载不会相互干扰。 + +### 成本优化 + +通过启用时分复用,组织可以通过在不同时间段在不同工作负载类型之间共享节点来最大化其硬件基础设施的利用率。 + +## 配置 + +### 节点标签 + +首先,将节点标记为可撤销: + +```bash +kubectl label node volcano.sh/revocable-node=true +kubectl annotate node volcano.sh/revocable-zone="zone-a" +``` + +### 调度器配置 + +使用时间窗口配置启用TDM插件: + +```yaml +tiers: +- plugins: + - name: priority + - name: gang +- plugins: + - name: predicates + - name: tdm + arguments: + tdm.revocable-zone.zone-a: "0 8 * * *:0 18 * * *" + tdm.revocable-zone.zone-b: "0 20 * * *:0 6 * * *" +``` + +### 配置参数 + +| 参数 | 描述 | 格式 | +|------|------|------| +| `tdm.revocable-zone.` | 可撤销区域的时间窗口 | `:` | + +时间窗口使用cron表达式指定: +- `0 8 * * *` 表示"每天上午8:00" +- `0 18 * * *` 表示"每天下午6:00" + +## 示例 + +### 可撤销节点配置 + +```bash +# 将节点标记为zone-a中的可撤销节点 +kubectl label node worker-node-1 volcano.sh/revocable-node=true +kubectl annotate node worker-node-1 volcano.sh/revocable-zone=zone-a +``` + +### 带有TDM的调度器ConfigMap + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: volcano-scheduler-configmap + namespace: volcano-system +data: + volcano-scheduler.conf: | + actions: "enqueue, allocate, backfill" + tiers: + - plugins: + - name: priority + - name: gang + - plugins: + - name: predicates + - name: tdm + arguments: + # zone-a从上午8点到下午6点可用于Kubernetes + tdm.revocable-zone.zone-a: "0 8 * * *:0 18 * * *" + - name: nodeorder +``` + +### 可抢占作业 + +提交可以在可撤销节点上调度的作业: + +```yaml +apiVersion: batch.volcano.sh/v1alpha1 +kind: Job +metadata: + name: preemptable-job + annotations: + volcano.sh/preemptable: "true" +spec: + schedulerName: volcano + minAvailable: 1 + tasks: + - replicas: 1 + name: worker + template: + spec: + containers: + - name: worker + image: busybox + command: ["sleep", "3600"] +``` + +在此示例中: +- 作业被标记为可抢占 +- 它可以在配置的时间窗口期间被调度到可撤销节点 +- 当时间窗口结束时将被驱逐