Support for dynamic/flexible device reservations

### Proposal

Nomad’s device scheduling model currently requires users to specify fixed device reservations (e.g., a specific device count with constraints or affinities).  This is works well for a bunch of workloads, but there is an opportunity to provide greater flexibility, specifically for workloads like GPU or accelerators where on given hardware we might want more control over placement: on a particular class of machine we might want to reserve 1 GPU, but if we don't have that available, we might also be able to carry the on a different node with different hardware, but with the need to reserve more than one GPU.

This proposal requests a more flexible device reservation mechanism that allows a job to express alternative or prioritized device requirements within a single task group.

The scheduler should be able to evaluate multiple device reservation options for a task and select one that can be satisfied. This would allow users to define what they need in terms of capability/capacity, rather than binding the job to a single, fixed device layout.

### Use-cases

For me (if it wasn't obvious from the above), the primary use case for this functionality is for flexibility in scheduling GPU workloads across heterogeneous fleets:

A workload may run on a single large GPU (e.g., GH200 with 96 GB), but it can also run on two smaller GPUs (e.g., two H100s with 80 GB each). Without some kind of flexible scheduling, you are forced down the path of creating multiple jobs or task groups to represent each option, and then creating some kind of wrapper over the top of Nomad to manage scaling these workloads up or down based on free hardware/capacity.

I think it's important to note that this kind of functionality should come with some kind of preference/ordering.. i.e: affinities, so along with being able to reserve different devices, you can tell Nomad that your first preference is this, second is that, etc -- and in each case the correct number of devices is reserved.

This is similar to [Prioritised List DRA](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#prioritized-list) which was added recently in Kubernetes.

### Attempted Solutions

I've got a first pass of my proposed solution here: https://github.com/hashicorp/nomad/pull/27391. Repeating myself from that pull request, `first_available` is added as a new block for devices, which is a list of your preferences where Nomad will stop after the first set of constraints are matched:

```hcl
device "nvidia/gpu" {
  # i would prefer this workload to land on a GH200 and if it does, it needs one GPU
  first_available {
    count = 1
    constraint {
      attribute = "${device.attr.model}"
      value = "GH200"
    }
  }
  # otherwise, i'll take a pair of H100s
  first_available {
    count = 2
    constraint {
      attribute = "${device.attr.model}"
      value = "H100"
    }
  }
}
```

With a job configuration like this, Nomad will first try to schedule the workload on a GH200. If that's not available, it will then try to schedule on two H100 SDMs. If that's not available, it will fail the job.

A more complete version of the above which makes decisions based on memory (the actual constraint) and shows just the degree of flexibility we now have with existing `constraint` and `affinity` options might look like:

```hcl
device "nvidia/gpu" {
  # i only need to reserve a single gpu if it has at least 90 GiB of memory
  first_available {
    count = 1
    constraint {
      attribute = "${device.attr.memory}"
      operator  = ">="
      value     = "90 GiB"
    }
  }

  # but if we can't, i'll take a pair of H100s if they have at least 80 GiB of memory
  first_available {
    count = 2
    constraint {
      attribute = "${device.attr.memory}"
      operator  = ">="
      value     = "80 GiB"
    }
  }

  # either way we only want this workload to be scheduled on nvidia gpus..
  constraint {
    attribute = "${device.attr.vendor}"
    value     = "nvidia"
  }

  # and we'd prefer to schedule it on amd64
  affinity {
    attribute = "${node.attr.arch}"
    value     = "amd64"
    weight    = 100
  }
}
```

`first_available` is an ordered list of options where the first match wins. Inside `first_available`, `constraint` is supported, which lets you perform the additional filtering.

`first_available` and `count` are mutually exclusive at the `device` level.

`count`, `affinity`, and `constraint` without `first_available` are supported as before.

This could potentially be made simpler by doing away with nesting of constraints, making `first_available` support expressions directly.

```hcl
device "nvidia/gpu" {
  first_available {
    count = 1
    attribute = "${device.attr.model}"
    value = "GH200"
  }
  first_available {
    count = 2
    attribute = "${device.attr.model}"
    value = "H100"
  }
}
```

I'm willing to work with y'all on a design that fits for this and iterate on my existing changeset or land a new one. Would love to see this functionality land!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for dynamic/flexible device reservations #27402

Proposal

Use-cases

Attempted Solutions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support for dynamic/flexible device reservations #27402

Description

Proposal

Use-cases

Attempted Solutions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions