Skip to content

Support for dynamic/flexible device reservations #27402

@chrisboulton

Description

@chrisboulton

Proposal

Nomad’s device scheduling model currently requires users to specify fixed device reservations (e.g., a specific device count with constraints or affinities). This is works well for a bunch of workloads, but there is an opportunity to provide greater flexibility, specifically for workloads like GPU or accelerators where on given hardware we might want more control over placement: on a particular class of machine we might want to reserve 1 GPU, but if we don't have that available, we might also be able to carry the on a different node with different hardware, but with the need to reserve more than one GPU.

This proposal requests a more flexible device reservation mechanism that allows a job to express alternative or prioritized device requirements within a single task group.

The scheduler should be able to evaluate multiple device reservation options for a task and select one that can be satisfied. This would allow users to define what they need in terms of capability/capacity, rather than binding the job to a single, fixed device layout.

Use-cases

For me (if it wasn't obvious from the above), the primary use case for this functionality is for flexibility in scheduling GPU workloads across heterogeneous fleets:

A workload may run on a single large GPU (e.g., GH200 with 96 GB), but it can also run on two smaller GPUs (e.g., two H100s with 80 GB each). Without some kind of flexible scheduling, you are forced down the path of creating multiple jobs or task groups to represent each option, and then creating some kind of wrapper over the top of Nomad to manage scaling these workloads up or down based on free hardware/capacity.

I think it's important to note that this kind of functionality should come with some kind of preference/ordering.. i.e: affinities, so along with being able to reserve different devices, you can tell Nomad that your first preference is this, second is that, etc -- and in each case the correct number of devices is reserved.

This is similar to Prioritised List DRA which was added recently in Kubernetes.

Attempted Solutions

I've got a first pass of my proposed solution here: #27391. Repeating myself from that pull request, first_available is added as a new block for devices, which is a list of your preferences where Nomad will stop after the first set of constraints are matched:

device "nvidia/gpu" {
  # i would prefer this workload to land on a GH200 and if it does, it needs one GPU
  first_available {
    count = 1
    constraint {
      attribute = "${device.attr.model}"
      value = "GH200"
    }
  }
  # otherwise, i'll take a pair of H100s
  first_available {
    count = 2
    constraint {
      attribute = "${device.attr.model}"
      value = "H100"
    }
  }
}

With a job configuration like this, Nomad will first try to schedule the workload on a GH200. If that's not available, it will then try to schedule on two H100 SDMs. If that's not available, it will fail the job.

A more complete version of the above which makes decisions based on memory (the actual constraint) and shows just the degree of flexibility we now have with existing constraint and affinity options might look like:

device "nvidia/gpu" {
  # i only need to reserve a single gpu if it has at least 90 GiB of memory
  first_available {
    count = 1
    constraint {
      attribute = "${device.attr.memory}"
      operator  = ">="
      value     = "90 GiB"
    }
  }

  # but if we can't, i'll take a pair of H100s if they have at least 80 GiB of memory
  first_available {
    count = 2
    constraint {
      attribute = "${device.attr.memory}"
      operator  = ">="
      value     = "80 GiB"
    }
  }

  # either way we only want this workload to be scheduled on nvidia gpus..
  constraint {
    attribute = "${device.attr.vendor}"
    value     = "nvidia"
  }

  # and we'd prefer to schedule it on amd64
  affinity {
    attribute = "${node.attr.arch}"
    value     = "amd64"
    weight    = 100
  }
}

first_available is an ordered list of options where the first match wins. Inside first_available, constraint is supported, which lets you perform the additional filtering.

first_available and count are mutually exclusive at the device level.

count, affinity, and constraint without first_available are supported as before.

This could potentially be made simpler by doing away with nesting of constraints, making first_available support expressions directly.

device "nvidia/gpu" {
  first_available {
    count = 1
    attribute = "${device.attr.model}"
    value = "GH200"
  }
  first_available {
    count = 2
    attribute = "${device.attr.model}"
    value = "H100"
  }
}

I'm willing to work with y'all on a design that fits for this and iterate on my existing changeset or land a new one. Would love to see this functionality land!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions