Announce Rails in ResourceSlice

For optimal performance of AI/ML workloads there are several factors that impact the network, more on presentation in https://docs.google.com/presentation/d/1VbxvXc1aqIjdpin7-MxF_ECaApsgAoWPySIZ3tq1ZoE/edit?slide=id.p#slide=id.p

Intra-Node topology and GPU/NIC alignment is achieved via MatchAttributes, however, it also require Inter-Node alignment, as VMs/Machines use to be cabled in a certain way to optimize the network.

This causes that if we have a cluster of VMs, we can not just require to match ANY GPU and NIC that are in the same pciRoot, we MUST match any GPU and NIC that are close in the machine topology but also that are in the same Rail across machines.

Per conversation in Kubernetes slack https://kubernetes.slack.com/archives/C0409NGC1TK/p1753615387658689
, this can be achieved using DRA :

> Using a node selector in the ResourceSlice instead of a node name makes the devices available for use on all nodes matching the selector.
The scheduler picks a device as usual and sets the ResourceClaim status so that it is marked as usable on the same nodes as the device. If there are multiple devices, the claim node selector covers the intersection of all device node selectors.
Depending on that outcome, the ResourceClaim might be usable by multiple different pods on different nodes.

NVIDIA already implements something similar with IMEX channels:

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/dra-cds.html#a-deeper-dive-related-resources
https://docs.google.com/presentation/d/1Xupr8IZVAjs5bNFKJnYaK0LE7QWETnJjkz6KOfLu87E/edit?pli=1&slide=id.g28ac369118f_0_1647#slide=id.g[…]118f_0_1647


cc: @gauravkghildiyal @michaelasp 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Announce Rails in ResourceSlice #173

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Announce Rails in ResourceSlice #173

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions