Skip to content
This repository was archived by the owner on Dec 9, 2025. It is now read-only.
This repository was archived by the owner on Dec 9, 2025. It is now read-only.

Announce Rails in ResourceSlice #173

@aojea

Description

@aojea

For optimal performance of AI/ML workloads there are several factors that impact the network, more on presentation in https://docs.google.com/presentation/d/1VbxvXc1aqIjdpin7-MxF_ECaApsgAoWPySIZ3tq1ZoE/edit?slide=id.p#slide=id.p

Intra-Node topology and GPU/NIC alignment is achieved via MatchAttributes, however, it also require Inter-Node alignment, as VMs/Machines use to be cabled in a certain way to optimize the network.

This causes that if we have a cluster of VMs, we can not just require to match ANY GPU and NIC that are in the same pciRoot, we MUST match any GPU and NIC that are close in the machine topology but also that are in the same Rail across machines.

Per conversation in Kubernetes slack https://kubernetes.slack.com/archives/C0409NGC1TK/p1753615387658689
, this can be achieved using DRA :

Using a node selector in the ResourceSlice instead of a node name makes the devices available for use on all nodes matching the selector.
The scheduler picks a device as usual and sets the ResourceClaim status so that it is marked as usable on the same nodes as the device. If there are multiple devices, the claim node selector covers the intersection of all device node selectors.
Depending on that outcome, the ResourceClaim might be usable by multiple different pods on different nodes.

NVIDIA already implements something similar with IMEX channels:

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/dra-cds.html#a-deeper-dive-related-resources
https://docs.google.com/presentation/d/1Xupr8IZVAjs5bNFKJnYaK0LE7QWETnJjkz6KOfLu87E/edit?pli=1&slide=id.g28ac369118f_0_1647#slide=id.g[…]118f_0_1647

cc: @gauravkghildiyal @michaelasp

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions