Skip to content

Conversation

@zvonkok
Copy link
Collaborator

@zvonkok zvonkok commented Jan 15, 2026

After enumerating the needed devices via go-nvlib, create the CDI spec before starting the device-plugin.

Depends On: #25

Copilot AI review requested due to automatic review settings January 15, 2026 22:36
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds CDI (Container Device Interface) spec generation for NVIDIA GPUs and NVSwitches after device enumeration. The changes refactor device discovery to use go-nvlib's nvpci interface instead of manual filesystem operations, and generate CDI specifications before starting device plugins.

Changes:

  • Replaced manual PCI device discovery with go-nvlib's nvpci library for device enumeration
  • Added CDI spec generation for GPUs and NVSwitches with support for both IOMMUFD and legacy VFIO modes
  • Refactored device structures to include additional metadata (device name, IommuFD, IsNVSwitch flag)

Reviewed changes

Copilot reviewed 7 out of 3089 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pkg/device_plugin/generic_device_plugin.go Improved context cancellation handling and removed manual VFIO device discovery logic
pkg/device_plugin/generic_device_plugin_test.go Updated tests to use structured device data with complete PCI addresses and device metadata
pkg/device_plugin/device_plugin.go Replaced filesystem-based device discovery with nvpci library and added device name formatting
pkg/device_plugin/device_plugin_test.go Rewrote tests to use nvpci mocks instead of filesystem operations
pkg/device_plugin/constants.go Added CDI-related constants and removed unused basePath variable
pkg/device_plugin/cdi.go New file implementing CDI spec generation for discovered devices
go.mod Updated Go version and added dependencies for nvpci and CDI libraries

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@rajatchopra rajatchopra self-assigned this Jan 16, 2026
@zvonkok
Copy link
Collaborator Author

zvonkok commented Jan 19, 2026

NvSwitches and GPUs are sorted

❯ sudo cat /var/run/cdi/nvidia.com-nvswitch.yaml
---
cdiVersion: 1.1.0
kind: nvidia.com/nvswitch
devices:
    - name: "0"
      containerEdits:
        deviceNodes:
            - path: /dev/vfio/devices/vfio8
    - name: "1"
      containerEdits:
        deviceNodes:
            - path: /dev/vfio/devices/vfio9
    - name: "2"
      containerEdits:
        deviceNodes:
            - path: /dev/vfio/devices/vfio10
    - name: "3"
      containerEdits:
        deviceNodes:
            - path: /dev/vfio/devices/vfio11
❯ sudo cat /var/run/cdi/nvidia.com-pgpu.yaml 
---
cdiVersion: 1.1.0
kind: nvidia.com/pgpu
devices:
    - name: "0"
      containerEdits:
        deviceNodes:
            - path: /dev/vfio/devices/vfio0
    - name: "1"
      containerEdits:
        deviceNodes:
            - path: /dev/vfio/devices/vfio1
    - name: "2"
      containerEdits:
        deviceNodes:
            - path: /dev/vfio/devices/vfio2
    - name: "3"
      containerEdits:
        deviceNodes:
            - path: /dev/vfio/devices/vfio3
    - name: "4"
      containerEdits:
        deviceNodes:
            - path: /dev/vfio/devices/vfio4
    - name: "5"
      containerEdits:
        deviceNodes:
            - path: /dev/vfio/devices/vfio5
    - name: "6"
      containerEdits:
        deviceNodes:
            - path: /dev/vfio/devices/vfio6
    - name: "7"
      containerEdits:
        deviceNodes:
            - path: /dev/vfio/devices/vfio7

@zvonkok
Copy link
Collaborator Author

zvonkok commented Jan 19, 2026

❯ k describe node | grep nvidia.com                                        
                    nvidia.com/cc.capable=true
                    nvidia.com/cc.mode=ppcie
                    nvidia.com/cc.mode.state=ppcie
                    nvidia.com/gpu.deploy.cc-manager=true
                    nvidia.com/gpu.deploy.kata-manager=true
                    nvidia.com/gpu.deploy.nvsm=
                    nvidia.com/gpu.deploy.sandbox-device-plugin=true
                    nvidia.com/gpu.deploy.sandbox-validator=true
                    nvidia.com/gpu.deploy.vfio-manager=true
                    nvidia.com/gpu.family=hopper
                    nvidia.com/gpu.present=true
                    nvidia.com/gpu.workload.config=vm-passthrough
  nvidia.com/GH100_H100_NVSWITCH:             4
  nvidia.com/pgpu:                            8
  nvidia.com/GH100_H100_NVSWITCH:             4
  nvidia.com/pgpu:                            8
  nvidia.com/GH100_H100_NVSWITCH             0            0
  nvidia.com/pgpu                            0            0

Next step NVSWITCH_ALIAS

Copy link
Collaborator

@rajatchopra rajatchopra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
We can optimize the structures in a follow up if needed.

@zvonkok zvonkok force-pushed the nvswitch-cdi branch 2 times, most recently from 2813e05 to 6205624 Compare January 23, 2026 19:19
After enumerating all the target devices,
create the CDI spec before starting the
device-plugin

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
@zvonkok zvonkok merged commit b14436b into NVIDIA:main Jan 23, 2026
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants