[Draft] feat: runt the gpu feature discovery pod as a job on the same node #30

rajatchopra · 2026-01-26T22:56:17Z

No description provided.

Signed-off-by: Rajat Chopra <rajatc@nvidia.com>

pkg/device_plugin/gfd.go

zvonkok · 2026-01-28T19:10:40Z

pkg/device_plugin/gfd.go

+							Command: []string{
+								"/bin/bash",
+								"-c",
+								"GPU_COUNT=$(kubectl get nodes \"$NODE_NAME\" -o=jsonpath='{.status.allocatable.nvidia\\.com/pgpu}')\n" +


pgpu should be derived from the P_GPU_ALIAS if present otherwise use the name that the DP provides.

zvonkok · 2026-01-28T19:13:44Z

pkg/device_plugin/gfd.go

@@ -0,0 +1,285 @@
+/*
+ * Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


Drop the year, bad practice.

Copilot

Pull request overview

This PR introduces automated GPU feature discovery by running a Kubernetes job on each node with GPU capabilities. The implementation creates a job that executes GPU feature discovery and labels the node accordingly.

Changes:

Added gfd.go with functions to create and manage GPU feature discovery jobs
Integrated GFD job execution into the device plugin initialization flow
Updated Go module dependencies to support Kubernetes client libraries

Reviewed changes

Copilot reviewed 3 out of 2447 changed files in this pull request and generated 12 comments.

File	Description
pkg/device_plugin/gfd.go	New file implementing GPU feature discovery job creation, runtime waiting logic, and job completion monitoring
pkg/device_plugin/device_plugin.go	Launches GFD job as a goroutine during device plugin initialization
go.mod	Added k8s.io client dependencies and updated Go version to support the new functionality

Comments suppressed due to low confidence (1)

pkg/device_plugin/gfd.go:1

Hard-coded string comparison for condition status should use the typed constant corev1.ConditionTrue instead of the string literal 'True' to ensure type safety.

/*

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-28T19:15:38Z

pkg/device_plugin/gfd.go

+	// 1. Get the Node Name from the environment (passed via Downward API)
+	nodeName := os.Getenv("NODE_NAME")
+	if nodeName == "" {
+		log.Printf("NODE_NAME environment variable is required")


These error messages lack context about what operation failed. Consider using log.Printf with descriptive messages like 'Failed to create in-cluster config: %v' or 'Failed to create Kubernetes clientset: %v' to help with debugging.

Copilot · 2026-01-28T19:15:38Z

pkg/device_plugin/gfd.go

+		log.Printf("NODE_NAME environment variable is required")
+		return
+	}
+	namespace := os.Getenv("POD_NAMESPACE")
+	if namespace == "" {
+		log.Printf("POD_NAMESPACE environment variable is required")
+		return
+	}
+
+	// 2. Authenticate within the cluster
+	config, err := rest.InClusterConfig()
+	if err != nil {
+		log.Printf(err.Error())
+		return
+	}
+	clientset, err := kubernetes.NewForConfig(config)
+	if err != nil {
+		log.Printf(err.Error())
+		return
+	}
+
+	err = WaitForKataRuntime(clientset, nodeName)
+	if err != nil {
+		log.Printf(err.Error())


These error messages lack context about what operation failed. Consider using log.Printf with descriptive messages like 'Failed to create in-cluster config: %v' or 'Failed to create Kubernetes clientset: %v' to help with debugging.

Suggested change

log.Printf("NODE_NAME environment variable is required")

return

}

namespace := os.Getenv("POD_NAMESPACE")

if namespace == "" {

log.Printf("POD_NAMESPACE environment variable is required")

return

}

// 2. Authenticate within the cluster

config, err := rest.InClusterConfig()

if err != nil {

log.Printf(err.Error())

return

}

clientset, err := kubernetes.NewForConfig(config)

if err != nil {

log.Printf(err.Error())

return

}

err = WaitForKataRuntime(clientset, nodeName)

if err != nil {

log.Printf(err.Error())

log.Printf("NODE_NAME environment variable is required; failed to determine target node for GFD job")

return

}

namespace := os.Getenv("POD_NAMESPACE")

if namespace == "" {

log.Printf("POD_NAMESPACE environment variable is required; failed to determine namespace for GFD job")

return

}

// 2. Authenticate within the cluster

config, err := rest.InClusterConfig()

if err != nil {

log.Printf("Failed to create in-cluster config: %v", err)

return

}

clientset, err := kubernetes.NewForConfig(config)

if err != nil {

log.Printf("Failed to create Kubernetes clientset: %v", err)

return

}

err = WaitForKataRuntime(clientset, nodeName)

if err != nil {

log.Printf("Failed while waiting for kata runtime on node %s: %v", nodeName, err)

Copilot · 2026-01-28T19:15:38Z

pkg/device_plugin/gfd.go

+		log.Printf(err.Error())
+		return
+	}
+	clientset, err := kubernetes.NewForConfig(config)
+	if err != nil {
+		log.Printf(err.Error())
+		return
+	}
+
+	err = WaitForKataRuntime(clientset, nodeName)
+	if err != nil {
+		log.Printf(err.Error())


These error messages lack context about what operation failed. Consider using log.Printf with descriptive messages like 'Failed to create in-cluster config: %v' or 'Failed to create Kubernetes clientset: %v' to help with debugging.

Suggested change

log.Printf(err.Error())

return

}

clientset, err := kubernetes.NewForConfig(config)

if err != nil {

log.Printf(err.Error())

return

}

err = WaitForKataRuntime(clientset, nodeName)

if err != nil {

log.Printf(err.Error())

log.Printf("Failed to create in-cluster config: %v", err)

return

}

clientset, err := kubernetes.NewForConfig(config)

if err != nil {

log.Printf("Failed to create Kubernetes clientset: %v", err)

return

}

err = WaitForKataRuntime(clientset, nodeName)

if err != nil {

log.Printf("Failed to wait for Kata runtime on node %s: %v", nodeName, err)

Copilot · 2026-01-28T19:15:39Z

pkg/device_plugin/gfd.go

+	}
+	clientset, err := kubernetes.NewForConfig(config)
+	if err != nil {
+		log.Printf(err.Error())


These error messages lack context about what operation failed. Consider using log.Printf with descriptive messages like 'Failed to create in-cluster config: %v' or 'Failed to create Kubernetes clientset: %v' to help with debugging.

Copilot · 2026-01-28T19:15:39Z

pkg/device_plugin/gfd.go

+
+	err = WaitForKataRuntime(clientset, nodeName)
+	if err != nil {
+		log.Printf(err.Error())


These error messages lack context about what operation failed. Consider using log.Printf with descriptive messages like 'Failed to create in-cluster config: %v' or 'Failed to create Kubernetes clientset: %v' to help with debugging.

Copilot · 2026-01-28T19:15:40Z

pkg/device_plugin/gfd.go

+				if condition.Type == "Complete" && condition.Status == "True" {
+					log.Println("GFD job completed successfully!")
+					return true, nil
+				}
+				if condition.Type == "Failed" && condition.Status == "True" {


Hard-coded string comparisons for Kubernetes condition types should use the typed constants from the batch/v1 API. Use batchv1.JobComplete and batchv1.JobFailed instead of string literals to ensure type safety and prevent typos.

Suggested change

if condition.Type == "Complete" && condition.Status == "True" {

log.Println("GFD job completed successfully!")

return true, nil

}

if condition.Type == "Failed" && condition.Status == "True" {

if condition.Type == batchv1.JobComplete && condition.Status == corev1.ConditionTrue {

log.Println("GFD job completed successfully!")

return true, nil

}

if condition.Type == batchv1.JobFailed && condition.Status == corev1.ConditionTrue {

Copilot · 2026-01-28T19:15:40Z

pkg/device_plugin/gfd.go

+				if condition.Type == "Complete" && condition.Status == "True" {
+					log.Println("GFD job completed successfully!")
+					return true, nil
+				}
+				if condition.Type == "Failed" && condition.Status == "True" {


Hard-coded string comparison for condition status should use the typed constant corev1.ConditionTrue instead of the string literal 'True' to ensure type safety.

Suggested change

if condition.Type == "Complete" && condition.Status == "True" {

log.Println("GFD job completed successfully!")

return true, nil

}

if condition.Type == "Failed" && condition.Status == "True" {

if condition.Type == "Complete" && condition.Status == corev1.ConditionTrue {

log.Println("GFD job completed successfully!")

return true, nil

}

if condition.Type == "Failed" && condition.Status == corev1.ConditionTrue {

Copilot · 2026-01-28T19:15:40Z

pkg/device_plugin/gfd.go

+					InitContainers: []corev1.Container{
+						{
+							Name:    "gpu-feature-discovery",
+							Image:   "nvcr.io/nvidia/k8s-device-plugin:v0.17.0",


The container image version is hardcoded. Consider making this configurable through an environment variable or configuration parameter to facilitate version updates without code changes.

+1 this should not be hardcoded. My recommendation is as follows:

Include the gpu-feature-discovery binary in the sandbox-device-plugin Dockerfile.

This init container should be using the sandbox-device-plugin image (with the same version that is currently deployed!)

Copilot · 2026-01-28T19:15:41Z

pkg/device_plugin/gfd.go

+							Name:  "post-process",
+							Image: "nvcr.io/ea-cnt/nv_only/kata-deploy:47001e265",


The container image version is hardcoded. Consider making this configurable through an environment variable or configuration parameter to facilitate version updates without code changes.

Suggested change

Name: "post-process",

Image: "nvcr.io/ea-cnt/nv_only/kata-deploy:47001e265",

Name: "post-process",

Image: func() string {

if v := os.Getenv("KATA_DEPLOY_IMAGE"); v != "" {

return v

}

return "nvcr.io/ea-cnt/nv_only/kata-deploy:47001e265"

}(),

Are we really planning to use the kata-deploy image for this? This does not feel right...

Yes, replacing this with alpine/kubectl

Copilot · 2026-01-28T19:15:41Z

pkg/device_plugin/gfd.go

+							Command: []string{
+								"/bin/bash",
+								"-c",
+								"GPU_COUNT=$(kubectl get nodes \"$NODE_NAME\" -o=jsonpath='{.status.allocatable.nvidia\\.com/pgpu}')\n" +


Inline bash script uses kubectl commands which could fail. Consider adding error handling in the script such as 'set -e' at the beginning or checking exit codes to ensure failures are properly reported.

Suggested change

"GPU_COUNT=$(kubectl get nodes \"$NODE_NAME\" -o=jsonpath='{.status.allocatable.nvidia\\.com/pgpu}')\n" +

"set -euo pipefail\n" +

"GPU_COUNT=$(kubectl get nodes \"$NODE_NAME\" -o=jsonpath='{.status.allocatable.nvidia\\.com/pgpu}')\n" +

zvonkok · 2026-01-28T19:16:55Z

Also, please provide a description, not everyone reviewing this knows why we're doing this.

feat: runt the gpu feature discovery pod as a job on the same node

fbcbc44

Signed-off-by: Rajat Chopra <rajatc@nvidia.com>

rajatchopra force-pushed the gfd branch from 7696d90 to fbcbc44 Compare January 28, 2026 17:35

rajatchopra mentioned this pull request Jan 28, 2026

new privileges for sandbox-device-plugin NVIDIA/gpu-operator#2078

Open

5 tasks

zvonkok reviewed Jan 28, 2026

View reviewed changes

pkg/device_plugin/gfd.go Outdated Show resolved Hide resolved

zvonkok reviewed Jan 28, 2026

View reviewed changes

zvonkok requested a review from Copilot January 28, 2026 19:14

Copilot AI reviewed Jan 28, 2026

View reviewed changes

golang version update

62e7e3d

		@@ -0,0 +1,285 @@
		/*
		* Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

		Name: "post-process",
		Image: "nvcr.io/ea-cnt/nv_only/kata-deploy:47001e265",

-							Name:  "post-process",
-							Image: "nvcr.io/ea-cnt/nv_only/kata-deploy:47001e265",
+							Name: "post-process",
+							Image: func() string {
+								if v := os.Getenv("KATA_DEPLOY_IMAGE"); v != "" {
+									return v
+								}
+								return "nvcr.io/ea-cnt/nv_only/kata-deploy:47001e265"
+							}(),

	"GPU_COUNT=$(kubectl get nodes \"$NODE_NAME\" -o=jsonpath='{.status.allocatable.nvidia\\.com/pgpu}')\n" +
	"set -euo pipefail\n" +
	"GPU_COUNT=$(kubectl get nodes \"$NODE_NAME\" -o=jsonpath='{.status.allocatable.nvidia\\.com/pgpu}')\n" +

[Draft] feat: runt the gpu feature discovery pod as a job on the same node #30

Are you sure you want to change the base?

[Draft] feat: runt the gpu feature discovery pod as a job on the same node #30

Conversation

rajatchopra commented Jan 26, 2026

Uh oh!

Uh oh!

zvonkok Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

zvonkok Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

cdesiniotis Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

cdesiniotis Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

rajatchopra Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

zvonkok commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants