Skip to content

Conversation

@rajatchopra
Copy link
Collaborator

No description provided.

Signed-off-by: Rajat Chopra <rajatc@nvidia.com>
Command: []string{
"/bin/bash",
"-c",
"GPU_COUNT=$(kubectl get nodes \"$NODE_NAME\" -o=jsonpath='{.status.allocatable.nvidia\\.com/pgpu}')\n" +
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pgpu should be derived from the P_GPU_ALIAS if present otherwise use the name that the DP provides.

@@ -0,0 +1,285 @@
/*
* Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop the year, bad practice.

@zvonkok zvonkok requested a review from Copilot January 28, 2026 19:14
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces automated GPU feature discovery by running a Kubernetes job on each node with GPU capabilities. The implementation creates a job that executes GPU feature discovery and labels the node accordingly.

Changes:

  • Added gfd.go with functions to create and manage GPU feature discovery jobs
  • Integrated GFD job execution into the device plugin initialization flow
  • Updated Go module dependencies to support Kubernetes client libraries

Reviewed changes

Copilot reviewed 3 out of 2447 changed files in this pull request and generated 12 comments.

File Description
pkg/device_plugin/gfd.go New file implementing GPU feature discovery job creation, runtime waiting logic, and job completion monitoring
pkg/device_plugin/device_plugin.go Launches GFD job as a goroutine during device plugin initialization
go.mod Added k8s.io client dependencies and updated Go version to support the new functionality
Comments suppressed due to low confidence (1)

pkg/device_plugin/gfd.go:1

  • Hard-coded string comparison for condition status should use the typed constant corev1.ConditionTrue instead of the string literal 'True' to ensure type safety.
/*

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// 1. Get the Node Name from the environment (passed via Downward API)
nodeName := os.Getenv("NODE_NAME")
if nodeName == "" {
log.Printf("NODE_NAME environment variable is required")
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These error messages lack context about what operation failed. Consider using log.Printf with descriptive messages like 'Failed to create in-cluster config: %v' or 'Failed to create Kubernetes clientset: %v' to help with debugging.

Copilot uses AI. Check for mistakes.
Comment on lines +51 to +74
log.Printf("NODE_NAME environment variable is required")
return
}
namespace := os.Getenv("POD_NAMESPACE")
if namespace == "" {
log.Printf("POD_NAMESPACE environment variable is required")
return
}

// 2. Authenticate within the cluster
config, err := rest.InClusterConfig()
if err != nil {
log.Printf(err.Error())
return
}
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
log.Printf(err.Error())
return
}

err = WaitForKataRuntime(clientset, nodeName)
if err != nil {
log.Printf(err.Error())
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These error messages lack context about what operation failed. Consider using log.Printf with descriptive messages like 'Failed to create in-cluster config: %v' or 'Failed to create Kubernetes clientset: %v' to help with debugging.

Suggested change
log.Printf("NODE_NAME environment variable is required")
return
}
namespace := os.Getenv("POD_NAMESPACE")
if namespace == "" {
log.Printf("POD_NAMESPACE environment variable is required")
return
}
// 2. Authenticate within the cluster
config, err := rest.InClusterConfig()
if err != nil {
log.Printf(err.Error())
return
}
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
log.Printf(err.Error())
return
}
err = WaitForKataRuntime(clientset, nodeName)
if err != nil {
log.Printf(err.Error())
log.Printf("NODE_NAME environment variable is required; failed to determine target node for GFD job")
return
}
namespace := os.Getenv("POD_NAMESPACE")
if namespace == "" {
log.Printf("POD_NAMESPACE environment variable is required; failed to determine namespace for GFD job")
return
}
// 2. Authenticate within the cluster
config, err := rest.InClusterConfig()
if err != nil {
log.Printf("Failed to create in-cluster config: %v", err)
return
}
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
log.Printf("Failed to create Kubernetes clientset: %v", err)
return
}
err = WaitForKataRuntime(clientset, nodeName)
if err != nil {
log.Printf("Failed while waiting for kata runtime on node %s: %v", nodeName, err)

Copilot uses AI. Check for mistakes.
Comment on lines +63 to +74
log.Printf(err.Error())
return
}
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
log.Printf(err.Error())
return
}

err = WaitForKataRuntime(clientset, nodeName)
if err != nil {
log.Printf(err.Error())
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These error messages lack context about what operation failed. Consider using log.Printf with descriptive messages like 'Failed to create in-cluster config: %v' or 'Failed to create Kubernetes clientset: %v' to help with debugging.

Suggested change
log.Printf(err.Error())
return
}
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
log.Printf(err.Error())
return
}
err = WaitForKataRuntime(clientset, nodeName)
if err != nil {
log.Printf(err.Error())
log.Printf("Failed to create in-cluster config: %v", err)
return
}
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
log.Printf("Failed to create Kubernetes clientset: %v", err)
return
}
err = WaitForKataRuntime(clientset, nodeName)
if err != nil {
log.Printf("Failed to wait for Kata runtime on node %s: %v", nodeName, err)

Copilot uses AI. Check for mistakes.
}
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
log.Printf(err.Error())
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These error messages lack context about what operation failed. Consider using log.Printf with descriptive messages like 'Failed to create in-cluster config: %v' or 'Failed to create Kubernetes clientset: %v' to help with debugging.

Copilot uses AI. Check for mistakes.

err = WaitForKataRuntime(clientset, nodeName)
if err != nil {
log.Printf(err.Error())
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These error messages lack context about what operation failed. Consider using log.Printf with descriptive messages like 'Failed to create in-cluster config: %v' or 'Failed to create Kubernetes clientset: %v' to help with debugging.

Copilot uses AI. Check for mistakes.
Comment on lines +271 to +275
if condition.Type == "Complete" && condition.Status == "True" {
log.Println("GFD job completed successfully!")
return true, nil
}
if condition.Type == "Failed" && condition.Status == "True" {
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard-coded string comparisons for Kubernetes condition types should use the typed constants from the batch/v1 API. Use batchv1.JobComplete and batchv1.JobFailed instead of string literals to ensure type safety and prevent typos.

Suggested change
if condition.Type == "Complete" && condition.Status == "True" {
log.Println("GFD job completed successfully!")
return true, nil
}
if condition.Type == "Failed" && condition.Status == "True" {
if condition.Type == batchv1.JobComplete && condition.Status == corev1.ConditionTrue {
log.Println("GFD job completed successfully!")
return true, nil
}
if condition.Type == batchv1.JobFailed && condition.Status == corev1.ConditionTrue {

Copilot uses AI. Check for mistakes.
Comment on lines +271 to +275
if condition.Type == "Complete" && condition.Status == "True" {
log.Println("GFD job completed successfully!")
return true, nil
}
if condition.Type == "Failed" && condition.Status == "True" {
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard-coded string comparison for condition status should use the typed constant corev1.ConditionTrue instead of the string literal 'True' to ensure type safety.

Suggested change
if condition.Type == "Complete" && condition.Status == "True" {
log.Println("GFD job completed successfully!")
return true, nil
}
if condition.Type == "Failed" && condition.Status == "True" {
if condition.Type == "Complete" && condition.Status == corev1.ConditionTrue {
log.Println("GFD job completed successfully!")
return true, nil
}
if condition.Type == "Failed" && condition.Status == corev1.ConditionTrue {

Copilot uses AI. Check for mistakes.
InitContainers: []corev1.Container{
{
Name: "gpu-feature-discovery",
Image: "nvcr.io/nvidia/k8s-device-plugin:v0.17.0",
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The container image version is hardcoded. Consider making this configurable through an environment variable or configuration parameter to facilitate version updates without code changes.

Copilot uses AI. Check for mistakes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 this should not be hardcoded. My recommendation is as follows:

  1. Include the gpu-feature-discovery binary in the sandbox-device-plugin Dockerfile.
  2. This init container should be using the sandbox-device-plugin image (with the same version that is currently deployed!)

Comment on lines 136 to 137
Name: "post-process",
Image: "nvcr.io/ea-cnt/nv_only/kata-deploy:47001e265",
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The container image version is hardcoded. Consider making this configurable through an environment variable or configuration parameter to facilitate version updates without code changes.

Suggested change
Name: "post-process",
Image: "nvcr.io/ea-cnt/nv_only/kata-deploy:47001e265",
Name: "post-process",
Image: func() string {
if v := os.Getenv("KATA_DEPLOY_IMAGE"); v != "" {
return v
}
return "nvcr.io/ea-cnt/nv_only/kata-deploy:47001e265"
}(),

Copilot uses AI. Check for mistakes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we really planning to use the kata-deploy image for this? This does not feel right...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, replacing this with alpine/kubectl

Command: []string{
"/bin/bash",
"-c",
"GPU_COUNT=$(kubectl get nodes \"$NODE_NAME\" -o=jsonpath='{.status.allocatable.nvidia\\.com/pgpu}')\n" +
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline bash script uses kubectl commands which could fail. Consider adding error handling in the script such as 'set -e' at the beginning or checking exit codes to ensure failures are properly reported.

Suggested change
"GPU_COUNT=$(kubectl get nodes \"$NODE_NAME\" -o=jsonpath='{.status.allocatable.nvidia\\.com/pgpu}')\n" +
"set -euo pipefail\n" +
"GPU_COUNT=$(kubectl get nodes \"$NODE_NAME\" -o=jsonpath='{.status.allocatable.nvidia\\.com/pgpu}')\n" +

Copilot uses AI. Check for mistakes.
@zvonkok
Copy link
Collaborator

zvonkok commented Jan 28, 2026

Also, please provide a description, not everyone reviewing this knows why we're doing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants