-
Notifications
You must be signed in to change notification settings - Fork 4
[Draft] feat: runt the gpu feature discovery pod as a job on the same node #30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Rajat Chopra <rajatc@nvidia.com>
| Command: []string{ | ||
| "/bin/bash", | ||
| "-c", | ||
| "GPU_COUNT=$(kubectl get nodes \"$NODE_NAME\" -o=jsonpath='{.status.allocatable.nvidia\\.com/pgpu}')\n" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pgpu should be derived from the P_GPU_ALIAS if present otherwise use the name that the DP provides.
| @@ -0,0 +1,285 @@ | |||
| /* | |||
| * Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop the year, bad practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces automated GPU feature discovery by running a Kubernetes job on each node with GPU capabilities. The implementation creates a job that executes GPU feature discovery and labels the node accordingly.
Changes:
- Added
gfd.gowith functions to create and manage GPU feature discovery jobs - Integrated GFD job execution into the device plugin initialization flow
- Updated Go module dependencies to support Kubernetes client libraries
Reviewed changes
Copilot reviewed 3 out of 2447 changed files in this pull request and generated 12 comments.
| File | Description |
|---|---|
| pkg/device_plugin/gfd.go | New file implementing GPU feature discovery job creation, runtime waiting logic, and job completion monitoring |
| pkg/device_plugin/device_plugin.go | Launches GFD job as a goroutine during device plugin initialization |
| go.mod | Added k8s.io client dependencies and updated Go version to support the new functionality |
Comments suppressed due to low confidence (1)
pkg/device_plugin/gfd.go:1
- Hard-coded string comparison for condition status should use the typed constant corev1.ConditionTrue instead of the string literal 'True' to ensure type safety.
/*
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // 1. Get the Node Name from the environment (passed via Downward API) | ||
| nodeName := os.Getenv("NODE_NAME") | ||
| if nodeName == "" { | ||
| log.Printf("NODE_NAME environment variable is required") |
Copilot
AI
Jan 28, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These error messages lack context about what operation failed. Consider using log.Printf with descriptive messages like 'Failed to create in-cluster config: %v' or 'Failed to create Kubernetes clientset: %v' to help with debugging.
| log.Printf("NODE_NAME environment variable is required") | ||
| return | ||
| } | ||
| namespace := os.Getenv("POD_NAMESPACE") | ||
| if namespace == "" { | ||
| log.Printf("POD_NAMESPACE environment variable is required") | ||
| return | ||
| } | ||
|
|
||
| // 2. Authenticate within the cluster | ||
| config, err := rest.InClusterConfig() | ||
| if err != nil { | ||
| log.Printf(err.Error()) | ||
| return | ||
| } | ||
| clientset, err := kubernetes.NewForConfig(config) | ||
| if err != nil { | ||
| log.Printf(err.Error()) | ||
| return | ||
| } | ||
|
|
||
| err = WaitForKataRuntime(clientset, nodeName) | ||
| if err != nil { | ||
| log.Printf(err.Error()) |
Copilot
AI
Jan 28, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These error messages lack context about what operation failed. Consider using log.Printf with descriptive messages like 'Failed to create in-cluster config: %v' or 'Failed to create Kubernetes clientset: %v' to help with debugging.
| log.Printf("NODE_NAME environment variable is required") | |
| return | |
| } | |
| namespace := os.Getenv("POD_NAMESPACE") | |
| if namespace == "" { | |
| log.Printf("POD_NAMESPACE environment variable is required") | |
| return | |
| } | |
| // 2. Authenticate within the cluster | |
| config, err := rest.InClusterConfig() | |
| if err != nil { | |
| log.Printf(err.Error()) | |
| return | |
| } | |
| clientset, err := kubernetes.NewForConfig(config) | |
| if err != nil { | |
| log.Printf(err.Error()) | |
| return | |
| } | |
| err = WaitForKataRuntime(clientset, nodeName) | |
| if err != nil { | |
| log.Printf(err.Error()) | |
| log.Printf("NODE_NAME environment variable is required; failed to determine target node for GFD job") | |
| return | |
| } | |
| namespace := os.Getenv("POD_NAMESPACE") | |
| if namespace == "" { | |
| log.Printf("POD_NAMESPACE environment variable is required; failed to determine namespace for GFD job") | |
| return | |
| } | |
| // 2. Authenticate within the cluster | |
| config, err := rest.InClusterConfig() | |
| if err != nil { | |
| log.Printf("Failed to create in-cluster config: %v", err) | |
| return | |
| } | |
| clientset, err := kubernetes.NewForConfig(config) | |
| if err != nil { | |
| log.Printf("Failed to create Kubernetes clientset: %v", err) | |
| return | |
| } | |
| err = WaitForKataRuntime(clientset, nodeName) | |
| if err != nil { | |
| log.Printf("Failed while waiting for kata runtime on node %s: %v", nodeName, err) |
| log.Printf(err.Error()) | ||
| return | ||
| } | ||
| clientset, err := kubernetes.NewForConfig(config) | ||
| if err != nil { | ||
| log.Printf(err.Error()) | ||
| return | ||
| } | ||
|
|
||
| err = WaitForKataRuntime(clientset, nodeName) | ||
| if err != nil { | ||
| log.Printf(err.Error()) |
Copilot
AI
Jan 28, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These error messages lack context about what operation failed. Consider using log.Printf with descriptive messages like 'Failed to create in-cluster config: %v' or 'Failed to create Kubernetes clientset: %v' to help with debugging.
| log.Printf(err.Error()) | |
| return | |
| } | |
| clientset, err := kubernetes.NewForConfig(config) | |
| if err != nil { | |
| log.Printf(err.Error()) | |
| return | |
| } | |
| err = WaitForKataRuntime(clientset, nodeName) | |
| if err != nil { | |
| log.Printf(err.Error()) | |
| log.Printf("Failed to create in-cluster config: %v", err) | |
| return | |
| } | |
| clientset, err := kubernetes.NewForConfig(config) | |
| if err != nil { | |
| log.Printf("Failed to create Kubernetes clientset: %v", err) | |
| return | |
| } | |
| err = WaitForKataRuntime(clientset, nodeName) | |
| if err != nil { | |
| log.Printf("Failed to wait for Kata runtime on node %s: %v", nodeName, err) |
| } | ||
| clientset, err := kubernetes.NewForConfig(config) | ||
| if err != nil { | ||
| log.Printf(err.Error()) |
Copilot
AI
Jan 28, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These error messages lack context about what operation failed. Consider using log.Printf with descriptive messages like 'Failed to create in-cluster config: %v' or 'Failed to create Kubernetes clientset: %v' to help with debugging.
|
|
||
| err = WaitForKataRuntime(clientset, nodeName) | ||
| if err != nil { | ||
| log.Printf(err.Error()) |
Copilot
AI
Jan 28, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These error messages lack context about what operation failed. Consider using log.Printf with descriptive messages like 'Failed to create in-cluster config: %v' or 'Failed to create Kubernetes clientset: %v' to help with debugging.
| if condition.Type == "Complete" && condition.Status == "True" { | ||
| log.Println("GFD job completed successfully!") | ||
| return true, nil | ||
| } | ||
| if condition.Type == "Failed" && condition.Status == "True" { |
Copilot
AI
Jan 28, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard-coded string comparisons for Kubernetes condition types should use the typed constants from the batch/v1 API. Use batchv1.JobComplete and batchv1.JobFailed instead of string literals to ensure type safety and prevent typos.
| if condition.Type == "Complete" && condition.Status == "True" { | |
| log.Println("GFD job completed successfully!") | |
| return true, nil | |
| } | |
| if condition.Type == "Failed" && condition.Status == "True" { | |
| if condition.Type == batchv1.JobComplete && condition.Status == corev1.ConditionTrue { | |
| log.Println("GFD job completed successfully!") | |
| return true, nil | |
| } | |
| if condition.Type == batchv1.JobFailed && condition.Status == corev1.ConditionTrue { |
| if condition.Type == "Complete" && condition.Status == "True" { | ||
| log.Println("GFD job completed successfully!") | ||
| return true, nil | ||
| } | ||
| if condition.Type == "Failed" && condition.Status == "True" { |
Copilot
AI
Jan 28, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard-coded string comparison for condition status should use the typed constant corev1.ConditionTrue instead of the string literal 'True' to ensure type safety.
| if condition.Type == "Complete" && condition.Status == "True" { | |
| log.Println("GFD job completed successfully!") | |
| return true, nil | |
| } | |
| if condition.Type == "Failed" && condition.Status == "True" { | |
| if condition.Type == "Complete" && condition.Status == corev1.ConditionTrue { | |
| log.Println("GFD job completed successfully!") | |
| return true, nil | |
| } | |
| if condition.Type == "Failed" && condition.Status == corev1.ConditionTrue { |
| InitContainers: []corev1.Container{ | ||
| { | ||
| Name: "gpu-feature-discovery", | ||
| Image: "nvcr.io/nvidia/k8s-device-plugin:v0.17.0", |
Copilot
AI
Jan 28, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The container image version is hardcoded. Consider making this configurable through an environment variable or configuration parameter to facilitate version updates without code changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 this should not be hardcoded. My recommendation is as follows:
- Include the gpu-feature-discovery binary in the sandbox-device-plugin Dockerfile.
- This init container should be using the sandbox-device-plugin image (with the same version that is currently deployed!)
pkg/device_plugin/gfd.go
Outdated
| Name: "post-process", | ||
| Image: "nvcr.io/ea-cnt/nv_only/kata-deploy:47001e265", |
Copilot
AI
Jan 28, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The container image version is hardcoded. Consider making this configurable through an environment variable or configuration parameter to facilitate version updates without code changes.
| Name: "post-process", | |
| Image: "nvcr.io/ea-cnt/nv_only/kata-deploy:47001e265", | |
| Name: "post-process", | |
| Image: func() string { | |
| if v := os.Getenv("KATA_DEPLOY_IMAGE"); v != "" { | |
| return v | |
| } | |
| return "nvcr.io/ea-cnt/nv_only/kata-deploy:47001e265" | |
| }(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we really planning to use the kata-deploy image for this? This does not feel right...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, replacing this with alpine/kubectl
| Command: []string{ | ||
| "/bin/bash", | ||
| "-c", | ||
| "GPU_COUNT=$(kubectl get nodes \"$NODE_NAME\" -o=jsonpath='{.status.allocatable.nvidia\\.com/pgpu}')\n" + |
Copilot
AI
Jan 28, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inline bash script uses kubectl commands which could fail. Consider adding error handling in the script such as 'set -e' at the beginning or checking exit codes to ensure failures are properly reported.
| "GPU_COUNT=$(kubectl get nodes \"$NODE_NAME\" -o=jsonpath='{.status.allocatable.nvidia\\.com/pgpu}')\n" + | |
| "set -euo pipefail\n" + | |
| "GPU_COUNT=$(kubectl get nodes \"$NODE_NAME\" -o=jsonpath='{.status.allocatable.nvidia\\.com/pgpu}')\n" + |
|
Also, please provide a description, not everyone reviewing this knows why we're doing this. |
No description provided.