-
Notifications
You must be signed in to change notification settings - Fork 56
Unload nvidia_drm module for changing Mig mode #301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
a0c0ac6 to
e025799
Compare
cmd/nvidia-mig-parted/util/device.go
Outdated
| if err != nil { | ||
| return "", fmt.Errorf("error getting GPU pci bus IDs: %v", err) | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary newline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
cmd/nvidia-mig-parted/util/device.go
Outdated
|
|
||
| cmd := exec.Command("nvidia-smi", "-r", "-i", strings.Join(pciBusIDs, ",")) //nolint:gosec | ||
| // Unload nvidia_drm module and its dependencies before reset to release GPU references | ||
| modprobeCmd := exec.Command("sudo", "modprobe", "-r", "nvidia_drm") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this assume the user has sudo access?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I will remove this. I don't think we need to handle root permissions here
e025799 to
1fc6e07
Compare
|
@JunAr7112 Can you fix the lint failure and add testing steps? |
1fc6e07 to
12b4993
Compare
Signed-off-by: Arjun <agadiyar@nvidia.com>
12b4993 to
97806a1
Compare
|
/ok-to-test 97806a1 |
| } | ||
|
|
||
| // Unload nvidia_drm module and its dependencies before reset to release GPU references | ||
| modprobeCmd := exec.Command("modprobe", "-r", "nvidia_drm") //nolint:gosec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe this is the right place (or only place) for this. Note, nvmlResetAllGPUs() is only called inResetAllGPUs() which in turn is only called in ApplyMigMode(). Based on the bug description, we need to unload this module across all MIG reconfigurations, not just a mode change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation addresses the case where a GPU reset is performed (e.g. when changing the MIG mode). But what about when the current MIG mode is correct, but we are attempting to apply a new MIG configuration? Does the mig-parted apply still work when the nvidia_drm module is loaded, or are we required to unload it first?
|
@JunAr7112 can you add testing details to the PR description? Demonstrate that you have reproduced the original issue and that it is fixed with the changes in this PR. |
No description provided.