Skip to content

Conversation

@JunAr7112
Copy link
Contributor

No description provided.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 26, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@JunAr7112 JunAr7112 force-pushed the mig-parted-fix branch 3 times, most recently from a0c0ac6 to e025799 Compare January 27, 2026 00:19
if err != nil {
return "", fmt.Errorf("error getting GPU pci bus IDs: %v", err)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary newline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed


cmd := exec.Command("nvidia-smi", "-r", "-i", strings.Join(pciBusIDs, ",")) //nolint:gosec
// Unload nvidia_drm module and its dependencies before reset to release GPU references
modprobeCmd := exec.Command("sudo", "modprobe", "-r", "nvidia_drm")
Copy link
Contributor

@rajathagasthya rajathagasthya Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this assume the user has sudo access?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I will remove this. I don't think we need to handle root permissions here

@rajathagasthya
Copy link
Contributor

@JunAr7112 Can you fix the lint failure and add testing steps?

Signed-off-by: Arjun <agadiyar@nvidia.com>
@JunAr7112
Copy link
Contributor Author

/ok-to-test 97806a1

}

// Unload nvidia_drm module and its dependencies before reset to release GPU references
modprobeCmd := exec.Command("modprobe", "-r", "nvidia_drm") //nolint:gosec
Copy link
Contributor

@cdesiniotis cdesiniotis Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe this is the right place (or only place) for this. Note, nvmlResetAllGPUs() is only called inResetAllGPUs() which in turn is only called in ApplyMigMode(). Based on the bug description, we need to unload this module across all MIG reconfigurations, not just a mode change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation addresses the case where a GPU reset is performed (e.g. when changing the MIG mode). But what about when the current MIG mode is correct, but we are attempting to apply a new MIG configuration? Does the mig-parted apply still work when the nvidia_drm module is loaded, or are we required to unload it first?

@cdesiniotis
Copy link
Contributor

@JunAr7112 can you add testing details to the PR description? Demonstrate that you have reproduced the original issue and that it is fixed with the changes in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants