Skip to content

Conversation

@ggoklani
Copy link
Collaborator

@ggoklani ggoklani commented Feb 10, 2026

Enhancement: Revert DOCA-OFED installation; product uses RHEL inbox IB stack (gcc-inbox). DOCA-OFED pulls proprietary kernel modules and is not supported by RHEL engineering.

Reason: We have explicitly not used DOCA-OFED for the IB stack that we are supporting - we use the RHEL OS IB stack ("gcc-inbox" in IB stack terms). The two stacks are not directly compatible, and the DOCA-OFED stack is not supported by RHEL engineering because it requires proprietary, non-GPL kernel modules.

As a result, the initial product we shipped uses the RHEL OS IB stack and not the DOCA stack. The RHEL OS IB stack is supposed to support everything the DOCA stack supports, so AFAICT there is no reason to be installing the DOCA-OFED stack here. OFED should not be necessary to ensure persistent naming of the IB devices, given that the IB devices are currently running RHEL OS provided drivers.

Result:
[azureuser@gaurav-hpcrdmatest1 hpc]$ ibv_devices
device node GUID
------ ----------------
mlx5_0 00155dfffe340069

Issue Tracker Tickets (Jira or BZ if any):

Summary by Sourcery

Remove DOCA-related RDMA configuration and package installation in favor of using the RHEL-provided InfiniBand stack.

Enhancements:

  • Eliminate installation of DOCA host repository RPM and minimal DOCA RDMA packages from the main Ansible role.
  • Remove DOCA-specific package lists and URLs from shared and RHEL 9 variable files to align with the standard RHEL IB stack.

@sourcery-ai
Copy link

sourcery-ai bot commented Feb 10, 2026

Reviewer's guide (collapsed on small PRs)

Reviewer's Guide

This PR removes all DOCA/OFED-related configuration from the HPC role so RDMA support relies solely on the RHEL inbox InfiniBand stack while preserving the Azure persistent RDMA naming feature.

Flow diagram for RDMA configuration without DOCA/OFED

flowchart TD
  A[Start RDMA role tasks] --> B[Check if Azure persistent RDMA naming is enabled]
  B -->|Enabled| C[Configure Azure persistent RDMA naming]
  B -->|Disabled| D[Skip Azure RDMA naming configuration]

  C --> C1[Deploy rdma_rename helper path variable]
  C1 --> C2[Install and configure systemd units for RDMA rename]
  C2 --> C3[Install udev rules for RDMA device naming]
  C3 --> C4[Trigger handlers to reload udev and RDMA rules]

  D --> E[Proceed with remaining HPC RDMA-related tasks]
  C4 --> E

  E --> F[Use RHEL inbox InfiniBand stack for RDMA devices]
  F --> G[End RDMA role tasks]
Loading

File-Level Changes

Change Details Files
Remove DOCA host repository setup and DOCA RDMA package installation from the main HPC role tasks so the playbook no longer installs or manages the DOCA-OFED stack.
  • Delete the task block that conditionally installs the DOCA host repo RPM, including temp directory creation, RPM download, GPG key import, package installation with dnf, and cleanup.
  • Delete the task that installs the minimal DOCA RDMA packages and the associated retry-until-success logic.
tasks/main.yml
Drop DOCA-related package variables and host-RPM URLs so the role no longer exposes configuration for DOCA.
  • Remove the __hpc_doca_packages list that defined the DOCA RDMA package set from the main vars file.
  • Remove the RHEL 9 specific DOCA host RPM URL and GPG key URL variables from the RedHat_9-specific vars file.
vars/main.yml
vars/RedHat_9.yml

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@ggoklani ggoklani force-pushed the fix_rdma_remove_doca branch from 6d0afa1 to 60d3610 Compare February 10, 2026 05:12
@ggoklani ggoklani changed the title Fix rdma remove doca fix: Fix rdma remove doca Feb 10, 2026
@ggoklani ggoklani marked this pull request as ready for review February 10, 2026 05:34
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • Since the DOCA installation block and variables have been removed, double-check and remove any remaining references to __hpc_doca_* variables or DOCA-specific logic elsewhere in the role to avoid dead configuration paths.
  • If the Clean dnf metadata handler was only used by the DOCA host RPM installation, consider removing that handler (and any now-unused related tasks) to keep the handlers list minimal and accurate.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Since the DOCA installation block and variables have been removed, double-check and remove any remaining references to `__hpc_doca_*` variables or DOCA-specific logic elsewhere in the role to avoid dead configuration paths.
- If the `Clean dnf metadata` handler was only used by the DOCA host RPM installation, consider removing that handler (and any now-unused related tasks) to keep the handlers list minimal and accurate.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@dgchinner
Copy link
Contributor

There is a spurious white space change commit in this PR that shouldn't be there.

Also, I can't tell if the entirity of the original commit was reverted because the revert is a manual change. When reverting a commit, the change should be created using the command git revert <commit-range> to capture the entire change in the revert automatically. It is also good practice to the put the reason for the revert in the commit message, too, along with your sign-off.

@ggoklani ggoklani force-pushed the fix_rdma_remove_doca branch 3 times, most recently from 0bde090 to bdbe795 Compare February 10, 2026 07:44
…ck (gcc-inbox).

DOCA-OFED pulls proprietary kernel modules and is not supported by RHEL engineering.

Signed-off-by: Gaurav Goklani <ggoklani@ggoklani-thinkpadt14gen4.punetw6.csb>
@ggoklani ggoklani force-pushed the fix_rdma_remove_doca branch from bdbe795 to 0bb222c Compare February 10, 2026 07:45
@ggoklani
Copy link
Collaborator Author

@dgchinner Changes mode, testing works well.. please review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants