Skip to content

[help]: nvidia-fabricmanager run error #29

@btdan

Description

@btdan

Hello, I am learning how to deploy cube-studio on ubuntu24.04 with 2 nvidia 4090 card.
When I install gpu device driver and fabricmanager following the article install/kubernetes/ranche/install_gpu.md.
I install gpu device driver and fabricmanager successfuly.
But When I run the command service nvidia-fabricmanager start
I get the tips as follows:
----------------- content begin ----------------
Sep 18 03:19:01 ubuntu2404 nvidia-fabricmanager-start.sh[4753]: Detected Pre-NVL5 system
Sep 18 03:19:01 ubuntu2404 nvidia-fabricmanager-start.sh[4756]: request to query NVSwitch device information from NVSwitch driver failed with error:WARNING Nothing to do [NV_WARN_NOTHING_TO_DO]
Sep 18 03:19:01 ubuntu2404 nvidia-fabricmanager-start.sh[4753]: "/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg" failed! Exit code: 1
Sep 18 03:19:01 ubuntu2404 systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE
--------------- content end ---------------------------

I search many articles in order to solve this problem. But I failed.
I find the article NVIDIA/gpu-operator#610
"In the Fabric-Manager User Guide, NVSwitches are supported starting with DGX-2, and only V100, A100, and H100 GPUs support them."
Is there any method to solve this problem?
Or cube-studio can run successfully without nvidia-fabricmanager?
Thanks very much.

By the way, 2 nvidia 4090 card is linked with nvlink.
The result of command nvidia-smi topo -m is as follows:

    GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID

GPU0 X NV4 0-31 0 N/A
GPU1 NV4 X 0-31 0 N/A

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions