-
Notifications
You must be signed in to change notification settings - Fork 153
Description
Hello, I am learning how to deploy cube-studio on ubuntu24.04 with 2 nvidia 4090 card.
When I install gpu device driver and fabricmanager following the article install/kubernetes/ranche/install_gpu.md.
I install gpu device driver and fabricmanager successfuly.
But When I run the command service nvidia-fabricmanager start
I get the tips as follows:
----------------- content begin ----------------
Sep 18 03:19:01 ubuntu2404 nvidia-fabricmanager-start.sh[4753]: Detected Pre-NVL5 system
Sep 18 03:19:01 ubuntu2404 nvidia-fabricmanager-start.sh[4756]: request to query NVSwitch device information from NVSwitch driver failed with error:WARNING Nothing to do [NV_WARN_NOTHING_TO_DO]
Sep 18 03:19:01 ubuntu2404 nvidia-fabricmanager-start.sh[4753]: "/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg" failed! Exit code: 1
Sep 18 03:19:01 ubuntu2404 systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE
--------------- content end ---------------------------
I search many articles in order to solve this problem. But I failed.
I find the article NVIDIA/gpu-operator#610
"In the Fabric-Manager User Guide, NVSwitches are supported starting with DGX-2, and only V100, A100, and H100 GPUs support them."
Is there any method to solve this problem?
Or cube-studio can run successfully without nvidia-fabricmanager?
Thanks very much.
By the way, 2 nvidia 4090 card is linked with nvlink.
The result of command nvidia-smi topo -m is as follows:
GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV4 0-31 0 N/A
GPU1 NV4 X 0-31 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks