multinode run fails with [CUDA_ERROR_NOT_PERMITTED] error in cuMemCreate() call

Below is the error log of the failed run. 100% reproducible with a MPIJob run in k8s, with RTX4090 GPUs.

is it possible to run nvbandwidth multi-node mode with nodes without nvlink/IB interconnects?

```
$ kubectl logs gputest-benchamrk-launcher-xpf9x
Warning: Permanently added 'gputest-benchamrk-worker-0.gputest-benchamrk.default.svc' (ED25519) to the list of known hosts.
Warning: Permanently added 'gputest-benchamrk-worker-1.gputest-benchamrk.default.svc' (ED25519) to the list of known hosts.
[gputest-benchamrk-worker-1:00050] MCW rank 1 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[gputest-benchamrk-worker-0:00050] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
nvbandwidth Version: v0.8
Built from Git version: 

MPI version: Open MPI v4.1.7rc1, package: Open MPI root@buildkitsandbox Distribution, ident: 4.1.7rc1, repo rev: v4.1.5-176-g6d9519e4c3, Unreleased developer copy
CUDA Runtime Version: 12090
CUDA Driver Version: 12090
Driver Version: 575.57.08

Process 0 (gputest-benchamrk-worker-0): device 0: NVIDIA GeForce RTX 4090 (00000000:81:00)
Process 1 (gputest-benchamrk-worker-1): device 0: NVIDIA GeForce RTX 4090 (00000000:24:00)

Running multinode_device_to_device_memcpy_write_ce.
[CUDA_ERROR_NOT_PERMITTED] operation not permitted in expression cuMemCreate(&handle, roundedUpAllocationSize, &prop, 0 ) on gputest-benchamrk-worker-0, rank = 0 in MultinodeMemoryAllocationUnicast::MultinodeMemoryAllocationUnicast(size_t, int)() : /tmp/multinode_memcpy.cpp:57

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
```

with the below MPIJob spec

```
---
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: gputest-benchamrk
  namespace: default
spec:
  slotsPerWorker: 1
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
            - image: <self-built-nvbandwidth-image>
              name: gputest-benchamrk
              env:
                - name: OMPI_ALLOW_RUN_AS_ROOT
                  value: "1"
                - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM
                  value: "1"
              command:
                - mpirun
                - -np
                - "2"
                - -bind-to
                - none
                - -map-by
                - slot
                - -x
                - NCCL_DEBUG=INFO
                - -x
                - NCCL_SOCKET_IFNAME=eth0
                - -x
                - NCCL_P2P_DISABLE=1
                - -x
                - UCX_NET_DEVICES=eth0
                - -x
                - UCX_TLS=tcp
                - -x
                - OMPI_MCA_coll_hcoll_enable=0
                - -x
                - OMPI_MCA_coll_ucc_enable=0
                - -x
                - OMPI_MCA_btl=self,tcp
                - nvbandwidth
                - -p
                - multinode
    Worker:
      replicas: 2
      template:
        spec:
          containers:
            - image:<self-built-nvbandwidth-image>
              name: gputest-benchamrk
              env:
                - name: NVIDIA_VISIBLE_DEVICES
                  value: all
                - name: NVIDIA_DRIVER_CAPABILITIES
                  value: compute,utility
              securityContext:
                privileged: true
                runAsUser: 0
                runAsGroup: 0
                capabilities:
                  add: ["SYS_ADMIN", "NET_ADMIN"]
              resources:
                limits:
                  nvidia.com/gpu: 1
                requests:
                  nvidia.com/gpu: 1

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

multinode run fails with [CUDA_ERROR_NOT_PERMITTED] error in cuMemCreate() call #46

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

multinode run fails with [CUDA_ERROR_NOT_PERMITTED] error in cuMemCreate() call #46

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions