Skip to content

multinode run fails with [CUDA_ERROR_NOT_PERMITTED] error in cuMemCreate() call #46

@robertwenquan

Description

@robertwenquan

Below is the error log of the failed run. 100% reproducible with a MPIJob run in k8s, with RTX4090 GPUs.

is it possible to run nvbandwidth multi-node mode with nodes without nvlink/IB interconnects?

$ kubectl logs gputest-benchamrk-launcher-xpf9x
Warning: Permanently added 'gputest-benchamrk-worker-0.gputest-benchamrk.default.svc' (ED25519) to the list of known hosts.
Warning: Permanently added 'gputest-benchamrk-worker-1.gputest-benchamrk.default.svc' (ED25519) to the list of known hosts.
[gputest-benchamrk-worker-1:00050] MCW rank 1 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[gputest-benchamrk-worker-0:00050] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
nvbandwidth Version: v0.8
Built from Git version: 

MPI version: Open MPI v4.1.7rc1, package: Open MPI root@buildkitsandbox Distribution, ident: 4.1.7rc1, repo rev: v4.1.5-176-g6d9519e4c3, Unreleased developer copy
CUDA Runtime Version: 12090
CUDA Driver Version: 12090
Driver Version: 575.57.08

Process 0 (gputest-benchamrk-worker-0): device 0: NVIDIA GeForce RTX 4090 (00000000:81:00)
Process 1 (gputest-benchamrk-worker-1): device 0: NVIDIA GeForce RTX 4090 (00000000:24:00)

Running multinode_device_to_device_memcpy_write_ce.
[CUDA_ERROR_NOT_PERMITTED] operation not permitted in expression cuMemCreate(&handle, roundedUpAllocationSize, &prop, 0 ) on gputest-benchamrk-worker-0, rank = 0 in MultinodeMemoryAllocationUnicast::MultinodeMemoryAllocationUnicast(size_t, int)() : /tmp/multinode_memcpy.cpp:57

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

with the below MPIJob spec

---
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: gputest-benchamrk
  namespace: default
spec:
  slotsPerWorker: 1
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
            - image: <self-built-nvbandwidth-image>
              name: gputest-benchamrk
              env:
                - name: OMPI_ALLOW_RUN_AS_ROOT
                  value: "1"
                - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM
                  value: "1"
              command:
                - mpirun
                - -np
                - "2"
                - -bind-to
                - none
                - -map-by
                - slot
                - -x
                - NCCL_DEBUG=INFO
                - -x
                - NCCL_SOCKET_IFNAME=eth0
                - -x
                - NCCL_P2P_DISABLE=1
                - -x
                - UCX_NET_DEVICES=eth0
                - -x
                - UCX_TLS=tcp
                - -x
                - OMPI_MCA_coll_hcoll_enable=0
                - -x
                - OMPI_MCA_coll_ucc_enable=0
                - -x
                - OMPI_MCA_btl=self,tcp
                - nvbandwidth
                - -p
                - multinode
    Worker:
      replicas: 2
      template:
        spec:
          containers:
            - image:<self-built-nvbandwidth-image>
              name: gputest-benchamrk
              env:
                - name: NVIDIA_VISIBLE_DEVICES
                  value: all
                - name: NVIDIA_DRIVER_CAPABILITIES
                  value: compute,utility
              securityContext:
                privileged: true
                runAsUser: 0
                runAsGroup: 0
                capabilities:
                  add: ["SYS_ADMIN", "NET_ADMIN"]
              resources:
                limits:
                  nvidia.com/gpu: 1
                requests:
                  nvidia.com/gpu: 1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions