Running multinode benchmarks works on single node, but fails with the following error when running on multiple nodes:
mpirun --map-by ppr:8:node --bind-to core -np 16 --report-bindings --hostfile hostfile /usr/local/bin/nvbandwidth -p multinode
...
Running multinode_device_to_device_memcpy_read_ce.
[CUDA_ERROR_INVALID_VALUE] invalid argument in expression cuMemSetAccess((CUdeviceptr) buffer, roundedUpAllocationSize, &desc, 1 ) on gpub200-nom6ae0518, rank = 8 in MultinodeMemoryAllocationUnicast::MultinodeMemoryAllocationUnicast(size_t, int)() : /home/code/external/nvbandwidth/multinode_memcpy.cpp:76
Binary was build with Cuda 12.8