Skip to content

NPKit with nccl-tests: Generated trace file is empty #37

@DevHSA

Description

@DevHSA

I am trying to profile an nccl-tests allReduce call: all_reduce_perf -b 16M -e 16M -n 10 -w 0 -g 1 -c 1
NCCL version: 2.17.1
NCCL-tests version: 2.13.3

TLDR:

  1. The trace file generated is empty for -g 1. I dot not see any errors and have tried enabling all the NPKIT_FLAGS as well. The NCCL version being used is as per NPKit specification. Is there a particular nccl-test version that needs to be used? Or is this a bug with NPKit's nccl profiling?

  2. I see that values other than -g 1 throws errors (e.g., -g 2). I am wondering how would I be able to observe the collective communication patterns between multiple GPUs/Ranks if i cannot specify them in my tests? Why does this limitation exist?

Edit: Before application of NPkit patch, the tests run for all values of -g.

---------------------------Details----------------------------

First, i wanted to confirm its working with -g 1. I have assigned the correct run parameters to the nccl_test() function in npkit_runner.sh

npkit_runner.sh

function nccl_test() {
  $1 -b $3 -e $3 -n $7 -w $6 -g 1 -c 1 | tee $9/log.txt
}

The test runs successfully:

root@c11efb988616:/home/NPKit/nccl_samples# bash npkit_launcher.sh 
+ export NCCL_SRC_DIR=/home/ncclold
+ NCCL_SRC_DIR=/home/ncclold
+ export NPKIT_SRC_DIR=/home/NPKit
+ NPKIT_SRC_DIR=/home/NPKit
+ export NCCL_TEST_BIN=/home/nccl-testsold/build/all_reduce_perf
+ NCCL_TEST_BIN=/home/nccl-testsold/build/all_reduce_perf
+ export NPKIT_RUN_DIR=/home/NPKit/results
+ NPKIT_RUN_DIR=/home/NPKit/results
+ export NCCL_MSG_SIZE=16M
+ NCCL_MSG_SIZE=16M
+ export NCCL_ALGO=Ring
+ NCCL_ALGO=Ring
+ export NCCL_PROTO=Simple
+ NCCL_PROTO=Simple
+ export NCCL_NUM_WARMUPS=0
+ NCCL_NUM_WARMUPS=0
+ export NCCL_NUM_ITERS=10
+ NCCL_NUM_ITERS=10
+ NPKIT_FLAGS_CPU_PREFIX=-DENABLE_NPKIT
+ NPKIT_FLAGS_GPU_PREFIX='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_ALL_REDUCE_RING_ENTRY -DENABLE_NPKIT_EVENT_ALL_REDUCE_RING_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_ALL_REDUCE_RING_ENTRY -DENABLE_NPKIT_EVENT_ALL_REDUCE_RING_EXIT'
+ bash npkit_runner.sh
++ basename /home/nccl-testsold/build/all_reduce_perf
+ npkit_run_tag=all_reduce_perf//Ring/Simple
+ npkit_dump_dir=/home/NPKit/results/npkit_dump/all_reduce_perf//Ring/Simple
+ npkit_trace_dir=/home/NPKit/results/npkit_trace/all_reduce_perf//Ring/Simple
+ npkit_result_dir=/home/NPKit/results/npkit_result/all_reduce_perf//Ring/Simple
+ rm -rf /home/NPKit/results
+ mkdir -p /home/NPKit/results/npkit_dump/all_reduce_perf//Ring/Simple
+ mkdir -p /home/NPKit/results/npkit_trace/all_reduce_perf//Ring/Simple
+ mkdir -p /home/NPKit/results/npkit_result/all_reduce_perf//Ring/Simple
+ nccl_test /home/nccl-testsold/build/all_reduce_perf /home/ncclold 16M Ring Simple 0 10 /home/NPKit/results/npkit_dump/all_reduce_perf//Ring/Simple /home/NPKit/results/npkit_result/all_reduce_perf//Ring/Simple
+ /home/nccl-testsold/build/all_reduce_perf -b 16M -e 16M -n 10 -w 0 -g 1 -c 1
+ tee /home/NPKit/results/npkit_result/all_reduce_perf//Ring/Simple/log.txt
# nThread 1 nGpus 1 minBytes 16777216 maxBytes 16777216 step: 1048576(bytes) warmup iters: 0 iters: 10 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 1062079 on c11efb988616 device  0 [0x17] NVIDIA A100 80GB PCIe
NCCL version 2.17.1+cuda12.0
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
    16777216       4194304     float     sum      -1    266.8   62.88    0.00      0     0.42  40031.53    0.00      0

c11efb988616:1062079:1062079 [0] init.cc:1481 NCCL WARN NPKIT_DUMP_DIR is empty
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0 
#

+ cd /home/NPKit/nccl_samples
+ python3 npkit_trace_generator.py --npkit_dump_dir=/home/NPKit/results/npkit_dump/all_reduce_perf//Ring/Simple --npkit_event_header_path=/home/ncclold/src/include/npkit/npkit_event.h --output_dir=/home/NPKit/results/npkit_trace/all_reduce_perf//Ring/Simple
+ cd /home/NPKit/results/npkit_trace/all_reduce_perf//Ring/Simple
+ tar cvzf npkit_result.tar.gz npkit_event_trace.json
npkit_event_trace.json
+ mv npkit_result.tar.gz /home/NPKit/results/npkit_result/all_reduce_perf//Ring/Simple

However, the npkit_event_trace.json is empty. Does it have something to do with the init.cc:1481 NCCL WARN NPKIT_DUMP_DIR is empty warning shown above?

{"traceEvents": [], "displayTimeUnit": "ns"}

When I increase the number of gpus to say -g 4

npkit_runner.sh

function nccl_test() {
  $1 -b $3 -e $3 -n $7 -w $6 -g 4 -c 1 | tee $9/log.txt
}

The test fails:

root@c11efb988616:/home/NPKit/nccl_samples# bash npkit_launcher.sh 
+ export NCCL_SRC_DIR=/home/ncclold
+ NCCL_SRC_DIR=/home/ncclold
+ export NPKIT_SRC_DIR=/home/NPKit
+ NPKIT_SRC_DIR=/home/NPKit
+ export NCCL_TEST_BIN=/home/nccl-testsold/build/all_reduce_perf
+ NCCL_TEST_BIN=/home/nccl-testsold/build/all_reduce_perf
+ export NPKIT_RUN_DIR=/home/NPKit/results
+ NPKIT_RUN_DIR=/home/NPKit/results
+ export NCCL_MSG_SIZE=16M
+ NCCL_MSG_SIZE=16M
+ export NCCL_ALGO=Ring
+ NCCL_ALGO=Ring
+ export NCCL_PROTO=Simple
+ NCCL_PROTO=Simple
+ export NCCL_NUM_WARMUPS=0
+ NCCL_NUM_WARMUPS=0
+ export NCCL_NUM_ITERS=10
+ NCCL_NUM_ITERS=10
+ NPKIT_FLAGS_CPU_PREFIX=-DENABLE_NPKIT
+ NPKIT_FLAGS_GPU_PREFIX='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_ALL_REDUCE_RING_ENTRY -DENABLE_NPKIT_EVENT_ALL_REDUCE_RING_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_ALL_REDUCE_RING_ENTRY -DENABLE_NPKIT_EVENT_ALL_REDUCE_RING_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_COPY_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_COPY_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_COPY_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_COPY_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_RECV_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_RECV_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_RECV_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_RECV_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_RECV_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_RECV_COPY_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_RECV_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_RECV_COPY_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_RECV_REDUCE_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_RECV_REDUCE_COPY_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_RECV_REDUCE_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_RECV_REDUCE_COPY_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_SEND_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_SEND_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_SEND_FROM_OUTPUT_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_SEND_FROM_OUTPUT_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_SEND_FROM_OUTPUT_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_SEND_FROM_OUTPUT_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_RECV_ENTRY -DENABLE_NPKIT_EVENT_RECV_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_RECV_ENTRY -DENABLE_NPKIT_EVENT_RECV_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_RECV_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_RECV_COPY_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_RECV_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_RECV_COPY_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_RECV_REDUCE_COPY_ENTRY -DENABLE_NPKIT_EVENT_RECV_REDUCE_COPY_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_RECV_REDUCE_COPY_ENTRY -DENABLE_NPKIT_EVENT_RECV_REDUCE_COPY_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_RECV_REDUCE_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_RECV_REDUCE_COPY_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_RECV_REDUCE_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_RECV_REDUCE_COPY_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_RECV_REDUCE_SEND_ENTRY -DENABLE_NPKIT_EVENT_RECV_REDUCE_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_RECV_REDUCE_SEND_ENTRY -DENABLE_NPKIT_EVENT_RECV_REDUCE_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_SEND_ENTRY -DENABLE_NPKIT_EVENT_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_SEND_ENTRY -DENABLE_NPKIT_EVENT_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_SEND_FROM_OUTPUT_ENTRY -DENABLE_NPKIT_EVENT_SEND_FROM_OUTPUT_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_SEND_FROM_OUTPUT_ENTRY -DENABLE_NPKIT_EVENT_SEND_FROM_OUTPUT_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_WAIT_PEER_ENTRY -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_WAIT_PEER_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_WAIT_PEER_ENTRY -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_WAIT_PEER_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_REDUCE_OR_COPY_MULTI_ENTRY -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_REDUCE_OR_COPY_MULTI_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_REDUCE_OR_COPY_MULTI_ENTRY -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_REDUCE_OR_COPY_MULTI_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL_WAIT_SEND_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL_WAIT_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL_WAIT_SEND_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL_WAIT_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL128_WAIT_SEND_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL128_WAIT_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL128_WAIT_SEND_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL128_WAIT_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_NET_SEND_ENTRY -DENABLE_NPKIT_EVENT_NET_SEND_EXIT -DENABLE_NPKIT_EVENT_NET_RECV_ENTRY -DENABLE_NPKIT_EVENT_NET_RECV_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_NET_SEND_ENTRY -DENABLE_NPKIT_EVENT_NET_SEND_EXIT -DENABLE_NPKIT_EVENT_NET_RECV_ENTRY -DENABLE_NPKIT_EVENT_NET_RECV_EXIT'
+ bash npkit_runner.sh
++ basename /home/nccl-testsold/build/all_reduce_perf
+ npkit_run_tag=all_reduce_perf//Ring/Simple
+ npkit_dump_dir=/home/NPKit/results/npkit_dump/all_reduce_perf//Ring/Simple
+ npkit_trace_dir=/home/NPKit/results/npkit_trace/all_reduce_perf//Ring/Simple
+ npkit_result_dir=/home/NPKit/results/npkit_result/all_reduce_perf//Ring/Simple
+ rm -rf /home/NPKit/results
+ mkdir -p /home/NPKit/results/npkit_dump/all_reduce_perf//Ring/Simple
+ mkdir -p /home/NPKit/results/npkit_trace/all_reduce_perf//Ring/Simple
+ mkdir -p /home/NPKit/results/npkit_result/all_reduce_perf//Ring/Simple
+ nccl_test /home/nccl-testsold/build/all_reduce_perf /home/ncclold 16M Ring Simple 0 10 /home/NPKit/results/npkit_dump/all_reduce_perf//Ring/Simple /home/NPKit/results/npkit_result/all_reduce_perf//Ring/Simple
+ /home/nccl-testsold/build/all_reduce_perf -b 16M -e 16M -n 10 -w 0 -g 4 -c 1
+ tee /home/NPKit/results/npkit_result/all_reduce_perf//Ring/Simple/log.txt
# nThread 1 nGpus 4 minBytes 16777216 maxBytes 16777216 step: 1048576(bytes) warmup iters: 0 iters: 10 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 1346195 on c11efb988616 device  0 [0x17] NVIDIA A100 80GB PCIe
#  Rank  1 Group  0 Pid 1346195 on c11efb988616 device  1 [0x65] NVIDIA A100 80GB PCIe
#  Rank  2 Group  0 Pid 1346195 on c11efb988616 device  2 [0xca] NVIDIA A100 80GB PCIe
#  Rank  3 Group  0 Pid 1346195 on c11efb988616 device  3 [0xe3] NVIDIA A100 80GB PCIe
NCCL version 2.17.1+cuda12.0

c11efb988616:1346195:1346234 [2] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346235 [3] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346235 [3] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346235 [3] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346234 [2] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346235 [3] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346235 [3] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346235 [3] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346235 [3] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346235 [3] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346235 [3] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
    16777216       4194304     float     sum      -1   5738.2    2.92    4.39      0   9352.9    1.79    2.69      0

c11efb988616:1346195:1346195 [0] init.cc:1481 NCCL WARN NPKIT_DUMP_DIR is empty

c11efb988616:1346195:1346195 [1] init.cc:1481 NCCL WARN NPKIT_DUMP_DIR is empty
terminate called after throwing an instance of 'std::system_error'
  what():  Invalid argument
+ cd /home/NPKit/nccl_samples
+ python3 npkit_trace_generator.py --npkit_dump_dir=/home/NPKit/results/npkit_dump/all_reduce_perf//Ring/Simple --npkit_event_header_path=/home/ncclold/src/include/npkit/npkit_event.h --output_dir=/home/NPKit/results/npkit_trace/all_reduce_perf//Ring/Simple
+ cd /home/NPKit/results/npkit_trace/all_reduce_perf//Ring/Simple
+ tar cvzf npkit_result.tar.gz npkit_event_trace.json
npkit_event_trace.json
+ mv npkit_result.tar.gz /home/NPKit/results/npkit_result/all_reduce_perf//Ring/Simple

Any pointers would be helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions