-
Notifications
You must be signed in to change notification settings - Fork 10
Description
I am trying to profile an nccl-tests allReduce call: all_reduce_perf -b 16M -e 16M -n 10 -w 0 -g 1 -c 1
NCCL version: 2.17.1
NCCL-tests version: 2.13.3
TLDR:
-
The trace file generated is empty for -g 1. I dot not see any errors and have tried enabling all the NPKIT_FLAGS as well. The NCCL version being used is as per NPKit specification. Is there a particular nccl-test version that needs to be used? Or is this a bug with NPKit's nccl profiling?
-
I see that values other than -g 1 throws errors (e.g., -g 2). I am wondering how would I be able to observe the collective communication patterns between multiple GPUs/Ranks if i cannot specify them in my tests? Why does this limitation exist?
Edit: Before application of NPkit patch, the tests run for all values of -g.
---------------------------Details----------------------------
First, i wanted to confirm its working with -g 1. I have assigned the correct run parameters to the nccl_test() function in npkit_runner.sh
npkit_runner.sh
function nccl_test() {
$1 -b $3 -e $3 -n $7 -w $6 -g 1 -c 1 | tee $9/log.txt
}
The test runs successfully:
root@c11efb988616:/home/NPKit/nccl_samples# bash npkit_launcher.sh
+ export NCCL_SRC_DIR=/home/ncclold
+ NCCL_SRC_DIR=/home/ncclold
+ export NPKIT_SRC_DIR=/home/NPKit
+ NPKIT_SRC_DIR=/home/NPKit
+ export NCCL_TEST_BIN=/home/nccl-testsold/build/all_reduce_perf
+ NCCL_TEST_BIN=/home/nccl-testsold/build/all_reduce_perf
+ export NPKIT_RUN_DIR=/home/NPKit/results
+ NPKIT_RUN_DIR=/home/NPKit/results
+ export NCCL_MSG_SIZE=16M
+ NCCL_MSG_SIZE=16M
+ export NCCL_ALGO=Ring
+ NCCL_ALGO=Ring
+ export NCCL_PROTO=Simple
+ NCCL_PROTO=Simple
+ export NCCL_NUM_WARMUPS=0
+ NCCL_NUM_WARMUPS=0
+ export NCCL_NUM_ITERS=10
+ NCCL_NUM_ITERS=10
+ NPKIT_FLAGS_CPU_PREFIX=-DENABLE_NPKIT
+ NPKIT_FLAGS_GPU_PREFIX='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_ALL_REDUCE_RING_ENTRY -DENABLE_NPKIT_EVENT_ALL_REDUCE_RING_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_ALL_REDUCE_RING_ENTRY -DENABLE_NPKIT_EVENT_ALL_REDUCE_RING_EXIT'
+ bash npkit_runner.sh
++ basename /home/nccl-testsold/build/all_reduce_perf
+ npkit_run_tag=all_reduce_perf//Ring/Simple
+ npkit_dump_dir=/home/NPKit/results/npkit_dump/all_reduce_perf//Ring/Simple
+ npkit_trace_dir=/home/NPKit/results/npkit_trace/all_reduce_perf//Ring/Simple
+ npkit_result_dir=/home/NPKit/results/npkit_result/all_reduce_perf//Ring/Simple
+ rm -rf /home/NPKit/results
+ mkdir -p /home/NPKit/results/npkit_dump/all_reduce_perf//Ring/Simple
+ mkdir -p /home/NPKit/results/npkit_trace/all_reduce_perf//Ring/Simple
+ mkdir -p /home/NPKit/results/npkit_result/all_reduce_perf//Ring/Simple
+ nccl_test /home/nccl-testsold/build/all_reduce_perf /home/ncclold 16M Ring Simple 0 10 /home/NPKit/results/npkit_dump/all_reduce_perf//Ring/Simple /home/NPKit/results/npkit_result/all_reduce_perf//Ring/Simple
+ /home/nccl-testsold/build/all_reduce_perf -b 16M -e 16M -n 10 -w 0 -g 1 -c 1
+ tee /home/NPKit/results/npkit_result/all_reduce_perf//Ring/Simple/log.txt
# nThread 1 nGpus 1 minBytes 16777216 maxBytes 16777216 step: 1048576(bytes) warmup iters: 0 iters: 10 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 1062079 on c11efb988616 device 0 [0x17] NVIDIA A100 80GB PCIe
NCCL version 2.17.1+cuda12.0
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
16777216 4194304 float sum -1 266.8 62.88 0.00 0 0.42 40031.53 0.00 0
c11efb988616:1062079:1062079 [0] init.cc:1481 NCCL WARN NPKIT_DUMP_DIR is empty
# Out of bounds values : 0 OK
# Avg bus bandwidth : 0
#
+ cd /home/NPKit/nccl_samples
+ python3 npkit_trace_generator.py --npkit_dump_dir=/home/NPKit/results/npkit_dump/all_reduce_perf//Ring/Simple --npkit_event_header_path=/home/ncclold/src/include/npkit/npkit_event.h --output_dir=/home/NPKit/results/npkit_trace/all_reduce_perf//Ring/Simple
+ cd /home/NPKit/results/npkit_trace/all_reduce_perf//Ring/Simple
+ tar cvzf npkit_result.tar.gz npkit_event_trace.json
npkit_event_trace.json
+ mv npkit_result.tar.gz /home/NPKit/results/npkit_result/all_reduce_perf//Ring/Simple
However, the npkit_event_trace.json is empty. Does it have something to do with the init.cc:1481 NCCL WARN NPKIT_DUMP_DIR is empty warning shown above?
{"traceEvents": [], "displayTimeUnit": "ns"}
When I increase the number of gpus to say -g 4
npkit_runner.sh
function nccl_test() {
$1 -b $3 -e $3 -n $7 -w $6 -g 4 -c 1 | tee $9/log.txt
}
The test fails:
root@c11efb988616:/home/NPKit/nccl_samples# bash npkit_launcher.sh
+ export NCCL_SRC_DIR=/home/ncclold
+ NCCL_SRC_DIR=/home/ncclold
+ export NPKIT_SRC_DIR=/home/NPKit
+ NPKIT_SRC_DIR=/home/NPKit
+ export NCCL_TEST_BIN=/home/nccl-testsold/build/all_reduce_perf
+ NCCL_TEST_BIN=/home/nccl-testsold/build/all_reduce_perf
+ export NPKIT_RUN_DIR=/home/NPKit/results
+ NPKIT_RUN_DIR=/home/NPKit/results
+ export NCCL_MSG_SIZE=16M
+ NCCL_MSG_SIZE=16M
+ export NCCL_ALGO=Ring
+ NCCL_ALGO=Ring
+ export NCCL_PROTO=Simple
+ NCCL_PROTO=Simple
+ export NCCL_NUM_WARMUPS=0
+ NCCL_NUM_WARMUPS=0
+ export NCCL_NUM_ITERS=10
+ NCCL_NUM_ITERS=10
+ NPKIT_FLAGS_CPU_PREFIX=-DENABLE_NPKIT
+ NPKIT_FLAGS_GPU_PREFIX='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_ALL_REDUCE_RING_ENTRY -DENABLE_NPKIT_EVENT_ALL_REDUCE_RING_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_ALL_REDUCE_RING_ENTRY -DENABLE_NPKIT_EVENT_ALL_REDUCE_RING_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_COPY_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_COPY_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_COPY_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_COPY_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_RECV_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_RECV_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_RECV_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_RECV_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_RECV_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_RECV_COPY_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_RECV_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_RECV_COPY_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_RECV_REDUCE_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_RECV_REDUCE_COPY_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_RECV_REDUCE_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_RECV_REDUCE_COPY_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_SEND_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_SEND_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_SEND_FROM_OUTPUT_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_SEND_FROM_OUTPUT_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_SEND_FROM_OUTPUT_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_SEND_FROM_OUTPUT_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_RECV_ENTRY -DENABLE_NPKIT_EVENT_RECV_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_RECV_ENTRY -DENABLE_NPKIT_EVENT_RECV_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_RECV_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_RECV_COPY_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_RECV_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_RECV_COPY_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_RECV_REDUCE_COPY_ENTRY -DENABLE_NPKIT_EVENT_RECV_REDUCE_COPY_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_RECV_REDUCE_COPY_ENTRY -DENABLE_NPKIT_EVENT_RECV_REDUCE_COPY_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_RECV_REDUCE_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_RECV_REDUCE_COPY_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_RECV_REDUCE_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_RECV_REDUCE_COPY_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_RECV_REDUCE_SEND_ENTRY -DENABLE_NPKIT_EVENT_RECV_REDUCE_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_RECV_REDUCE_SEND_ENTRY -DENABLE_NPKIT_EVENT_RECV_REDUCE_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_SEND_ENTRY -DENABLE_NPKIT_EVENT_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_SEND_ENTRY -DENABLE_NPKIT_EVENT_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_SEND_FROM_OUTPUT_ENTRY -DENABLE_NPKIT_EVENT_SEND_FROM_OUTPUT_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_SEND_FROM_OUTPUT_ENTRY -DENABLE_NPKIT_EVENT_SEND_FROM_OUTPUT_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_WAIT_PEER_ENTRY -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_WAIT_PEER_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_WAIT_PEER_ENTRY -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_WAIT_PEER_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_REDUCE_OR_COPY_MULTI_ENTRY -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_REDUCE_OR_COPY_MULTI_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_REDUCE_OR_COPY_MULTI_ENTRY -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_REDUCE_OR_COPY_MULTI_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL_WAIT_SEND_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL_WAIT_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL_WAIT_SEND_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL_WAIT_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL128_WAIT_SEND_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL128_WAIT_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL128_WAIT_SEND_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL128_WAIT_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_NET_SEND_ENTRY -DENABLE_NPKIT_EVENT_NET_SEND_EXIT -DENABLE_NPKIT_EVENT_NET_RECV_ENTRY -DENABLE_NPKIT_EVENT_NET_RECV_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_NET_SEND_ENTRY -DENABLE_NPKIT_EVENT_NET_SEND_EXIT -DENABLE_NPKIT_EVENT_NET_RECV_ENTRY -DENABLE_NPKIT_EVENT_NET_RECV_EXIT'
+ bash npkit_runner.sh
++ basename /home/nccl-testsold/build/all_reduce_perf
+ npkit_run_tag=all_reduce_perf//Ring/Simple
+ npkit_dump_dir=/home/NPKit/results/npkit_dump/all_reduce_perf//Ring/Simple
+ npkit_trace_dir=/home/NPKit/results/npkit_trace/all_reduce_perf//Ring/Simple
+ npkit_result_dir=/home/NPKit/results/npkit_result/all_reduce_perf//Ring/Simple
+ rm -rf /home/NPKit/results
+ mkdir -p /home/NPKit/results/npkit_dump/all_reduce_perf//Ring/Simple
+ mkdir -p /home/NPKit/results/npkit_trace/all_reduce_perf//Ring/Simple
+ mkdir -p /home/NPKit/results/npkit_result/all_reduce_perf//Ring/Simple
+ nccl_test /home/nccl-testsold/build/all_reduce_perf /home/ncclold 16M Ring Simple 0 10 /home/NPKit/results/npkit_dump/all_reduce_perf//Ring/Simple /home/NPKit/results/npkit_result/all_reduce_perf//Ring/Simple
+ /home/nccl-testsold/build/all_reduce_perf -b 16M -e 16M -n 10 -w 0 -g 4 -c 1
+ tee /home/NPKit/results/npkit_result/all_reduce_perf//Ring/Simple/log.txt
# nThread 1 nGpus 4 minBytes 16777216 maxBytes 16777216 step: 1048576(bytes) warmup iters: 0 iters: 10 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 1346195 on c11efb988616 device 0 [0x17] NVIDIA A100 80GB PCIe
# Rank 1 Group 0 Pid 1346195 on c11efb988616 device 1 [0x65] NVIDIA A100 80GB PCIe
# Rank 2 Group 0 Pid 1346195 on c11efb988616 device 2 [0xca] NVIDIA A100 80GB PCIe
# Rank 3 Group 0 Pid 1346195 on c11efb988616 device 3 [0xe3] NVIDIA A100 80GB PCIe
NCCL version 2.17.1+cuda12.0
c11efb988616:1346195:1346234 [2] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes
c11efb988616:1346195:1346235 [3] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes
c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes
c11efb988616:1346195:1346235 [3] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes
c11efb988616:1346195:1346235 [3] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes
c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes
c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes
c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes
c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes
c11efb988616:1346195:1346234 [2] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes
c11efb988616:1346195:1346235 [3] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes
c11efb988616:1346195:1346235 [3] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes
c11efb988616:1346195:1346235 [3] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes
c11efb988616:1346195:1346235 [3] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes
c11efb988616:1346195:1346235 [3] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes
c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes
c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes
c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes
c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes
c11efb988616:1346195:1346235 [3] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes
c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes
c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes
c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes
c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes
c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
16777216 4194304 float sum -1 5738.2 2.92 4.39 0 9352.9 1.79 2.69 0
c11efb988616:1346195:1346195 [0] init.cc:1481 NCCL WARN NPKIT_DUMP_DIR is empty
c11efb988616:1346195:1346195 [1] init.cc:1481 NCCL WARN NPKIT_DUMP_DIR is empty
terminate called after throwing an instance of 'std::system_error'
what(): Invalid argument
+ cd /home/NPKit/nccl_samples
+ python3 npkit_trace_generator.py --npkit_dump_dir=/home/NPKit/results/npkit_dump/all_reduce_perf//Ring/Simple --npkit_event_header_path=/home/ncclold/src/include/npkit/npkit_event.h --output_dir=/home/NPKit/results/npkit_trace/all_reduce_perf//Ring/Simple
+ cd /home/NPKit/results/npkit_trace/all_reduce_perf//Ring/Simple
+ tar cvzf npkit_result.tar.gz npkit_event_trace.json
npkit_event_trace.json
+ mv npkit_result.tar.gz /home/NPKit/results/npkit_result/all_reduce_perf//Ring/Simple
Any pointers would be helpful.