-
Notifications
You must be signed in to change notification settings - Fork 10
Open
Description
Issue
When I use npkit_trace_generator.py to convert the trace file generated by npkit to a json file, I get some errors.
Traceback (most recent call last):
File "/home/zhangshizhuo/msccl/tools/npkit_trace_generator.py", line 232, in <module>
convert_npkit_dump_to_trace(args.input_dir, args.output_dir, npkit_event_def)
File "/home/zhangshizhuo/msccl/tools/npkit_trace_generator.py", line 211, in convert_npkit_dump_to_trace
gpu_events = parse_gpu_event_file(npkit_dump_dir, npkit_event_def, rank, buf_idx, gpu_clock_scale, cpu_clock_scale)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zhangshizhuo/msccl/tools/npkit_trace_generator.py", line 95, in parse_gpu_event_file
'ts': curr_cpu_base_time + parsed_gpu_event['timestamp'] / gpu_clock_scale - curr_gpu_base_time,
~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for +: 'NoneType' and 'float'
Specifically, I used the msccl-tools/examples/mscclang/allgather_recursive_doubling.py to generate the xml file and communicate on the cluster. This error also occurs when testing reduce scatter, but allreduce and alltoall not. Can you help me with this error? Looking forward to your reply.
Details
Generate xml file:
python /home/zhangshizhuo/msccl-tools/examples/mscclang/allgather_recursive_doubling.py 4 1 --protocol='Simple'> /home/zhangshizhuo/xml2/Allgather_test.xml
mpirun test:
mpirun --prefix /usr/local/openmpi \
-np 4 \
-H gpu1:4\
-map-by slot \
-mca btl_tcp_if_include 10.1.1.0/24 \
-x NCCL_SOCKET_IFNAME=ens16f0,enp75s0f0np0,ens6f0 \
-x LD_LIBRARY_PATH=/home/zhangshizhuo/msccl/build/lib/:$LD_LIBRARY_PATH \
-x NCCL_NET_SHARED_BUFFERS=0 \
-x NCCL_IGNORE_DISABLED_P2P=1 \
-x NCCL_SHM_Disable=1 \
-x NCCL_DEBUG=INFO \
-x NCCL_ALGO=MSCCL,RING \
-x MSCCL_XML_FILES=/home/zhangshizhuo/xml2/Allgather_test.xml \
-x NPKIT_DUMP_DIR=/home/zhangshizhuo/trace/trace_allgather/ \
-x CUDA_VISIBLE_DEVICES=0,1,2,3 \
bash -c ' cd /home/zhangshizhuo/nccl-tests/build/; \
./all_gather_perf -b 32M -e 32M -f 2 -g 1 -n 5 -w 3 -c 0 -z 1 '
Metadata
Metadata
Assignees
Labels
No labels