Skip to content

Errors when run allgather with npkit_trace_generator.py #33

@zhuo121

Description

@zhuo121

Issue

When I use npkit_trace_generator.py to convert the trace file generated by npkit to a json file, I get some errors.

Traceback (most recent call last):
  File "/home/zhangshizhuo/msccl/tools/npkit_trace_generator.py", line 232, in <module>
    convert_npkit_dump_to_trace(args.input_dir, args.output_dir, npkit_event_def)
  File "/home/zhangshizhuo/msccl/tools/npkit_trace_generator.py", line 211, in convert_npkit_dump_to_trace
    gpu_events = parse_gpu_event_file(npkit_dump_dir, npkit_event_def, rank, buf_idx, gpu_clock_scale, cpu_clock_scale)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhangshizhuo/msccl/tools/npkit_trace_generator.py", line 95, in parse_gpu_event_file
    'ts': curr_cpu_base_time + parsed_gpu_event['timestamp'] / gpu_clock_scale - curr_gpu_base_time,
          ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for +: 'NoneType' and 'float'

Specifically, I used the msccl-tools/examples/mscclang/allgather_recursive_doubling.py to generate the xml file and communicate on the cluster. This error also occurs when testing reduce scatter, but allreduce and alltoall not. Can you help me with this error? Looking forward to your reply.

Details

Generate xml file:

python /home/zhangshizhuo/msccl-tools/examples/mscclang/allgather_recursive_doubling.py 4 1 --protocol='Simple'> /home/zhangshizhuo/xml2/Allgather_test.xml

mpirun test:

 mpirun --prefix /usr/local/openmpi \
        -np 4 \
        -H gpu1:4\
        -map-by slot \
        -mca btl_tcp_if_include 10.1.1.0/24 \
        -x NCCL_SOCKET_IFNAME=ens16f0,enp75s0f0np0,ens6f0 \
        -x LD_LIBRARY_PATH=/home/zhangshizhuo/msccl/build/lib/:$LD_LIBRARY_PATH \
        -x NCCL_NET_SHARED_BUFFERS=0 \
        -x NCCL_IGNORE_DISABLED_P2P=1 \
        -x NCCL_SHM_Disable=1 \
        -x NCCL_DEBUG=INFO \
        -x NCCL_ALGO=MSCCL,RING  \
        -x MSCCL_XML_FILES=/home/zhangshizhuo/xml2/Allgather_test.xml \
        -x NPKIT_DUMP_DIR=/home/zhangshizhuo/trace/trace_allgather/ \
        -x CUDA_VISIBLE_DEVICES=0,1,2,3 \
        bash -c ' cd /home/zhangshizhuo/nccl-tests/build/; \
        ./all_gather_perf -b 32M -e 32M -f 2 -g 1 -n 5 -w 3 -c 0 -z 1 '

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions