Skip to content

gluon-mxnet-bert多机速度慢问题 #48

@Flowingsun007

Description

@Flowingsun007

简介

horovod是支持pytorch,tensorflow,mxnet多机分布式训练的库,其底层机器间通讯依赖nccl或mpi,所以安装前通常需要先安装好nccl、openmpi,且至少安装了一种深度学习框架,譬如mxnet:

python3 -m pip install gluonnlp==0.10.0 mxnet-cu102mkl==1.6.0.post0 -i https://mirror.baidu.com/pypi/simple

问题解决

安装好依赖后,可以进行horovod的安装,horovod安装时,需为NCCL指定相关变量,否则运行时可能不会走nccl通讯导致速度很慢。详细安装过程:https://github.com/horovod/horovod/blob/master/docs/gpus.rst

  • 安装horovod时,需指定NCCL相关变量:
HOROVOD_WITH_MXNET=1  HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL

如果不指定变量直接安装,则用horovodrun时也能运行,不过速度会很慢,因为其底层并未走nccl,直接走的是mpi通信

其他

  • 可以在运行时添加--log-level参数为INFO或者DEBUG来查看详细输出
horovodrun -np ${gpu_num} -H ${node_ip}   -p ${PORT} \
--start-timeout 600 --log-level INFO \
python3 ${WORKSPACE}/run_pretraining.py ${CMD} 2>&1 | tee ${log_file}
  • 通过mpi运行时,可添加参数-x NCCL_DEBUG=INFO查看nccl输出
mpirun -oversubscribe -np ${gpu_num} -H ${node_ip} \
    -bind-to none -map-by numa \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
    -mca pml ob1 -mca btl ^openib \
    -mca plm_rsh_args "-p 22 -q -o StrictHostKeyChecking=no" \
    -mca btl_tcp_if_include ib0 \
python3 ${WORKSPACE}/run_pretraining.py ${CMD} 2>&1 | tee ${log_file}

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions