-
Notifications
You must be signed in to change notification settings - Fork 26
Open
Labels
documentationImprovements or additions to documentationImprovements or additions to documentation
Description
简介
horovod是支持pytorch,tensorflow,mxnet多机分布式训练的库,其底层机器间通讯依赖nccl或mpi,所以安装前通常需要先安装好nccl、openmpi,且至少安装了一种深度学习框架,譬如mxnet:
python3 -m pip install gluonnlp==0.10.0 mxnet-cu102mkl==1.6.0.post0 -i https://mirror.baidu.com/pypi/simple问题解决
安装好依赖后,可以进行horovod的安装,horovod安装时,需为NCCL指定相关变量,否则运行时可能不会走nccl通讯导致速度很慢。详细安装过程:https://github.com/horovod/horovod/blob/master/docs/gpus.rst
- 安装horovod时,需指定NCCL相关变量:
HOROVOD_WITH_MXNET=1 HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL如果不指定变量直接安装,则用horovodrun时也能运行,不过速度会很慢,因为其底层并未走nccl,直接走的是mpi通信
其他
- 可以在运行时添加--log-level参数为INFO或者DEBUG来查看详细输出
horovodrun -np ${gpu_num} -H ${node_ip} -p ${PORT} \
--start-timeout 600 --log-level INFO \
python3 ${WORKSPACE}/run_pretraining.py ${CMD} 2>&1 | tee ${log_file}- 通过mpi运行时,可添加参数-x NCCL_DEBUG=INFO查看nccl输出
mpirun -oversubscribe -np ${gpu_num} -H ${node_ip} \
-bind-to none -map-by numa \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib \
-mca plm_rsh_args "-p 22 -q -o StrictHostKeyChecking=no" \
-mca btl_tcp_if_include ib0 \
python3 ${WORKSPACE}/run_pretraining.py ${CMD} 2>&1 | tee ${log_file}Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
documentationImprovements or additions to documentationImprovements or additions to documentation