Skip to content

tf1.x+ngc关于两机训练的问题 #119

@zzt941006

Description

@zzt941006

在两个服务器上,起了两个容器,然后在里面装好了openmpi之类的通信工具。
简单用horovodrun 命令测试了一下,似乎应该是通的?

horovodrun -np 8  -H localhost:8  -p 10000  echo "233"
2021-01-30 03:50:03.454606: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1,0]<stdout>:233
[1,1]<stdout>:233
[1,2]<stdout>:233
[1,3]<stdout>:233
[1,4]<stdout>:233
[1,5]<stdout>:233
[1,6]<stdout>:233
[1,7]<stdout>:233
horovodrun -np 8  -H node2:8  -p 10000  echo "233"
2021-01-30 03:51:07.190350: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1,0]<stdout>:233
[1,1]<stdout>:233
[1,2]<stdout>:233
[1,3]<stdout>:233
[1,4]<stdout>:233
[1,5]<stdout>:233
[1,6]<stdout>:233
[1,7]<stdout>:233
horovodrun -np 8  -H node1:8,node2:8  -p 10000 --start-timeout 100 echo "233"
2021-01-30 03:53:35.059288: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1,0]<stdout>:233
[1,1]<stdout>:233
[1,2]<stdout>:233
[1,3]<stdout>:233
[1,4]<stdout>:233
[1,5]<stdout>:233
[1,6]<stdout>:233
[1,7]<stdout>:233

接着验了一下horovd run 16卡的,好像也没问题?

horovodrun -np 16  -H node1:8,node2:8  -p 10000 --start-timeout 100 echo "233"
2021-01-30 06:26:32.968745: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1,8]<stdout>:233
[1,9]<stdout>:233
[1,10]<stdout>:233
[1,11]<stdout>:233
[1,12]<stdout>:233
[1,13]<stdout>:233
[1,14]<stdout>:233
[1,15]<stdout>:233
[1,0]<stdout>:233
[1,1]<stdout>:233
[1,2]<stdout>:233
[1,3]<stdout>:233
[1,4]<stdout>:233
[1,5]<stdout>:233
[1,6]<stdout>:233
[1,7]<stdout>:233

然后配好了相关路径,执行
bash run_two_node.sh 64 fp16 true
当然,multi xxx.sh里,我加了一个--start-timeout 300, 否则好像会显示超时报错。
这样之后可以正常跑,但是性能明显不对:

[1,0]<stdout>:Skipping time record for  0  due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:37:35.188454 - Iteration: 1  throughput_train : 487.252 seq/s mlm_loss : 10.4442  nsp_loss : 0.6882  total_loss : 11.1323  avg_loss_step : 11.1323  learning_rate : 0.0  loss_scaler : 67108864
[1,0]<stdout>:Skipping time record for  0  due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:37:37.309305 - Iteration: 1  throughput_train : 482.925 seq/s mlm_loss : 10.4311  nsp_loss : 0.7286  total_loss : 11.1597  avg_loss_step : 11.1597  learning_rate : 0.0  loss_scaler : 33554432
[1,0]<stdout>:Skipping time record for  0  due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:37:39.419278 - Iteration: 1  throughput_train : 485.466 seq/s mlm_loss : 10.4503  nsp_loss : 0.7116  total_loss : 11.1619  avg_loss_step : 11.1619  learning_rate : 0.0  loss_scaler : 16777216
[1,0]<stdout>:Skipping time record for  0  due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:37:41.516706 - Iteration: 1  throughput_train : 488.409 seq/s mlm_loss : 10.4479  nsp_loss : 0.7024  total_loss : 11.1503  avg_loss_step : 11.1503  learning_rate : 0.0  loss_scaler : 8388608
[1,0]<stdout>:Skipping time record for  0  due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:37:43.757067 - Iteration: 1  throughput_train : 457.224 seq/s mlm_loss : 10.4415  nsp_loss : 0.7040  total_loss : 11.1455  avg_loss_step : 11.1455  learning_rate : 0.0  loss_scaler : 4194304
[1,0]<stdout>:Skipping time record for  0  due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:37:45.888906 - Iteration: 1  throughput_train : 480.491 seq/s mlm_loss : 10.4445  nsp_loss : 0.7108  total_loss : 11.1553  avg_loss_step : 11.1553  learning_rate : 0.0  loss_scaler : 2097152
[1,0]<stdout>:Skipping time record for  0  due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:37:47.978569 - Iteration: 1  throughput_train : 490.213 seq/s mlm_loss : 10.4558  nsp_loss : 0.7030  total_loss : 11.1589  avg_loss_step : 11.1589  learning_rate : 0.0  loss_scaler : 1048576
[1,0]<stdout>:Skipping time record for  0  due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:37:50.481595 - Iteration: 1  throughput_train : 409.234 seq/s mlm_loss : 10.4543  nsp_loss : 0.7094  total_loss : 11.1637  avg_loss_step : 11.1637  learning_rate : 0.0  loss_scaler : 524288
[1,0]<stdout>:Skipping time record for  0  due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:37:53.738915 - Iteration: 1  throughput_train : 314.426 seq/s mlm_loss : 10.4676  nsp_loss : 0.7103  total_loss : 11.1779  avg_loss_step : 11.1779  learning_rate : 0.0  loss_scaler : 262144
[1,0]<stdout>:Skipping time record for  0  due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:37:57.473934 - Iteration: 1  throughput_train : 274.211 seq/s mlm_loss : 10.4273  nsp_loss : 0.7190  total_loss : 11.1463  avg_loss_step : 11.1463  learning_rate : 0.0  loss_scaler : 131072
[1,0]<stdout>:Skipping time record for  0  due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:38:01.415495 - Iteration: 1  throughput_train : 259.842 seq/s mlm_loss : 10.4480  nsp_loss : 0.7205  total_loss : 11.1685  avg_loss_step : 11.1685  learning_rate : 0.0  loss_scaler : 65536
[1,0]<stdout>:Skipping time record for  0  due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:38:03.598933 - Iteration: 1  throughput_train : 469.080 seq/s mlm_loss : 10.4538  nsp_loss : 0.7007  total_loss : 11.1545  avg_loss_step : 11.1545  learning_rate : 0.0  loss_scaler : 32768
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected.  This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).

Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate.  For
example, there may be a core file that you can examine.  More
generally: such peer hangups are frequently caused by application bugs
or other external events.

  Local host: node1
  Local PID:  197398
  Peer host:  node2
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 9 with PID 217958 on node node2 exited on signal 9 (Killed).
--------------------------------------------------------------------------
link ok
Writting log to ../logs//ngc/tensorflow/bert/bz64/2n8g/bert_b64_fp16_2.log

最后每轮还会打出这些信息,这是出了什么问题么?感觉应该是连接上了,且有结果,是不是就说明没有出现通信卡死的情况?

然后我进一步换了一下跑的参数 改成了跑32bacth的混合精度: bash run_two_node.sh 32 fp16 true,
这次直接出现了报错:

[1,7]<stderr>:   [[node HorovodBroadcast_bert_encoder_layer_9_output_dense_bias_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[1,7]<stderr>:
[1,7]<stderr>:Original stack trace for 'HorovodBroadcast_bert_encoder_layer_9_output_dense_bias_0':
[1,7]<stderr>:  File "/workspace/bert/run_pretraining.py", line 713, in <module>
[1,7]<stderr>:    tf.compat.v1.app.run()
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
[1,7]<stderr>:    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
[1,7]<stderr>:    _run_main(main, args)
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
[1,7]<stderr>:    sys.exit(main(argv))
[1,7]<stderr>:  File "/workspace/bert/run_pretraining.py", line 633, in main
[1,7]<stderr>:    estimator.train(input_fn=train_input_fn, hooks=training_hooks, max_steps=FLAGS.num_train_steps)
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
[1,7]<stderr>:    loss = self._train_model(input_fn, hooks, saving_listeners)
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
[1,7]<stderr>:    return self._train_model_default(input_fn, hooks, saving_listeners)
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
[1,7]<stderr>:    saving_listeners)
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1490, in _train_with_estimator_spec
[1,7]<stderr>:    log_step_count_steps=log_step_count_steps) as mon_sess:
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 584, in MonitoredTrainingSession
[1,7]<stderr>:    stop_grace_period_secs=stop_grace_period_secs)
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1014, in __init__
[1,7]<stderr>:    stop_grace_period_secs=stop_grace_period_secs)
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 713, in __init__
[1,7]<stderr>:    h.begin()
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 240, in begin
[1,7]<stderr>:    self.bcast_op = broadcast_global_variables(self.root_rank)
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 191, in broadcast_global_variables
[1,7]<stderr>:    return broadcast_variables(_global_variables(), root_rank)
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/functions.py", line 56, in broadcast_variables
[1,7]<stderr>:    return broadcast_group(variables, root_rank)
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/functions.py", line 42, in broadcast_group
[1,7]<stderr>:    for var in variables])
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/functions.py", line 42, in <listcomp>
[1,7]<stderr>:    for var in variables])
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_ops.py", line 198, in broadcast
[1,7]<stderr>:    ignore_name_scope=ignore_name_scope)
[1,7]<stderr>:  File "<string>", line 320, in horovod_broadcast
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
[1,7]<stderr>:    op_def=op_def)
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
[1,7]<stderr>:    return func(*args, **kwargs)
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
[1,7]<stderr>:    attrs, op_def, compute_device)
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py",[1,7]<stderr>: line 3426, in _create_op_internal
[1,7]<stderr>:    op_def=op_def)
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
[1,7]<stderr>:    self._traceback = tf_stack.extract_stack()
[1,6]<stderr>:Traceback (most recent call last):
[1,6]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
[1,6]<stderr>:    return fn(*args)
[1,6]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
[1,6]<stderr>:    target_list, run_metadata)
[1,6]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
[1,6]<stderr>:    run_metadata)

很奇怪 是啥姿势不对么?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions