-
Notifications
You must be signed in to change notification settings - Fork 26
Open
Description
在两个服务器上,起了两个容器,然后在里面装好了openmpi之类的通信工具。
简单用horovodrun 命令测试了一下,似乎应该是通的?
horovodrun -np 8 -H localhost:8 -p 10000 echo "233"
2021-01-30 03:50:03.454606: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1,0]<stdout>:233
[1,1]<stdout>:233
[1,2]<stdout>:233
[1,3]<stdout>:233
[1,4]<stdout>:233
[1,5]<stdout>:233
[1,6]<stdout>:233
[1,7]<stdout>:233
horovodrun -np 8 -H node2:8 -p 10000 echo "233"
2021-01-30 03:51:07.190350: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1,0]<stdout>:233
[1,1]<stdout>:233
[1,2]<stdout>:233
[1,3]<stdout>:233
[1,4]<stdout>:233
[1,5]<stdout>:233
[1,6]<stdout>:233
[1,7]<stdout>:233
horovodrun -np 8 -H node1:8,node2:8 -p 10000 --start-timeout 100 echo "233"
2021-01-30 03:53:35.059288: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1,0]<stdout>:233
[1,1]<stdout>:233
[1,2]<stdout>:233
[1,3]<stdout>:233
[1,4]<stdout>:233
[1,5]<stdout>:233
[1,6]<stdout>:233
[1,7]<stdout>:233
接着验了一下horovd run 16卡的,好像也没问题?
horovodrun -np 16 -H node1:8,node2:8 -p 10000 --start-timeout 100 echo "233"
2021-01-30 06:26:32.968745: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1,8]<stdout>:233
[1,9]<stdout>:233
[1,10]<stdout>:233
[1,11]<stdout>:233
[1,12]<stdout>:233
[1,13]<stdout>:233
[1,14]<stdout>:233
[1,15]<stdout>:233
[1,0]<stdout>:233
[1,1]<stdout>:233
[1,2]<stdout>:233
[1,3]<stdout>:233
[1,4]<stdout>:233
[1,5]<stdout>:233
[1,6]<stdout>:233
[1,7]<stdout>:233
然后配好了相关路径,执行
bash run_two_node.sh 64 fp16 true
当然,multi xxx.sh里,我加了一个--start-timeout 300, 否则好像会显示超时报错。
这样之后可以正常跑,但是性能明显不对:
[1,0]<stdout>:Skipping time record for 0 due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:37:35.188454 - Iteration: 1 throughput_train : 487.252 seq/s mlm_loss : 10.4442 nsp_loss : 0.6882 total_loss : 11.1323 avg_loss_step : 11.1323 learning_rate : 0.0 loss_scaler : 67108864
[1,0]<stdout>:Skipping time record for 0 due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:37:37.309305 - Iteration: 1 throughput_train : 482.925 seq/s mlm_loss : 10.4311 nsp_loss : 0.7286 total_loss : 11.1597 avg_loss_step : 11.1597 learning_rate : 0.0 loss_scaler : 33554432
[1,0]<stdout>:Skipping time record for 0 due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:37:39.419278 - Iteration: 1 throughput_train : 485.466 seq/s mlm_loss : 10.4503 nsp_loss : 0.7116 total_loss : 11.1619 avg_loss_step : 11.1619 learning_rate : 0.0 loss_scaler : 16777216
[1,0]<stdout>:Skipping time record for 0 due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:37:41.516706 - Iteration: 1 throughput_train : 488.409 seq/s mlm_loss : 10.4479 nsp_loss : 0.7024 total_loss : 11.1503 avg_loss_step : 11.1503 learning_rate : 0.0 loss_scaler : 8388608
[1,0]<stdout>:Skipping time record for 0 due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:37:43.757067 - Iteration: 1 throughput_train : 457.224 seq/s mlm_loss : 10.4415 nsp_loss : 0.7040 total_loss : 11.1455 avg_loss_step : 11.1455 learning_rate : 0.0 loss_scaler : 4194304
[1,0]<stdout>:Skipping time record for 0 due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:37:45.888906 - Iteration: 1 throughput_train : 480.491 seq/s mlm_loss : 10.4445 nsp_loss : 0.7108 total_loss : 11.1553 avg_loss_step : 11.1553 learning_rate : 0.0 loss_scaler : 2097152
[1,0]<stdout>:Skipping time record for 0 due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:37:47.978569 - Iteration: 1 throughput_train : 490.213 seq/s mlm_loss : 10.4558 nsp_loss : 0.7030 total_loss : 11.1589 avg_loss_step : 11.1589 learning_rate : 0.0 loss_scaler : 1048576
[1,0]<stdout>:Skipping time record for 0 due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:37:50.481595 - Iteration: 1 throughput_train : 409.234 seq/s mlm_loss : 10.4543 nsp_loss : 0.7094 total_loss : 11.1637 avg_loss_step : 11.1637 learning_rate : 0.0 loss_scaler : 524288
[1,0]<stdout>:Skipping time record for 0 due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:37:53.738915 - Iteration: 1 throughput_train : 314.426 seq/s mlm_loss : 10.4676 nsp_loss : 0.7103 total_loss : 11.1779 avg_loss_step : 11.1779 learning_rate : 0.0 loss_scaler : 262144
[1,0]<stdout>:Skipping time record for 0 due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:37:57.473934 - Iteration: 1 throughput_train : 274.211 seq/s mlm_loss : 10.4273 nsp_loss : 0.7190 total_loss : 11.1463 avg_loss_step : 11.1463 learning_rate : 0.0 loss_scaler : 131072
[1,0]<stdout>:Skipping time record for 0 due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:38:01.415495 - Iteration: 1 throughput_train : 259.842 seq/s mlm_loss : 10.4480 nsp_loss : 0.7205 total_loss : 11.1685 avg_loss_step : 11.1685 learning_rate : 0.0 loss_scaler : 65536
[1,0]<stdout>:Skipping time record for 0 due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:38:03.598933 - Iteration: 1 throughput_train : 469.080 seq/s mlm_loss : 10.4538 nsp_loss : 0.7007 total_loss : 11.1545 avg_loss_step : 11.1545 learning_rate : 0.0 loss_scaler : 32768
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).
Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate. For
example, there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.
Local host: node1
Local PID: 197398
Peer host: node2
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 9 with PID 217958 on node node2 exited on signal 9 (Killed).
--------------------------------------------------------------------------
link ok
Writting log to ../logs//ngc/tensorflow/bert/bz64/2n8g/bert_b64_fp16_2.log
最后每轮还会打出这些信息,这是出了什么问题么?感觉应该是连接上了,且有结果,是不是就说明没有出现通信卡死的情况?
然后我进一步换了一下跑的参数 改成了跑32bacth的混合精度: bash run_two_node.sh 32 fp16 true,
这次直接出现了报错:
[1,7]<stderr>: [[node HorovodBroadcast_bert_encoder_layer_9_output_dense_bias_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[1,7]<stderr>:
[1,7]<stderr>:Original stack trace for 'HorovodBroadcast_bert_encoder_layer_9_output_dense_bias_0':
[1,7]<stderr>: File "/workspace/bert/run_pretraining.py", line 713, in <module>
[1,7]<stderr>: tf.compat.v1.app.run()
[1,7]<stderr>: File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
[1,7]<stderr>: _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
[1,7]<stderr>: File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
[1,7]<stderr>: _run_main(main, args)
[1,7]<stderr>: File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
[1,7]<stderr>: sys.exit(main(argv))
[1,7]<stderr>: File "/workspace/bert/run_pretraining.py", line 633, in main
[1,7]<stderr>: estimator.train(input_fn=train_input_fn, hooks=training_hooks, max_steps=FLAGS.num_train_steps)
[1,7]<stderr>: File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
[1,7]<stderr>: loss = self._train_model(input_fn, hooks, saving_listeners)
[1,7]<stderr>: File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
[1,7]<stderr>: return self._train_model_default(input_fn, hooks, saving_listeners)
[1,7]<stderr>: File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
[1,7]<stderr>: saving_listeners)
[1,7]<stderr>: File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1490, in _train_with_estimator_spec
[1,7]<stderr>: log_step_count_steps=log_step_count_steps) as mon_sess:
[1,7]<stderr>: File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 584, in MonitoredTrainingSession
[1,7]<stderr>: stop_grace_period_secs=stop_grace_period_secs)
[1,7]<stderr>: File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1014, in __init__
[1,7]<stderr>: stop_grace_period_secs=stop_grace_period_secs)
[1,7]<stderr>: File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 713, in __init__
[1,7]<stderr>: h.begin()
[1,7]<stderr>: File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 240, in begin
[1,7]<stderr>: self.bcast_op = broadcast_global_variables(self.root_rank)
[1,7]<stderr>: File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 191, in broadcast_global_variables
[1,7]<stderr>: return broadcast_variables(_global_variables(), root_rank)
[1,7]<stderr>: File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/functions.py", line 56, in broadcast_variables
[1,7]<stderr>: return broadcast_group(variables, root_rank)
[1,7]<stderr>: File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/functions.py", line 42, in broadcast_group
[1,7]<stderr>: for var in variables])
[1,7]<stderr>: File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/functions.py", line 42, in <listcomp>
[1,7]<stderr>: for var in variables])
[1,7]<stderr>: File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_ops.py", line 198, in broadcast
[1,7]<stderr>: ignore_name_scope=ignore_name_scope)
[1,7]<stderr>: File "<string>", line 320, in horovod_broadcast
[1,7]<stderr>: File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
[1,7]<stderr>: op_def=op_def)
[1,7]<stderr>: File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
[1,7]<stderr>: return func(*args, **kwargs)
[1,7]<stderr>: File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
[1,7]<stderr>: attrs, op_def, compute_device)
[1,7]<stderr>: File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py",[1,7]<stderr>: line 3426, in _create_op_internal
[1,7]<stderr>: op_def=op_def)
[1,7]<stderr>: File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
[1,7]<stderr>: self._traceback = tf_stack.extract_stack()
[1,6]<stderr>:Traceback (most recent call last):
[1,6]<stderr>: File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
[1,6]<stderr>: return fn(*args)
[1,6]<stderr>: File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
[1,6]<stderr>: target_list, run_metadata)
[1,6]<stderr>: File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
[1,6]<stderr>: run_metadata)
很奇怪 是啥姿势不对么?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels