-
Notifications
You must be signed in to change notification settings - Fork 14
Open
Description
您好,我在服务器上面部署了您的项目,服务器上是3张5880。我采用了和您一样的环境,除了驱动程序不一样。单卡训练太慢了,我想使用3张卡同时训练,于是我使用了这个命令:
CUDA_VISIBLE_DEVICES=0,1,2 python -m torch.distributed.launch --master_port=26355 --nproc_per_node=3 main.py --batch_size 1 --seq_num 28 --use_checkpoint
但是却报错了:
Traceback (most recent call last):
File "main.py", line 1048, in <module>
main(exp_root, timestamp_save_path, local_rank, rank)
File "main.py", line 1010, in main
device = init_device(args.device, args.batch_size, local_rank)
File "main.py", line 81, in init_device
assert (
AssertionError: Insufficient CUDA devices for DDP command.
Traceback (most recent call last):
File "main.py", line 1048, in <module>
main(exp_root, timestamp_save_path, local_rank, rank)
File "main.py", line 1010, in main
device = init_device(args.device, args.batch_size, local_rank)
File "main.py", line 81, in init_device
assert (
AssertionError: Insufficient CUDA devices for DDP command.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 799721 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 799722) of binary: /home/user/miniconda3/envs/urnn_cu118/bin/python
Traceback (most recent call last):
File "/home/user/miniconda3/envs/urnn_cu118/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/user/miniconda3/envs/urnn_cu118/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/user/miniconda3/envs/urnn_cu118/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module>
main()
File "/home/user/miniconda3/envs/urnn_cu118/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/home/user/miniconda3/envs/urnn_cu118/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/home/user/miniconda3/envs/urnn_cu118/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/user/miniconda3/envs/urnn_cu118/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/user/miniconda3/envs/urnn_cu118/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2025-10-09_14:52:47
host : ynaushuili
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 799723)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-10-09_14:52:47
host : ynaushuili
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 799722)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
我发现第一次报错是出现在您写的这处代码:
assert (
torch.cuda.device_count() > local_rank
), "Insufficient CUDA devices for DDP command."
我也试了多种办法,实在是没招了。我想问问您是怎么进行多卡训练的?
Metadata
Metadata
Assignees
Labels
No labels