Skip to content

多卡训练问题 #5

@awei3878

Description

@awei3878

您好,我在服务器上面部署了您的项目,服务器上是3张5880。我采用了和您一样的环境,除了驱动程序不一样。单卡训练太慢了,我想使用3张卡同时训练,于是我使用了这个命令:

CUDA_VISIBLE_DEVICES=0,1,2 python -m torch.distributed.launch --master_port=26355 --nproc_per_node=3 main.py --batch_size 1 --seq_num 28 --use_checkpoint

但是却报错了:

Traceback (most recent call last):
  File "main.py", line 1048, in <module>
    main(exp_root, timestamp_save_path, local_rank, rank)
  File "main.py", line 1010, in main
    device = init_device(args.device, args.batch_size, local_rank)
  File "main.py", line 81, in init_device
    assert (
AssertionError: Insufficient CUDA devices for DDP command.
Traceback (most recent call last):
  File "main.py", line 1048, in <module>
    main(exp_root, timestamp_save_path, local_rank, rank)
  File "main.py", line 1010, in main
    device = init_device(args.device, args.batch_size, local_rank)
  File "main.py", line 81, in init_device
    assert (
AssertionError: Insufficient CUDA devices for DDP command.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 799721 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 799722) of binary: /home/user/miniconda3/envs/urnn_cu118/bin/python
Traceback (most recent call last):
  File "/home/user/miniconda3/envs/urnn_cu118/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/user/miniconda3/envs/urnn_cu118/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/user/miniconda3/envs/urnn_cu118/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/home/user/miniconda3/envs/urnn_cu118/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/home/user/miniconda3/envs/urnn_cu118/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/home/user/miniconda3/envs/urnn_cu118/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/user/miniconda3/envs/urnn_cu118/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/user/miniconda3/envs/urnn_cu118/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2025-10-09_14:52:47
  host      : ynaushuili
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 799723)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-10-09_14:52:47
  host      : ynaushuili
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 799722)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

我发现第一次报错是出现在您写的这处代码:

assert (
            torch.cuda.device_count() > local_rank
        ), "Insufficient CUDA devices for DDP command."

我也试了多种办法,实在是没招了。我想问问您是怎么进行多卡训练的?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions