Skip to content

Error with distributed mp #3

@jyu-theartofml

Description

@jyu-theartofml

Hi, I tried running my code like your example, and I got this error

File "artGAN512_impre_v8.py", line 286, in main
 mp.spawn(train, nprocs=args.gpus, args=(args,))
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 167, in spawn
    while not spawn_context.join():
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 114, in join
    raise Exception(msg)
Exception: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/ubuntu/dcgan/artGAN512_impre_v8.py", line 167, in train
    world_size=args.world_size, rank=rank)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 406, in init_process_group
    store, rank, world_size = next(rendezvous(url))
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 143, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon)
RuntimeError: Connection timed out

Under my train function, i have

rank = args.nr * args.gpus + gpu
dist.init_process_group(backend='nccl', init_method='env://', 
                            world_size=args.world_size, rank=rank)
torch.manual_seed(0)
torch.cuda.set_device(gpu)

I think it has something to do with the os.environ['MASTER_ADDR'] , can you explain how you chose value for that parameter? I'm using an AWS instance.

Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions