Error with distributed mp

Hi, I tried running my code like your example, and I got this error
```
File "artGAN512_impre_v8.py", line 286, in main
 mp.spawn(train, nprocs=args.gpus, args=(args,))
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 167, in spawn
    while not spawn_context.join():
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 114, in join
    raise Exception(msg)
Exception: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/ubuntu/dcgan/artGAN512_impre_v8.py", line 167, in train
    world_size=args.world_size, rank=rank)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 406, in init_process_group
    store, rank, world_size = next(rendezvous(url))
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 143, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon)
RuntimeError: Connection timed out

```

Under my `train` function, i have
```
rank = args.nr * args.gpus + gpu
dist.init_process_group(backend='nccl', init_method='env://', 
                            world_size=args.world_size, rank=rank)
torch.manual_seed(0)
torch.cuda.set_device(gpu)
```
I think it has something to do with the `os.environ['MASTER_ADDR'] `, can you explain how you chose value for that parameter? I'm using an AWS instance. 

Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with distributed mp #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Error with distributed mp #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions