Fix distributed deadlock when experiment log path exists by prachitbhike · Pull Request #9 · YalaLab/pillar-pretrain

prachitbhike · 2026-02-06T22:30:32Z

Summary

Fixes a distributed training deadlock where only rank 0 exits early when a log path already exists (without --resume latest), leaving all other ranks hung at the next NCCL collective
Broadcasts the exit decision to all ranks using the existing broadcast_object utility before returning, so every process exits cleanly

Details

In src/trainer/main.py, when args.log_path already exists and --resume is not "latest", the master rank (rank 0) called return -1 while all other ranks — already in the NCCL process group via init_distributed_device() — hung forever at the next collective operation.

The fix introduces a should_exit flag that is broadcast from rank 0 to all ranks before any early return. This matches the existing pattern used for resume_from at line ~687.

Non-distributed case: The if args.distributed guard skips the broadcast; single-process training returns -1 exactly as before.

Fixes #4

Test plan

Verify single-GPU training still exits cleanly when experiment exists
Verify multi-GPU (torchrun --nproc_per_node=2) exits cleanly on all ranks when experiment exists
Verify normal training (no pre-existing log) is unaffected

🤖 Generated with Claude Code

When a log path already exists and --resume is not "latest", only rank 0 returned early while all other ranks hung at the next NCCL collective. Broadcast the exit decision to all ranks before returning so every process exits cleanly. Uses the existing broadcast_object utility, matching the pattern already used for resume_from at line ~687. Fixes YalaLab#4

prachitbhike force-pushed the fix/distributed-deadlock-experiment-exists branch from f224242 to f05e86f Compare February 6, 2026 22:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix distributed deadlock when experiment log path exists#9

Fix distributed deadlock when experiment log path exists#9
prachitbhike wants to merge 1 commit intoYalaLab:mainfrom
prachitbhike:fix/distributed-deadlock-experiment-exists

prachitbhike commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

prachitbhike commented Feb 6, 2026

Summary

Details

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments