Skip to content

Fix distributed deadlock when experiment log path exists#9

Open
prachitbhike wants to merge 1 commit intoYalaLab:mainfrom
prachitbhike:fix/distributed-deadlock-experiment-exists
Open

Fix distributed deadlock when experiment log path exists#9
prachitbhike wants to merge 1 commit intoYalaLab:mainfrom
prachitbhike:fix/distributed-deadlock-experiment-exists

Conversation

@prachitbhike
Copy link

Summary

  • Fixes a distributed training deadlock where only rank 0 exits early when a log path already exists (without --resume latest), leaving all other ranks hung at the next NCCL collective
  • Broadcasts the exit decision to all ranks using the existing broadcast_object utility before returning, so every process exits cleanly

Details

In src/trainer/main.py, when args.log_path already exists and --resume is not "latest", the master rank (rank 0) called return -1 while all other ranks — already in the NCCL process group via init_distributed_device() — hung forever at the next collective operation.

The fix introduces a should_exit flag that is broadcast from rank 0 to all ranks before any early return. This matches the existing pattern used for resume_from at line ~687.

Non-distributed case: The if args.distributed guard skips the broadcast; single-process training returns -1 exactly as before.

Fixes #4

Test plan

  • Verify single-GPU training still exits cleanly when experiment exists
  • Verify multi-GPU (torchrun --nproc_per_node=2) exits cleanly on all ranks when experiment exists
  • Verify normal training (no pre-existing log) is unaffected

🤖 Generated with Claude Code

When a log path already exists and --resume is not "latest", only
rank 0 returned early while all other ranks hung at the next NCCL
collective. Broadcast the exit decision to all ranks before returning
so every process exits cleanly.

Uses the existing broadcast_object utility, matching the pattern
already used for resume_from at line ~687.

Fixes YalaLab#4
@prachitbhike prachitbhike force-pushed the fix/distributed-deadlock-experiment-exists branch from f224242 to f05e86f Compare February 6, 2026 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deadlock when experiment name already exists - main.py

1 participant

Comments