Fix hot spare exit handling#266
Conversation
Greptile SummaryThis PR enhances hot spare and standby node exit handling to prevent jobs from hanging after successful training completion. The changes introduce three key mechanisms: (1) hot spares waiting at rendezvous now raise Major changes:
Issues found:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant HS as Hot Spare
participant SB as Standby Node
participant AN as Active Node
participant RDZ as Rendezvous State
participant Store as TCPStore
Note over HS,Store: Hot Spare Exit on Permanent Close
HS->>RDZ: _wait_for_rendezvous_open()
RDZ->>Store: check(permanent_close_key)
Store-->>RDZ: True
RDZ-->>HS: raise RendezvousGracefulExitError
HS->>HS: Exit with code 0
Note over SB,Store: Standby Exit on Training Success
loop Monitor Loop (HEALTHY state)
SB->>SB: _maybe_exit_standby_on_success()
SB->>Store: check(_EXIT_BARRIER_LAST_MEMBER_KEY)
alt Exit barrier complete
Store-->>SB: True
SB->>SB: _on_cycle_end()
SB-->>SB: raise RendezvousGracefulExitError
SB->>SB: Exit with code 0
else Training in progress
Store-->>SB: False
SB->>SB: Continue monitoring
end
end
Note over AN,Store: Active Node Success Path
AN->>AN: Training completes
AN->>Store: _exit_barrier() sets last_member key
AN->>RDZ: shutdown()
RDZ->>Store: set(permanent_close_key)
AN->>Store: Wait 3s grace period
AN->>AN: Exit with code 0
Last reviewed commit: 24e46e7 |
| # (which requires the store host to have called shutdown(); the store host may be a standby). | ||
| _EXIT_BARRIER_LAST_MEMBER_KEY = "torchelastic/agent/terminal_state/last_member" |
There was a problem hiding this comment.
Coupling to PyTorch internal key
This constant duplicates a string from PyTorch's internal _exit_barrier() implementation. If a future PyTorch version changes this key name, standby detection would silently stop working (the store.check would always return False, and standbys would never exit gracefully). Consider adding a comment noting which PyTorch version this was extracted from, or adding a startup-time assertion that validates the key exists after a successful _exit_barrier() call in integration tests.
src/nvidia_resiliency_ext/fault_tolerance/ft_rendezvous_barrier.py
Outdated
Show resolved
Hide resolved
bb24772 to
c5a31cb
Compare
1. A hot spare that is waiting for the rendezvous to open should raise a RendezvousGracefulExitError when it detects that the rendezvous has been closed. This ensures the hot spare exits cleanly with a zero (success) exit code. 2. When a hot spare transitions to standby, it should verify whether the active nodes have already completed training successfully. If so, it should exit gracefully instead of remaining alive and forcing the job to be cancelled by SLURM. 3. The rendezvous shutdown sequence should occur before entering the final 3-second TCPStore wait stage.
c5a31cb to
ef67113
Compare
|
@greptile |
…n check 1. The participant info key arrived_<count> is intentionally not cleared between cycles for performance reasons. To prevent stale data from a previous cycle being misinterpreted, add cycle_id to the participant info tuple so the rendezvous host can distinguish old entries from the current cycle. 2. Simplify the step2 rendezvous completion logic by periodically performing a full re-read of all participant keys. The multi-get overhead for ~10K keys is approximately 28 ms, which is acceptable. If either the arrival count or withdrawn count changes during the read, retry to ensure consistency.
| try: | ||
| info = json.loads(data) | ||
| node_desc = _NodeDesc(addr=info["addr"], pid=info["pid"], local_id=info["local_id"]) | ||
| # Support old format without infra_rank field | ||
| infra_rank = info.get("infra_rank", -1) | ||
| # Support old format without domain_id field - default to "none" | ||
| domain_id = info.get("domain_id", "none") | ||
| return node_desc, infra_rank, domain_id | ||
| infra_rank = info["infra_rank"] | ||
| domain_id = info["domain_id"] | ||
| cycle_id = info["cycle_id"] | ||
| return node_desc, infra_rank, domain_id, cycle_id | ||
| except (json.JSONDecodeError, KeyError) as e: | ||
| raise ValueError(f"Invalid participant info data: {e}") |
There was a problem hiding this comment.
Removing backward compatibility breaks rolling upgrades and mixed-version scenarios. The old unpack used .get() with defaults for infra_rank, domain_id, and implicitly supported missing cycle_id. Now all fields require strict presence via direct dict access info["cycle_id"], causing KeyError when reading participant data written by older code versions.
During a rolling upgrade where some nodes run old code (without cycle_id) and some run new code, any new-version node calling get_all_participants will crash with ValueError: Invalid participant info data: KeyError('cycle_id') when unpacking old-format slots.
| try: | |
| info = json.loads(data) | |
| node_desc = _NodeDesc(addr=info["addr"], pid=info["pid"], local_id=info["local_id"]) | |
| # Support old format without infra_rank field | |
| infra_rank = info.get("infra_rank", -1) | |
| # Support old format without domain_id field - default to "none" | |
| domain_id = info.get("domain_id", "none") | |
| return node_desc, infra_rank, domain_id | |
| infra_rank = info["infra_rank"] | |
| domain_id = info["domain_id"] | |
| cycle_id = info["cycle_id"] | |
| return node_desc, infra_rank, domain_id, cycle_id | |
| except (json.JSONDecodeError, KeyError) as e: | |
| raise ValueError(f"Invalid participant info data: {e}") | |
| try: | |
| info = json.loads(data) | |
| node_desc = _NodeDesc(addr=info["addr"], pid=info["pid"], local_id=info["local_id"]) | |
| infra_rank = info.get("infra_rank", -1) | |
| domain_id = info.get("domain_id", "none") | |
| cycle_id = info.get("cycle_id", 0) | |
| return node_desc, infra_rank, domain_id, cycle_id |
There was a problem hiding this comment.
@greptile. No backward compatibility issue.
A hot spare that is waiting for the rendezvous to open should raise a RendezvousGracefulExitError when it detects that the rendezvous has been closed. This ensures the hot spare exits cleanly with a zero (success) exit code.
When a hot spare transitions to standby, it should verify whether the active nodes have already completed training successfully. If so, it should exit gracefully instead of remaining alive and forcing the job to be cancelled by SLURM.
The rendezvous shutdown sequence should occur before entering the final 3-second TCPStore wait stage.