Skip to content

Comments

Fix hot spare exit handling#266

Merged
hexinw-nvidia merged 3 commits intoNVIDIA:mainfrom
hexinw-nvidia:standby_exit
Feb 19, 2026
Merged

Fix hot spare exit handling#266
hexinw-nvidia merged 3 commits intoNVIDIA:mainfrom
hexinw-nvidia:standby_exit

Conversation

@hexinw-nvidia
Copy link
Contributor

  1. A hot spare that is waiting for the rendezvous to open should raise a RendezvousGracefulExitError when it detects that the rendezvous has been closed. This ensures the hot spare exits cleanly with a zero (success) exit code.

  2. When a hot spare transitions to standby, it should verify whether the active nodes have already completed training successfully. If so, it should exit gracefully instead of remaining alive and forcing the job to be cancelled by SLURM.

  3. The rendezvous shutdown sequence should occur before entering the final 3-second TCPStore wait stage.

@hexinw-nvidia hexinw-nvidia added the ci-approved Approved to run CI label Feb 18, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 18, 2026

Greptile Summary

This PR enhances hot spare and standby node exit handling to prevent jobs from hanging after successful training completion. The changes introduce three key mechanisms: (1) hot spares waiting at rendezvous now raise RendezvousGracefulExitError when detecting permanent closure, ensuring clean exit with success code; (2) standby nodes monitor PyTorch's internal exit barrier key to detect when all active nodes complete training successfully, allowing them to exit instead of forcing SLURM cancellation; (3) the rendezvous shutdown sequence now occurs before the TCPStore grace period to ensure state is properly propagated.

Major changes:

  • Added cycle_id field to participant info for stale data detection across rendezvous rounds
  • Renamed is_closed() to is_permanently_closed() to clarify semantics (permanent vs per-round closure)
  • Refactored Step 2 completion logic to always fetch full participant snapshots with cycle validation
  • Introduced _maybe_exit_standby_on_success() to check PyTorch's _EXIT_BARRIER_LAST_MEMBER_KEY
  • Moved shutdown() call before TCPStore grace period in cleanup sequence

Issues found:

  • Critical backward compatibility break in RendezvousParticipantInfo.unpack() - strict schema validation will crash during rolling upgrades when new nodes encounter old-format participant data

Confidence Score: 3/5

  • This PR has a critical backward compatibility issue that will break rolling upgrades
  • Score reflects one critical logic bug (backward compatibility break in unpack) that will cause crashes during rolling upgrades or mixed-version scenarios. The core functionality improvements for hot spare exit handling are well-designed and thoroughly tested, but the breaking schema change prevents safe deployment without coordinated full restarts
  • Pay close attention to ft_rendezvous_barrier.py - the RendezvousParticipantInfo.unpack() method needs .get() with defaults restored

Important Files Changed

Filename Overview
src/nvidia_resiliency_ext/fault_tolerance/ft_rendezvous_barrier.py Adds cycle_id to participant info for stale detection, renames is_closed to is_permanently_closed, improves Step 2 completion logic, changes hot spare wait to raise RendezvousGracefulExitError
src/nvidia_resiliency_ext/fault_tolerance/launcher.py Adds standby exit detection via PyTorch exit barrier key, moves shutdown sequence before TCPStore grace period, handles RendezvousGracefulExitError in run loop

Sequence Diagram

sequenceDiagram
    participant HS as Hot Spare
    participant SB as Standby Node
    participant AN as Active Node
    participant RDZ as Rendezvous State
    participant Store as TCPStore

    Note over HS,Store: Hot Spare Exit on Permanent Close
    HS->>RDZ: _wait_for_rendezvous_open()
    RDZ->>Store: check(permanent_close_key)
    Store-->>RDZ: True
    RDZ-->>HS: raise RendezvousGracefulExitError
    HS->>HS: Exit with code 0

    Note over SB,Store: Standby Exit on Training Success
    loop Monitor Loop (HEALTHY state)
        SB->>SB: _maybe_exit_standby_on_success()
        SB->>Store: check(_EXIT_BARRIER_LAST_MEMBER_KEY)
        alt Exit barrier complete
            Store-->>SB: True
            SB->>SB: _on_cycle_end()
            SB-->>SB: raise RendezvousGracefulExitError
            SB->>SB: Exit with code 0
        else Training in progress
            Store-->>SB: False
            SB->>SB: Continue monitoring
        end
    end

    Note over AN,Store: Active Node Success Path
    AN->>AN: Training completes
    AN->>Store: _exit_barrier() sets last_member key
    AN->>RDZ: shutdown()
    RDZ->>Store: set(permanent_close_key)
    AN->>Store: Wait 3s grace period
    AN->>AN: Exit with code 0
Loading

Last reviewed commit: 24e46e7

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 108 to 109
# (which requires the store host to have called shutdown(); the store host may be a standby).
_EXIT_BARRIER_LAST_MEMBER_KEY = "torchelastic/agent/terminal_state/last_member"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coupling to PyTorch internal key

This constant duplicates a string from PyTorch's internal _exit_barrier() implementation. If a future PyTorch version changes this key name, standby detection would silently stop working (the store.check would always return False, and standbys would never exit gracefully). Consider adding a comment noting which PyTorch version this was extracted from, or adding a startup-time assertion that validates the key exists after a successful _exit_barrier() call in integration tests.

1. A hot spare that is waiting for the rendezvous to open should raise a
   RendezvousGracefulExitError when it detects that the rendezvous has been
   closed. This ensures the hot spare exits cleanly with a zero (success) exit code.

2. When a hot spare transitions to standby, it should verify whether the active
   nodes have already completed training successfully. If so, it should exit
   gracefully instead of remaining alive and forcing the job to be cancelled by SLURM.

3. The rendezvous shutdown sequence should occur before entering the final
   3-second TCPStore wait stage.
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@hexinw-nvidia
Copy link
Contributor Author

@greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

…n check

1. The participant info key arrived_<count> is intentionally not cleared
   between cycles for performance reasons. To prevent stale data from a
   previous cycle being misinterpreted, add cycle_id to the participant info
   tuple so the rendezvous host can distinguish old entries from the current cycle.

2. Simplify the step2 rendezvous completion logic by periodically performing
   a full re-read of all participant keys. The multi-get overhead for ~10K keys
   is approximately 28 ms, which is acceptable. If either the arrival count or
   withdrawn count changes during the read, retry to ensure consistency.
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines 306 to 314
try:
info = json.loads(data)
node_desc = _NodeDesc(addr=info["addr"], pid=info["pid"], local_id=info["local_id"])
# Support old format without infra_rank field
infra_rank = info.get("infra_rank", -1)
# Support old format without domain_id field - default to "none"
domain_id = info.get("domain_id", "none")
return node_desc, infra_rank, domain_id
infra_rank = info["infra_rank"]
domain_id = info["domain_id"]
cycle_id = info["cycle_id"]
return node_desc, infra_rank, domain_id, cycle_id
except (json.JSONDecodeError, KeyError) as e:
raise ValueError(f"Invalid participant info data: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing backward compatibility breaks rolling upgrades and mixed-version scenarios. The old unpack used .get() with defaults for infra_rank, domain_id, and implicitly supported missing cycle_id. Now all fields require strict presence via direct dict access info["cycle_id"], causing KeyError when reading participant data written by older code versions.

During a rolling upgrade where some nodes run old code (without cycle_id) and some run new code, any new-version node calling get_all_participants will crash with ValueError: Invalid participant info data: KeyError('cycle_id') when unpacking old-format slots.

Suggested change
try:
info = json.loads(data)
node_desc = _NodeDesc(addr=info["addr"], pid=info["pid"], local_id=info["local_id"])
# Support old format without infra_rank field
infra_rank = info.get("infra_rank", -1)
# Support old format without domain_id field - default to "none"
domain_id = info.get("domain_id", "none")
return node_desc, infra_rank, domain_id
infra_rank = info["infra_rank"]
domain_id = info["domain_id"]
cycle_id = info["cycle_id"]
return node_desc, infra_rank, domain_id, cycle_id
except (json.JSONDecodeError, KeyError) as e:
raise ValueError(f"Invalid participant info data: {e}")
try:
info = json.loads(data)
node_desc = _NodeDesc(addr=info["addr"], pid=info["pid"], local_id=info["local_id"])
infra_rank = info.get("infra_rank", -1)
domain_id = info.get("domain_id", "none")
cycle_id = info.get("cycle_id", 0)
return node_desc, infra_rank, domain_id, cycle_id

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@greptile. No backward compatibility issue.

@hexinw-nvidia hexinw-nvidia merged commit 1f24208 into NVIDIA:main Feb 19, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-approved Approved to run CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants