Fix hot spare exit handling by hexinw-nvidia · Pull Request #266 · NVIDIA/nvidia-resiliency-ext

hexinw-nvidia · 2026-02-18T05:40:46Z

A hot spare that is waiting for the rendezvous to open should raise a RendezvousGracefulExitError when it detects that the rendezvous has been closed. This ensures the hot spare exits cleanly with a zero (success) exit code.
When a hot spare transitions to standby, it should verify whether the active nodes have already completed training successfully. If so, it should exit gracefully instead of remaining alive and forcing the job to be cancelled by SLURM.
The rendezvous shutdown sequence should occur before entering the final 3-second TCPStore wait stage.

greptile-apps · 2026-02-18T05:46:23Z

Greptile Summary

This PR enhances hot spare and standby node exit handling to prevent jobs from hanging after successful training completion. The changes introduce three key mechanisms: (1) hot spares waiting at rendezvous now raise RendezvousGracefulExitError when detecting permanent closure, ensuring clean exit with success code; (2) standby nodes monitor PyTorch's internal exit barrier key to detect when all active nodes complete training successfully, allowing them to exit instead of forcing SLURM cancellation; (3) the rendezvous shutdown sequence now occurs before the TCPStore grace period to ensure state is properly propagated.

Major changes:

Added cycle_id field to participant info for stale data detection across rendezvous rounds
Renamed is_closed() to is_permanently_closed() to clarify semantics (permanent vs per-round closure)
Refactored Step 2 completion logic to always fetch full participant snapshots with cycle validation
Introduced _maybe_exit_standby_on_success() to check PyTorch's _EXIT_BARRIER_LAST_MEMBER_KEY
Moved shutdown() call before TCPStore grace period in cleanup sequence

Issues found:

Critical backward compatibility break in RendezvousParticipantInfo.unpack() - strict schema validation will crash during rolling upgrades when new nodes encounter old-format participant data

Confidence Score: 3/5

This PR has a critical backward compatibility issue that will break rolling upgrades
Score reflects one critical logic bug (backward compatibility break in unpack) that will cause crashes during rolling upgrades or mixed-version scenarios. The core functionality improvements for hot spare exit handling are well-designed and thoroughly tested, but the breaking schema change prevents safe deployment without coordinated full restarts
Pay close attention to ft_rendezvous_barrier.py - the RendezvousParticipantInfo.unpack() method needs .get() with defaults restored

Important Files Changed

Filename	Overview
src/nvidia_resiliency_ext/fault_tolerance/ft_rendezvous_barrier.py	Adds cycle_id to participant info for stale detection, renames is_closed to is_permanently_closed, improves Step 2 completion logic, changes hot spare wait to raise RendezvousGracefulExitError
src/nvidia_resiliency_ext/fault_tolerance/launcher.py	Adds standby exit detection via PyTorch exit barrier key, moves shutdown sequence before TCPStore grace period, handles RendezvousGracefulExitError in run loop

Sequence Diagram

sequenceDiagram
    participant HS as Hot Spare
    participant SB as Standby Node
    participant AN as Active Node
    participant RDZ as Rendezvous State
    participant Store as TCPStore

    Note over HS,Store: Hot Spare Exit on Permanent Close
    HS->>RDZ: _wait_for_rendezvous_open()
    RDZ->>Store: check(permanent_close_key)
    Store-->>RDZ: True
    RDZ-->>HS: raise RendezvousGracefulExitError
    HS->>HS: Exit with code 0

    Note over SB,Store: Standby Exit on Training Success
    loop Monitor Loop (HEALTHY state)
        SB->>SB: _maybe_exit_standby_on_success()
        SB->>Store: check(_EXIT_BARRIER_LAST_MEMBER_KEY)
        alt Exit barrier complete
            Store-->>SB: True
            SB->>SB: _on_cycle_end()
            SB-->>SB: raise RendezvousGracefulExitError
            SB->>SB: Exit with code 0
        else Training in progress
            Store-->>SB: False
            SB->>SB: Continue monitoring
        end
    end

    Note over AN,Store: Active Node Success Path
    AN->>AN: Training completes
    AN->>Store: _exit_barrier() sets last_member key
    AN->>RDZ: shutdown()
    RDZ->>Store: set(permanent_close_key)
    AN->>Store: Wait 3s grace period
    AN->>AN: Exit with code 0

_{Last reviewed commit: 24e46e7}

greptile-apps

_{4 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-18T05:46:27Z

src/nvidia_resiliency_ext/fault_tolerance/launcher.py

+# (which requires the store host to have called shutdown(); the store host may be a standby).
+_EXIT_BARRIER_LAST_MEMBER_KEY = "torchelastic/agent/terminal_state/last_member"


Coupling to PyTorch internal key

This constant duplicates a string from PyTorch's internal _exit_barrier() implementation. If a future PyTorch version changes this key name, standby detection would silently stop working (the store.check would always return False, and standbys would never exit gracefully). Consider adding a comment noting which PyTorch version this was extracted from, or adding a startup-time assertion that validates the key exists after a successful _exit_barrier() call in integration tests.

src/nvidia_resiliency_ext/fault_tolerance/ft_rendezvous_barrier.py

1. A hot spare that is waiting for the rendezvous to open should raise a RendezvousGracefulExitError when it detects that the rendezvous has been closed. This ensures the hot spare exits cleanly with a zero (success) exit code. 2. When a hot spare transitions to standby, it should verify whether the active nodes have already completed training successfully. If so, it should exit gracefully instead of remaining alive and forcing the job to be cancelled by SLURM. 3. The rendezvous shutdown sequence should occur before entering the final 3-second TCPStore wait stage.

greptile-apps

_{4 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

src/nvidia_resiliency_ext/fault_tolerance/launcher.py

hexinw-nvidia · 2026-02-18T06:37:46Z

@greptile

greptile-apps

_{4 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

src/nvidia_resiliency_ext/fault_tolerance/launcher.py

…n check 1. The participant info key arrived_<count> is intentionally not cleared between cycles for performance reasons. To prevent stale data from a previous cycle being misinterpreted, add cycle_id to the participant info tuple so the rendezvous host can distinguish old entries from the current cycle. 2. Simplify the step2 rendezvous completion logic by periodically performing a full re-read of all participant keys. The multi-get overhead for ~10K keys is approximately 28 ms, which is acceptable. If either the arrival count or withdrawn count changes during the read, retry to ensure consistency.

greptile-apps

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{4 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-19T23:30:16Z

src/nvidia_resiliency_ext/fault_tolerance/ft_rendezvous_barrier.py

        try:
            info = json.loads(data)
            node_desc = _NodeDesc(addr=info["addr"], pid=info["pid"], local_id=info["local_id"])
-            # Support old format without infra_rank field
-            infra_rank = info.get("infra_rank", -1)
-            # Support old format without domain_id field - default to "none"
-            domain_id = info.get("domain_id", "none")
-            return node_desc, infra_rank, domain_id
+            infra_rank = info["infra_rank"]
+            domain_id = info["domain_id"]
+            cycle_id = info["cycle_id"]
+            return node_desc, infra_rank, domain_id, cycle_id
        except (json.JSONDecodeError, KeyError) as e:
            raise ValueError(f"Invalid participant info data: {e}")


Removing backward compatibility breaks rolling upgrades and mixed-version scenarios. The old unpack used .get() with defaults for infra_rank, domain_id, and implicitly supported missing cycle_id. Now all fields require strict presence via direct dict access info["cycle_id"], causing KeyError when reading participant data written by older code versions.

During a rolling upgrade where some nodes run old code (without cycle_id) and some run new code, any new-version node calling get_all_participants will crash with ValueError: Invalid participant info data: KeyError('cycle_id') when unpacking old-format slots.

Suggested change

try:

info = json.loads(data)

node_desc = _NodeDesc(addr=info["addr"], pid=info["pid"], local_id=info["local_id"])

# Support old format without infra_rank field

infra_rank = info.get("infra_rank", -1)

# Support old format without domain_id field - default to "none"

domain_id = info.get("domain_id", "none")

return node_desc, infra_rank, domain_id

infra_rank = info["infra_rank"]

domain_id = info["domain_id"]

cycle_id = info["cycle_id"]

return node_desc, infra_rank, domain_id, cycle_id

except (json.JSONDecodeError, KeyError) as e:

raise ValueError(f"Invalid participant info data: {e}")

try:

info = json.loads(data)

node_desc = _NodeDesc(addr=info["addr"], pid=info["pid"], local_id=info["local_id"])

infra_rank = info.get("infra_rank", -1)

domain_id = info.get("domain_id", "none")

cycle_id = info.get("cycle_id", 0)

return node_desc, infra_rank, domain_id, cycle_id

@greptile. No backward compatibility issue.

hexinw-nvidia added the ci-approved Approved to run CI label Feb 18, 2026

hexinw-nvidia requested review from namitdhameja, rhewett-nv and sbak5 February 18, 2026 05:41

greptile-apps bot reviewed Feb 18, 2026

View reviewed changes

hexinw-nvidia force-pushed the standby_exit branch from bb24772 to c5a31cb Compare February 18, 2026 06:30

greptile-apps bot reviewed Feb 18, 2026

View reviewed changes

src/nvidia_resiliency_ext/fault_tolerance/launcher.py Show resolved Hide resolved

src/nvidia_resiliency_ext/fault_tolerance/launcher.py Show resolved Hide resolved

hexinw-nvidia force-pushed the standby_exit branch from c5a31cb to ef67113 Compare February 18, 2026 06:37

greptile-apps bot reviewed Feb 18, 2026

View reviewed changes

src/nvidia_resiliency_ext/fault_tolerance/launcher.py Show resolved Hide resolved

greptile-apps bot reviewed Feb 18, 2026

View reviewed changes

rhewett-nv approved these changes Feb 19, 2026

View reviewed changes

Merge branch 'main' into standby_exit

24e46e7

greptile-apps bot reviewed Feb 19, 2026

View reviewed changes

sbak5 approved these changes Feb 19, 2026

View reviewed changes

hexinw-nvidia merged commit 1f24208 into NVIDIA:main Feb 19, 2026
9 checks passed

		# (which requires the store host to have called shutdown(); the store host may be a standby).
		_EXIT_BARRIER_LAST_MEMBER_KEY = "torchelastic/agent/terminal_state/last_member"

Comments

Conversation

hexinw-nvidia commented Feb 18, 2026

Uh oh!

greptile-apps bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hexinw-nvidia commented Feb 18, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

hexinw-nvidia Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps bot commented Feb 18, 2026 •

edited

Loading