Add cycle tracking and REST API for failure attribution by hexinw-nvidia · Pull Request #217 · NVIDIA/nvidia-resiliency-ext

hexinw-nvidia · 2025-11-04T00:26:48Z

Implement a comprehensive cycle management system to track fault tolerance cycles and enable external failure attribution modules to report failure reasons via REST API. The launcher uses this information to make early exit decisions for non-recoverable hardware failures.

Key changes:

Add built-in HTTP server (default port 2025).

Enabled by default, disable with --no-http-server
Implemented in launcher_server.py using werkzeug

Cycle Management System:
- Add CycleManager singleton to maintain LRU cache of recent cycles
- Add Cycle class to store profiling events, failure reasons, and metadata
- Store all profiling events (WORKER_START_STARTED, WORKER_START_COMPLETED, FAILURE_DETECTED, etc.) within cycle objects with timestamps
- Support negative indexing for cycle queries (e.g., -1 for last cycle)
- Implement check_recent_cycles_for_exit() to detect non-recoverable failures (GPU_HW_FAILURE, MEMORY_HW_FAILURE, etc.)
REST API Endpoints:
- Add GET / endpoint returning launcher_start_time
- Add GET /cycles endpoint to query all or specific cycles with events
  - Support query parameter: /cycles?cycle_number=3
  - Support negative indexing: /cycles?cycle_number=-1
- Add POST /cycles endpoint for external modules to update cycle failure_reason and metadata (requires existing cycle)
Profiler Refactoring:
- Remove global profiler singleton and record_profiling_event() function
- Make FaultToleranceProfiler an instance variable of LocalElasticAgent
- Pass profiler to rendezvous handlers via set_profiler() method
- All profiling events now stored in cycle objects via cycle.add_event()
- Remove explicit cycle_start_time tracking (derived from events)
Launcher Integration:
- Integrate cycle check in _monitor_workers() with 5-second throttling
- Set _remaining_restarts=0 when non-recoverable failure detected
- Prevent restart attempts for hardware failures that won't recover
- Early exit job when non-recoverable failure reported by external module

Implement a comprehensive cycle management system to track fault tolerance cycles and enable external failure attribution modules to report failure reasons via REST API. The launcher uses this information to make early exit decisions for non-recoverable hardware failures. Key changes: * Add built-in HTTP server (default port 2025). - Enabled by default, disable with --no-http-server - Implemented in launcher_server.py using werkzeug * Cycle Management System: - Add CycleManager singleton to maintain LRU cache of recent cycles - Add Cycle class to store profiling events, failure reasons, and metadata - Store all profiling events (WORKER_START_STARTED, WORKER_START_COMPLETED, FAILURE_DETECTED, etc.) within cycle objects with timestamps - Support negative indexing for cycle queries (e.g., -1 for last cycle) - Implement check_recent_cycles_for_exit() to detect non-recoverable failures (GPU_HW_FAILURE, MEMORY_HW_FAILURE, etc.) * REST API Endpoints: - Add GET / endpoint returning launcher_start_time - Add GET /cycles endpoint to query all or specific cycles with events - Support query parameter: /cycles?cycle_number=3 - Support negative indexing: /cycles?cycle_number=-1 - Add POST /cycles endpoint for external modules to update cycle failure_reason and metadata (requires existing cycle) * Profiler Refactoring: - Remove global profiler singleton and record_profiling_event() function - Make FaultToleranceProfiler an instance variable of LocalElasticAgent - Pass profiler to rendezvous handlers via set_profiler() method - All profiling events now stored in cycle objects via cycle.add_event() - Remove explicit cycle_start_time tracking (derived from events) * Launcher Integration: - Integrate cycle check in _monitor_workers() with 5-second throttling - Set _remaining_restarts=0 when non-recoverable failure detected - Prevent restart attempts for hardware failures that won't recover - Early exit job when non-recoverable failure reported by external module

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add cycle tracking and REST API for failure attribution#217

Add cycle tracking and REST API for failure attribution#217
hexinw-nvidia wants to merge 1 commit intoNVIDIA:mainfrom
hexinw-nvidia:grpc

hexinw-nvidia commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

hexinw-nvidia commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant