Add cycle tracking and REST API for failure attribution#217
Draft
hexinw-nvidia wants to merge 1 commit intoNVIDIA:mainfrom
Draft
Add cycle tracking and REST API for failure attribution#217hexinw-nvidia wants to merge 1 commit intoNVIDIA:mainfrom
hexinw-nvidia wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
Implement a comprehensive cycle management system to track fault tolerance
cycles and enable external failure attribution modules to report failure
reasons via REST API. The launcher uses this information to make early exit
decisions for non-recoverable hardware failures.
Key changes:
* Add built-in HTTP server (default port 2025).
- Enabled by default, disable with --no-http-server
- Implemented in launcher_server.py using werkzeug
* Cycle Management System:
- Add CycleManager singleton to maintain LRU cache of recent cycles
- Add Cycle class to store profiling events, failure reasons, and metadata
- Store all profiling events (WORKER_START_STARTED, WORKER_START_COMPLETED,
FAILURE_DETECTED, etc.) within cycle objects with timestamps
- Support negative indexing for cycle queries (e.g., -1 for last cycle)
- Implement check_recent_cycles_for_exit() to detect non-recoverable
failures (GPU_HW_FAILURE, MEMORY_HW_FAILURE, etc.)
* REST API Endpoints:
- Add GET / endpoint returning launcher_start_time
- Add GET /cycles endpoint to query all or specific cycles with events
- Support query parameter: /cycles?cycle_number=3
- Support negative indexing: /cycles?cycle_number=-1
- Add POST /cycles endpoint for external modules to update cycle
failure_reason and metadata (requires existing cycle)
* Profiler Refactoring:
- Remove global profiler singleton and record_profiling_event() function
- Make FaultToleranceProfiler an instance variable of LocalElasticAgent
- Pass profiler to rendezvous handlers via set_profiler() method
- All profiling events now stored in cycle objects via cycle.add_event()
- Remove explicit cycle_start_time tracking (derived from events)
* Launcher Integration:
- Integrate cycle check in _monitor_workers() with 5-second throttling
- Set _remaining_restarts=0 when non-recoverable failure detected
- Prevent restart attempts for hardware failures that won't recover
- Early exit job when non-recoverable failure reported by external module
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implement a comprehensive cycle management system to track fault tolerance cycles and enable external failure attribution modules to report failure reasons via REST API. The launcher uses this information to make early exit decisions for non-recoverable hardware failures.
Key changes:
Cycle Management System:
REST API Endpoints:
Profiler Refactoring:
Launcher Integration: