Skip to content

Comments

feat: add non-retryable exception pattern matching#212

Draft
hexinw-nvidia wants to merge 1 commit intoNVIDIA:mainfrom
hexinw-nvidia:stop_retry
Draft

feat: add non-retryable exception pattern matching#212
hexinw-nvidia wants to merge 1 commit intoNVIDIA:mainfrom
hexinw-nvidia:stop_retry

Conversation

@hexinw-nvidia
Copy link
Contributor

Add --ft-non-retryable-exception-file to mark nodes unhealthy when workers fail with specific exception patterns (e.g., config errors). This prevents retrying on errors that won't be fixed by retrying.

Implementation:

  • Workers write full tracebacks via sys.excepthook to error files
  • Launcher checks error files against configured patterns on worker failure
  • Nodes with matching exceptions increment unhealthy_count and exit
  • Rendezvous uses unhealthy_count to decide if job can continue

Example: Configure patterns like "insufficient shared memory (shm)" to stop retry on configuration error.

Add --ft-non-retryable-exception-file to mark nodes unhealthy when workers
fail with specific exception patterns (e.g., config errors). This prevents
retrying on errors that won't be fixed by retrying.

Implementation:
- Workers write full tracebacks via sys.excepthook to error files
- Launcher checks error files against configured patterns on worker failure
- Nodes with matching exceptions increment unhealthy_count and exit
- Rendezvous uses unhealthy_count to decide if job can continue

Example: Configure patterns like "insufficient shared memory (shm)" to
stop retry on configuration error.
@hexinw-nvidia hexinw-nvidia marked this pull request as draft February 11, 2026 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant