Skip to content

Comments

Deprecate ft-restart-policy (min-healthy) #259

Draft
hexinw-nvidia wants to merge 1 commit intoNVIDIA:mainfrom
hexinw-nvidia:cleanup_minhealthy
Draft

Deprecate ft-restart-policy (min-healthy) #259
hexinw-nvidia wants to merge 1 commit intoNVIDIA:mainfrom
hexinw-nvidia:cleanup_minhealthy

Conversation

@hexinw-nvidia
Copy link
Contributor

Remove inprocess+injob example

The previous ft-restart-policy design exposed two separate restart levels: InJob (launcher) and InProcess. That was confusing for users, who expect a single restart per training cycle—either some processes stay up across the restart or they are restarted for the next cycle. The current logic does not match that model and needs a proper re-design.

  • Deprecate --ft-restart-policy: only any-failed is supported; mark option deprecated and remove min-healthy implementation.
  • Remove min-healthy code path (_invoke_run_with_min_healthy_policy) and always use any-failed behavior; set upscaling_enabled=True.
  • Remove in_job_and_in_process example (script, Python example, and doc) and references; document that injob+inprocess integration is under re-evaluation.
  • Update docs (usage_guide, inprocess integration, examples toctree) and inprocess usage_guide to drop min-healthy and example references.

Removing this code simplifies the codebase and makes the intended restart model easier to reason about before a future re-design.

…example

The previous ft-restart-policy design exposed two separate restart levels:
InJob (launcher) and InProcess. That was confusing for users, who expect
a single restart per training cycle—either some processes stay up across
the restart or they are restarted for the next cycle. The current logic
does not match that model and needs a proper re-design.

- Deprecate --ft-restart-policy: only any-failed is supported; mark
  option deprecated and remove min-healthy implementation.
- Remove min-healthy code path (_invoke_run_with_min_healthy_policy)
  and always use any-failed behavior; set upscaling_enabled=True.
- Remove in_job_and_in_process example (script, Python example, and
  doc) and references; document that injob+inprocess integration is
  under re-evaluation.
- Update docs (usage_guide, inprocess integration, examples toctree)
  and inprocess usage_guide to drop min-healthy and example references.

Removing this code simplifies the codebase and makes the intended
restart model easier to reason about before a future re-design.
@hexinw-nvidia hexinw-nvidia marked this pull request as draft February 10, 2026 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant