drop `./devops/run.sh train` launch time from 20s -> 2s with web-based callbacks, lazy imports, and default tasks 10k -> 64 #4545

relh · 2025-12-28T15:39:14Z

Summary

This PR reduces training startup latency by overlapping slow preflight work (AWS/SSO checkpoint routing + optional stats client init) with environment/policy construction. It also makes mission lookup lazy/cached, and reduces the default active curriculum task count to avoid unnecessary work on startup.

The only “save/load interface” change in this PR is the introduction of a PolicyStorageDecision object used to decide where checkpoints are saved and how they’re resolved during launch. The on‑disk checkpoint format is unchanged.

Motivation

Startup was spending time in a few places that don’t need to be on the critical path:

AWS/SSO checks for remote checkpoint routing
eager loading of mission registries
very large default curriculum task counts

These changes pull those costs earlier or defer them so that training can begin faster without changing functional behavior.

What Changed (High Level)

Preflight executor in TrainTool to overlap storage decision + stats client init with env/policy init
PolicyStorageDecision passed into CheckpointManager to avoid recomputing AWS/SSO routing
Lazy/cached mission registries and cleaner mission resolution (core + evals)
Default curriculum active tasks lowered (10000 → 64)
Lazy imports in heavy recipe entrypoints

“Save/Load Interface” (Checkpoint Routing)

PolicyStorageDecision (new data object)

metta/tools/utils/auto_config.py

Captures where policies should be saved:
- base_prefix: base remote prefix (e.g., s3://...) or None
- remote_prefix: run‑specific prefix (base/run) or None
- reason: why remote is or isn’t used
using_remote indicates when remote storage is active

CheckpointManager now accepts a decision

metta/rl/checkpoint_manager.py

CheckpointManager(..., storage_decision=...) uses the precomputed decision to set _remote_prefix
Avoids recomputing AWS/SSO checks during startup
Save/load behavior remains the same; only the decision is centralized

TrainTool uses it in preflight

metta/tools/train.py

auto_policy_storage_decision(run_name) starts in a background thread
The result is passed to CheckpointManager once available
Preserves fail‑fast behavior on decision errors

Net effect: the “save/load interface” is still the same file/URI scheme, but the routing decision is now explicit, cached, and overlapped with startup.

Launch Sequence (TrainTool)

Determine run_name
Start preflight executor (if needed):
- storage decision
- optional stats client
Build vectorized environment + policy
Apply storage decision to CheckpointManager
Initialize logging + trainer
Register components and start training

This keeps behavior identical while shortening the critical path.

Mission Lookup (Lazy + Cached)

Lazy registries

Core missions load immediately; eval missions load only when requested
Registries are cached (lru_cache) so repeated lookups are cheap

Cleaner resolution

find_mission(..., include_evals=True) provides a single path for “core + eval” lookup
If no site matches and mission_name is None, we treat the input as a mission name (fixes easy_hearts in recipe tests)

Curriculum Defaults

metta/cogworks/curriculum/curriculum.py

num_active_tasks default reduced from 10000 → 64
Tests updated accordingly

Recipe + Import Cleanup

recipes/experiment/machina_1.py

Heavy imports moved inside function bodies to reduce CLI startup time

Testing

CI: Lint, Python, Recipes, C++/Nim
Updated curriculum default test
Recipe tests now pass easy_hearts mission lookup via find_mission(..., include_evals=True)

Asana Task

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

recipes/experiment/cogs_v_clips.py

datadog-official · 2025-12-28T15:50:28Z

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: afea62c | Docs | Was this helpful? Give us feedback!}

This reverts commit 325a22c.

relh · 2025-12-28T19:36:31Z

Startup timing (machina_1 via ./devops/run.sh, sandbox=true, total_timesteps=1): origin/main 22.56s vs richard-launchfixes 2.79s. Both runs used W&B + SSO; main run was a single cold start.

relh · 2025-12-28T19:36:45Z

drop ./devops/run.sh train launch time from 20s -> 2s with web-based callbacks, lazy imports, and default tasks 10k -> 64 #4545 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

…d callbacks, lazy imports, and default tasks 10k -> 64 (#4545) ## Summary This PR reduces training startup latency by overlapping slow preflight work (AWS/SSO checkpoint routing + optional stats client init) with environment/policy construction. It also makes mission lookup lazy/cached, and reduces the default active curriculum task count to avoid unnecessary work on startup. The only “save/load interface” change in this PR is the introduction of a **PolicyStorageDecision** object used to decide *where* checkpoints are saved and how they’re resolved during launch. The on‑disk checkpoint format is unchanged. ## Motivation Startup was spending time in a few places that don’t need to be on the critical path: - AWS/SSO checks for remote checkpoint routing - eager loading of mission registries - very large default curriculum task counts These changes pull those costs earlier or defer them so that training can begin faster without changing functional behavior. ## What Changed (High Level) - **Preflight executor in TrainTool** to overlap storage decision + stats client init with env/policy init - **PolicyStorageDecision** passed into CheckpointManager to avoid recomputing AWS/SSO routing - **Lazy/cached mission registries** and cleaner mission resolution (core + evals) - **Default curriculum active tasks lowered** (10000 → 64) - **Lazy imports** in heavy recipe entrypoints --- ## “Save/Load Interface” (Checkpoint Routing) ### PolicyStorageDecision (new data object) `metta/tools/utils/auto_config.py` - Captures *where* policies should be saved: - `base_prefix`: base remote prefix (e.g., `s3://...`) or `None` - `remote_prefix`: run‑specific prefix (`base/run`) or `None` - `reason`: why remote is or isn’t used - `using_remote` indicates when remote storage is active ### CheckpointManager now accepts a decision `metta/rl/checkpoint_manager.py` - `CheckpointManager(..., storage_decision=...)` uses the precomputed decision to set `_remote_prefix` - Avoids recomputing AWS/SSO checks during startup - Save/load behavior remains the same; only the *decision* is centralized ### TrainTool uses it in preflight `metta/tools/train.py` - `auto_policy_storage_decision(run_name)` starts in a background thread - The result is passed to `CheckpointManager` once available - Preserves fail‑fast behavior on decision errors **Net effect:** the “save/load interface” is still the same file/URI scheme, but the **routing decision is now explicit, cached, and overlapped with startup**. --- ## Launch Sequence (TrainTool) 1. Determine `run_name` 2. Start preflight executor (if needed): - storage decision - optional stats client 3. Build vectorized environment + policy 4. Apply storage decision to CheckpointManager 5. Initialize logging + trainer 6. Register components and start training This keeps behavior identical while shortening the critical path. --- ## Mission Lookup (Lazy + Cached) ### Lazy registries - Core missions load immediately; eval missions load only when requested - Registries are cached (`lru_cache`) so repeated lookups are cheap ### Cleaner resolution - `find_mission(..., include_evals=True)` provides a single path for “core + eval” lookup - If no site matches and `mission_name` is None, we treat the input as a mission name (fixes `easy_hearts` in recipe tests) --- ## Curriculum Defaults `metta/cogworks/curriculum/curriculum.py` - `num_active_tasks` default reduced from **10000 → 64** - Tests updated accordingly --- ## Recipe + Import Cleanup `recipes/experiment/machina_1.py` - Heavy imports moved inside function bodies to reduce CLI startup time --- ## Testing - CI: Lint, Python, Recipes, C++/Nim - Updated curriculum default test - Recipe tests now pass `easy_hearts` mission lookup via `find_mission(..., include_evals=True)` [Asana Task](https://app.asana.com/1/1209016784099267/project/1210348820405981/task/1212600746277979)

startup timing on branch

59af304

github-actions bot assigned relh Dec 28, 2025

hide imports and control num tasks

812a386

chatgpt-codex-connector bot reviewed Dec 28, 2025

View reviewed changes

recipes/experiment/cogs_v_clips.py Show resolved Hide resolved

active tasks

d18ce24

relh added 4 commits December 28, 2025 11:11

launch fixes

648b611

easy hearts and init cleanups

325a22c

remove audits

2001c93

linted

fbeedd2

relh changed the title ~~improve launch/startup time of training~~ improve launch/startup time of training and ditch easy_hearts Dec 28, 2025

relh changed the title ~~improve launch/startup time of training and ditch easy_hearts~~ improve launch/startup time of training Dec 28, 2025

Revert "easy hearts and init cleanups"

4960081

This reverts commit 325a22c.

relh enabled auto-merge December 28, 2025 17:20

relh added 5 commits December 28, 2025 12:29

Trim non-startup diffs

b9bb5ef

Remove preflight exception swallowing

25f0ad4

Make mission lookup lazy

6cee964

Log coarse startup time

e5516cc

Simplify preflight futures

19d742f

relh changed the title ~~improve launch/startup time of training~~ improve launch/startup time of training with lazy imports Dec 28, 2025

relh added 2 commits December 28, 2025 14:36

Drop startup timing log

b726eee

lint

156d75f

relh added the review wanted: stamp This PR needs a review from any available team member label Dec 28, 2025

relh added 3 commits December 28, 2025 15:03

Inline preflight executor

781d4fe

Allow eval missions in cogs_v_clips resolution

df37940

Update curriculum default test

19b289e

relh changed the title ~~improve launch/startup time of training with lazy imports~~ improve launch/startup time of training with lazy imports, default tasks 10k -> 64 Dec 29, 2025

Simplify preflight setup

52ab8cf

relh added 7 commits December 29, 2025 10:30

Merge remote-tracking branch 'origin/main' into richard-launchfixes

d517d85

Merge remote-tracking branch 'origin/main' into richard-launchfixes

522ee49

Simplify mission lookup

37626e6

Cache mission registries

abd0d0f

lint

bd2c2c8

Allow mission-name fallback in find_mission

9258c68

Merge remote-tracking branch 'origin/main' into richard-launchfixes

e1f5340

relh changed the title ~~improve launch/startup time of training with lazy imports, default tasks 10k -> 64~~ drop ./devops/run.sh train launch time from 20s -> 2s with web-based callbacks, lazy imports, and default tasks 10k -> 64 Dec 29, 2025

nishu-builder approved these changes Dec 29, 2025

View reviewed changes

relh added this pull request to the merge queue Dec 29, 2025

relh removed this pull request from the merge queue due to a manual request Dec 29, 2025

relh added 3 commits December 29, 2025 12:08

more concise?

6da62ec

weird

e2d56c5

Merge remote-tracking branch 'origin/main' into richard-launchfixes

afea62c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

drop `./devops/run.sh train` launch time from 20s -> 2s with web-based callbacks, lazy imports, and default tasks 10k -> 64 #4545

drop `./devops/run.sh train` launch time from 20s -> 2s with web-based callbacks, lazy imports, and default tasks 10k -> 64 #4545

Uh oh!

relh commented Dec 28, 2025 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

datadog-official bot commented Dec 28, 2025 •

edited

Loading

Uh oh!

relh commented Dec 28, 2025

Uh oh!

relh commented Dec 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

drop ./devops/run.sh train launch time from 20s -> 2s with web-based callbacks, lazy imports, and default tasks 10k -> 64 #4545

Are you sure you want to change the base?

drop ./devops/run.sh train launch time from 20s -> 2s with web-based callbacks, lazy imports, and default tasks 10k -> 64 #4545

Uh oh!

Conversation

relh commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

What Changed (High Level)

“Save/Load Interface” (Checkpoint Routing)

PolicyStorageDecision (new data object)

CheckpointManager now accepts a decision

TrainTool uses it in preflight

Launch Sequence (TrainTool)

Mission Lookup (Lazy + Cached)

Lazy registries

Cleaner resolution

Curriculum Defaults

Recipe + Import Cleanup

Testing

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

datadog-official bot commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

relh commented Dec 28, 2025

Uh oh!

relh commented Dec 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

drop `./devops/run.sh train` launch time from 20s -> 2s with web-based callbacks, lazy imports, and default tasks 10k -> 64 #4545

drop `./devops/run.sh train` launch time from 20s -> 2s with web-based callbacks, lazy imports, and default tasks 10k -> 64 #4545

relh commented Dec 28, 2025 •

edited

Loading

datadog-official bot commented Dec 28, 2025 •

edited

Loading