Skip to content

Conversation

@relh
Copy link
Contributor

@relh relh commented Dec 28, 2025

Summary

This PR reduces training startup latency by overlapping slow preflight work (AWS/SSO checkpoint routing + optional stats client init) with environment/policy construction. It also makes mission lookup lazy/cached, and reduces the default active curriculum task count to avoid unnecessary work on startup.

The only “save/load interface” change in this PR is the introduction of a PolicyStorageDecision object used to decide where checkpoints are saved and how they’re resolved during launch. The on‑disk checkpoint format is unchanged.

Motivation

Startup was spending time in a few places that don’t need to be on the critical path:

  • AWS/SSO checks for remote checkpoint routing
  • eager loading of mission registries
  • very large default curriculum task counts

These changes pull those costs earlier or defer them so that training can begin faster without changing functional behavior.

What Changed (High Level)

  • Preflight executor in TrainTool to overlap storage decision + stats client init with env/policy init
  • PolicyStorageDecision passed into CheckpointManager to avoid recomputing AWS/SSO routing
  • Lazy/cached mission registries and cleaner mission resolution (core + evals)
  • Default curriculum active tasks lowered (10000 → 64)
  • Lazy imports in heavy recipe entrypoints

“Save/Load Interface” (Checkpoint Routing)

PolicyStorageDecision (new data object)

metta/tools/utils/auto_config.py

  • Captures where policies should be saved:
    • base_prefix: base remote prefix (e.g., s3://...) or None
    • remote_prefix: run‑specific prefix (base/run) or None
    • reason: why remote is or isn’t used
  • using_remote indicates when remote storage is active

CheckpointManager now accepts a decision

metta/rl/checkpoint_manager.py

  • CheckpointManager(..., storage_decision=...) uses the precomputed decision to set _remote_prefix
  • Avoids recomputing AWS/SSO checks during startup
  • Save/load behavior remains the same; only the decision is centralized

TrainTool uses it in preflight

metta/tools/train.py

  • auto_policy_storage_decision(run_name) starts in a background thread
  • The result is passed to CheckpointManager once available
  • Preserves fail‑fast behavior on decision errors

Net effect: the “save/load interface” is still the same file/URI scheme, but the routing decision is now explicit, cached, and overlapped with startup.


Launch Sequence (TrainTool)

  1. Determine run_name
  2. Start preflight executor (if needed):
    • storage decision
    • optional stats client
  3. Build vectorized environment + policy
  4. Apply storage decision to CheckpointManager
  5. Initialize logging + trainer
  6. Register components and start training

This keeps behavior identical while shortening the critical path.


Mission Lookup (Lazy + Cached)

Lazy registries

  • Core missions load immediately; eval missions load only when requested
  • Registries are cached (lru_cache) so repeated lookups are cheap

Cleaner resolution

  • find_mission(..., include_evals=True) provides a single path for “core + eval” lookup
  • If no site matches and mission_name is None, we treat the input as a mission name (fixes easy_hearts in recipe tests)

Curriculum Defaults

metta/cogworks/curriculum/curriculum.py

  • num_active_tasks default reduced from 10000 → 64
  • Tests updated accordingly

Recipe + Import Cleanup

recipes/experiment/machina_1.py

  • Heavy imports moved inside function bodies to reduce CLI startup time

Testing

  • CI: Lint, Python, Recipes, C++/Nim
  • Updated curriculum default test
  • Recipe tests now pass easy_hearts mission lookup via find_mission(..., include_evals=True)

Asana Task

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@datadog-official
Copy link

datadog-official bot commented Dec 28, 2025

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: afea62c | Docs | Was this helpful? Give us feedback!

@relh relh changed the title improve launch/startup time of training improve launch/startup time of training and ditch easy_hearts Dec 28, 2025
@relh relh changed the title improve launch/startup time of training and ditch easy_hearts improve launch/startup time of training Dec 28, 2025
@relh relh enabled auto-merge December 28, 2025 17:20
@relh relh changed the title improve launch/startup time of training improve launch/startup time of training with lazy imports Dec 28, 2025
Copy link
Contributor Author

relh commented Dec 28, 2025

Startup timing (machina_1 via ./devops/run.sh, sandbox=true, total_timesteps=1): origin/main 22.56s vs richard-launchfixes 2.79s. Both runs used W&B + SSO; main run was a single cold start.

Copy link
Contributor Author

relh commented Dec 28, 2025

@relh relh added the review wanted: stamp This PR needs a review from any available team member label Dec 28, 2025
@relh relh changed the title improve launch/startup time of training with lazy imports improve launch/startup time of training with lazy imports, default tasks 10k -> 64 Dec 29, 2025
@relh relh changed the title improve launch/startup time of training with lazy imports, default tasks 10k -> 64 drop ./devops/run.sh train launch time from 20s -> 2s with web-based callbacks, lazy imports, and default tasks 10k -> 64 Dec 29, 2025
@relh relh added this pull request to the merge queue Dec 29, 2025
@relh relh removed this pull request from the merge queue due to a manual request Dec 29, 2025
github-merge-queue bot pushed a commit that referenced this pull request Dec 29, 2025
…d callbacks, lazy imports, and default tasks 10k -> 64 (#4545)

## Summary
This PR reduces training startup latency by overlapping slow preflight
work (AWS/SSO checkpoint routing + optional stats client init) with
environment/policy construction. It also makes mission lookup
lazy/cached, and reduces the default active curriculum task count to
avoid unnecessary work on startup.

The only “save/load interface” change in this PR is the introduction of
a **PolicyStorageDecision** object used to decide *where* checkpoints
are saved and how they’re resolved during launch. The on‑disk checkpoint
format is unchanged.

## Motivation
Startup was spending time in a few places that don’t need to be on the
critical path:
- AWS/SSO checks for remote checkpoint routing
- eager loading of mission registries
- very large default curriculum task counts

These changes pull those costs earlier or defer them so that training
can begin faster without changing functional behavior.

## What Changed (High Level)
- **Preflight executor in TrainTool** to overlap storage decision +
stats client init with env/policy init
- **PolicyStorageDecision** passed into CheckpointManager to avoid
recomputing AWS/SSO routing
- **Lazy/cached mission registries** and cleaner mission resolution
(core + evals)
- **Default curriculum active tasks lowered** (10000 → 64)
- **Lazy imports** in heavy recipe entrypoints

---

## “Save/Load Interface” (Checkpoint Routing)
### PolicyStorageDecision (new data object)
`metta/tools/utils/auto_config.py`
- Captures *where* policies should be saved:
  - `base_prefix`: base remote prefix (e.g., `s3://...`) or `None`
  - `remote_prefix`: run‑specific prefix (`base/run`) or `None`
  - `reason`: why remote is or isn’t used
- `using_remote` indicates when remote storage is active

### CheckpointManager now accepts a decision
`metta/rl/checkpoint_manager.py`
- `CheckpointManager(..., storage_decision=...)` uses the precomputed
decision to set `_remote_prefix`
- Avoids recomputing AWS/SSO checks during startup
- Save/load behavior remains the same; only the *decision* is
centralized

### TrainTool uses it in preflight
`metta/tools/train.py`
- `auto_policy_storage_decision(run_name)` starts in a background thread
- The result is passed to `CheckpointManager` once available
- Preserves fail‑fast behavior on decision errors

**Net effect:** the “save/load interface” is still the same file/URI
scheme, but the **routing decision is now explicit, cached, and
overlapped with startup**.

---

## Launch Sequence (TrainTool)
1. Determine `run_name`
2. Start preflight executor (if needed):
   - storage decision
   - optional stats client
3. Build vectorized environment + policy
4. Apply storage decision to CheckpointManager
5. Initialize logging + trainer
6. Register components and start training

This keeps behavior identical while shortening the critical path.

---

## Mission Lookup (Lazy + Cached)
### Lazy registries
- Core missions load immediately; eval missions load only when requested
- Registries are cached (`lru_cache`) so repeated lookups are cheap

### Cleaner resolution
- `find_mission(..., include_evals=True)` provides a single path for
“core + eval” lookup
- If no site matches and `mission_name` is None, we treat the input as a
mission name (fixes `easy_hearts` in recipe tests)

---

## Curriculum Defaults
`metta/cogworks/curriculum/curriculum.py`
- `num_active_tasks` default reduced from **10000 → 64**
- Tests updated accordingly

---

## Recipe + Import Cleanup
`recipes/experiment/machina_1.py`
- Heavy imports moved inside function bodies to reduce CLI startup time

---

## Testing
- CI: Lint, Python, Recipes, C++/Nim
- Updated curriculum default test
- Recipe tests now pass `easy_hearts` mission lookup via
`find_mission(..., include_evals=True)`

[Asana
Task](https://app.asana.com/1/1209016784099267/project/1210348820405981/task/1212600746277979)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review wanted: stamp This PR needs a review from any available team member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants