-
Notifications
You must be signed in to change notification settings - Fork 48
drop ./devops/run.sh train launch time from 20s -> 2s with web-based callbacks, lazy imports, and default tasks 10k -> 64
#4545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
This reverts commit 325a22c.
|
Startup timing (machina_1 via ./devops/run.sh, sandbox=true, total_timesteps=1): origin/main 22.56s vs richard-launchfixes 2.79s. Both runs used W&B + SSO; main run was a single cold start. |
This stack of pull requests is managed by Graphite. Learn more about stacking. |
./devops/run.sh train launch time from 20s -> 2s with web-based callbacks, lazy imports, and default tasks 10k -> 64
…d callbacks, lazy imports, and default tasks 10k -> 64 (#4545) ## Summary This PR reduces training startup latency by overlapping slow preflight work (AWS/SSO checkpoint routing + optional stats client init) with environment/policy construction. It also makes mission lookup lazy/cached, and reduces the default active curriculum task count to avoid unnecessary work on startup. The only “save/load interface” change in this PR is the introduction of a **PolicyStorageDecision** object used to decide *where* checkpoints are saved and how they’re resolved during launch. The on‑disk checkpoint format is unchanged. ## Motivation Startup was spending time in a few places that don’t need to be on the critical path: - AWS/SSO checks for remote checkpoint routing - eager loading of mission registries - very large default curriculum task counts These changes pull those costs earlier or defer them so that training can begin faster without changing functional behavior. ## What Changed (High Level) - **Preflight executor in TrainTool** to overlap storage decision + stats client init with env/policy init - **PolicyStorageDecision** passed into CheckpointManager to avoid recomputing AWS/SSO routing - **Lazy/cached mission registries** and cleaner mission resolution (core + evals) - **Default curriculum active tasks lowered** (10000 → 64) - **Lazy imports** in heavy recipe entrypoints --- ## “Save/Load Interface” (Checkpoint Routing) ### PolicyStorageDecision (new data object) `metta/tools/utils/auto_config.py` - Captures *where* policies should be saved: - `base_prefix`: base remote prefix (e.g., `s3://...`) or `None` - `remote_prefix`: run‑specific prefix (`base/run`) or `None` - `reason`: why remote is or isn’t used - `using_remote` indicates when remote storage is active ### CheckpointManager now accepts a decision `metta/rl/checkpoint_manager.py` - `CheckpointManager(..., storage_decision=...)` uses the precomputed decision to set `_remote_prefix` - Avoids recomputing AWS/SSO checks during startup - Save/load behavior remains the same; only the *decision* is centralized ### TrainTool uses it in preflight `metta/tools/train.py` - `auto_policy_storage_decision(run_name)` starts in a background thread - The result is passed to `CheckpointManager` once available - Preserves fail‑fast behavior on decision errors **Net effect:** the “save/load interface” is still the same file/URI scheme, but the **routing decision is now explicit, cached, and overlapped with startup**. --- ## Launch Sequence (TrainTool) 1. Determine `run_name` 2. Start preflight executor (if needed): - storage decision - optional stats client 3. Build vectorized environment + policy 4. Apply storage decision to CheckpointManager 5. Initialize logging + trainer 6. Register components and start training This keeps behavior identical while shortening the critical path. --- ## Mission Lookup (Lazy + Cached) ### Lazy registries - Core missions load immediately; eval missions load only when requested - Registries are cached (`lru_cache`) so repeated lookups are cheap ### Cleaner resolution - `find_mission(..., include_evals=True)` provides a single path for “core + eval” lookup - If no site matches and `mission_name` is None, we treat the input as a mission name (fixes `easy_hearts` in recipe tests) --- ## Curriculum Defaults `metta/cogworks/curriculum/curriculum.py` - `num_active_tasks` default reduced from **10000 → 64** - Tests updated accordingly --- ## Recipe + Import Cleanup `recipes/experiment/machina_1.py` - Heavy imports moved inside function bodies to reduce CLI startup time --- ## Testing - CI: Lint, Python, Recipes, C++/Nim - Updated curriculum default test - Recipe tests now pass `easy_hearts` mission lookup via `find_mission(..., include_evals=True)` [Asana Task](https://app.asana.com/1/1209016784099267/project/1210348820405981/task/1212600746277979)

Summary
This PR reduces training startup latency by overlapping slow preflight work (AWS/SSO checkpoint routing + optional stats client init) with environment/policy construction. It also makes mission lookup lazy/cached, and reduces the default active curriculum task count to avoid unnecessary work on startup.
The only “save/load interface” change in this PR is the introduction of a PolicyStorageDecision object used to decide where checkpoints are saved and how they’re resolved during launch. The on‑disk checkpoint format is unchanged.
Motivation
Startup was spending time in a few places that don’t need to be on the critical path:
These changes pull those costs earlier or defer them so that training can begin faster without changing functional behavior.
What Changed (High Level)
“Save/Load Interface” (Checkpoint Routing)
PolicyStorageDecision (new data object)
metta/tools/utils/auto_config.pybase_prefix: base remote prefix (e.g.,s3://...) orNoneremote_prefix: run‑specific prefix (base/run) orNonereason: why remote is or isn’t usedusing_remoteindicates when remote storage is activeCheckpointManager now accepts a decision
metta/rl/checkpoint_manager.pyCheckpointManager(..., storage_decision=...)uses the precomputed decision to set_remote_prefixTrainTool uses it in preflight
metta/tools/train.pyauto_policy_storage_decision(run_name)starts in a background threadCheckpointManageronce availableNet effect: the “save/load interface” is still the same file/URI scheme, but the routing decision is now explicit, cached, and overlapped with startup.
Launch Sequence (TrainTool)
run_nameThis keeps behavior identical while shortening the critical path.
Mission Lookup (Lazy + Cached)
Lazy registries
lru_cache) so repeated lookups are cheapCleaner resolution
find_mission(..., include_evals=True)provides a single path for “core + eval” lookupmission_nameis None, we treat the input as a mission name (fixeseasy_heartsin recipe tests)Curriculum Defaults
metta/cogworks/curriculum/curriculum.pynum_active_tasksdefault reduced from 10000 → 64Recipe + Import Cleanup
recipes/experiment/machina_1.pyTesting
easy_heartsmission lookup viafind_mission(..., include_evals=True)Asana Task