Skip to content

Conversation

@relh
Copy link
Contributor

@relh relh commented Dec 12, 2025

save/load works both on single GPU machine and multi-gpu machine.


Unifies checkpoint save/load around policy_spec bundles (directory with policy_spec.json + weights.safetensors) while keeping a minimal .mpt compatibility path via MptPolicy in metta. Adds bundle utilities and a CheckpointPolicy to standardize serialization, resolution, and S3 sync, and fixes submission zip creation to avoid leaking local paths.

Key Changes

  • New checkpoint bundle tooling (checkpoint_dir) and CheckpointPolicy for safetensors‑based loading; training/checkpointer now load via policy_spec.
  • URI resolution now targets checkpoint directories (and :latest) and handles S3 checkpoint dirs; checkpoint filenames drop the .mpt suffix.
  • Evaluator submission zip upload now derives from checkpoint bundle paths and rewrites absolute data paths to portable names.
  • CLI/docs/recipes updated to reference checkpoint dirs or policy_spec.json.
  • Minimal .mpt compatibility preserved by moving MptPolicy + mpt artifact helpers into metta/rl; .mpt loads only via policy_spec bundles that reference MptPolicy.

Breaking Changes / Migration

  • New checkpoints are saved as policy_spec bundles (no .mpt filenames).

@datadog-official
Copy link

datadog-official bot commented Dec 12, 2025

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 20de671 | Docs | Was this helpful? Give us feedback!

@relh relh assigned nishu-builder and unassigned relh Dec 20, 2025
@@ -0,0 +1,131 @@
from __future__ import annotations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly, can we put all the custom saving/loading into CheckpointPolicy. load_policy_data and CheckpointPolicy.save_policy_data? It would dump (and load) more than just the weights file

@relh
Copy link
Contributor Author

relh commented Dec 22, 2025

closing for #4502

@relh relh closed this Dec 22, 2025
auto-merge was automatically disabled December 22, 2025 22:34

Pull request was closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants