Skip to content

Conversation

@relh
Copy link
Contributor

@relh relh commented Dec 23, 2025

Top 5–10 things worth investigating in the codebase (ordered by likely impact):

  1. Scalar syncs in hot loops (.item(), float(tensor), tensor.cpu().numpy()), likely in logging/metrics and loss reporting; check metta/rl/training/
    progress_logger.py, metta/rl/training/wandb_logger.py, metta/rl/training/stats_reporter.py, and metta/rl/loss/*.
  2. Scatter/index heavy paths (index_put_, index_copy_, index_select) in experience/loss assembly; check metta/rl/training/experience.py, metta/rl/advantage.py, and
    metta/rl/loss/*.
  3. Allocation churn (empty, fill_, empty_like, resize_) inside per‑minibatch loops; check metta/rl/training/core.py training phase and any TensorDict construction.
  4. Redundant device transfers (aten::to, _to_copy) in training/loss paths; look for repeated .to(device=...) in minibatch loops or loss code.
  5. Explicit syncs (torch.cuda.synchronize) or implicit sync from logging; search for synchronize( and similar in metta/rl/training/* and metta/rl/loss/*.
  6. DDP gradient sync cadence (all‑reduce shows up in kernels); check whether grad sync is happening more often than needed (e.g., per minibatch vs per update) in
    metta/rl/training/distributed_helper.py and optimizer step logic.
  7. Repeated arange/index tensor creation; consider caching per‑epoch/per‑minibatch indices beyond the existing cache in metta/rl/training/core.py.
  8. Any per‑step CPU‑side action post‑processing that could be vectorized (e.g., action masking or log‑prob padding) in agent/src/metta/agent/components/actor.py and
    agent/src/metta/agent/util/distribution_utils.py.

Want me to start with a targeted .item()/sync audit and propose concrete code changes, or do the index/scatter path audit first?

@relh relh changed the title machina1 perf improvements audit training pipeline in search of perf limitations Dec 23, 2025
@relh relh changed the title audit training pipeline in search of perf limitations audit training pipeline for performance issues and fix Dec 23, 2025
@relh relh changed the title audit training pipeline for performance issues and fix audit training pipeline for performance issues Dec 23, 2025
Copy link
Contributor Author

relh commented Dec 24, 2025

This stack of pull requests is managed by Graphite. Learn more about stacking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants