audit training pipeline for performance issues #4512

relh · 2025-12-23T20:58:00Z

Top 5–10 things worth investigating in the codebase (ordered by likely impact):

Scalar syncs in hot loops (.item(), float(tensor), tensor.cpu().numpy()), likely in logging/metrics and loss reporting; check metta/rl/training/
progress_logger.py, metta/rl/training/wandb_logger.py, metta/rl/training/stats_reporter.py, and metta/rl/loss/*.
Scatter/index heavy paths (index_put_, index_copy_, index_select) in experience/loss assembly; check metta/rl/training/experience.py, metta/rl/advantage.py, and
metta/rl/loss/*.
Allocation churn (empty, fill_, empty_like, resize_) inside per‑minibatch loops; check metta/rl/training/core.py training phase and any TensorDict construction.
Redundant device transfers (aten::to, _to_copy) in training/loss paths; look for repeated .to(device=...) in minibatch loops or loss code.
Explicit syncs (torch.cuda.synchronize) or implicit sync from logging; search for synchronize( and similar in metta/rl/training/* and metta/rl/loss/*.
DDP gradient sync cadence (all‑reduce shows up in kernels); check whether grad sync is happening more often than needed (e.g., per minibatch vs per update) in
metta/rl/training/distributed_helper.py and optimizer step logic.
Repeated arange/index tensor creation; consider caching per‑epoch/per‑minibatch indices beyond the existing cache in metta/rl/training/core.py.
Any per‑step CPU‑side action post‑processing that could be vectorized (e.g., action masking or log‑prob padding) in agent/src/metta/agent/components/actor.py and
agent/src/metta/agent/util/distribution_utils.py.

Want me to start with a targeted .item()/sync audit and propose concrete code changes, or do the index/scatter path audit first?

relh · 2025-12-24T00:09:55Z

audit training pipeline for performance issues #4512 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

relh added 5 commits December 23, 2025 13:46

pinned buffers

7a3b882

revert

7aa790d

profiling code

152e397

cleaner

e0309fa

Fix torch profiler arming across epochs

7a3ddfe

github-actions bot assigned relh Dec 23, 2025

relh added 2 commits December 23, 2025 16:16

Default machina_1 variants to heart_chorus

223b28d

perf docs

8cfc36a

relh changed the title ~~machina1 perf improvements~~ audit training pipeline in search of perf limitations Dec 23, 2025

relh changed the title ~~audit training pipeline in search of perf limitations~~ audit training pipeline for performance issues and fix Dec 23, 2025

relh changed the title ~~audit training pipeline for performance issues and fix~~ audit training pipeline for performance issues Dec 23, 2025

perf docs epoch3

4999028

relh added 4 commits December 23, 2025 22:14

Merge remote-tracking branch 'origin/main' into richard-perf1223

d742511

Merge remote-tracking branch 'origin/main' into richard-perf1223

248a086

Merge remote-tracking branch 'origin/main' into richard-perf1223

4fe422f

Merge remote-tracking branch 'origin/main' into richard-perf1223

79f6fed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

audit training pipeline for performance issues #4512

audit training pipeline for performance issues #4512

relh commented Dec 23, 2025 •

edited

Loading

Uh oh!

relh commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

audit training pipeline for performance issues #4512

Are you sure you want to change the base?

audit training pipeline for performance issues #4512

Conversation

relh commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

relh commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

relh commented Dec 23, 2025 •

edited

Loading