-
Notifications
You must be signed in to change notification settings - Fork 443
Description
Bug report
Hey MaxText Team,
We are running a training and checkpointing benchmark and noticed that the logs for node 0 are incomplete. Specifically, several steps are missing—for example, steps 9, 10, and 11 do not appear in the node 0 logs: https://cloudlogging.app.goo.gl/K4YF5Y8wrDtbGmLU7
These steps are present in the logs for other nodes: https://cloudlogging.app.goo.gl/QLmonNgvjfoUdfCv6
Since we rely specifically on node 0 logs for our metrics processing, it is important that they are complete. Could you please investigate why these logs are being dropped and help us resolve the issue? Thanks for your help!
Logs/Output
Full logs: https://cloudlogging.app.goo.gl/agLPRhaEx7WmNA6d8
Additional Context
You can find the workload manifests here: https://paste.googleplex.com/4857812201111552#l=248