Skip to content

Node 0 training step logs missing #2834

@lepan-google

Description

@lepan-google

Bug report

Hey MaxText Team,

We are running a training and checkpointing benchmark and noticed that the logs for node 0 are incomplete. Specifically, several steps are missing—for example, steps 9, 10, and 11 do not appear in the node 0 logs: https://cloudlogging.app.goo.gl/K4YF5Y8wrDtbGmLU7

These steps are present in the logs for other nodes: https://cloudlogging.app.goo.gl/QLmonNgvjfoUdfCv6

Since we rely specifically on node 0 logs for our metrics processing, it is important that they are complete. Could you please investigate why these logs are being dropped and help us resolve the issue? Thanks for your help!

Logs/Output

Full logs: https://cloudlogging.app.goo.gl/agLPRhaEx7WmNA6d8

Additional Context

You can find the workload manifests here: https://paste.googleplex.com/4857812201111552#l=248

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions