Skip to content

Conversation

@henry-berger
Copy link

  • Line 53: If one were to train a model with a total of 100k epochs starting from a checkpoint of 25k epochs, the current code would display a status bar that runs from 0 to 100k. That would provide an inaccurate estimate of the remaining time, because there are actually only 75k epochs to run. This edit changes the status bar so it goes from 0 to epochs - start_epoch, which in this example would be 0 to 75k.

  • Line 56: When starting from a checkpoint other than 0, the checkpoint will almost always satisfy the two conditions not epoch % epochs_til_checkpoint and epoch, so the current code would save a checkpoint before actually doing any training. Among other things, this overwrites the training losses file at the the starting checkpoint, replacing it with an empty file. Replacing the condition epoch with the condition epoch != start_epoch should fix that problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant