How to do mnist-distributed with checkpointing?

I saw the tutorial (https://pytorch.org/tutorials/intermediate/ddp_tutorial.html#save-and-load-checkpoints):
```
def demo_checkpoint(rank, world_size):
    print(f"Running DDP checkpoint example on rank {rank}.")
    setup(rank, world_size)

    model = ToyModel().to(rank)
    ddp_model = DDP(model, device_ids=[rank])

    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    CHECKPOINT_PATH = tempfile.gettempdir() + "/model.checkpoint"
    if rank == 0:
        # All processes should see same parameters as they all start from same
        # random parameters and gradients are synchronized in backward passes.
        # Therefore, saving it in one process is sufficient.
        torch.save(ddp_model.state_dict(), CHECKPOINT_PATH)

    # Use a barrier() to make sure that process 1 loads the model after process
    # 0 saves it.
    dist.barrier()
    # configure map_location properly
    map_location = {'cuda:%d' % 0: 'cuda:%d' % rank}
    ddp_model.load_state_dict(
        torch.load(CHECKPOINT_PATH, map_location=map_location))

    optimizer.zero_grad()
    outputs = ddp_model(torch.randn(20, 10))
    labels = torch.randn(20, 5).to(rank)
    loss_fn = nn.MSELoss()
    loss_fn(outputs, labels).backward()
    optimizer.step()

    # Not necessary to use a dist.barrier() to guard the file deletion below
    # as the AllReduce ops in the backward pass of DDP already served as
    # a synchronization.

    if rank == 0:
        os.remove(CHECKPOINT_PATH)

    cleanup()
```
but as you said the tutorial is not very well written or missing or something. I was wondering if you could extend your tutorial with checkpointing?

I am personally interested only in processing each batch quicker by using multiprocessing. So what confuses me is why the code above not simply just save the model once training is done (but instead saves it when rank==0 before training starts). As you said, its confusing. Extending your mnist-example so after I process all the data in mnist and then I can save my model would be fantastic or saving every X number of epochs as it's the common case.

Btw, thanks for your example, it is fantastic!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to do mnist-distributed with checkpointing? #9

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

How to do mnist-distributed with checkpointing? #9

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions