Skip to content

Good practices of training #73

@DarioMarzella

Description

@DarioMarzella

List of good practices to prevent already-occurred issues:

  1. Check that the batch size is not too big. This might cause a memory overload, with a consequent missing memory to allocate error.
    Error:
RuntimeError: CUDA out of memory. Tried to allocate 494.00 MiB (GPU 0; 39.44 GiB total capacity; 13.79 MiB already allocated; 168.62 MiB free; 22.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
RuntimeError: CUDA error: out of memory
  1. Check that your batch size is smaller than the smaller cluster of test data you have, otherwise it will try to load more cases than available and crash.

  2. Did you get a "ValueError: No avaiable training data after filtering" error? You might have just entered the wrong data path, so there is no hdf5 file available, so no data

  3. Use the pytorch.DataLoader num_worker argument to have multiple workers pre-loading the data for the training, especially if you train on GPU. In deeprank, you have to specify it in model.train(). Do not assign more num_workers than number of CPU cores you have available (on snellius, 18 CPU for each GPU card).

  4. In your model you want to set one shape (input or output) as a fraction of a variable (e.g. input_shape/2)? You might encounter the following issue:
    TypeError: empty() received an invalid combination of arguments - got (tuple, dtype=NoneType, device=NoneType), but expected one of: \* (tuple of ints size, *, tuple of names names, torch.memory_format memory_format, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad) \* (tuple of ints size, *, torch.memory_format memory_format, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)

    To prevent it, write your shape as in the following example, with '\' and 'int':
    nn.Conv3d(input_shape[0], int(input_shape[0]//2), kernel_size=1)

  5. You can check your GPU memory consumption in real time by submitting a job, connecting through ssh to the node running that job (nodename@surf.nl) and running nvidia-smi

Metadata

Metadata

Assignees

No one assigned

    Labels

    notesThings to remember that do not necessarily cause an issue at the moment, but may do in the future

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions