-
Notifications
You must be signed in to change notification settings - Fork 0
Description
List of good practices to prevent already-occurred issues:
- Check that the batch size is not too big. This might cause a memory overload, with a consequent missing memory to allocate error.
Error:
RuntimeError: CUDA out of memory. Tried to allocate 494.00 MiB (GPU 0; 39.44 GiB total capacity; 13.79 MiB already allocated; 168.62 MiB free; 22.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
RuntimeError: CUDA error: out of memory
-
Check that your batch size is smaller than the smaller cluster of test data you have, otherwise it will try to load more cases than available and crash.
-
Did you get a "ValueError: No avaiable training data after filtering" error? You might have just entered the wrong data path, so there is no hdf5 file available, so no data
-
Use the pytorch.DataLoader num_worker argument to have multiple workers pre-loading the data for the training, especially if you train on GPU. In deeprank, you have to specify it in model.train(). Do not assign more num_workers than number of CPU cores you have available (on snellius, 18 CPU for each GPU card).
-
In your model you want to set one shape (input or output) as a fraction of a variable (e.g. input_shape/2)? You might encounter the following issue:
TypeError: empty() received an invalid combination of arguments - got (tuple, dtype=NoneType, device=NoneType), but expected one of: \* (tuple of ints size, *, tuple of names names, torch.memory_format memory_format, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad) \* (tuple of ints size, *, torch.memory_format memory_format, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)To prevent it, write your shape as in the following example, with '\' and 'int':
nn.Conv3d(input_shape[0], int(input_shape[0]//2), kernel_size=1) -
You can check your GPU memory consumption in real time by submitting a job, connecting through ssh to the node running that job (nodename@surf.nl) and running
nvidia-smi