Skip to content
This repository was archived by the owner on Aug 29, 2023. It is now read-only.
This repository was archived by the owner on Aug 29, 2023. It is now read-only.

Unable to train the model #69

@kagawa588

Description

@kagawa588

Hi,

Thanks for your great work! I try to train the model myself recently, but I found that it takes so long to transfer the model from cpu to gpu (about an hour) and then it failed. Could you pls give me any suggestions? Did I do something wrong?

Thanks in advance!

My environment is below:

sys.platform linux
Python 3.7.0 (default, Oct 9 2018, 10:31:47) [GCC 7.3.0]
numpy 1.21.5
detectron2 0.6 @/home/mu/anaconda3/envs/maskformer/lib/python3.7/site-packages/detectron2
Compiler GCC 7.3
CUDA compiler CUDA 10.2
detectron2 arch flags 3.7, 5.0, 5.2, 6.0, 6.1, 7.0, 7.5
DETECTRON2_ENV_MODULE
PyTorch 1.8.2 @/home/mu/anaconda3/envs/maskformer/lib/python3.7/site-packages/torch
PyTorch debug build False
GPU available Yes
GPU 0 NVIDIA GeForce RTX 3080 Laptop GPU (arch=8.6)
Driver version 510.60.02
CUDA_HOME /usr/local/cuda
Pillow 9.2.0
torchvision 0.9.2 @/home/mu/anaconda3/envs/maskformer/lib/python3.7/site-packages/torchvision
torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5
fvcore 0.1.5.post20220512
iopath 0.1.9
cv2 4.6.0


The error is below:

res4.9.conv3.norm.num_batches_tracked
res5.0.conv1.norm.num_batches_tracked
res5.0.conv2.norm.num_batches_tracked
res5.0.conv3.norm.num_batches_tracked
res5.0.shortcut.norm.num_batches_tracked
res5.1.conv1.norm.num_batches_tracked
res5.1.conv2.norm.num_batches_tracked
res5.1.conv3.norm.num_batches_tracked
res5.2.conv1.norm.num_batches_tracked
res5.2.conv2.norm.num_batches_tracked
res5.2.conv3.norm.num_batches_tracked
stem.conv1.norm.num_batches_tracked
stem.conv2.norm.num_batches_tracked
stem.conv3.norm.num_batches_tracked
stem.fc.{bias, weight}
[08/21 20:18:39 d2.engine.train_loop]: Starting training from iteration 0
ERROR [08/21 20:20:24 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 149, in train
self.run_step()
File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/defaults.py", line 494, in run_step
self._trainer.run_step()
File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 285, in run_step
losses.backward()
File "/cloud/maskformer/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/cloud/maskformer/lib/python3.7/site-packages/torch/autograd/init.py", line 147, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
[08/21 20:20:24 d2.engine.hooks]: Total training time: 0:01:45 (0:00:00 on hooks)
[08/21 20:20:24 d2.utils.events]: iter: 0 lr: N/A max_mem: 5604M
Traceback (most recent call last):
File "train_net.py", line 270, in
args=(args,),
File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/launch.py", line 82, in launch
main_func(*args)
File "train_net.py", line 258, in main
return trainer.train()
File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/defaults.py", line 484, in train
super().train(self.start_iter, self.max_iter)
File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 149, in train
self.run_step()
File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/defaults.py", line 494, in run_step
self._trainer.run_step()
File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 285, in run_step
losses.backward()
File "/cloud/maskformer/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/cloud/maskformer/lib/python3.7/site-packages/torch/autograd/init.py", line 147, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions