在训练了几个epoch之后报错：CUDA out of memory

作者您好！
我目前在卫星图像数据集DOTA上训练时出现了一些问题，希望得到您的解答。

### 运行环境
* 操作系统：Linux gdp 5.15.0-107-generic #117~20.04.1-Ubuntu SMP Tue Apr 30 10:35:57 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
* 显卡：NVIDIA GeForce RTX 4090
* 环境：使用了 [yolov7](https://github.com/WongKinYiu/yolov7) 给出的 docker
  ```bash
  docker run --gpus all --name yolov7 -it -v my_danamicdet_path/:/DynamicDet --shm-size=64g nvcr.io/nvidia/pytorch:21.08-py3
  ```
  * python 版本：3.8.10

### 问题描述
1. 我首先在DOTA数据集的一个小的子集上做了训练，这个子集的训练集只有 500 张图片，最终我成功完成了训练。训练代码为
```python
python train_step1.py --workers 8 --device 0 --batch-size 4 --epochs 300 --img 1024 --cfg cfg/dy-yolov7-step1.yaml --weight '' --data data/DOTA_1024_sample.yaml --hyp hyp/hyp.scratch.p5.yaml --name dy-yolov7-1024-sample-step1
```
下面是训练得到的 `result.txt` 文件中记录的前几个 epoch 的数据
```
     0/299     17.1G    0.1601    0.1461    0.0883    0.3945       202      1024         0         0         0         0    0.1026     0.131   0.04253
     1/299     17.4G    0.1549    0.1501    0.0829    0.3878       592      1024  0.001754 0.0002915 3.143e-05 9.429e-06    0.1035    0.1278   0.04169
     2/299     17.7G    0.1538    0.1459   0.07953    0.3792       170      1024  0.003952  0.004189 0.0004965 6.652e-05    0.1072    0.1166    0.0443
     3/299     17.4G    0.1521    0.1451   0.07527    0.3725        24      1024         0         0         0         0    0.1035     0.127   0.04011
     4/299     17.1G    0.1505    0.1401   0.06871    0.3593       138      1024  0.005832   0.01104 0.0006081 0.0001068    0.1004    0.1214   0.03939
     5/299     17.4G    0.1495    0.1234   0.06305     0.336       288      1024    0.7021   0.00276 0.0005829 0.0001146    0.1026    0.1027   0.03801
     6/299     17.4G    0.1507    0.1259    0.0612    0.3378       366      1024  0.005736   0.01074  0.001191 0.0002127    0.1075    0.1325   0.03855
```
2. 但是接下来当我尝试在整个数据集上进行训练时，出现了问题。训练代码为
```python
python train_step1.py --workers 8 --device 0 --batch-size 4 --epochs 300 --img 1024 --cfg cfg/dy-yolov7-step1.yaml --weight '' --data data/DOTA_1024.yaml --hyp hyp/hyp.scratch.p5.yaml --name dy-yolov7-1024-step1
```
当训练代码完成了第一个epoch，在进行第二个epoch时，却给出了 CUDA out of memory 的报错
![8e72cc5e4b127607a7165be4389f65a](https://github.com/VDIGPKU/DynamicDet/assets/102898272/79decadf-2449-4e3e-bc76-979225d59aa5)
下面是本次训练得到的 `result.txt` 文件中记录的数据
```
     0/299     3.99G    0.1447   0.09345   0.07174    0.3098        29      1024  0.006743  0.009978  0.001146 0.0002416   0.09399   0.07602   0.03865
```

### 疑问
1. 训练时显存的占用应该只和 batch-size 相关，我在两次训练中使用了相同的batch-size，而且每张图片的大小也都是1024×1024，为什么后一次训练会因为显存的原因失败？
2. 同样的，为什么在第二次训练的时候，第一个 epoch 能够完成，但是第二个 epoch 就失败了呢？

希望得到您的答复，非常感谢！

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

在训练了几个epoch之后报错：CUDA out of memory #30

运行环境

问题描述

疑问

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

在训练了几个epoch之后报错：CUDA out of memory #30

Description

运行环境

问题描述

疑问

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions