Settings:
- benchmark: BERT Base
- one 32GB V100 GPU
- Tensorflow 1.15
- CUDA 10.0, cuDNN 7.6.5
The measured time is the average of 10 iterations.
| method |
iteration time (ms) |
memory (GB) |
| w/o optimization |
557.11 |
16.42 |
| recomputation (speed mode) |
1457.91 |
13.32 |
| recomputation (memory mode) |
704.9 |
7.43 |
code comes from google-research/bert, with a small modification to adopt gradient checkpointing.