Batch Size Choosing for single GPU Traing and Multiple GPU Train

### Issue summary
I am using Caffe and ImageNet dataset training on GoogleNet(v1). When I do the single GPU (MI25) training, the training batch size I used is '128'. Then I change the training applied to Multiple MI25 training on hipCaffe,  since the total GPU memory capacity has 4 times ( 16GB x4), the batch size should able to fit 512 image/batch(128 image/batch/card).  From my testing result, the batch size cannot be changed, even just '192' (multiple of 64),  it shows "error: 'hipErrorMemory Allocation'(1002)" .  

Since the batch size only has '128', I just do a roughly math, the four cards training time will 3  ~  3.5x longer as training time on 4xP100 system (batch_size=512). 

May I ask **is there some environment parameters should I set** before the training which can help on enlarge the batch size on multiple GPU training?

I crossed check with one of my NVIDIA P100x4 Server, the batch size could be increased as long as I use more cards. The batch number mentioned above was based on my experience when I did on the same dataset, same network, with NVIDIA P100(16GB GDDR), and V100(16GB GDDR)  Training job. 


### Steps to reproduce
Use the bvlc_googlenet training network under the hipCaffe installation path. The ImageNet dataset from ImageNet official website. 

### Your system configuration
Operating system: Ubuntu 16.04.3
Compiler: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5)
CUDA version (if applicable):
CUDNN version (if applicable):
BLAS: USE_ROCBLAS := 1
Python or MATLAB version (for pycaffe and matcaffe respectively): 2.7.12
Other:
miopen-hip 1.1.4
miopengemm 1.1.5
rocm-libs 1.6.180
Server: Inventec P47
GPU: AMD MI25 x4
CPU: AMD EPYC 7601 x2
Memory: 512GB


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Batch Size Choosing for single GPU Traing and Multiple GPU Train #21

Issue summary

Steps to reproduce

Your system configuration

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Batch Size Choosing for single GPU Traing and Multiple GPU Train #21

Description

Issue summary

Steps to reproduce

Your system configuration

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions