No significant change in iters/sec while comparing cpu vs gpu performance

I have installed torch_tvm with cuda/opencl support by enabling the following options: 
https://github.com/dmlc/tvm/blob/master/cmake/config.cmake#L32
https://github.com/dmlc/tvm/blob/master/cmake/config.cmake#L129
https://github.com/dmlc/tvm/blob/master/cmake/config.cmake#L132



Trying to compare the cpu vs gpu performance by running the following test: https://github.com/pytorch/tvm/blob/master/test/benchmarks.py

- CPU version:
```
$ CUDA_VISIBLE_DEVICES='' PYTHONPATH=../:$PYTHONPATH python3 benchmarks.py 
```

Execution Log:
```
root@ccf26f0f9541:/opt/work/tvm/test# CUDA_VISIBLE_DEVICES='' PYTHONPATH=../:$PYTHONPATH python3 benchmarks.py 
Tracing model with JIT
Warming JIT up with 10 runs
Running JIT 10 times
Done benchmarking JIT
Tracing model with TVM
WARNING: reshape with -1 as the first value has known incompatibility with PyTorch semantics.
Cannot find config for target=llvm -mcpu=core-avx2, workload=('dense', (1, 512, 'float32'), (125, 512, 8, 'float32'), 0, 'float32'). A fallback configuration is used, which may bring great performance regression.
[08:58:08] /opt/work/tvm/tvm/src/pass/vectorize_loop.cc:389: Detect vector condition in Vectorized Loop, scalarizing...
[08:58:08] /opt/work/tvm/tvm/src/pass/vectorize_loop.cc:389: Detect vector condition in Vectorized Loop, scalarizing...
[08:58:08] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (64 - (ax0.ax1.outer.fused.ax2.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (64 - (ax0.ax1.outer.fused.ax2.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (64 - (ax0.ax1.outer.fused.ax2.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.fused/14))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.fused/14))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.fused/14))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (16 - (ax0.ax1.outer.fused.ax2.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (16 - (ax0.ax1.outer.fused.ax2.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (8 - (ax0.ax1.outer.fused.ax2.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (4 - (ax0.ax1.outer.fused.ax2.fused/56))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (4 - (ax0.ax1.outer.fused.ax2.fused/56))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (8 - (ax0.ax1.outer.fused.ax2.fused/112))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (2 - (ax0.ax1.outer.fused.ax2.outer.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (16 - (ax0.ax1.outer.fused.ax2.outer.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.outer.fused/4))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((1 - (7 - ((ax0.ax1.outer.fused.ax2.outer.fused % 4)*2))) + 1) >= 0), when generating the post doubt loop
/usr/local/lib/python3.5/dist-packages/torch/jit/__init__.py:1030: TracerWarning: Output nr 1. of the traced function does not match the corresponding output of the Python function. Detailed error:
Not within tolerance rtol=1e-05 atol=1e-05 at input[0, 255] (-0.710386335849762 vs. -0.7103500366210938) and 5 other locations (0.00%)
  check_tolerance, _force_outplace, True, _module_class)
Warming TVM up with 10 iters
Running TVM 10 times
Done benchmarking TVM, which compiled 100.00% of compute
JIT: 39.134256974191366 iter/s
TVM: 62.80919757107452 iter/s
root@ccf26f0f9541:/opt/work/tvm/test# 
```

- GPU version:
Edit `L39` of `benchmarks.py` to `torch_tvm.enable(opt_level=3, device_type='cuda')`

```
$ CUDA_VISIBLE_DEVICES='0' PYTHONPATH=../:$PYTHONPATH python3 benchmarks.py 
```

Execution Log:
```
root@ccf26f0f9541:/opt/work/tvm/test# CUDA_VISIBLE_DEVICES='0' PYTHONPATH=../:$PYTHONPATH python3 benchmarks.py 
Tracing model with JIT
Warming JIT up with 10 runs
Running JIT 10 times
Done benchmarking JIT
Tracing model with TVM
WARNING: reshape with -1 as the first value has known incompatibility with PyTorch semantics.
Cannot find config for target=llvm -mcpu=core-avx2, workload=('dense', (1, 512, 'float32'), (125, 512, 8, 'float32'), 0, 'float32'). A fallback configuration is used, which may bring great performance regression.
[08:58:43] /opt/work/tvm/tvm/src/pass/vectorize_loop.cc:389: Detect vector condition in Vectorized Loop, scalarizing...
[08:58:43] /opt/work/tvm/tvm/src/pass/vectorize_loop.cc:389: Detect vector condition in Vectorized Loop, scalarizing...
[08:58:43] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (64 - (ax0.ax1.outer.fused.ax2.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:43] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (64 - (ax0.ax1.outer.fused.ax2.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (64 - (ax0.ax1.outer.fused.ax2.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.fused/14))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.fused/14))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.fused/14))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (16 - (ax0.ax1.outer.fused.ax2.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (16 - (ax0.ax1.outer.fused.ax2.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (8 - (ax0.ax1.outer.fused.ax2.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (4 - (ax0.ax1.outer.fused.ax2.fused/56))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (4 - (ax0.ax1.outer.fused.ax2.fused/56))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (8 - (ax0.ax1.outer.fused.ax2.fused/112))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (2 - (ax0.ax1.outer.fused.ax2.outer.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (16 - (ax0.ax1.outer.fused.ax2.outer.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.outer.fused/4))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((1 - (7 - ((ax0.ax1.outer.fused.ax2.outer.fused % 4)*2))) + 1) >= 0), when generating the post doubt loop
/usr/local/lib/python3.5/dist-packages/torch/jit/__init__.py:1030: TracerWarning: Output nr 1. of the traced function does not match the corresponding output of the Python function. Detailed error:
Not within tolerance rtol=1e-05 atol=1e-05 at input[0, 255] (-0.710386335849762 vs. -0.7103500366210938) and 5 other locations (0.00%)
  check_tolerance, _force_outplace, True, _module_class)
Warming TVM up with 10 iters
Running TVM 10 times
Done benchmarking TVM, which compiled 100.00% of compute
JIT: 39.478923510188096 iter/s
TVM: 64.52328684937197 iter/s
root@ccf26f0f9541:/opt/work/tvm/test# 
```


As seen above there is no significant change in iter/s.
CPU version: 62.80919757107452 iter/s
GPU version: 64.52328684937197 iter/s

If I check the GPU memory usage with `nvidia-smi` command, as expected, the GPU is idle.
Is there any other configuration necessary to enable GPU backend?

(Apart from setting `set(USE_CUDA ON)` , `set(USE_CUDNN ON)`, `set(USE_CUBLAS ON)` in https://github.com/dmlc/tvm/blob/master/cmake/config.cmake
And setting `torch_tvm.enable(opt_level=3, device_type='cuda')` in https://github.com/pytorch/tvm/blob/master/test/benchmarks.py)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No significant change in iters/sec while comparing cpu vs gpu performance #138

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

No significant change in iters/sec while comparing cpu vs gpu performance #138

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions