Skip to content
This repository was archived by the owner on Apr 1, 2021. It is now read-only.
This repository was archived by the owner on Apr 1, 2021. It is now read-only.

No significant change in iters/sec while comparing cpu vs gpu performance #138

@hemantranvir

Description

@hemantranvir

I have installed torch_tvm with cuda/opencl support by enabling the following options:
https://github.com/dmlc/tvm/blob/master/cmake/config.cmake#L32
https://github.com/dmlc/tvm/blob/master/cmake/config.cmake#L129
https://github.com/dmlc/tvm/blob/master/cmake/config.cmake#L132

Trying to compare the cpu vs gpu performance by running the following test: https://github.com/pytorch/tvm/blob/master/test/benchmarks.py

  • CPU version:
$ CUDA_VISIBLE_DEVICES='' PYTHONPATH=../:$PYTHONPATH python3 benchmarks.py 

Execution Log:

root@ccf26f0f9541:/opt/work/tvm/test# CUDA_VISIBLE_DEVICES='' PYTHONPATH=../:$PYTHONPATH python3 benchmarks.py 
Tracing model with JIT
Warming JIT up with 10 runs
Running JIT 10 times
Done benchmarking JIT
Tracing model with TVM
WARNING: reshape with -1 as the first value has known incompatibility with PyTorch semantics.
Cannot find config for target=llvm -mcpu=core-avx2, workload=('dense', (1, 512, 'float32'), (125, 512, 8, 'float32'), 0, 'float32'). A fallback configuration is used, which may bring great performance regression.
[08:58:08] /opt/work/tvm/tvm/src/pass/vectorize_loop.cc:389: Detect vector condition in Vectorized Loop, scalarizing...
[08:58:08] /opt/work/tvm/tvm/src/pass/vectorize_loop.cc:389: Detect vector condition in Vectorized Loop, scalarizing...
[08:58:08] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (64 - (ax0.ax1.outer.fused.ax2.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (64 - (ax0.ax1.outer.fused.ax2.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (64 - (ax0.ax1.outer.fused.ax2.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.fused/14))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.fused/14))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.fused/14))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (16 - (ax0.ax1.outer.fused.ax2.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (16 - (ax0.ax1.outer.fused.ax2.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (8 - (ax0.ax1.outer.fused.ax2.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (4 - (ax0.ax1.outer.fused.ax2.fused/56))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (4 - (ax0.ax1.outer.fused.ax2.fused/56))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (8 - (ax0.ax1.outer.fused.ax2.fused/112))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (2 - (ax0.ax1.outer.fused.ax2.outer.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (16 - (ax0.ax1.outer.fused.ax2.outer.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.outer.fused/4))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((1 - (7 - ((ax0.ax1.outer.fused.ax2.outer.fused % 4)*2))) + 1) >= 0), when generating the post doubt loop
/usr/local/lib/python3.5/dist-packages/torch/jit/__init__.py:1030: TracerWarning: Output nr 1. of the traced function does not match the corresponding output of the Python function. Detailed error:
Not within tolerance rtol=1e-05 atol=1e-05 at input[0, 255] (-0.710386335849762 vs. -0.7103500366210938) and 5 other locations (0.00%)
  check_tolerance, _force_outplace, True, _module_class)
Warming TVM up with 10 iters
Running TVM 10 times
Done benchmarking TVM, which compiled 100.00% of compute
JIT: 39.134256974191366 iter/s
TVM: 62.80919757107452 iter/s
root@ccf26f0f9541:/opt/work/tvm/test# 
  • GPU version:
    Edit L39 of benchmarks.py to torch_tvm.enable(opt_level=3, device_type='cuda')
$ CUDA_VISIBLE_DEVICES='0' PYTHONPATH=../:$PYTHONPATH python3 benchmarks.py 

Execution Log:

root@ccf26f0f9541:/opt/work/tvm/test# CUDA_VISIBLE_DEVICES='0' PYTHONPATH=../:$PYTHONPATH python3 benchmarks.py 
Tracing model with JIT
Warming JIT up with 10 runs
Running JIT 10 times
Done benchmarking JIT
Tracing model with TVM
WARNING: reshape with -1 as the first value has known incompatibility with PyTorch semantics.
Cannot find config for target=llvm -mcpu=core-avx2, workload=('dense', (1, 512, 'float32'), (125, 512, 8, 'float32'), 0, 'float32'). A fallback configuration is used, which may bring great performance regression.
[08:58:43] /opt/work/tvm/tvm/src/pass/vectorize_loop.cc:389: Detect vector condition in Vectorized Loop, scalarizing...
[08:58:43] /opt/work/tvm/tvm/src/pass/vectorize_loop.cc:389: Detect vector condition in Vectorized Loop, scalarizing...
[08:58:43] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (64 - (ax0.ax1.outer.fused.ax2.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:43] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (64 - (ax0.ax1.outer.fused.ax2.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (64 - (ax0.ax1.outer.fused.ax2.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.fused/14))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.fused/14))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.fused/14))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (16 - (ax0.ax1.outer.fused.ax2.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (16 - (ax0.ax1.outer.fused.ax2.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (8 - (ax0.ax1.outer.fused.ax2.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (4 - (ax0.ax1.outer.fused.ax2.fused/56))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (4 - (ax0.ax1.outer.fused.ax2.fused/56))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (8 - (ax0.ax1.outer.fused.ax2.fused/112))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (2 - (ax0.ax1.outer.fused.ax2.outer.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (16 - (ax0.ax1.outer.fused.ax2.outer.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.outer.fused/4))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((1 - (7 - ((ax0.ax1.outer.fused.ax2.outer.fused % 4)*2))) + 1) >= 0), when generating the post doubt loop
/usr/local/lib/python3.5/dist-packages/torch/jit/__init__.py:1030: TracerWarning: Output nr 1. of the traced function does not match the corresponding output of the Python function. Detailed error:
Not within tolerance rtol=1e-05 atol=1e-05 at input[0, 255] (-0.710386335849762 vs. -0.7103500366210938) and 5 other locations (0.00%)
  check_tolerance, _force_outplace, True, _module_class)
Warming TVM up with 10 iters
Running TVM 10 times
Done benchmarking TVM, which compiled 100.00% of compute
JIT: 39.478923510188096 iter/s
TVM: 64.52328684937197 iter/s
root@ccf26f0f9541:/opt/work/tvm/test# 

As seen above there is no significant change in iter/s.
CPU version: 62.80919757107452 iter/s
GPU version: 64.52328684937197 iter/s

If I check the GPU memory usage with nvidia-smi command, as expected, the GPU is idle.
Is there any other configuration necessary to enable GPU backend?

(Apart from setting set(USE_CUDA ON) , set(USE_CUDNN ON), set(USE_CUBLAS ON) in https://github.com/dmlc/tvm/blob/master/cmake/config.cmake
And setting torch_tvm.enable(opt_level=3, device_type='cuda') in https://github.com/pytorch/tvm/blob/master/test/benchmarks.py)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions