Skip to content

Conversation

@nicolasvasilache
Copy link

@nicolasvasilache nicolasvasilache commented Jan 14, 2021

…ng MLIR

Prerequisites:
==============

First, `export MLIR_SOURCE_DIR=...`

```
(mkdir -p ${MLIR_SOURCE_DIR}/../build && \
cd ${MLIR_SOURCE_DIR}/../build && \
cmake -G Ninja ../llvm -DLLVM_ENABLE_PROJECTS="mlir"  -DBUILD_SHARED_LIBS=ON -DLLVM_BUILD_LLVM_DYLIB=1 -DMLIR_LINK_MLIR_DYLIB=1  -DLLVM_BUILD_EXAMPLES=OFF  -DLLVM_TARGETS_TO_BUILD="X86" \
 -DCMAKE_BUILD_TYPE=Release    -DLLVM_ENABLE_ASSERTIONS=ON -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ && \
cmake --build . --target MLIR check-mlir)
```

Codegen:
========

```
MLIR_DIR=${MLIR_SOURCE_DIR}/../build    cmake -GNinja -DCMAKE_CXX_COMPILER=clang++-11 -DCMAKE_C_COMPILER=clang-11 \
-DMLIR_SOURCE=${MLIR_SOURCE_DIR} -DUSE_MKL=OFF -DMLIR_BUILD=${MLIR_SOURCE_DIR}/../build/lib -B build ./Codegen/matmul && \
cmake --build build
```

Benchmark:
==========

```
rm -f build/matmul_* && cmake --build build --target matmul-compile; \
for f in $(find build/ -maxdepth 1 -executable -type f | sort --version-sort); do $f; done; \
ls *out | sort --version-sort | xargs tail -n 1
```

Results (on my machine, peak ~96GFlops/s DP):
=============================================

==> matmul_18x32x96_mlir_perf.out <==
32.44 GFLOPS

==> matmul_24x64x96_mlir_perf.out <==
33.86 GFLOPS

==> matmul_24x64x512_mlir_perf.out <==
40.66 GFLOPS

==> matmul_48x64x128_mlir_perf.out <==
42.69 GFLOPS

==> matmul_192x64x128_mlir_perf.out <==
41.60 GFLOPS

==> matmul_192x128x128_mlir_perf.out <==
36.87 GFLOPS

==> matmul_192x256x256_mlir_perf.out <==
34.32 GFLOPS

==> matmul_384x256x256_mlir_perf.out <==
35.13 GFLOPS

==> matmul_480x512x256_mlir_perf.out <==
30.80 GFLOPS

==> matmul_1020x1152x1152_mlir_perf.out <==
12.49 GFLOPS

==> matmul_1024x1024x1024_mlir_perf.out <==
35.26 GFLOPS

==> matmul_2304x2304x2560_mlir_perf.out <==
24.42 GFLOPS

Notes:
======

1. ODM numbers were using F32, good register/tile sizes need to be explored for F64.
2. Fixed some issues preventing AVX512, may be a few more things needed re compiler flags.
3. There seems to be some core MLIR regressions: manually trying different tiles sizes can create code that segfaults.
4. MLIR OSS lacks hoistings that were used internally, linalg on tensors is a better abstraction for this but still WIP.
5. MLIR OSS lacks full/partial splitting + outlining strategies that were used internally.

…ng MLIR

Prerequisites:
==============

First, `export MLIR_SOURCE_DIR=...`

```
(mkdir -p ${MLIR_SOURCE_DIR}/../build && \
cd ${MLIR_SOURCE_DIR}/../build && \
cmake -G Ninja ../llvm -DLLVM_ENABLE_PROJECTS="mlir"  -DBUILD_SHARED_LIBS=ON -DLLVM_BUILD_LLVM_DYLIB=1 -DMLIR_LINK_MLIR_DYLIB=1  -DLLVM_BUILD_EXAMPLES=OFF  -DLLVM_TARGETS_TO_BUILD="X86" \
 -DCMAKE_BUILD_TYPE=Release    -DLLVM_ENABLE_ASSERTIONS=ON -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ && \
cmake --build . --target MLIR check-mlir)
```

Codegen:
========

```
MLIR_DIR=${MLIR_SOURCE_DIR}/../build    cmake -GNinja -DCMAKE_CXX_COMPILER=clang++-11 -DCMAKE_C_COMPILER=clang-11 \
-DMLIR_SOURCE=${MLIR_SOURCE_DIR} -DUSE_MKL=OFF -DMLIR_BUILD=${MLIR_SOURCE_DIR}/../build/lib -B build ./Codegen/matmul && \
cmake --build build
```

Benchmark:
==========

```
rm -f build/matmul_* && cmake --build build --target matmul-compile; \
for f in $(find build/ -maxdepth 1 -executable -type f | sort --version-sort); do $f; done; \
ls *out | sort --version-sort | xargs tail -n 1
```

Results (on my machine, peak ~96GFlops/s DP):
=============================================

==> matmul_18x32x96_mlir_perf.out <==
32.44 GFLOPS

==> matmul_24x64x96_mlir_perf.out <==
33.86 GFLOPS

==> matmul_24x64x512_mlir_perf.out <==
40.66 GFLOPS

==> matmul_48x64x128_mlir_perf.out <==
42.69 GFLOPS

==> matmul_192x64x128_mlir_perf.out <==
41.60 GFLOPS

==> matmul_192x128x128_mlir_perf.out <==
36.87 GFLOPS

==> matmul_192x256x256_mlir_perf.out <==
34.32 GFLOPS

==> matmul_384x256x256_mlir_perf.out <==
35.13 GFLOPS

==> matmul_480x512x256_mlir_perf.out <==
30.80 GFLOPS

==> matmul_1020x1152x1152_mlir_perf.out <==
12.49 GFLOPS

==> matmul_1024x1024x1024_mlir_perf.out <==
35.26 GFLOPS

==> matmul_2304x2304x2560_mlir_perf.out <==
24.42 GFLOPS

Notes:
======

1. ODM numbers were using F32, good register/tile sizes need to be explored for F64.
2. Fixed some issues preventing AVX512, may be a few more things needed re compiler flags.
3. There seems to be some core MLIR regressions: manually trying different tiles sizes can create code that segfaults.
4. MLIR OSS lacks hoistings that were used internally, linalg on tensors is a better abstraction for this but still WIP.
5. MLIR OSS lacks full/partial splitting + outlining strategies that were used internally.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant