SparseFlow is a next-generation MLIR-based compiler that detects and exploits generalized structured sparsity (N:M) in AI workloads.
Unlike traditional sparse libraries (limited to 2:4 or fully unstructured), SparseFlow supports any N:M block pattern and achieves massive CPU acceleration using compile-time analysis + custom sparse kernels.
Supports the following patterns out of the box:
- 1:4
- 2:4
- 2:8
- 4:16
- 8:32
- SPA Pass β Static sparsity analysis
- Rewrite Pass β Converts dense matmuls β sparse kernels
- Export Pass β Dumps metadata
- Pluggable runtime lowering
- 5 hand-tuned OpenMP kernels
- Contiguous block loads
- Branch-free inner loops
- High cache locality
- Designed for future SIMD + GPU backend
SparseFlow achieves 9Γβ20Γ speedup on CPU for realistic matrix sizes, significantly outperforming typical sparse CPU libraries.
Full dense vs N:M sparse CPU benchmark tables (all patterns and sizes) are in:
Benchmarks compare dense vs SparseFlow sparse kernels on CPU.
| Matrix Size | Typical Speedup | Peak Speedup |
|---|---|---|
| 256Γ256 | 3Γβ8Γ | 8Γ |
| 512Γ512 | 8Γβ12Γ | 12Γ |
| 1024Γ1024 | 9Γβ20Γ | 20Γ |
Stable patterns frequently hit:
- 1:4 β ~18Γ
- 2:8 β ~18Γ
- 4:16 β ~20Γ
These numbers are based on multiple runs and exclude outlier spikes.
Matrix Size: 1024Γ1024
βββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββ¬ββββββββββββ
β Pattern β Dense (ms) β Sparse (ms)β Speedup β Density β
βββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββΌββββββββββββ€
β 1:4 β 12618.09 β 670.56 β 18.82Γ β 25% β
β 2:4 β 14662.58 β 1626.62 β 9.01Γ β 50% β
β 2:8 β 13843.85 β 769.59 β 17.99Γ β 25% β
β 4:16 β 10886.07 β 544.07 β 20.01Γ β 25% β
βββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββ΄ββββββββββββ
SparseFlow transforms dense MLIR into sparse-optimized executable code:
PyTorch / ONNX β MLIR β SPA Pass β Rewrite Pass β LLVM β Sparse Runtime
Identifies sparse regions and marks tensors with {n, m} metadata.
Replaces linalg.matmul with:
func.call @sparse_matmul_N_M(...)Dynamically choosing the correct sparse kernel.
Backed by optimized C++/OpenMP kernels:
sparse_matmul_1_4
sparse_matmul_2_4
sparse_matmul_2_8
sparse_matmul_4_16
sparse_matmul_8_32A pattern N:M means:
- For every M consecutive weights
- Exactly N are non-zero
- Zeros are static at compile time
- Blocks are memory contiguous
This allows:
- Predictable skipping
- SIMD-friendly loads
- Low branch divergence
- Great cache efficiency
%A = tensor<16x16xf32> {n = 2 : i32, m = 8 : i32}
%B = tensor<16x16xf32>
%C = tensor<16x16xf32>
%0 = linalg.matmul ins(%A, %B)func.call @sparse_matmul_2_8(%A, %B, %C, %m, %k, %n)git clone https://github.com/MapleSilicon/SparseFlow
cd SparseFlow/compiler
mkdir build && cd build
cmake -DCMAKE_PREFIX_PATH=/usr/lib/llvm-19 ..
make -j8cd ../../runtime/build
./benchmark_nm_runtime- CUDA kernels
- Tensor Core support
- 30β60Γ expected speedup
- Python bindings
torch.compilebackend- Model zoo support
- Cloud provider pilots
- Enterprise safety and tooling
Email: maplesilicon1@gmail.com
GitHub: https://github.com/MapleSilicon/SparseFlow
Author: Gourav Kumar
Generalized Sparse Compute for AI.
Simple. Fast. Open.
Full dense vs N:M sparse CPU benchmark tables (all patterns, sizes) are in: