Skip to content

MapleSilicon/SparseFlow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🌲 SparseFlow v0.2.0

Generalized N:M Sparse Compiler for AI Inference (MLIR + CPU Runtime)

SparseFlow is a next-generation MLIR-based compiler that detects and exploits generalized structured sparsity (N:M) in AI workloads.

Unlike traditional sparse libraries (limited to 2:4 or fully unstructured), SparseFlow supports any N:M block pattern and achieves massive CPU acceleration using compile-time analysis + custom sparse kernels.


πŸš€ Key Features (v0.2.0)

βœ… Generalized N:M Sparsity

Supports the following patterns out of the box:

  • 1:4
  • 2:4
  • 2:8
  • 4:16
  • 8:32

βœ… MLIR Compiler Integration

  • SPA Pass β€” Static sparsity analysis
  • Rewrite Pass β€” Converts dense matmuls β†’ sparse kernels
  • Export Pass β€” Dumps metadata
  • Pluggable runtime lowering

βœ… Optimized CPU Runtime

  • 5 hand-tuned OpenMP kernels
  • Contiguous block loads
  • Branch-free inner loops
  • High cache locality
  • Designed for future SIMD + GPU backend

βœ… Real Performance

SparseFlow achieves 9×–20Γ— speedup on CPU for realistic matrix sizes, significantly outperforming typical sparse CPU libraries.

πŸ“Š Detailed Benchmarks

Full dense vs N:M sparse CPU benchmark tables (all patterns and sizes) are in:


πŸ“Š Benchmark Results (REAL HARDWARE)

Benchmarks compare dense vs SparseFlow sparse kernels on CPU.

Matrix Size Typical Speedup Peak Speedup
256Γ—256 3×–8Γ— 8Γ—
512Γ—512 8×–12Γ— 12Γ—
1024Γ—1024 9×–20Γ— 20Γ—

Stable patterns frequently hit:

  • 1:4 β†’ ~18Γ—
  • 2:8 β†’ ~18Γ—
  • 4:16 β†’ ~20Γ—

These numbers are based on multiple runs and exclude outlier spikes.


πŸ§ͺ Example Benchmark Output

Matrix Size: 1024Γ—1024
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Pattern β”‚ Dense (ms) β”‚ Sparse (ms)β”‚ Speedup  β”‚ Density   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1:4     β”‚ 12618.09   β”‚ 670.56     β”‚ 18.82Γ—   β”‚ 25%       β”‚
β”‚ 2:4     β”‚ 14662.58   β”‚ 1626.62    β”‚ 9.01Γ—    β”‚ 50%       β”‚
β”‚ 2:8     β”‚ 13843.85   β”‚ 769.59     β”‚ 17.99Γ—   β”‚ 25%       β”‚
β”‚ 4:16    β”‚ 10886.07   β”‚ 544.07     β”‚ 20.01Γ—   β”‚ 25%       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ— Compiler Pipeline

SparseFlow transforms dense MLIR into sparse-optimized executable code:

PyTorch / ONNX β†’ MLIR β†’ SPA Pass β†’ Rewrite Pass β†’ LLVM β†’ Sparse Runtime

1. SPA Pass

Identifies sparse regions and marks tensors with {n, m} metadata.

2. Rewrite Pass

Replaces linalg.matmul with:

func.call @sparse_matmul_N_M(...)

Dynamically choosing the correct sparse kernel.

3. Runtime

Backed by optimized C++/OpenMP kernels:

sparse_matmul_1_4
sparse_matmul_2_4
sparse_matmul_2_8
sparse_matmul_4_16
sparse_matmul_8_32

🧩 Supported Sparsity Patterns

A pattern N:M means:

  • For every M consecutive weights
  • Exactly N are non-zero
  • Zeros are static at compile time
  • Blocks are memory contiguous

This allows:

  • Predictable skipping
  • SIMD-friendly loads
  • Low branch divergence
  • Great cache efficiency

πŸ”¬ Example MLIR Input

%A = tensor<16x16xf32> {n = 2 : i32, m = 8 : i32}
%B = tensor<16x16xf32>
%C = tensor<16x16xf32>

%0 = linalg.matmul ins(%A, %B)

After Rewrite Pass:

func.call @sparse_matmul_2_8(%A, %B, %C, %m, %k, %n)

πŸ“¦ Build Instructions

git clone https://github.com/MapleSilicon/SparseFlow
cd SparseFlow/compiler
mkdir build && cd build
cmake -DCMAKE_PREFIX_PATH=/usr/lib/llvm-19 ..
make -j8

Run benchmarks

cd ../../runtime/build
./benchmark_nm_runtime

πŸ—Ί Roadmap

v0.3 (Q1 2026) β€” GPU Acceleration

  • CUDA kernels
  • Tensor Core support
  • 30–60Γ— expected speedup

v0.4 (Q2 2026) β€” PyTorch Integration

  • Python bindings
  • torch.compile backend
  • Model zoo support

v0.5 (Q3 2026) β€” Production Deployment

  • Cloud provider pilots
  • Enterprise safety and tooling

🀝 Contact

Email: maplesilicon1@gmail.com
GitHub: https://github.com/MapleSilicon/SparseFlow
Author: Gourav Kumar


🌲 SparseFlow

Generalized Sparse Compute for AI.
Simple. Fast. Open.


πŸ“Š Detailed Benchmarks

Full dense vs N:M sparse CPU benchmark tables (all patterns, sizes) are in: