minfer-python

Overview:

Introduction:

Benchmarking inference on (decoder-only) LLM models. Kernels are written using CUDA as the underlying backend.

The compute intensity of certain operations common to deep learning — e.g. the matrix multiplication — scales more rapidly than the amount of data moved. Since inference is typically a memory-bound task, this justifies a careful approach to the optimization of their kernel implementations, and thus the techniques used here to optimize their kernels are specific to the GPU device used. Hence, I optimized these kernels specifically for use on the NVIDIA L40S GPU (as the Hopper GPUs were in frequent use).

NVIDIA GPUs differ in supported instructions and hardware capabilities, which makes it quite difficult to judge whether you will observe the same performance on a different device. And writing hand-optimized kernels for other NVIDIA GPUs is currently beyond the scope of this project. However, I believe it is likely that these kernels will work well on one of the Ampere architecture GPUs, due to e.g. similar compute capabilities and opt-in limits on shared memory. Regardless, if you are using a device with a different compute capability, change setup.py to reflect this.

Note especially if you are testing this out / forking this project for local development on a cluster setup:

The environment variable TORCH_EXTENSIONS_DIR determines where the compiled kernels are stored. You will want to set this directory to a local disk.
In general, it is better not to build the project on compute nodes isolated from internet access, e.g. on a SLURM cluster setup. You would have to build the project without build isolation, meaning many slow transactions over the network filesystem if the cluster is configured to use Lustre (large fileblock sizes). In addition, since the project would be too slow to copy over to a local disk (env managers make many small file transactions), dependencies have to be fetched at runtime from the login node as well, so not just at compile-time. If your cluster uses this setup, consider moving over to an NFS/NFS4 cluster with fast scratch storage and smaller fileblock sizes, and additionally ensure that the compute nodes are connected to the internet. The project will then be fast to compile, and fast at runtime.

For more information regarding the installation, refer to setup.py and setup.sh. The setup script will have to be adjusted depending on your system configuration.

Requirements:

Python >= 3.12
uv for env management
At least 1 NVIDIA GPU with driver supporting CUDA 12.4+ (see intro)

Quick Start:

To install the package in editable mode:

[DEBUG=1] uv sync --group dev # use testing suite

To verify that the setup works on your system, run the unit tests located at tests/kernel.py with Pytest:

pytest test/kernels.py

To benchmark the kernels against the corresponding Pytorch implementations:

python apps/benchmark.py

To profile kernels with NCU:

ncu --set full python apps/profile.py

Results

The timing benchmarks measure performance relative to the Pytorch implementations, where we are essentially competing against highly optimized vendor libraries such as cuBLAS and/or CUTLASS where applicable. For these benchmarks, I do not have sudo access privileges on the cluster to be able to lock the GPU or memory clocks¹. This is only possible with NSight Compute in my case, and so we use it to profile the kernels.

Matrix Multiplication (HGEMM)

All of the measurements reported are the median of trials via repeated testing.

Kernel	Throughput (TFLOPS)	Pytorch HGEMM Throughput	% PyTorch HGEMM Throughput	Speedup vs. Baseline
Baseline (Basic Tiling)	19.97	165.05	12.1%	1x
Unroll/vectorize shmem load	120.77	162.97	74.1%	6x

Remarks

The most substantial progress resulted from referencing NVIDIA's cuda-samples repo² for their WMMA HGEMM implementation. Due to eventually wanting to understand the underlying functionality of the WMMA API, I eventually searched for an HGEMM optimization blogpost that made use of the lower-level MMA API³, and a related blog diving into SGEMM optimization¹.

We start with the baseline implementation from the former blogpost³. Some of the subsequent improvements will differ due to e.g. the introduction of an explicitly supported asynchronous memcpy from global to shared memory (starting with the Ampere architecture), differences in opt-in shared memory and supported matrix fragment sizes, register capacity on each streaming multiprocessor, etc. Of course, optimizations related to e.g. avoiding shared memory bank conflicts⁴ such as swizzling³, or vectorized memory transactions, etc. will look quite similar in terms of approach.

Acknowledgments

All of the work here was developed on a university SLURM cluster. Thanks to MIT ORCD for access to substantial compute resources.

Thanks to Pytorch for the C++/CUDA extension⁵, a tool without which it would have been difficult to incorporate work in kernel optimization into an inference engine without writing an entirely separate tensor backend from scratch.

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
apps		apps
src/minfer		src/minfer
test		test
utils		utils
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py
setup.sh		setup.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

minfer-python

Table of Contents

Overview:

Introduction:

Requirements:

Quick Start:

Results

Matrix Multiplication (HGEMM)

Remarks

Acknowledgments

References

About

Uh oh!

Releases

Packages

Languages

dsuarez01/minfer-python

Folders and files

Latest commit

History

Repository files navigation

minfer-python

Table of Contents

Overview:

Introduction:

Requirements:

Quick Start:

Results

Matrix Multiplication (HGEMM)

Remarks

Acknowledgments

References

Footnotes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages