This repository is designed to simplify your introduction to CUDA kernel development by providing a ready-to-use VSCode setup. With it, you can both profile your kernels and debug them directly from the VSCode editor, so you can dive into online tutorials immediately without wrestling with your toolchain first.
- Build system: CMake (tested with version 3.28.3)
- Tested with CUDA 13.0 and Python 3.12.3
This repository contains different CUDA implementations of an sgemm kernel, inspired by the tutorials from siboehm and leimao. The triton folder contains examples of how to tune a gemm kernel with Triton.
The pybind folder contains an example of how to invoke a CUDA kernel written in C++ from Python.
-
Make sure you have all necessary VSCode extensions:
- C/C++ & C/C++ Extension Pack
- CMake & CMake Tools
- (Kernel profiling) Nsight Visual Studio Code Edition
- (For Python examples only) Python & Python Extension Pack
- Clang-format (by X. Hellauer) for C++ formatting
- Yapf for Python formatting
-
Adapt the pathes in the settings.json file.
-
Select the build variant (Release or Debug): (F1) -> (CMake: Select Variant)
-
Configure + Build + Install the executable: (F1) -> (CMake: Install)
-
You should now be able to see the binary called
sgemmin the build or release folder depending on the variant.
The run and debug configurations can be found in the launch.json file.
Adapt the pathes in the launch.json file.
To just run the kernel (release version), select:
- (F1) -> (Debug: Select and Start Debugging) -> (Run kernel)
To set breakpoints in VSCode to debug the host code and/or the GPU code, select:
- (F1) -> (Debug: Select and Start Debugging) -> (Debug kernel)
To collect meaningful performance metrics, you should always profile the release version of your kernel.
By default, NVIDIA’s profiler (ncu) requires elevated (root) privileges to access GPU performance counters.
To allow all users to run ncu without invoking sudo, NVIDIA describes a permanent, non-root workaround
here.
- Follow the steps on the website if you wish to continue without sudo.
- (F1) -> (Tasks: Run task) -> (Profile SGEMM with Nsight {sudo/ no sudo})
- Enter the kernel name, e.g.,
sgemm_simple - Select the section you want to profile.
- Enter the sudo password in the terminal in VSCode.
mkdir -p build/debug/build && cd build/debug/build
cmake \
-DCMAKE_BUILD_TYPE={Debug/Release} \
-DCMAKE_INSTALL_PREFIX=../ \
-DCMAKE_CUDA_TOOLKIT_ROOT_DIR=<CUDA_PATH, e.g.: `/usr/local/cuda-13`> \
../../../
make
make install
# Execute in main directory
./build/{debug/release}/bin/sgemm
# Show instructions
${CUDA_PATH}/bin/cuobjdump --dump-ptx build/{debug/release}/bin/sgemmInstall requirements:
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtReload VSCode (F1) -> Developer: Reload Window to ensure that the venv is activated when you open a new terminal.
If not: (F1) -> Python: Create Environment... -> Venv -> Use Existing and reload the window again.
This repository also contains a Triton implementation of a GEMM kernel. You can find it in this folder. To run the file (without debugging):
- Open the Python file you want to execute.
- Press (F1) -> Python: Run Python File in Terminal.
To debug the Python file, use the corresponding configuration.
To debug the triton kernel, TRITON_INTERPRET needs to be set to 1.
This activates the interpreter mode instead of executing the compiled kernel.
More information can be found here.
- Open the Python file you want to debug.
- Press (F1) -> Debug: Select and Start Debugging
- Choose: Debug Python File
The following sgemm implementations are included in this repository:
-
Enable the collection of tracing information in the settings.json.
-
Trace the kernel <my_kernel>, e.g.
sgemm_warptiling:${CUDA_PATH}/bin/ncu \ --set full -f \ --kernel-name <my_kernel> \ --export sgemm.ncu-rep \ ./build/release/bin/sgemm
-
Open the file with nsight:
${CUDA_PATH}/bin/ncu-ui sgemm.ncu-rep -
Profile additional metrics:
# Show all metrics ${CUDA_PATH}/bin/ncu --query-metrics # Profile more metrics (m1, m2, and m3) ${CUDA_PATH}/bin/ncu [...] --metrics m1,m2,m3 [...]
To print the results, use
--page raw.






