Skip to content

rpelke/cuda-vscode-setup

Repository files navigation

Writing CUDA Kernels - A VSCode Setup & Tutorial

This repository is designed to simplify your introduction to CUDA kernel development by providing a ready-to-use VSCode setup. With it, you can both profile your kernels and debug them directly from the VSCode editor, so you can dive into online tutorials immediately without wrestling with your toolchain first.

This repository contains different CUDA implementations of an sgemm kernel, inspired by the tutorials from siboehm and leimao. The triton folder contains examples of how to tune a gemm kernel with Triton. The pybind folder contains an example of how to invoke a CUDA kernel written in C++ from Python.

Build sgemm kernels in VSCode

  1. Make sure you have all necessary VSCode extensions:

    • C/C++ & C/C++ Extension Pack
    • CMake & CMake Tools
    • (Kernel profiling) Nsight Visual Studio Code Edition
    • (For Python examples only) Python & Python Extension Pack
    • Clang-format (by X. Hellauer) for C++ formatting
    • Yapf for Python formatting
  2. Adapt the pathes in the settings.json file.

  3. Select the build variant (Release or Debug): (F1) -> (CMake: Select Variant)

  4. Configure + Build + Install the executable: (F1) -> (CMake: Install)

  5. You should now be able to see the binary called sgemm in the build or release folder depending on the variant.

Run and debug kernels in VSCode

The run and debug configurations can be found in the launch.json file.

Adapt the pathes in the launch.json file.

To just run the kernel (release version), select:

  • (F1) -> (Debug: Select and Start Debugging) -> (Run kernel)

To set breakpoints in VSCode to debug the host code and/or the GPU code, select:

  • (F1) -> (Debug: Select and Start Debugging) -> (Debug kernel)

Profile the kernels in VSCode

To collect meaningful performance metrics, you should always profile the release version of your kernel.

By default, NVIDIA’s profiler (ncu) requires elevated (root) privileges to access GPU performance counters.

To allow all users to run ncu without invoking sudo, NVIDIA describes a permanent, non-root workaround here.

  1. Follow the steps on the website if you wish to continue without sudo.
  2. (F1) -> (Tasks: Run task) -> (Profile SGEMM with Nsight {sudo/ no sudo})
  3. Enter the kernel name, e.g., sgemm_simple
  4. Select the section you want to profile.
  5. Enter the sudo password in the terminal in VSCode.

Build sgemm kernels in terminal

mkdir -p build/debug/build && cd build/debug/build
cmake \
    -DCMAKE_BUILD_TYPE={Debug/Release} \
    -DCMAKE_INSTALL_PREFIX=../ \
    -DCMAKE_CUDA_TOOLKIT_ROOT_DIR=<CUDA_PATH, e.g.: `/usr/local/cuda-13`> \
    ../../../
make
make install

# Execute in main directory
./build/{debug/release}/bin/sgemm

# Show instructions
${CUDA_PATH}/bin/cuobjdump --dump-ptx build/{debug/release}/bin/sgemm

Run/Debug Python file

Install requirements:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Reload VSCode (F1) -> Developer: Reload Window to ensure that the venv is activated when you open a new terminal. If not: (F1) -> Python: Create Environment... -> Venv -> Use Existing and reload the window again.

This repository also contains a Triton implementation of a GEMM kernel. You can find it in this folder. To run the file (without debugging):

  1. Open the Python file you want to execute.
  2. Press (F1) -> Python: Run Python File in Terminal.

To debug the Python file, use the corresponding configuration. To debug the triton kernel, TRITON_INTERPRET needs to be set to 1. This activates the interpreter mode instead of executing the compiled kernel. More information can be found here.

  1. Open the Python file you want to debug.
  2. Press (F1) -> Debug: Select and Start Debugging
  3. Choose: Debug Python File

Implementations included

The following sgemm implementations are included in this repository:


Simple sgemm


Coalesced sgemm


Tiled sgemm


2D-Tiled sgemm & 2D-Tiled sgemm (vectorized v2)


2D-Tiled sgemm (vectorized v1)


2D Warptiling


Tensorcores

Tracing

  1. Enable the collection of tracing information in the settings.json.

  2. Trace the kernel <my_kernel>, e.g. sgemm_warptiling:

    ${CUDA_PATH}/bin/ncu \
      --set full -f \
      --kernel-name <my_kernel> \
      --export sgemm.ncu-rep \
      ./build/release/bin/sgemm
  3. Open the file with nsight:

    ${CUDA_PATH}/bin/ncu-ui sgemm.ncu-rep
  4. Profile additional metrics:

    # Show all metrics
    ${CUDA_PATH}/bin/ncu --query-metrics
    
    # Profile more metrics (m1, m2, and m3)
    ${CUDA_PATH}/bin/ncu [...] --metrics m1,m2,m3 [...]

    To print the results, use --page raw .

About

A simple example how to write CUDA (C++) kernels in VSCode

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •