CUDA microbenchmarks for measuring bandwidth between global memory and shared memory, plus a Python plotting utility for generating PNG/PDF figures.
global2shared.cu: benchmarks global -> shared copies (float,float4).shared2global.cu: benchmarks shared -> global copies (float,float4).plot.py: merges CSV outputs and generates plots.- GPU result folders (
3090/,4090/,5090/,Titan/,a100/) with sample outputs.
- NVIDIA GPU with CUDA support
- CUDA toolkit +
nvcc - CMake >= 3.18
- Python >= 3.9
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -jThe binaries write fixed filenames:
global_to_shared_async_constexpr.csvshared_to_global.csv
Run them from a result directory to keep outputs organized:
mkdir -p results
cd results
../build/global2shared
../build/shared2global
cd ..Install Python dependencies (choose one):
uv syncor
python -m pip install -e .Then render plots from benchmark CSVs:
python plot.py --input-dir results --output-dir resultsThis creates:
results/png/(PNG plots)results/pdf/(PDF plots)results/csv/plot_data.csv(combined data)
| GPU | Plot |
|---|---|
| RTX 3090 | ![]() |
| RTX 4090 | ![]() |
| RTX 5090 | ![]() |
| Titan | ![]() |
| A100 | ![]() |
plot.pydefaults to readingglobal_to_shared_async_constexpr.csvandshared_to_global.csvfrom--input-dir.- CUDA architecture selection is handled in
CMakeLists.txt(nativewhen supported by CMake).




