Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ convolution2D
vectAdd_errors
gpu_info
parallel_histogram
stencil

# Profiler and logs
*.nvvp
Expand Down
12 changes: 7 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,14 +59,16 @@ If you do not have `nvprof`, install the CUDA toolkit, or run the GitHub Actions

Click the folders below for the example README files and more details:

- [`Vector Addition`](vector_addition/) — vector add example
- [`Vector Addition`](vector_addition/) — vector add example with multiple modes
- [`Error Handling`](error_handling/) — examples showing CUDA error handling
- [`Device Specification`](device_specification/) — device query and capability examples
- [`Image Manipulation`](image_manip/) — image processing examples (blur, grayscale); includes `stb` helper headers
- [`Image Manipulation`](image_manip/) — image processing examples (blur, grayscale) with libpng
- [`Matrix-Vector Multiplication`](matrix_vector_multiplication/) — matrix-vector multiplication example
- [`Matrix Multiplication`](matrix_multiplication/) — matrix multiplication example
- [`Convolution`](convolution/) — convolution examples (1D & 2D)
- [`Profiling Tools`](profiling_tools/) — automated GPU profiling suite with roofline analysis, timing histograms, and occupancy visualization
- [`Matrix Multiplication`](matrix_multiplication/) — matrix multiplication with naive, tiled, and coarsened kernels
- [`Convolution`](convolution/) — 1D and 2D convolution with constant memory and tiling
- [`Parallel Histogram`](parallel_histogram/) — parallel histogram with privatization, aggregation, and coarsening
- [`3D Stencil`](stencil/) — 3D seven-point stencil with shared memory, coarsening, and register tiling
- [`Profiling Tools`](profiling_tools/) — automated GPU profiling suite with roofline analysis

Each folder includes a `README.md` with per-example instructions.

Expand Down
9 changes: 5 additions & 4 deletions convolution/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,8 @@ Both implementations use a compile-time filter radius defined as:

## Notes

- Constant memory provides broadcast capability for filter coefficients accessed by all threads
- Tiling reduces global memory bandwidth by reusing data in shared memory
- The halo region in tiled implementations handles boundary conditions
- Use `NVCCFLAGS` in the Makefile to tune compilation flags for your hardware
- Constant memory provides broadcast capability for filter coefficients accessed by all threads.
- Tiling reduces global memory bandwidth by reusing data in shared memory.
- The halo region in tiled implementations handles boundary conditions.
- Use `NVCCFLAGS` in the Makefile to tune compilation flags for your hardware.
- Use profiling tools: `../profiling_tools/profile_cuda.sh -d .`
33 changes: 24 additions & 9 deletions device_specification/README.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,42 @@
# Device Specification

Purpose: utility to enumerate CUDA devices and print hardware limits and properties useful for tuning kernels and understanding the platform.
Utility to enumerate CUDA devices and print hardware limits and properties useful for tuning kernels and understanding the platform.

Build:
## Build

```bash
cd device_specification
make
```

Programs and usage:
## Usage

- `deviceSpec [--device device_index]` : print properties for all devices or for the supplied device index.
```bash
./deviceSpec [--device DEVICE_INDEX]
```

**Options:**

| Flag | Description | Default |
|------|-------------|---------|
| `--device` | Device index to query | All devices |

Run examples (local runner):
## Run

```bash
./run.sh --device=0 # print device 0 only
# No args prints all devices
# Print all devices
./run.sh

# Print device 0 only
./run.sh --device=0
```

Notes:
## Profiling

This is a query utility; profiling is not typically needed.

## Notes

- Uses `cudaGetDeviceProperties` to collect a broad set of fields: memory, SMPs, registers, warp size, clock rates, compute capability, PCI IDs, ECC and concurrency flags. Useful baseline for kernel tuning.
- Uses `cudaGetDeviceProperties` to collect a broad set of fields: memory, SMPs, registers, warp size, clock rates, compute capability, PCI IDs, ECC and concurrency flags.
- Useful baseline for kernel tuning.
- No external libraries required beyond the CUDA toolkit.
62 changes: 37 additions & 25 deletions error_handling/README.md
Original file line number Diff line number Diff line change
@@ -1,48 +1,60 @@
# Error-handling demos for vector addition
# Error Handling Demos

This directory contains small programs that intentionally trigger common CUDA
runtime and kernel errors so you can observe runtime messages and test
profiling/debugging tools such as `nvprof`.
This directory contains small programs that intentionally trigger common CUDA runtime and kernel errors so you can observe runtime messages and test profiling/debugging tools.

Build:
## Build

```bash
cd error_handling
make
```

Programs and usage:
## Usage

- `vectAdd_errors [--mode idx]` : vector-add demo with intentional error modes
- Mode 0 (or no mode): safe run
- Mode 1 : excessive block size (invalid kernel launch configuration)
- Mode 2 : invalid host pointer passed to `cudaMemcpy`
- Mode 3 : excessive allocation request (forced `cudaMalloc` failure)
- Mode 4 : referencing invalid device pointer in kernel (NULL/invalid)
- Mode 5 : out-of-bounds global memory write in kernel
### vectAdd_errors

- `errorCudaMemcpy` : separate demo that demonstrates common cudaMemcpy /
memory-management mistakes, including incorrect sizes, nullptr copies and
misuse of `cudaMemcpyDeviceToDevice`. This file includes its own checking
macros and intentionally triggers runtime/runtime-sticky errors for testing.
Vector-add demo with intentional error modes:

Run examples (local runner):
```bash
./vectAdd_errors [--mode MODE] [--n N]
```

**Modes:**

| Mode | Description |
|------|-------------|
| 0 | Safe run (no errors) |
| 1 | Excessive block size (invalid launch configuration) |
| 2 | Invalid host pointer passed to `cudaMemcpy` |
| 3 | Excessive allocation request (forced `cudaMalloc` failure) |
| 4 | Referencing invalid device pointer in kernel |
| 5 | Out-of-bounds global memory write in kernel |

### errorCudaMemcpy

Demonstrates common `cudaMemcpy` and memory-management mistakes:
- Incorrect sizes
- nullptr copies
- Misuse of `cudaMemcpyDeviceToDevice`

## Run

```bash
# Run vectAdd_errors in a specific mode:
# Run vectAdd_errors in a specific mode
./run.sh vectAdd_errors --mode 1 --n 1024
# Run the errorCudaMemcpy demo:

# Run the errorCudaMemcpy demo
./run.sh errorCudaMemcpy --n 1024
```

Profile with nvprof:
## Profiling

```bash
./profile_nvprof.sh errorCudaMemcpy
```

Notes:
## Notes

- The examples are intentionally invalid — run them in a controlled environment
for learning and debugging. The programs print CUDA error strings produced by
the runtime. Use `nvprof` output to inspect kernel activity and memory events.
- These examples are **intentionally invalid** — run them in a controlled environment for learning and debugging.
- The programs print CUDA error strings produced by the runtime.
- Use `nvprof` or `compute-sanitizer` to inspect kernel activity and memory events.
38 changes: 27 additions & 11 deletions image_manip/README.md
Original file line number Diff line number Diff line change
@@ -1,33 +1,49 @@
# Image Manipulation (libpng)
# Image Manipulation

Simple CUDA examples that load PNG images with `libpng`, run GPU kernels (blur and grayscale), and write PNG outputs.
CUDA examples that load PNG images with `libpng`, run GPU kernels (blur and grayscale), and write PNG outputs.

Build:
## Build

```bash
cd image_manip
make
```

Programs and usage:
Requires `libpng-dev` (or equivalent) installed on the system.

- `imageBlur [--infile IN.png] [--outfile OUT.png]` : apply a small box blur (GPU)
- `imageToGrayscale [--infile IN.png] [--outfile OUT.png]` : convert to grayscale on GPU
## Usage

Run examples (local runner):
### imageBlur

Apply a box blur filter on GPU:

```bash
./imageBlur [--infile IN.png] [--outfile OUT.png]
```

### imageToGrayscale

Convert to grayscale on GPU:

```bash
./imageToGrayscale [--infile IN.png] [--outfile OUT.png]
```

## Run

```bash
./run.sh imageBlur --infile=input.png --outfile=output.png
./run.sh imageToGrayscale --infile=input.png --outfile=gray.png
```

Profile with nvprof:
## Profiling

```bash
./profile_nvprof.sh imageBlur --infile=input.png --outfile=output.png
```

Notes:
## Notes

- These examples use `libpng` from the system. Ensure `libpng-dev` (or equivalent) is installed and visible to the compiler.
- The binaries link with `-lpng -lz`. If your system puts headers/libraries in non-standard locations, update `Makefile` accordingly.
- Binaries link with `-lpng -lz`. If your system puts headers/libraries in non-standard locations, update `Makefile` accordingly.
- Outputs keep the same number of channels as the input (RGB/RGBA).
- Use `NVCCFLAGS` in the `Makefile` to tune compilation flags.
43 changes: 30 additions & 13 deletions matrix_multiplication/README.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,51 @@
# matrix_multiplication
# Matrix Multiplication

Examples and microbenchmarks for matrix-matrix multiplication. This subproject includes
multiple kernel variants (naive, tiled/shared-memory, coarsened and per-row/col variants),
convenience runner and profiling helpers.
Multiple kernel implementations for matrix-matrix multiplication demonstrating different optimization strategies: naive, tiled (shared memory), coarsened, and per-row/per-column variants.

**Build**
## Build

```bash
cd matrix_multiplication
make
```

**Programs and usage**
## Usage

- `matrixMul [--mode MODE] [--M M] [--K K] [--N N] [--threads THREADS] [--tile TILE] [--coarse COARSE]` : Run a single mode.
- Modes (supported): `naive`, `tiled`, `coarsened`, `perrows`, `percols`.
- `coarsened` accepts additional `COARSE` parameter (1..8) as last argument.
```bash
./matrixMul [--mode MODE] [--M M] [--K K] [--N N] [--threads T] [--tile TILE] [--coarse C]
```

**Options:**

**Run (local runner)**
| Flag | Description | Default |
|------|-------------|---------|
| `--mode` | Kernel: `naive`, `tiled`, `coarsened`, `perrows`, `percols`, `all` | `all` |
| `--M` | Matrix A rows | 1024 |
| `--K` | Matrix A cols / B rows | 1024 |
| `--N` | Matrix B cols | 1024 |
| `--threads` | Threads per block | 256 |
| `--tile` | Tile dimension for shared memory | 16 |
| `--coarse` | Coarsening factor (1-8) | 2 |

## Run

```bash
./run.sh --mode=tiled --M=1024 --K=1024 --N=1024 --threads=256 --tile=16
```

**Profile with nvprof**
## Profiling

```bash
# Profile with nvprof
./profile_nvprof.sh --M 1024 --K 1024 --N 1024 --threads 256

# Use profiling tools
../profiling_tools/profile_cuda.sh -d .
```

Notes
## Notes

- The CUDA kernels intentionally demonstrate multiple implementation strategies for microbenchmarking; they are not heavily optimized for every GPU. Use the profiling scripts and gnuplot files to collect timings and generate a Roofline / bar chart.
- The CUDA kernels demonstrate multiple implementation strategies for microbenchmarking.
- Use the profiling scripts and gnuplot to collect timings and generate Roofline / bar charts.
- Tiled kernel uses shared memory to reduce global memory bandwidth requirements.
- Coarsened kernel computes multiple output elements per thread.
31 changes: 23 additions & 8 deletions matrix_vector_multiplication/README.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,46 @@
# Matrix-Vector Multiplication

Purpose: simple example that multiplies a matrix A (height x width) by a vector B (width) producing vector C (height) using a straightforward GPU kernel.
Simple example that multiplies a matrix A (height × width) by a vector B (width) producing vector C (height) using a straightforward GPU kernel.

Build:
## Build

```bash
cd matrix_vector_multiplication
make
```

Programs and usage:
## Usage

- `matrixVectMul [--width W] [--height H] [--threads T]` : run the example (defaults to 1024 x 1024 with 256 threads if no args provided).
```bash
./matrixVectMul [--width W] [--height H] [--threads T]
```

**Options:**

Run examples (local runner):
| Flag | Description | Default |
|------|-------------|---------|
| `--width` | Matrix width / vector length | 1024 |
| `--height` | Matrix height / output length | 1024 |
| `--threads` | Threads per block | 256 |

## Run

```bash
./run.sh --width=2048 --height=1024 --threads=128
```

Profile with nvprof:
## Profiling

```bash
# Profile with nvprof
./profile_nvprof.sh --width=2048 --height=1024

# Use profiling tools
../profiling_tools/profile_cuda.sh -d .
```

Notes:
## Notes

- This implementation is intentionally simple. It demonstrates a per-row parallelization where each thread computes one output element. It does not attempt shared-memory tiling or other optimizations.
- This implementation demonstrates per-row parallelization where each thread computes one output element.
- Intentionally simple; does not use shared-memory tiling or other optimizations.
- Use `NVCCFLAGS` in the `Makefile` to tune compile flags.
15 changes: 13 additions & 2 deletions parallel_histogram/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,14 +101,25 @@ Thread i processes: input[i*C], input[i*C+1], ..., input[i*C+(C-1)]
./parallelHistogram --bins 64 --n 1000000
```

## Run Script
## Run

```bash
./run.sh [OPTIONS]
```

## Profile with nvprof
## Profiling

```bash
# Profile with nvprof
./profile_nvprof.sh --mode all --n 10000000

# Use profiling tools
../profiling_tools/profile_cuda.sh -d .
```

## Notes

- Maximum bins limited to 4096 due to shared memory constraints.
- Privatized kernel provides best performance for typical use cases.
- Coarsened kernel benefits from better memory coalescing.
- Host-side verification ensures correctness.
Loading