Alessandro624 · Alessandro624 · Dec 30, 2025 · Dec 30, 2025 · Dec 30, 2025 · Dec 30, 2025
diff --git a/.gitignore b/.gitignore
@@ -20,6 +20,7 @@ convolution2D
 vectAdd_errors
 gpu_info
 parallel_histogram
+stencil
 
 # Profiler and logs
 *.nvvp

diff --git a/README.md b/README.md
@@ -59,14 +59,16 @@ If you do not have `nvprof`, install the CUDA toolkit, or run the GitHub Actions
 
 Click the folders below for the example README files and more details:
 
-- [`Vector Addition`](vector_addition/) — vector add example
+- [`Vector Addition`](vector_addition/) — vector add example with multiple modes
 - [`Error Handling`](error_handling/) — examples showing CUDA error handling
 - [`Device Specification`](device_specification/) — device query and capability examples
-- [`Image Manipulation`](image_manip/) — image processing examples (blur, grayscale); includes `stb` helper headers
+- [`Image Manipulation`](image_manip/) — image processing examples (blur, grayscale) with libpng
 - [`Matrix-Vector Multiplication`](matrix_vector_multiplication/) — matrix-vector multiplication example
-- [`Matrix Multiplication`](matrix_multiplication/) — matrix multiplication example
-- [`Convolution`](convolution/) — convolution examples (1D & 2D)
-- [`Profiling Tools`](profiling_tools/) — automated GPU profiling suite with roofline analysis, timing histograms, and occupancy visualization
+- [`Matrix Multiplication`](matrix_multiplication/) — matrix multiplication with naive, tiled, and coarsened kernels
+- [`Convolution`](convolution/) — 1D and 2D convolution with constant memory and tiling
+- [`Parallel Histogram`](parallel_histogram/) — parallel histogram with privatization, aggregation, and coarsening
+- [`3D Stencil`](stencil/) — 3D seven-point stencil with shared memory, coarsening, and register tiling
+- [`Profiling Tools`](profiling_tools/) — automated GPU profiling suite with roofline analysis
 
 Each folder includes a `README.md` with per-example instructions.
 

diff --git a/convolution/README.md b/convolution/README.md
@@ -111,7 +111,8 @@ Both implementations use a compile-time filter radius defined as:
 
 ## Notes
 
-- Constant memory provides broadcast capability for filter coefficients accessed by all threads
-- Tiling reduces global memory bandwidth by reusing data in shared memory
-- The halo region in tiled implementations handles boundary conditions
-- Use `NVCCFLAGS` in the Makefile to tune compilation flags for your hardware
+- Constant memory provides broadcast capability for filter coefficients accessed by all threads.
+- Tiling reduces global memory bandwidth by reusing data in shared memory.
+- The halo region in tiled implementations handles boundary conditions.
+- Use `NVCCFLAGS` in the Makefile to tune compilation flags for your hardware.
+- Use profiling tools: `../profiling_tools/profile_cuda.sh -d .`
diff --git a/device_specification/README.md b/device_specification/README.md
@@ -1,27 +1,42 @@
 # Device Specification
 
-Purpose: utility to enumerate CUDA devices and print hardware limits and properties useful for tuning kernels and understanding the platform.
+Utility to enumerate CUDA devices and print hardware limits and properties useful for tuning kernels and understanding the platform.
 
-Build:
+## Build
 
 ```bash
 cd device_specification
 make
 ```
 
-Programs and usage:
+## Usage
 
-- `deviceSpec [--device device_index]` : print properties for all devices or for the supplied device index.
+```bash
+./deviceSpec [--device DEVICE_INDEX]
+```
+
+**Options:**
+
+| Flag | Description | Default |
+|------|-------------|---------|
+| `--device` | Device index to query | All devices |
 
-Run examples (local runner):
+## Run
 
 ```bash
-./run.sh --device=0    # print device 0 only
-# No args prints all devices
+# Print all devices
 ./run.sh
+
+# Print device 0 only
+./run.sh --device=0
 ```
 
-Notes:
+## Profiling
+
+This is a query utility; profiling is not typically needed.
+
+## Notes
 
-- Uses `cudaGetDeviceProperties` to collect a broad set of fields: memory, SMPs, registers, warp size, clock rates, compute capability, PCI IDs, ECC and concurrency flags. Useful baseline for kernel tuning.
+- Uses `cudaGetDeviceProperties` to collect a broad set of fields: memory, SMPs, registers, warp size, clock rates, compute capability, PCI IDs, ECC and concurrency flags.
+- Useful baseline for kernel tuning.
 - No external libraries required beyond the CUDA toolkit.
diff --git a/error_handling/README.md b/error_handling/README.md
@@ -1,48 +1,60 @@
-# Error-handling demos for vector addition
+# Error Handling Demos
 
-This directory contains small programs that intentionally trigger common CUDA
-runtime and kernel errors so you can observe runtime messages and test
-profiling/debugging tools such as `nvprof`.
+This directory contains small programs that intentionally trigger common CUDA runtime and kernel errors so you can observe runtime messages and test profiling/debugging tools.
 
-Build:
+## Build
 
 ```bash
 cd error_handling
 make
 ```
 
-Programs and usage:
+## Usage
 
-- `vectAdd_errors [--mode idx]` : vector-add demo with intentional error modes
-  - Mode 0 (or no mode): safe run
-  - Mode 1 : excessive block size (invalid kernel launch configuration)
-  - Mode 2 : invalid host pointer passed to `cudaMemcpy`
-  - Mode 3 : excessive allocation request (forced `cudaMalloc` failure)
-  - Mode 4 : referencing invalid device pointer in kernel (NULL/invalid)
-  - Mode 5 : out-of-bounds global memory write in kernel
+### vectAdd_errors
 
-- `errorCudaMemcpy` : separate demo that demonstrates common cudaMemcpy /
-  memory-management mistakes, including incorrect sizes, nullptr copies and
-  misuse of `cudaMemcpyDeviceToDevice`. This file includes its own checking
-  macros and intentionally triggers runtime/runtime-sticky errors for testing.
+Vector-add demo with intentional error modes:
 
-Run examples (local runner):
+```bash
+./vectAdd_errors [--mode MODE] [--n N]
+```
+
+**Modes:**
+
+| Mode | Description |
+|------|-------------|
+| 0 | Safe run (no errors) |
+| 1 | Excessive block size (invalid launch configuration) |
+| 2 | Invalid host pointer passed to `cudaMemcpy` |
+| 3 | Excessive allocation request (forced `cudaMalloc` failure) |
+| 4 | Referencing invalid device pointer in kernel |
+| 5 | Out-of-bounds global memory write in kernel |
+
+### errorCudaMemcpy
+
+Demonstrates common `cudaMemcpy` and memory-management mistakes:
+- Incorrect sizes
+- nullptr copies
+- Misuse of `cudaMemcpyDeviceToDevice`
+
+## Run
 
 ```bash
-# Run vectAdd_errors in a specific mode:
+# Run vectAdd_errors in a specific mode
 ./run.sh vectAdd_errors --mode 1 --n 1024
-# Run the errorCudaMemcpy demo:
+
+# Run the errorCudaMemcpy demo
 ./run.sh errorCudaMemcpy --n 1024
 ```
 
-Profile with nvprof:
+## Profiling
 
 ```bash
 ./profile_nvprof.sh errorCudaMemcpy
 ```
 
-Notes:
+## Notes
 
-- The examples are intentionally invalid — run them in a controlled environment
-  for learning and debugging. The programs print CUDA error strings produced by
-  the runtime. Use `nvprof` output to inspect kernel activity and memory events.
+- These examples are **intentionally invalid** — run them in a controlled environment for learning and debugging.
+- The programs print CUDA error strings produced by the runtime.
+- Use `nvprof` or `compute-sanitizer` to inspect kernel activity and memory events.
diff --git a/image_manip/README.md b/image_manip/README.md
@@ -1,33 +1,49 @@
-# Image Manipulation (libpng)
+# Image Manipulation
 
-Simple CUDA examples that load PNG images with `libpng`, run GPU kernels (blur and grayscale), and write PNG outputs.
+CUDA examples that load PNG images with `libpng`, run GPU kernels (blur and grayscale), and write PNG outputs.
 
-Build:
+## Build
 
 ```bash
 cd image_manip
 make
 ```
 
-Programs and usage:
+Requires `libpng-dev` (or equivalent) installed on the system.
 
-- `imageBlur [--infile IN.png] [--outfile OUT.png]` : apply a small box blur (GPU)
-- `imageToGrayscale [--infile IN.png] [--outfile OUT.png]` : convert to grayscale on GPU
+## Usage
 
-Run examples (local runner):
+### imageBlur
+
+Apply a box blur filter on GPU:
+
+```bash
+./imageBlur [--infile IN.png] [--outfile OUT.png]
+```
+
+### imageToGrayscale
+
+Convert to grayscale on GPU:
+
+```bash
+./imageToGrayscale [--infile IN.png] [--outfile OUT.png]
+```
+
+## Run
 
 ```bash
 ./run.sh imageBlur --infile=input.png --outfile=output.png
+./run.sh imageToGrayscale --infile=input.png --outfile=gray.png
 ```
 
-Profile with nvprof:
+## Profiling
 
 ```bash
 ./profile_nvprof.sh imageBlur --infile=input.png --outfile=output.png
 ```
 
-Notes:
+## Notes
 
-- These examples use `libpng` from the system. Ensure `libpng-dev` (or equivalent) is installed and visible to the compiler.
-- The binaries link with `-lpng -lz`. If your system puts headers/libraries in non-standard locations, update `Makefile` accordingly.
+- Binaries link with `-lpng -lz`. If your system puts headers/libraries in non-standard locations, update `Makefile` accordingly.
 - Outputs keep the same number of channels as the input (RGB/RGBA).
+- Use `NVCCFLAGS` in the `Makefile` to tune compilation flags.
diff --git a/matrix_multiplication/README.md b/matrix_multiplication/README.md
@@ -1,34 +1,51 @@
-# matrix_multiplication
+# Matrix Multiplication
 
-Examples and microbenchmarks for matrix-matrix multiplication. This subproject includes
-multiple kernel variants (naive, tiled/shared-memory, coarsened and per-row/col variants),
-convenience runner and profiling helpers.
+Multiple kernel implementations for matrix-matrix multiplication demonstrating different optimization strategies: naive, tiled (shared memory), coarsened, and per-row/per-column variants.
 
-**Build**
+## Build
 
 ```bash
 cd matrix_multiplication
 make
 ```
 
-**Programs and usage**
+## Usage
 
-- `matrixMul [--mode MODE] [--M M] [--K K] [--N N] [--threads THREADS] [--tile TILE] [--coarse COARSE]` : Run a single mode.
-- Modes (supported): `naive`, `tiled`, `coarsened`, `perrows`, `percols`.
-	- `coarsened` accepts additional `COARSE` parameter (1..8) as last argument.
+```bash
+./matrixMul [--mode MODE] [--M M] [--K K] [--N N] [--threads T] [--tile TILE] [--coarse C]
+```
+
+**Options:**
 
-**Run (local runner)**
+| Flag | Description | Default |
+|------|-------------|---------|
+| `--mode` | Kernel: `naive`, `tiled`, `coarsened`, `perrows`, `percols`, `all` | `all` |
+| `--M` | Matrix A rows | 1024 |
+| `--K` | Matrix A cols / B rows | 1024 |
+| `--N` | Matrix B cols | 1024 |
+| `--threads` | Threads per block | 256 |
+| `--tile` | Tile dimension for shared memory | 16 |
+| `--coarse` | Coarsening factor (1-8) | 2 |
+
+## Run
 
 ```bash
 ./run.sh --mode=tiled --M=1024 --K=1024 --N=1024 --threads=256 --tile=16
 ```
 
-**Profile with nvprof**
+## Profiling
 
 ```bash
+# Profile with nvprof
 ./profile_nvprof.sh --M 1024 --K 1024 --N 1024 --threads 256
+
+# Use profiling tools
+../profiling_tools/profile_cuda.sh -d .
 ```
 
-Notes
+## Notes
 
-- The CUDA kernels intentionally demonstrate multiple implementation strategies for microbenchmarking; they are not heavily optimized for every GPU. Use the profiling scripts and gnuplot files to collect timings and generate a Roofline / bar chart.
+- The CUDA kernels demonstrate multiple implementation strategies for microbenchmarking.
+- Use the profiling scripts and gnuplot to collect timings and generate Roofline / bar charts.
+- Tiled kernel uses shared memory to reduce global memory bandwidth requirements.
+- Coarsened kernel computes multiple output elements per thread.
diff --git a/matrix_vector_multiplication/README.md b/matrix_vector_multiplication/README.md
@@ -1,31 +1,46 @@
 # Matrix-Vector Multiplication
 
-Purpose: simple example that multiplies a matrix A (height x width) by a vector B (width) producing vector C (height) using a straightforward GPU kernel.
+Simple example that multiplies a matrix A (height × width) by a vector B (width) producing vector C (height) using a straightforward GPU kernel.
 
-Build:
+## Build
 
 ```bash
 cd matrix_vector_multiplication
 make
 ```
 
-Programs and usage:
+## Usage
 
-- `matrixVectMul [--width W] [--height H] [--threads T]` : run the example (defaults to 1024 x 1024 with 256 threads if no args provided).
+```bash
+./matrixVectMul [--width W] [--height H] [--threads T]
+```
+
+**Options:**
 
-Run examples (local runner):
+| Flag | Description | Default |
+|------|-------------|---------|
+| `--width` | Matrix width / vector length | 1024 |
+| `--height` | Matrix height / output length | 1024 |
+| `--threads` | Threads per block | 256 |
+
+## Run
 
 ```bash
 ./run.sh --width=2048 --height=1024 --threads=128
 ```
 
-Profile with nvprof:
+## Profiling
 
 ```bash
+# Profile with nvprof
 ./profile_nvprof.sh --width=2048 --height=1024
+
+# Use profiling tools
+../profiling_tools/profile_cuda.sh -d .
 ```
 
-Notes:
+## Notes
 
-- This implementation is intentionally simple. It demonstrates a per-row parallelization where each thread computes one output element. It does not attempt shared-memory tiling or other optimizations.
+- This implementation demonstrates per-row parallelization where each thread computes one output element.
+- Intentionally simple; does not use shared-memory tiling or other optimizations.
 - Use `NVCCFLAGS` in the `Makefile` to tune compile flags.
diff --git a/parallel_histogram/README.md b/parallel_histogram/README.md
@@ -101,14 +101,25 @@ Thread i processes: input[i*C], input[i*C+1], ..., input[i*C+(C-1)]
 ./parallelHistogram --bins 64 --n 1000000
 ```
 
-## Run Script
+## Run
 
 ```bash
 ./run.sh [OPTIONS]
 ```
 
-## Profile with nvprof
+## Profiling
 
 ```bash
+# Profile with nvprof
 ./profile_nvprof.sh --mode all --n 10000000
+
+# Use profiling tools
+../profiling_tools/profile_cuda.sh -d .
 ```
+
+## Notes
+
+- Maximum bins limited to 4096 due to shared memory constraints.
+- Privatized kernel provides best performance for typical use cases.
+- Coarsened kernel benefits from better memory coalescing.
+- Host-side verification ensures correctness.