Add CUDA parallel histogram example and profiling workflow #12
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR adds a new CUDA parallel histogram module, showcasing a classic parallel pattern with a strong emphasis on performance analysis and profiling.
Key changes
Added a new
parallel_histogram/module including:Makefileto handle compilation.run.shfor execution.profile_nvprof.shfor GPU performance profiling.READMEdescribing the algorithm, usage, and profiling steps.Updated
.gitignoreto include generated artifacts related to the parallel histogram module.Impact
The parallel histogram example introduces a workload characterized by contention and memory access challenges, making it an excellent case study for analyzing synchronization, atomics, and memory behavior on GPUs. It further enriches the repository as a hands-on CUDA performance playground.