Add nvbench-based benchmarks for the fltflt data types#1124
Merged
cliffburdick merged 2 commits intomainfrom Feb 2, 2026
Merged
Add nvbench-based benchmarks for the fltflt data types#1124cliffburdick merged 2 commits intomainfrom
cliffburdick merged 2 commits intomainfrom
Conversation
Add several benchmarks that are built via -DMATX_BUILD_BENCHMARKS=ON. Also add bench/scripts/run_fltflt_benchmarks.py, which runs the fltflt benchmarks and summarizes the results. Example results when running on RTX PRO 6000 Blackwell Server Edition are as follows: Performance relative to single-precision (float = 1.0x baseline) Higher values indicate slower performance Benchmark float double fltflt fltflt vs dbl ------------------------------------------------------------------ add 1.00x 71.10x 28.84x 2.47x sub 1.00x 71.11x 28.85x 2.46x mul 1.00x 71.17x 10.15x 7.01x div 1.00x 52.63x 5.85x 8.99x sqrt 1.00x 52.40x 3.89x 13.48x abs 1.00x 2.17x 2.15x 1.01x fma 1.00x 71.13x 25.36x 2.81x madd 1.00x 71.14x 38.78x 1.83x ------------------------------------------------------------------- Note that addition and subtration and only ~2.5x faster using fltflt than fp64. Multiplication, division, and square root are significantly faster. Future updates may improve addition performance, but potentially at an accuracy cost, so the changes will likely be opt-in. Signed-off-by: Thomas Benson <tbenson@nvidia.com>
Contributor
Greptile OverviewGreptile SummaryThis PR adds comprehensive nvbench-based benchmarks for the fltflt (float-float) data type, comparing performance against single and double precision floating-point operations. Key Changes:
Implementation Quality: Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant Script as run_fltflt_benchmarks.py
participant Bench as matx_bench executable
participant Kernel as CUDA Kernel
participant GPU
User->>Script: Execute script
Script->>Script: Find build directory
Script->>Script: Locate matx_bench executable
loop For each benchmark (add, sub, mul, div, sqrt, abs, fma, madd)
Script->>Bench: Run benchmark with --benchmark flag
Bench->>Bench: Initialize nvbench state
Bench->>Bench: Create tensor with size 2^24
loop For each type (float, double, fltflt)
Bench->>Kernel: Launch kernel with 256 threads/block
Kernel->>GPU: Execute iterations (250x) with ILP_FACTOR=8
GPU-->>Kernel: Complete computation
Kernel-->>Bench: Write results to memory
Bench->>Bench: Measure GPU time
end
Bench-->>Script: Return benchmark output with timing data
Script->>Script: Parse nvbench table format
Script->>Script: Extract GPU time for each precision
end
Script->>Script: Calculate relative performance (float baseline)
Script->>Script: Generate summary tables
Script->>User: Display performance comparison
|
Signed-off-by: Thomas Benson <tbenson@nvidia.com>
Collaborator
|
/build |
cliffburdick
approved these changes
Feb 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add several benchmarks that are built via -DMATX_BUILD_BENCHMARKS=ON. Also add bench/scripts/run_fltflt_benchmarks.py, which runs the fltflt benchmarks and summarizes the results. Example results when running on RTX PRO 6000 Blackwell Server Edition are as follows:
Performance relative to single-precision (float = 1.0x baseline) Higher values indicate slower performance
Benchmark float double fltflt fltflt vs dbl
add 1.00x 71.10x 28.84x 2.47x
sub 1.00x 71.11x 28.85x 2.46x
mul 1.00x 71.17x 10.15x 7.01x
div 1.00x 52.63x 5.85x 8.99x
sqrt 1.00x 52.40x 3.89x 13.48x
abs 1.00x 2.17x 2.15x 1.01x
fma 1.00x 71.13x 25.36x 2.81x
madd 1.00x 71.14x 38.78x 1.83x
Note that addition and subtration are only ~2.5x faster using fltflt than fp64. Multiplication, division, and square root are significantly faster. Future updates may improve addition performance, but potentially at an accuracy cost, so the changes will likely be opt-in.