Skip to content

Add nvbench-based benchmarks for the fltflt data types#1124

Merged
cliffburdick merged 2 commits intomainfrom
perf/add-fltflt-benchmarks
Feb 2, 2026
Merged

Add nvbench-based benchmarks for the fltflt data types#1124
cliffburdick merged 2 commits intomainfrom
perf/add-fltflt-benchmarks

Conversation

@tbensonatl
Copy link
Collaborator

Add several benchmarks that are built via -DMATX_BUILD_BENCHMARKS=ON. Also add bench/scripts/run_fltflt_benchmarks.py, which runs the fltflt benchmarks and summarizes the results. Example results when running on RTX PRO 6000 Blackwell Server Edition are as follows:

Performance relative to single-precision (float = 1.0x baseline) Higher values indicate slower performance

Benchmark float double fltflt fltflt vs dbl

add 1.00x 71.10x 28.84x 2.47x
sub 1.00x 71.11x 28.85x 2.46x
mul 1.00x 71.17x 10.15x 7.01x
div 1.00x 52.63x 5.85x 8.99x
sqrt 1.00x 52.40x 3.89x 13.48x
abs 1.00x 2.17x 2.15x 1.01x
fma 1.00x 71.13x 25.36x 2.81x
madd 1.00x 71.14x 38.78x 1.83x


Note that addition and subtration are only ~2.5x faster using fltflt than fp64. Multiplication, division, and square root are significantly faster. Future updates may improve addition performance, but potentially at an accuracy cost, so the changes will likely be opt-in.

Add several benchmarks that are built via -DMATX_BUILD_BENCHMARKS=ON.
Also add bench/scripts/run_fltflt_benchmarks.py, which runs the fltflt
benchmarks and summarizes the results. Example results when running
on RTX PRO 6000 Blackwell Server Edition are as follows:

Performance relative to single-precision (float = 1.0x baseline)
Higher values indicate slower performance

Benchmark       float        double       fltflt       fltflt vs dbl
------------------------------------------------------------------
add             1.00x        71.10x       28.84x       2.47x
sub             1.00x        71.11x       28.85x       2.46x
mul             1.00x        71.17x       10.15x       7.01x
div             1.00x        52.63x       5.85x        8.99x
sqrt            1.00x        52.40x       3.89x        13.48x
abs             1.00x        2.17x        2.15x        1.01x
fma             1.00x        71.13x       25.36x       2.81x
madd            1.00x        71.14x       38.78x       1.83x

-------------------------------------------------------------------

Note that addition and subtration and only ~2.5x faster using fltflt than
fp64. Multiplication, division, and square root are significantly faster.
Future updates may improve addition performance, but potentially at an
accuracy cost, so the changes will likely be opt-in.

Signed-off-by: Thomas Benson <tbenson@nvidia.com>
@tbensonatl tbensonatl self-assigned this Jan 27, 2026
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 27, 2026

Greptile Overview

Greptile Summary

This PR adds comprehensive nvbench-based benchmarks for the fltflt (float-float) data type, comparing performance against single and double precision floating-point operations.

Key Changes:

  • Added bench/00_misc/fltflt_arithmetic.cu with 8 benchmark functions covering fundamental arithmetic operations (add, sub, mul, div, sqrt, abs, fma, madd)
  • Implemented custom CUDA kernels with instruction-level parallelism (ILP_FACTOR=8) and loop unrolling to increase arithmetic intensity for accurate performance measurement
  • Added bench/scripts/run_fltflt_benchmarks.py Python script to automate benchmark execution and generate performance comparison tables
  • Added abs() overload in include/matx/kernels/fltflt.h to support unary operator dispatch in benchmark code
  • Updated bench/CMakeLists.txt to include the new benchmark file

Implementation Quality:
The benchmarks are well-designed with proper optimization techniques to measure arithmetic performance rather than memory bandwidth. The Python parsing script includes appropriate error handling and timeouts.

Confidence Score: 5/5

  • This PR is safe to merge with no issues found
  • The PR adds well-structured benchmark code with proper error handling. All functions used in benchmarks exist in the codebase. The Python script has appropriate safeguards including timeouts and error handling. No logic errors or security concerns identified.
  • No files require special attention

Important Files Changed

Filename Overview
bench/00_misc/fltflt_arithmetic.cu Added comprehensive nvbench-based benchmarks for fltflt arithmetic operations (add, sub, mul, div, sqrt, abs, fma, madd) with proper ILP optimization
bench/CMakeLists.txt Added fltflt_arithmetic.cu to benchmark sources list
bench/scripts/run_fltflt_benchmarks.py Added Python script to run and summarize fltflt benchmarks with performance comparison tables
include/matx/kernels/fltflt.h Added abs() overload for fltflt to enable unary operator dispatch

Sequence Diagram

sequenceDiagram
    participant User
    participant Script as run_fltflt_benchmarks.py
    participant Bench as matx_bench executable
    participant Kernel as CUDA Kernel
    participant GPU

    User->>Script: Execute script
    Script->>Script: Find build directory
    Script->>Script: Locate matx_bench executable
    
    loop For each benchmark (add, sub, mul, div, sqrt, abs, fma, madd)
        Script->>Bench: Run benchmark with --benchmark flag
        Bench->>Bench: Initialize nvbench state
        Bench->>Bench: Create tensor with size 2^24
        
        loop For each type (float, double, fltflt)
            Bench->>Kernel: Launch kernel with 256 threads/block
            Kernel->>GPU: Execute iterations (250x) with ILP_FACTOR=8
            GPU-->>Kernel: Complete computation
            Kernel-->>Bench: Write results to memory
            Bench->>Bench: Measure GPU time
        end
        
        Bench-->>Script: Return benchmark output with timing data
        Script->>Script: Parse nvbench table format
        Script->>Script: Extract GPU time for each precision
    end
    
    Script->>Script: Calculate relative performance (float baseline)
    Script->>Script: Generate summary tables
    Script->>User: Display performance comparison
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Signed-off-by: Thomas Benson <tbenson@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@cliffburdick
Copy link
Collaborator

/build

@cliffburdick cliffburdick merged commit f7c1a6d into main Feb 2, 2026
1 check passed
@cliffburdick cliffburdick deleted the perf/add-fltflt-benchmarks branch February 2, 2026 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants