Add nvbench-based benchmarks for the fltflt data types by tbensonatl · Pull Request #1124 · NVIDIA/MatX

tbensonatl · 2026-01-27T21:58:58Z

Add several benchmarks that are built via -DMATX_BUILD_BENCHMARKS=ON. Also add bench/scripts/run_fltflt_benchmarks.py, which runs the fltflt benchmarks and summarizes the results. Example results when running on RTX PRO 6000 Blackwell Server Edition are as follows:

Performance relative to single-precision (float = 1.0x baseline) Higher values indicate slower performance

Benchmark float double fltflt fltflt vs dbl

add 1.00x 71.10x 28.84x 2.47x
sub 1.00x 71.11x 28.85x 2.46x
mul 1.00x 71.17x 10.15x 7.01x
div 1.00x 52.63x 5.85x 8.99x
sqrt 1.00x 52.40x 3.89x 13.48x
abs 1.00x 2.17x 2.15x 1.01x
fma 1.00x 71.13x 25.36x 2.81x
madd 1.00x 71.14x 38.78x 1.83x

Note that addition and subtration are only ~2.5x faster using fltflt than fp64. Multiplication, division, and square root are significantly faster. Future updates may improve addition performance, but potentially at an accuracy cost, so the changes will likely be opt-in.

Add several benchmarks that are built via -DMATX_BUILD_BENCHMARKS=ON. Also add bench/scripts/run_fltflt_benchmarks.py, which runs the fltflt benchmarks and summarizes the results. Example results when running on RTX PRO 6000 Blackwell Server Edition are as follows: Performance relative to single-precision (float = 1.0x baseline) Higher values indicate slower performance Benchmark float double fltflt fltflt vs dbl ------------------------------------------------------------------ add 1.00x 71.10x 28.84x 2.47x sub 1.00x 71.11x 28.85x 2.46x mul 1.00x 71.17x 10.15x 7.01x div 1.00x 52.63x 5.85x 8.99x sqrt 1.00x 52.40x 3.89x 13.48x abs 1.00x 2.17x 2.15x 1.01x fma 1.00x 71.13x 25.36x 2.81x madd 1.00x 71.14x 38.78x 1.83x ------------------------------------------------------------------- Note that addition and subtration and only ~2.5x faster using fltflt than fp64. Multiplication, division, and square root are significantly faster. Future updates may improve addition performance, but potentially at an accuracy cost, so the changes will likely be opt-in. Signed-off-by: Thomas Benson <tbenson@nvidia.com>

copy-pr-bot · 2026-01-27T21:59:02Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-01-27T22:01:34Z

Greptile Overview

Greptile Summary

This PR adds comprehensive nvbench-based benchmarks for the fltflt (float-float) data type, comparing performance against single and double precision floating-point operations.

Key Changes:

Added bench/00_misc/fltflt_arithmetic.cu with 8 benchmark functions covering fundamental arithmetic operations (add, sub, mul, div, sqrt, abs, fma, madd)
Implemented custom CUDA kernels with instruction-level parallelism (ILP_FACTOR=8) and loop unrolling to increase arithmetic intensity for accurate performance measurement
Added bench/scripts/run_fltflt_benchmarks.py Python script to automate benchmark execution and generate performance comparison tables
Added abs() overload in include/matx/kernels/fltflt.h to support unary operator dispatch in benchmark code
Updated bench/CMakeLists.txt to include the new benchmark file

Implementation Quality:
The benchmarks are well-designed with proper optimization techniques to measure arithmetic performance rather than memory bandwidth. The Python parsing script includes appropriate error handling and timeouts.

Confidence Score: 5/5

This PR is safe to merge with no issues found
The PR adds well-structured benchmark code with proper error handling. All functions used in benchmarks exist in the codebase. The Python script has appropriate safeguards including timeouts and error handling. No logic errors or security concerns identified.
No files require special attention

Important Files Changed

Filename	Overview
bench/00_misc/fltflt_arithmetic.cu	Added comprehensive nvbench-based benchmarks for fltflt arithmetic operations (add, sub, mul, div, sqrt, abs, fma, madd) with proper ILP optimization
bench/CMakeLists.txt	Added fltflt_arithmetic.cu to benchmark sources list
bench/scripts/run_fltflt_benchmarks.py	Added Python script to run and summarize fltflt benchmarks with performance comparison tables
include/matx/kernels/fltflt.h	Added abs() overload for fltflt to enable unary operator dispatch

Sequence Diagram

sequenceDiagram
    participant User
    participant Script as run_fltflt_benchmarks.py
    participant Bench as matx_bench executable
    participant Kernel as CUDA Kernel
    participant GPU

    User->>Script: Execute script
    Script->>Script: Find build directory
    Script->>Script: Locate matx_bench executable
    
    loop For each benchmark (add, sub, mul, div, sqrt, abs, fma, madd)
        Script->>Bench: Run benchmark with --benchmark flag
        Bench->>Bench: Initialize nvbench state
        Bench->>Bench: Create tensor with size 2^24
        
        loop For each type (float, double, fltflt)
            Bench->>Kernel: Launch kernel with 256 threads/block
            Kernel->>GPU: Execute iterations (250x) with ILP_FACTOR=8
            GPU-->>Kernel: Complete computation
            Kernel-->>Bench: Write results to memory
            Bench->>Bench: Measure GPU time
        end
        
        Bench-->>Script: Return benchmark output with timing data
        Script->>Script: Parse nvbench table format
        Script->>Script: Extract GPU time for each precision
    end
    
    Script->>Script: Calculate relative performance (float baseline)
    Script->>Script: Generate summary tables
    Script->>User: Display performance comparison

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

bench/scripts/run_fltflt_benchmarks.py

Signed-off-by: Thomas Benson <tbenson@nvidia.com>

greptile-apps

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

cliffburdick · 2026-01-30T15:20:54Z

/build

tbensonatl requested a review from cliffburdick January 27, 2026 21:58

tbensonatl self-assigned this Jan 27, 2026

greptile-apps bot reviewed Jan 27, 2026

View reviewed changes

bench/scripts/run_fltflt_benchmarks.py Outdated Show resolved Hide resolved

Add adding guards in run_fltflt_benchmarks.py parsing

d743d27

Signed-off-by: Thomas Benson <tbenson@nvidia.com>

greptile-apps bot reviewed Jan 30, 2026

View reviewed changes

cliffburdick approved these changes Feb 2, 2026

View reviewed changes

cliffburdick merged commit f7c1a6d into main Feb 2, 2026
1 check passed

cliffburdick deleted the perf/add-fltflt-benchmarks branch February 2, 2026 19:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add nvbench-based benchmarks for the fltflt data types#1124

Add nvbench-based benchmarks for the fltflt data types#1124
cliffburdick merged 2 commits intomainfrom
perf/add-fltflt-benchmarks

tbensonatl commented Jan 27, 2026

Uh oh!

copy-pr-bot bot commented Jan 27, 2026

Uh oh!

greptile-apps bot commented Jan 27, 2026 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Uh oh!

cliffburdick commented Jan 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tbensonatl commented Jan 27, 2026

Benchmark float double fltflt fltflt vs dbl

Uh oh!

copy-pr-bot bot commented Jan 27, 2026

Uh oh!

greptile-apps bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

cliffburdick commented Jan 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps bot commented Jan 27, 2026 •

edited

Loading