Skip to content

Conversation

@WenqingLan1
Copy link
Contributor

@WenqingLan1 WenqingLan1 commented Dec 19, 2025

Refinements:

  • Use 128-bit aligned vector types (double2/float4) for optimal memory bandwidth.
  • Add support for float execution.
  • Add --data_type <float|double> CLI option for runtime type selection.
  • Move template kernel implementations to header file (required for CUDA template instantiation across compilation units).
  • Rename entry point file from gpu_stream_test.cpp to gpu_stream_main.cpp.
  • Updated hard-coded GPU iteration to single node run so it can run with SuperBench's distributed execution in config.yaml.
  • Updated numa assignment from hard coded numa_alloc_onnode to numa_alloc_local to optimize memory allocation.
  • Updated micro benchmark doc to reflect new metric name removing gpu_id.

New config:

    gpu-stream:fp64:
      <<: *default_local_mode
      timeout: 600
      parameters:
        num_warm_up: 10
        num_loops: 40
        size: 1308622848
        data_type: double
    gpu-stream:fp64-correctness:
      <<: *default_local_mode
      timeout: 600
      parameters:
        num_warm_up: 0
        num_loops: 1
        size: 1048576
        data_type: double
        check_data: true
    gpu-stream:fp32:
      <<: *default_local_mode
      timeout: 600
      parameters:
        num_warm_up: 10
        num_loops: 40
        size: 2617245696
        data_type: float
    gpu-stream:fp32-correctness:
      <<: *default_local_mode
      timeout: 600
      parameters:
        num_warm_up: 0
        num_loops: 1
        size: 1048576
        data_type: float
        check_data: true

New rule:

    gpu-stream:
      statistics:
        - mean
      categories: GPU-STREAM
      aggregate: True
      metrics:
        - gpu-stream:fp(?:32|64)/STREAM_.*_(?:bw|ratio):(\d+)

Example results:

"gpu-stream:fp32/STREAM_COPY_float_buffer_2617245696_block_256_bw:0": 1234, 
"gpu-stream:fp32/STREAM_COPY_float_buffer_2617245696_block_256_bw:1": 1234, 
"gpu-stream:fp32/STREAM_COPY_float_buffer_2617245696_block_256_bw:2": 1234, 
"gpu-stream:fp32/STREAM_COPY_float_buffer_2617245696_block_256_bw:3": 1234

Processed by rules:

| gpu-stream:fp32/STREAM_COPY_float_buffer_2617245696_block_256_bw | mean | 1234|

@WenqingLan1 WenqingLan1 requested a review from a team as a code owner December 19, 2025 20:05
@WenqingLan1 WenqingLan1 added the micro-benchmarks Micro Benchmark Test for SuperBench Benchmarks label Dec 19, 2025
@codecov
Copy link

codecov bot commented Dec 19, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.70%. Comparing base (c99380b) to head (3c359a3).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #769   +/-   ##
=======================================
  Coverage   85.69%   85.70%           
=======================================
  Files         102      102           
  Lines        7699     7700    +1     
=======================================
+ Hits         6598     6599    +1     
  Misses       1101     1101           
Flag Coverage Δ
cpu-python3.10-unit-test 70.97% <50.00%> (+<0.01%) ⬆️
cpu-python3.7-unit-test 70.42% <50.00%> (+<0.01%) ⬆️
cuda-unit-test 83.61% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@guoshzhao guoshzhao self-assigned this Dec 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

micro-benchmarks Micro Benchmark Test for SuperBench Benchmarks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants