Skip to content

Performance drop in host_to_device_memcpy_sm with large buffers #42

@yuyuankang

Description

@yuyuankang

I am using Ubuntu 22.04 and testing the performance of an NVIDIA A100 GPU with the nvbandwidth tool. I observed that as the buffer size increases, the reported throughput decreases:

Test case: host_to_device_memcpy_sm

  • 512 MiB: 25.13 GB/s
  • 1 GiB: 25.13 GB/s
  • 10 GiB: 18.05 GB/s
  • 20 GiB: 16.50 GB/s

Below is the output:

$ ./nvbandwidth -t host_to_device_memcpy_sm -b 512
nvbandwidth Version: v0.8
Built from Git version: v0.8

CUDA Runtime Version: 12060
CUDA Driver Version: 12060
Driver Version: 560.35.05

Device 0: NVIDIA A100-SXM4-40GB (00000000:99:00)

Running host_to_device_memcpy_sm.
memcpy SM CPU(row) -> GPU(column) bandwidth (GB/s)
           0
 0     25.13

SUM host_to_device_memcpy_sm 25.13

NOTE: The reported results may not reflect the full capabilities of the platform.
Performance can vary with software drivers, hardware clocks, and system topology.
$ ./nvbandwidth -t host_to_device_memcpy_sm -b 1024
nvbandwidth Version: v0.8
Built from Git version: v0.8

CUDA Runtime Version: 12060
CUDA Driver Version: 12060
Driver Version: 560.35.05

Device 0: NVIDIA A100-SXM4-40GB (00000000:99:00)

Running host_to_device_memcpy_sm.
memcpy SM CPU(row) -> GPU(column) bandwidth (GB/s)
           0
 0     25.13

SUM host_to_device_memcpy_sm 25.13

NOTE: The reported results may not reflect the full capabilities of the platform.
Performance can vary with software drivers, hardware clocks, and system topology.
$ ./nvbandwidth -t host_to_device_memcpy_sm -b 10240
nvbandwidth Version: v0.8
Built from Git version: v0.8

CUDA Runtime Version: 12060
CUDA Driver Version: 12060
Driver Version: 560.35.05

Device 0: NVIDIA A100-SXM4-40GB (00000000:99:00)

Running host_to_device_memcpy_sm.
memcpy SM CPU(row) -> GPU(column) bandwidth (GB/s)
           0
 0     18.05

SUM host_to_device_memcpy_sm 18.05

NOTE: The reported results may not reflect the full capabilities of the platform.
Performance can vary with software drivers, hardware clocks, and system topology.
$ ./nvbandwidth -t host_to_device_memcpy_sm -b 20480
nvbandwidth Version: v0.8
Built from Git version: v0.8

CUDA Runtime Version: 12060
CUDA Driver Version: 12060
Driver Version: 560.35.05

Device 0: NVIDIA A100-SXM4-40GB (00000000:99:00)

Running host_to_device_memcpy_sm.
memcpy SM CPU(row) -> GPU(column) bandwidth (GB/s)
           0
 0     16.50

SUM host_to_device_memcpy_sm 16.50

NOTE: The reported results may not reflect the full capabilities of the platform.
Performance can vary with software drivers, hardware clocks, and system topology.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions