Skip to content

SM memcpy performance concerns on GH200 #41

@rajesh-s

Description

@rajesh-s

I’m using nvbandwidth to understand the memcpy performance across NVLINK C2C on a GH200 (900GB/s bi-directional)

Command example: ./nvbandwidth --testcase 0 1 2 3 16 17 32 --testSamples 1 --useMean -b 20480

I do not understand the following performance characteristics. What could be the reasons for this?

  1. As I understand from [GH200] Unexpected Low Host-to-Device Bandwidth  #23, the SM version should outperform the performance of the Copy Engines due to the available parallel threads. But in this case I am seeing a drastic drop in performance for buffer sizes 20GB and above. Is there copy kernel tuning necessary?
  2. Prior work suggests the CE engines should operate in parallel, however for the bi-directional case I see that this is not true and the performance in half of the peak
  3. The DtoH and HtoD asymmetry is larger than previously reported estimates.
Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions