-
Notifications
You must be signed in to change notification settings - Fork 65
Open
Description
I’m using nvbandwidth to understand the memcpy performance across NVLINK C2C on a GH200 (900GB/s bi-directional)
Command example: ./nvbandwidth --testcase 0 1 2 3 16 17 32 --testSamples 1 --useMean -b 20480
I do not understand the following performance characteristics. What could be the reasons for this?
- As I understand from [GH200] Unexpected Low Host-to-Device Bandwidth #23, the SM version should outperform the performance of the Copy Engines due to the available parallel threads. But in this case I am seeing a drastic drop in performance for buffer sizes 20GB and above. Is there copy kernel tuning necessary?
- Prior work suggests the CE engines should operate in parallel, however for the bi-directional case I see that this is not true and the performance in half of the peak
- The DtoH and HtoD asymmetry is larger than previously reported estimates.

Metadata
Metadata
Assignees
Labels
No labels