-
Notifications
You must be signed in to change notification settings - Fork 65
Open
Description
Environment
Hardware:
- GPU: 8x NVIDIA A100-SXM4-80GB
- CPU: Dual socket (NUMA 0: GPU0-3, NUMA 1: GPU4-7)
- Memory: DDR4 8-channel per CPU socket
- Topology: PCIe 4.0 (see below)
Software:
- nvbandwidth Version: v0.8
- CUDA Runtime: 12.6.0
- CUDA Driver: 12.6.0
- Driver Version: 560.35.05
- OS: Linux
GPU Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity
GPU0 X PXB NODE NODE SYS SYS SYS SYS 0-31,64-95 0
GPU1 PXB X NODE NODE SYS SYS SYS SYS 0-31,64-95 0
GPU2 NODE NODE X PXB SYS SYS SYS SYS 0-31,64-95 0
GPU3 NODE NODE PXB X SYS SYS SYS SYS 0-31,64-95 0
GPU4 SYS SYS SYS SYS X PXB NODE NODE 32-63,96-127 1
GPU5 SYS SYS SYS SYS PXB X NODE NODE 32-63,96-127 1
GPU6 SYS SYS SYS SYS NODE NODE X PXB 32-63,96-127 1
GPU7 SYS SYS SYS SYS NODE NODE PXB X 32-63,96-127 1
Description
When running bidirectional Host-Device memory copy tests, I observe significant asymmetry in bandwidth:
-
Unidirectional transfers (expected behavior):
- H2D: ~25 GB/s ✅
- D2H: ~25 GB/s ✅
-
Bidirectional transfers (unexpected behavior):
- D2H: ~24 GB/s ✅ (maintains performance)
- H2D: ~11 GB/s ❌ (drops by ~56%)
Test Results
1. Unidirectional Host-to-Device (Baseline)
$ ./nvbandwidth -t host_to_device_memcpy_ceResult: ~24.8 GB/s per GPU (SUM: 198.72 GB/s)
memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 24.81 24.80 24.84 24.83 24.88 24.83 24.84 24.88
2. Unidirectional Device-to-Host (Baseline)
$ ./nvbandwidth -t device_to_host_memcpy_ceResult: ~25.9 GB/s per GPU (SUM: 207.13 GB/s)
memcpy CE CPU(row) <- GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 25.89 25.88 25.89 25.89 25.89 25.89 25.89 25.90
3. Bidirectional Device-to-Host
$ ./nvbandwidth -t device_to_host_bidirectional_memcpy_ceResult: ~24.2 GB/s per GPU (SUM: 194.07 GB/s) ✅ Maintains performance
memcpy CE CPU(row) <-> GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 24.25 24.26 24.24 24.26 24.27 24.26 24.27 24.27
4. Bidirectional Host-to-Device ⚠️ ISSUE HERE
$ ./nvbandwidth -t host_to_device_bidirectional_memcpy_ceResult: ~11 GB/s per GPU ❌ (56% performance drop)
(Note: The output appears to be truncated in the original test, but based on the pattern, H2D bandwidth drops significantly)
Expected Behavior
Since PCIe is full-duplex, I expect:
- Bidirectional H2D bandwidth: ~24-25 GB/s (similar to unidirectional)
- Bidirectional D2H bandwidth: ~24-25 GB/s (similar to unidirectional)
Both directions should maintain near-peak performance simultaneously.
Observations
- D2H maintains performance in bidirectional mode (~24 GB/s)
- H2D drops significantly in bidirectional mode (~11 GB/s, ~56% reduction)
- Memory bandwidth is not saturated (verified with
pcm-memory- DDR4 8-channel not fully utilized) - PCIe 4.0 x16 theoretical bandwidth: ~32 GB/s per direction (full-duplex)
Questions & Request for help
Could you please help clarify:
- Whether this asymmetry is expected on this hardware configuration
- What might be causing the H2D throttling in bidirectional mode
- Any recommendations for investigation or mitigation
Thank you!
Metadata
Metadata
Assignees
Labels
No labels