Skip to content

Asymmetric bandwidth in bidirectional Host-Device transfers: H2D drops to ~11 GB/s while D2H maintains ~25 GB/s #53

@chanpapa

Description

@chanpapa

Environment

Hardware:

  • GPU: 8x NVIDIA A100-SXM4-80GB
  • CPU: Dual socket (NUMA 0: GPU0-3, NUMA 1: GPU4-7)
  • Memory: DDR4 8-channel per CPU socket
  • Topology: PCIe 4.0 (see below)

Software:

  • nvbandwidth Version: v0.8
  • CUDA Runtime: 12.6.0
  • CUDA Driver: 12.6.0
  • Driver Version: 560.35.05
  • OS: Linux

GPU Topology:

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity
GPU0     X      PXB     NODE    NODE    SYS     SYS     SYS     SYS     0-31,64-95      0
GPU1    PXB      X      NODE    NODE    SYS     SYS     SYS     SYS     0-31,64-95      0
GPU2    NODE    NODE     X      PXB     SYS     SYS     SYS     SYS     0-31,64-95      0
GPU3    NODE    NODE    PXB      X      SYS     SYS     SYS     SYS     0-31,64-95      0
GPU4    SYS     SYS     SYS     SYS      X      PXB     NODE    NODE    32-63,96-127    1
GPU5    SYS     SYS     SYS     SYS     PXB      X      NODE    NODE    32-63,96-127    1
GPU6    SYS     SYS     SYS     SYS     NODE    NODE     X      PXB     32-63,96-127    1
GPU7    SYS     SYS     SYS     SYS     NODE    NODE    PXB      X      32-63,96-127    1

Description

When running bidirectional Host-Device memory copy tests, I observe significant asymmetry in bandwidth:

  • Unidirectional transfers (expected behavior):

    • H2D: ~25 GB/s ✅
    • D2H: ~25 GB/s ✅
  • Bidirectional transfers (unexpected behavior):

    • D2H: ~24 GB/s ✅ (maintains performance)
    • H2D: ~11 GB/s ❌ (drops by ~56%)

Test Results

1. Unidirectional Host-to-Device (Baseline)

$ ./nvbandwidth -t host_to_device_memcpy_ce

Result: ~24.8 GB/s per GPU (SUM: 198.72 GB/s)

memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s)
           0         1         2         3         4         5         6         7
 0     24.81     24.80     24.84     24.83     24.88     24.83     24.84     24.88

2. Unidirectional Device-to-Host (Baseline)

$ ./nvbandwidth -t device_to_host_memcpy_ce

Result: ~25.9 GB/s per GPU (SUM: 207.13 GB/s)

memcpy CE CPU(row) <- GPU(column) bandwidth (GB/s)
           0         1         2         3         4         5         6         7
 0     25.89     25.88     25.89     25.89     25.89     25.89     25.89     25.90

3. Bidirectional Device-to-Host

$ ./nvbandwidth -t device_to_host_bidirectional_memcpy_ce

Result: ~24.2 GB/s per GPU (SUM: 194.07 GB/s) ✅ Maintains performance

memcpy CE CPU(row) <-> GPU(column) bandwidth (GB/s)
           0         1         2         3         4         5         6         7
 0     24.25     24.26     24.24     24.26     24.27     24.26     24.27     24.27

4. Bidirectional Host-to-Device ⚠️ ISSUE HERE

$ ./nvbandwidth -t host_to_device_bidirectional_memcpy_ce

Result: ~11 GB/s per GPU(56% performance drop)

(Note: The output appears to be truncated in the original test, but based on the pattern, H2D bandwidth drops significantly)


Expected Behavior

Since PCIe is full-duplex, I expect:

  • Bidirectional H2D bandwidth: ~24-25 GB/s (similar to unidirectional)
  • Bidirectional D2H bandwidth: ~24-25 GB/s (similar to unidirectional)

Both directions should maintain near-peak performance simultaneously.


Observations

  1. D2H maintains performance in bidirectional mode (~24 GB/s)
  2. H2D drops significantly in bidirectional mode (~11 GB/s, ~56% reduction)
  3. Memory bandwidth is not saturated (verified with pcm-memory - DDR4 8-channel not fully utilized)
  4. PCIe 4.0 x16 theoretical bandwidth: ~32 GB/s per direction (full-duplex)

Questions & Request for help

Could you please help clarify:

  1. Whether this asymmetry is expected on this hardware configuration
  2. What might be causing the H2D throttling in bidirectional mode
  3. Any recommendations for investigation or mitigation

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions