Asymmetric bandwidth in bidirectional Host-Device transfers: H2D drops to ~11 GB/s while D2H maintains ~25 GB/s


## Environment

**Hardware:**
- GPU: 8x NVIDIA A100-SXM4-80GB
- CPU: Dual socket (NUMA 0: GPU0-3, NUMA 1: GPU4-7)
- Memory: DDR4 8-channel per CPU socket
- Topology: PCIe 4.0 (see below)

**Software:**
- nvbandwidth Version: v0.8
- CUDA Runtime: 12.6.0
- CUDA Driver: 12.6.0
- Driver Version: 560.35.05
- OS: Linux

**GPU Topology:**
```
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity
GPU0     X      PXB     NODE    NODE    SYS     SYS     SYS     SYS     0-31,64-95      0
GPU1    PXB      X      NODE    NODE    SYS     SYS     SYS     SYS     0-31,64-95      0
GPU2    NODE    NODE     X      PXB     SYS     SYS     SYS     SYS     0-31,64-95      0
GPU3    NODE    NODE    PXB      X      SYS     SYS     SYS     SYS     0-31,64-95      0
GPU4    SYS     SYS     SYS     SYS      X      PXB     NODE    NODE    32-63,96-127    1
GPU5    SYS     SYS     SYS     SYS     PXB      X      NODE    NODE    32-63,96-127    1
GPU6    SYS     SYS     SYS     SYS     NODE    NODE     X      PXB     32-63,96-127    1
GPU7    SYS     SYS     SYS     SYS     NODE    NODE    PXB      X      32-63,96-127    1
```

---

## Description

When running bidirectional Host-Device memory copy tests, I observe **significant asymmetry** in bandwidth:

- **Unidirectional transfers** (expected behavior):
  - H2D: ~25 GB/s ✅
  - D2H: ~25 GB/s ✅

- **Bidirectional transfers** (unexpected behavior):
  - D2H: ~24 GB/s ✅ (maintains performance)
  - H2D: **~11 GB/s** ❌ (drops by ~56%)

---

## Test Results

### 1. Unidirectional Host-to-Device (Baseline)
```bash
$ ./nvbandwidth -t host_to_device_memcpy_ce
```
**Result:** ~24.8 GB/s per GPU (SUM: 198.72 GB/s)
```
memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s)
           0         1         2         3         4         5         6         7
 0     24.81     24.80     24.84     24.83     24.88     24.83     24.84     24.88
```

### 2. Unidirectional Device-to-Host (Baseline)
```bash
$ ./nvbandwidth -t device_to_host_memcpy_ce
```
**Result:** ~25.9 GB/s per GPU (SUM: 207.13 GB/s)
```
memcpy CE CPU(row) <- GPU(column) bandwidth (GB/s)
           0         1         2         3         4         5         6         7
 0     25.89     25.88     25.89     25.89     25.89     25.89     25.89     25.90
```

### 3. Bidirectional Device-to-Host
```bash
$ ./nvbandwidth -t device_to_host_bidirectional_memcpy_ce
```
**Result:** ~24.2 GB/s per GPU (SUM: 194.07 GB/s) ✅ **Maintains performance**
```
memcpy CE CPU(row) <-> GPU(column) bandwidth (GB/s)
           0         1         2         3         4         5         6         7
 0     24.25     24.26     24.24     24.26     24.27     24.26     24.27     24.27
```

### 4. Bidirectional Host-to-Device ⚠️ **ISSUE HERE**
```bash
$ ./nvbandwidth -t host_to_device_bidirectional_memcpy_ce
```
**Result:** **~11 GB/s per GPU** ❌ **(56% performance drop)**

*(Note: The output appears to be truncated in the original test, but based on the pattern, H2D bandwidth drops significantly)*

---

## Expected Behavior

Since PCIe is **full-duplex**, I expect:
- Bidirectional H2D bandwidth: ~24-25 GB/s (similar to unidirectional)
- Bidirectional D2H bandwidth: ~24-25 GB/s (similar to unidirectional)

Both directions should maintain near-peak performance simultaneously.

---

## Observations

1. **D2H maintains performance** in bidirectional mode (~24 GB/s)
2. **H2D drops significantly** in bidirectional mode (~11 GB/s, ~56% reduction)
3. **Memory bandwidth is not saturated** (verified with `pcm-memory` - DDR4 8-channel not fully utilized)
4. PCIe 4.0 x16 theoretical bandwidth: ~32 GB/s per direction (full-duplex)

---

## Questions & Request for help

Could you please help clarify:
1. Whether this asymmetry is expected on this hardware configuration
2. What might be causing the H2D throttling in bidirectional mode
3. Any recommendations for investigation or mitigation

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Asymmetric bandwidth in bidirectional Host-Device transfers: H2D drops to ~11 GB/s while D2H maintains ~25 GB/s #53

Environment

Description

Test Results

1. Unidirectional Host-to-Device (Baseline)

2. Unidirectional Device-to-Host (Baseline)

3. Bidirectional Device-to-Host

4. Bidirectional Host-to-Device ⚠️ ISSUE HERE

Expected Behavior

Observations

Questions & Request for help

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Asymmetric bandwidth in bidirectional Host-Device transfers: H2D drops to ~11 GB/s while D2H maintains ~25 GB/s #53

Description

Environment

Description

Test Results

1. Unidirectional Host-to-Device (Baseline)

2. Unidirectional Device-to-Host (Baseline)

3. Bidirectional Device-to-Host

4. Bidirectional Host-to-Device ⚠️ ISSUE HERE

Expected Behavior

Observations

Questions & Request for help

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions