This repository contains C++ and CUDA scripts designed to benchmark the NVIDIA T4 GPU on AWS EC2 g4dn instances. The suite includes tests for memory bandwidth, compute throughput, stream overlap, precision scaling, and multi-node network latency (AWS EFA vs Soft-RoCE vs TCP).
| File | Description |
|---|---|
setup_cuda.sh |
Verifies the NVIDIA driver, installs the nvidia-cuda-toolkit (nvcc), and installs Python dependencies for plotting. |
cuda_benchmark.sh |
The main interface script. It compiles the standard kernels (vector, matrix) and executes the benchmark loops. |
bench_vector_add.cu |
CUDA kernel for Vector Addition ( |
bench_matrix_mul.cu |
CUDA kernel for Matrix Multiplication. Comparisons are run against a naive CPU implementation for validation. |
run_advanced.sh |
Compiles and executes the single-node micro-benchmarks (bandwidth, overlap, fp16). |
| soft-roce/ |
Directory: Scripts for benchmarking software-emulated RDMA on standard g4dn.xlarge instances. |
soft-roce/setup_roce.sh |
Installs RDMA drivers (ibverbs, rdma-core) and configures the RXE network interface. |
soft-roce/bench_net_gpu.cu |
Network benchmark comparing TCP sockets against Verbs for transferring data to GPU memory. |
| efa_benchmark/ |
Directory: Scripts for benchmarking hardware EFA on g4dn.8xlarge instances using NCCL. |
efa_benchmark/setup_node.sh |
Installs EFA drivers, AWS-OFI-NCCL plugin, and compiles nccl-tests. |
efa_benchmark/run_nccl.sh |
Executes the all_reduce_perf benchmark over EFA. |
Vector Addition Performance
Matrix Multiplication Performance
Amazon EFA Networking Performance
Sample output from run_advanced.sh on a g4dn.xlarge instance.
Comparing standard malloc (Pageable) against cudaMallocHost (Pinned/DMA) transfers.
| Transfer Size | Pageable Speed | Pinned Speed |
|---|---|---|
| 10 MB | 5.68 GB/s | 6.25 GB/s |
| 100 MB | 6.15 GB/s | 6.27 GB/s |
| 500 MB | 6.20 GB/s | 6.27 GB/s |
Measuring the throughput advantage of Tensor Cores using __half intrinsics.
| Precision | Execution Time | Speedup |
|---|---|---|
| FP32 (Float) | 0.0328 s | 1.0x |
| FP16 (Half) | 0.0180 s | 1.82x |
Speed-comparison of transferring 100MB buffers between g4dn instances.
| Configuration | Instance Type | Transport | Throughput | Notes |
|---|---|---|---|---|
| Standard TCP | g4dn.xlarge |
TCP/IP | ~0.56 GB/s | Limited by standard Ethernet overhead. |
| Soft-RoCE | g4dn.xlarge |
RXE (Software) | ~0.30 GB/s | Slower due to CPU emulation overhead of InfiniBand frames. |
| AWS EFA | g4dn.8xlarge |
SRD (Hardware) | ~2.56 GB/s | 5x Speedup. Limited by T4 GPU-to-CPU Copy (No GPUDirect). |
Note on EFA Speed: The g4dn.8xlarge has a 50 Gbps (6.25 GB/s) network card. However, the NVIDIA T4 GPU does not support GPUDirect RDMA. Data must traverse GPU -> CPU -> Network, creating a CPU bottleneck around ~2.6 GB/s. This is the hardware limit of the instance, not the network.
- Instance: AWS EC2
g4dn.xlarge(or larger). - AMI: Deep Learning OSS Nvidia Driver AMI GPU PyTorch (Ubuntu 22.04 or 24.04).
- Drivers: Pre-installed on the AMI (NVIDIA 535+).
- Network: For RoCE tests, Security Groups must allow all traffic (or specific TCP/UDP ports) between instances.
Run the setup script to install the compiler and plotting libraries.
chmod +x setup_cuda.sh
./setup_cuda.shRun the Vector Addition or Matrix Multiplication suites. These scripts automatically compile the binaries and generate visualization images.
# Syntax: ./cuda_benchmark.sh <type> <limit> <iterations>
# Run Vector Addition
./cuda_benchmark.sh vector 10000000 50
# Run Matrix Multiplication
./cuda_benchmark.sh matrix 2048 20Run the hardware-specific diagnostic tests (Bandwidth, Overlap, FP16).
chmod +x run_advanced.sh
./run_advanced.shWorks on any instance size (e.g., g4dn.xlarge). Good for testing code logic.
- Setup (Both Nodes):
cd soft-roce && ./setup_roce.sh && ./compile_net.sh - Run Server:
./bench_net server - Run Client:
./bench_net client <SERVER_IP>
Requirements:
- Instance: Must use
g4dn.8xlargeor larger (smaller sizes do not support EFA). - Launch Config: You MUST check "Enable Elastic Fabric Adapter" in the Network Settings during launch.
- Placement Group: Instances should be in a "Cluster" placement group for low latency.
Step 1: Setup & Reboot (Both Nodes) The EFA driver updates the kernel. You must reboot.
cd efa_benchmark
./setup_node.sh
sudo rebootStep 2: SSH Configuration (Both Nodes) After reboot, login and configure passwordless SSH.
cd efa_benchmark
./configure_ssh.sh
# COPY output key from Node A -> ~/.ssh/authorized_keys on Node B
# COPY output key from Node B -> ~/.ssh/authorized_keys on Node A
# VERIFY by running: ssh <OTHER_IP> (It must not ask for a password)Step 3: Run Benchmark (Node 1)
./run_nccl.sh <NODE_1_IP> <NODE_2_IP>If the benchmark hangs or fails:
- Check Firewall: EFA uses the SRD protocol. Ensure your Security Group Inbound / Outbound Rules allows traffic to rebound to itself.
- Disable Ptrace: Ubuntu re-enables Ptrace on reboot. Run
sudo sysctl -w kernel.yama.ptrace_scope=0. - Test Hardware: Run
/opt/amazon/efa/bin/fi_pingpong -p efa -e rdm. If this hangs, butping <other instance>does not, check your Inbound / Outbound rules. ```


