AWS G4dn CUDA Benchmarks

This repository contains C++ and CUDA scripts designed to benchmark the NVIDIA T4 GPU on AWS EC2 g4dn instances. The suite includes tests for memory bandwidth, compute throughput, stream overlap, precision scaling, and multi-node network latency (AWS EFA vs Soft-RoCE vs TCP).

Files

File	Description
`setup_cuda.sh`	Verifies the NVIDIA driver, installs the `nvidia-cuda-toolkit` (nvcc), and installs Python dependencies for plotting.
`cuda_benchmark.sh`	The main interface script. It compiles the standard kernels (`vector`, `matrix`) and executes the benchmark loops.
`bench_vector_add.cu`	CUDA kernel for Vector Addition ($C = A + B$). Measures execution time for array sizes ranging from small to large.
`bench_matrix_mul.cu`	CUDA kernel for Matrix Multiplication. Comparisons are run against a naive CPU implementation for validation.
`run_advanced.sh`	Compiles and executes the single-node micro-benchmarks (`bandwidth`, `overlap`, `fp16`).
soft-roce/	Directory: Scripts for benchmarking software-emulated RDMA on standard `g4dn.xlarge` instances.
`soft-roce/setup_roce.sh`	Installs RDMA drivers (`ibverbs`, `rdma-core`) and configures the RXE network interface.
`soft-roce/bench_net_gpu.cu`	Network benchmark comparing TCP sockets against Verbs for transferring data to GPU memory.
efa_benchmark/	Directory: Scripts for benchmarking hardware EFA on `g4dn.8xlarge` instances using NCCL.
`efa_benchmark/setup_node.sh`	Installs EFA drivers, AWS-OFI-NCCL plugin, and compiles `nccl-tests`.
`efa_benchmark/run_nccl.sh`	Executes the `all_reduce_perf` benchmark over EFA.

Benchmark Results

Vector Addition Performance

Matrix Multiplication Performance

Amazon EFA Networking Performance

Micro-Benchmark Results

Sample output from run_advanced.sh on a g4dn.xlarge instance.

PCIe Bandwidth (Host to Device)

Comparing standard malloc (Pageable) against cudaMallocHost (Pinned/DMA) transfers.

Transfer Size	Pageable Speed	Pinned Speed
10 MB	5.68 GB/s	6.25 GB/s
100 MB	6.15 GB/s	6.27 GB/s
500 MB	6.20 GB/s	6.27 GB/s

Precision Scaling (FP32 vs FP16)

Measuring the throughput advantage of Tensor Cores using __half intrinsics.

Precision	Execution Time	Speedup
FP32 (Float)	0.0328 s	1.0x
FP16 (Half)	0.0180 s	1.82x

Network Transport (Soft-RoCE vs TCP vs AWS EPA)

Speed-comparison of transferring 100MB buffers between g4dn instances.

Configuration	Instance Type	Transport	Throughput	Notes
Standard TCP	`g4dn.xlarge`	TCP/IP	~0.56 GB/s	Limited by standard Ethernet overhead.
Soft-RoCE	`g4dn.xlarge`	RXE (Software)	~0.30 GB/s	Slower due to CPU emulation overhead of InfiniBand frames.
AWS EFA	`g4dn.8xlarge`	SRD (Hardware)	~2.56 GB/s	5x Speedup. Limited by T4 GPU-to-CPU Copy (No GPUDirect).

Note on EFA Speed: The g4dn.8xlarge has a 50 Gbps (6.25 GB/s) network card. However, the NVIDIA T4 GPU does not support GPUDirect RDMA. Data must traverse GPU -> CPU -> Network, creating a CPU bottleneck around ~2.6 GB/s. This is the hardware limit of the instance, not the network.

Prerequisites

Instance: AWS EC2 g4dn.xlarge (or larger).
AMI: Deep Learning OSS Nvidia Driver AMI GPU PyTorch (Ubuntu 22.04 or 24.04).
Drivers: Pre-installed on the AMI (NVIDIA 535+).
Network: For RoCE tests, Security Groups must allow all traffic (or specific TCP/UDP ports) between instances.

Installation

Run the setup script to install the compiler and plotting libraries.

chmod +x setup_cuda.sh
./setup_cuda.sh

Usage

Standard Benchmarks

Run the Vector Addition or Matrix Multiplication suites. These scripts automatically compile the binaries and generate visualization images.

# Syntax: ./cuda_benchmark.sh <type> <limit> <iterations>

# Run Vector Addition
./cuda_benchmark.sh vector 10000000 50

# Run Matrix Multiplication
./cuda_benchmark.sh matrix 2048 20

Advanced Micro-Benchmarks

Run the hardware-specific diagnostic tests (Bandwidth, Overlap, FP16).

chmod +x run_advanced.sh
./run_advanced.sh

Usage: Network Benchmarks

Option A: Soft-RoCE (Cheap, Low Performance)

Works on any instance size (e.g., g4dn.xlarge). Good for testing code logic.

Setup (Both Nodes): cd soft-roce && ./setup_roce.sh && ./compile_net.sh
Run Server: ./bench_net server
Run Client: ./bench_net client <SERVER_IP>

Option B: AWS EFA (High Performance)

Requirements:

Instance: Must use g4dn.8xlarge or larger (smaller sizes do not support EFA).
Launch Config: You MUST check "Enable Elastic Fabric Adapter" in the Network Settings during launch.
Placement Group: Instances should be in a "Cluster" placement group for low latency.

Step 1: Setup & Reboot (Both Nodes) The EFA driver updates the kernel. You must reboot.

cd efa_benchmark
./setup_node.sh
sudo reboot

Step 2: SSH Configuration (Both Nodes) After reboot, login and configure passwordless SSH.

cd efa_benchmark
./configure_ssh.sh
# COPY output key from Node A -> ~/.ssh/authorized_keys on Node B
# COPY output key from Node B -> ~/.ssh/authorized_keys on Node A
# VERIFY by running: ssh <OTHER_IP> (It must not ask for a password)

Step 3: Run Benchmark (Node 1)

./run_nccl.sh <NODE_1_IP> <NODE_2_IP>

Troubleshooting EFA

If the benchmark hangs or fails:

Check Firewall: EFA uses the SRD protocol. Ensure your Security Group Inbound / Outbound Rules allows traffic to rebound to itself.
Disable Ptrace: Ubuntu re-enables Ptrace on reboot. Run sudo sysctl -w kernel.yama.ptrace_scope=0.
Test Hardware: Run /opt/amazon/efa/bin/fi_pingpong -p efa -e rdm. If this hangs, but ping <other instance> does not, check your Inbound / Outbound rules. ```

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
efa-benchmark		efa-benchmark
results		results
soft-roce		soft-roce
README.md		README.md
bench_bandwidth.cu		bench_bandwidth.cu
bench_fp16.cu		bench_fp16.cu
bench_matrix_mul.cu		bench_matrix_mul.cu
bench_overlap.cu		bench_overlap.cu
bench_vector_add.cu		bench_vector_add.cu
cuda_benchmark.sh		cuda_benchmark.sh
generate_plot.py		generate_plot.py
run_advanced.sh		run_advanced.sh
setup_cuda.sh		setup_cuda.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AWS G4dn CUDA Benchmarks

Files

Benchmark Results

Micro-Benchmark Results

PCIe Bandwidth (Host to Device)

Precision Scaling (FP32 vs FP16)

Network Transport (Soft-RoCE vs TCP vs AWS EPA)

Prerequisites

Installation

Usage

Standard Benchmarks

Advanced Micro-Benchmarks

Usage: Network Benchmarks

Option A: Soft-RoCE (Cheap, Low Performance)

Option B: AWS EFA (High Performance)

Troubleshooting EFA

About

Uh oh!

Releases

Packages

Languages

WarpRomo/gpu-perf-tests

Folders and files

Latest commit

History

Repository files navigation

AWS G4dn CUDA Benchmarks

Files

Benchmark Results

Micro-Benchmark Results

PCIe Bandwidth (Host to Device)

Precision Scaling (FP32 vs FP16)

Network Transport (Soft-RoCE vs TCP vs AWS EPA)

Prerequisites

Installation

Usage

Standard Benchmarks

Advanced Micro-Benchmarks

Usage: Network Benchmarks

Option A: Soft-RoCE (Cheap, Low Performance)

Option B: AWS EFA (High Performance)

Troubleshooting EFA

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages