diff --git a/notebooks/benchmarking.ipynb b/notebooks/benchmarking.ipynb new file mode 100644 index 00000000..a9f3ba15 --- /dev/null +++ b/notebooks/benchmarking.ipynb @@ -0,0 +1,1561 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "view-in-github" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_PCU0gUyzX2c" + }, + "source": [ + "# A Practical Guide to GPU Benchmarking\n", + "\n", + "> **Note on outputs:** The outputs in this notebook were generated on an **NVIDIA H200 GPU** (90MB L2 cache, 4.8 TB/s memory bandwidth). Your results may vary depending on your hardware. The H200's large cache means cache effects are less dramatic than on older GPUs like A100 (40MB L2) or consumer cards.\n", + "\n", + "## TL;DR — How to Benchmark Correctly\n", + "\n", + "Benchmarking on GPUs requires a strict protocol to avoid measuring Python overhead or caching artifacts. To get reliable numbers, you must:\n", + "\n", + "1. **Warmup:** Run the kernel ~10-50 times first to settle compilation and memory allocators.\n", + "2. **Sample Extensively:** Don't trust one run. Collect 100+ samples to build a statistical distribution.\n", + "3. **Flush the L2 Cache:** Between *every* sample, flush the cache to force a cold cache state (simulating real-world inference).\n", + "4. **Use Device Timers:** Use `torch.cuda.Event` instead of `time.time()` to measure execution on the GPU, not the CPU driver.\n", + "5. **Aggregate Robustly:** Aggregate over many samples to filter out jitter/outliers.\n", + "6. **Wait for sidestreams to finish:** Ensure no side-streams are running or wait for all of them to finish before reporting a time.\n", + "\n", + "*Pro-Tip:* **KernelBench's timing module** (`src/timing.py`) implements all these best practices. Use `get_timing_function(\"cuda_event\")` for trusted code or `get_timing_function(\"host_time\")` for evaluating untrusted/agent-generated code.\n", + "\n", + "-----\n", + "\n", + "If are using an LLM agent to write GPU kernels (and evaluating against something like say [Kernel Bench](https://github.com/ScalingIntelligence/KernelBench)), or just trying to optimize a custom GPU kernel, you are eventually going to ask: **\"How fast is this thing?\"**\n", + "\n", + "This notebook is heavily inspired by [this great guide](https://www.youtube.com/watch?v=1i7dxoAfKOU) from the **GPU MODE** community and the practical \"footguns\" (traps) encountered while building benchmarking harnesses for LLM-generated code. Our goal here is simplicity and keeping things Pythonic—for more advanced techniques, see the footnotes.\n", + "\n", + "We won't just list best practices. Instead, we are going to build a benchmarking harness from scratch, make every common mistake, debug why the numbers are wrong, and iterate our way to a robust solution. So let's start things out by doing the most naive thing by using `time.time()`!" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2025-12-17T21:24:36.427802Z", + "iopub.status.busy": "2025-12-17T21:24:36.427684Z", + "iopub.status.idle": "2025-12-17T21:24:40.995279Z", + "shell.execute_reply": "2025-12-17T21:24:40.994328Z" + }, + "id": "PKWz_W7uzX2f", + "outputId": "8751fd78-569b-4080-f21a-60bbb5ee8caf" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/simon/palic/KernelBench/.venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", + " from .autonotebook import tqdm as notebook_tqdm\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Using GPU: NVIDIA H200\n" + ] + } + ], + "source": [ + "# @title Environment Setup\n", + "# Ensure we have the necessary libraries and a GPU available\n", + "# !pip install -q triton matplotlib numpy torch\n", + "# !pip install -e .. # Install KernelBench locally for timing utilities\n", + "\n", + "import sys\n", + "import os\n", + "sys.path.insert(0, '..') # Add parent directory to path for imports\n", + "\n", + "# For multi-GPU systems, set CUDA_VISIBLE_DEVICES=X before running to select a specific GPU\n", + "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"5\"\n", + "\n", + "import torch\n", + "import time\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "import triton\n", + "\n", + "# Import KernelBench's timing module\n", + "from src import timing\n", + "from src.timing import clear_l2_cache, get_timing_stats, get_timing_function\n", + "\n", + "if not torch.cuda.is_available():\n", + " raise RuntimeError(\"This notebook requires a GPU. Please enable GPU in your runtime settings.\")\n", + "\n", + "# Device configuration\n", + "# The selected GPU will appear as cuda:0\n", + "\n", + "DEVICE = f\"cuda:0\"\n", + "print(f\"Using GPU: {torch.cuda.get_device_name(DEVICE)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kjWByrwvzX2f" + }, + "source": [ + "## The Journey: Benchmarking a Matrix Multiplication\n", + "\n", + "Let's define a simple workload to test. We want to measure the performance of a standard Matrix Multiplication." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2025-12-17T21:24:40.997969Z", + "iopub.status.busy": "2025-12-17T21:24:40.997722Z", + "iopub.status.idle": "2025-12-17T21:24:41.252072Z", + "shell.execute_reply": "2025-12-17T21:24:41.250967Z" + }, + "id": "gxtKes5lzX2g", + "outputId": "5890bae4-5b9a-4366-8947-367146593158" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Output shape: torch.Size([8192, 8192])\n", + "Op ran successfully\n" + ] + } + ], + "source": [ + "# A standard size for testing\n", + "N = 8192\n", + "\n", + "def get_data(n=N, device=DEVICE):\n", + " \"\"\"Generate random float32 matrices for benchmarking.\"\"\"\n", + " return torch.randn(n, n, device=device), torch.randn(n, n, device=device)\n", + "\n", + "def simple_mm(a, b):\n", + " \"\"\"Our kernel under test: standard matrix multiplication.\"\"\"\n", + " return torch.matmul(a, b)\n", + "\n", + "# Let's verify it runs\n", + "a, b = get_data()\n", + "res = simple_mm(a, b)\n", + "print(f\"Output shape: {res.shape}\")\n", + "print(\"Op ran successfully\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GWlsBEVyzX2g" + }, + "source": [ + "### Attempt 1: The Naive Timer\n", + "\n", + "The most intuitive way to time code in Python is using `time.time()`. Let's try that first." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2025-12-17T21:24:41.254622Z", + "iopub.status.busy": "2025-12-17T21:24:41.254499Z", + "iopub.status.idle": "2025-12-17T21:24:41.258106Z", + "shell.execute_reply": "2025-12-17T21:24:41.257414Z" + }, + "id": "LynIxLaRzX2g", + "outputId": "72548dad-6570-4eaa-ce07-923556fe70b4" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Naive time: 0.6356 ms\n" + ] + } + ], + "source": [ + "def benchmark_naive(func, *args):\n", + " \"\"\"WRONG: Measures kernel launch time, not execution time.\"\"\"\n", + " start = time.time()\n", + " func(*args)\n", + " end = time.time()\n", + " return (end - start) * 1000 # to ms\n", + "\n", + "t = benchmark_naive(simple_mm, a, b)\n", + "print(f\"Naive time: {t:.4f} ms\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gw4NGYRmzX2h" + }, + "source": [ + "**The Problem:**\n", + "Wait, less than 1ms? That seems impossibly fast for a 8192² matrix multiplication involving over 1 trillion floating-point operations.[¹](#footnote-1)\n", + "\n", + "**What happened?**\n", + "GPUs are **asynchronous**. When you call `torch.matmul`, the CPU doesn't actually do the math. It simply queues a \"launch kernel\" command to the GPU and moves on immediately. Our timer didn't measure the matrix multiplication; it measured how long it took Python to place an order in the queue.\n", + "\n", + "To fix this, we need to:\n", + "1. **Synchronize** - Force the CPU to wait for the GPU with `torch.cuda.synchronize()`\n", + "2. **Use CUDA Events** - Record timestamps directly on the GPU to avoid CPU overhead\n", + "\n", + "Let's compare these approaches to see the difference.\n", + "\n", + "---\n", + "\n", + "\n", + "¹ **Why impossible?** The [H200](https://www.nvidia.com/en-us/data-center/h200/) peaks at 989 TFLOPS for TF32 Tensor Cores. At that rate: 1.1 TFLOP ÷ 989 TFLOP/s = **1.11ms minimum**. Anything under 1ms is physically impossible." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Attempt 2: Synchronizing the Device\n", + "\n", + "To fix this, we need to force the CPU to wait until the GPU has finished its work before we stop the clock. We do this with `torch.cuda.synchronize()`." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "execution": { + "iopub.execute_input": "2025-12-17T21:24:41.260592Z", + "iopub.status.busy": "2025-12-17T21:24:41.260479Z", + "iopub.status.idle": "2025-12-17T21:24:41.460207Z", + "shell.execute_reply": "2025-12-17T21:24:41.459267Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Sync time: 21.5368 ms\n" + ] + } + ], + "source": [ + "def benchmark_sync(func, *args):\n", + " \"\"\"Better: Actually waits for GPU to finish.\"\"\"\n", + " torch.cuda.synchronize() # Wait for previous work to finish\n", + " start = time.time()\n", + " func(*args)\n", + " torch.cuda.synchronize() # Wait for THIS work to finish\n", + " end = time.time()\n", + " return (end - start) * 1000\n", + "\n", + "t = benchmark_sync(simple_mm, a, b)\n", + "print(f\"Sync time: {t:.4f} ms\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dV8AmQi-zX2i" + }, + "source": [ + "### Attempt 3: Removing CPU Overhead (CUDA Events)\n", + "\n", + "To get a precise measurement, we need to bypass the CPU clock entirely. We can ask the GPU driver to record timestamps directly on the device using `torch.cuda.Event`." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2025-12-17T21:24:41.463315Z", + "iopub.status.busy": "2025-12-17T21:24:41.463177Z", + "iopub.status.idle": "2025-12-17T21:24:41.532922Z", + "shell.execute_reply": "2025-12-17T21:24:41.531966Z" + }, + "id": "i6PfSdkTzX2i", + "outputId": "8b3e29d1-1789-4bfb-9a44-599016516dd7" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Run 0: 21.7344 ms\n", + "Run 1: 21.4754 ms\n", + "Run 2: 21.4987 ms\n" + ] + } + ], + "source": [ + "def benchmark_events(func, *args):\n", + " \"\"\"Better: Uses GPU timestamps, avoiding CPU overhead.\"\"\"\n", + " start_event = torch.cuda.Event(enable_timing=True)\n", + " end_event = torch.cuda.Event(enable_timing=True)\n", + "\n", + " torch.cuda.synchronize(device=DEVICE)\n", + " start_event.record()\n", + " func(*args)\n", + " end_event.record()\n", + " torch.cuda.synchronize(device=DEVICE)\n", + "\n", + " return start_event.elapsed_time(end_event) # Returns ms directly\n", + "\n", + "# Run it a few times\n", + "for i in range(3):\n", + " print(f\"Run {i}: {benchmark_events(simple_mm, a, b):.4f} ms\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BkfaaDawzX2i" + }, + "source": [ + "### Attempt 4: Handling the \"Cold Start\"\n", + "\n", + "Notice Run 0 is noticably slower than the rest. The first time you run a PyTorch function (and similarly launching a cuda kernel), the framework does a lot of heavy lifting which could include: allocating memory, initializing cuBLAS/cuDNN workspaces, lazy kernel loading, and compiling kernels (especially if using `torch.compile` or Triton). This \"Cold Start\" penalty is a one-time cost that shouldn't be included in your performance metrics.\n", + "\n", + "**The Fix:**\n", + "We need to perform **Warmup Runs**—running the kernel a few times to settle the system state before we start measuring." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2025-12-17T21:24:41.535509Z", + "iopub.status.busy": "2025-12-17T21:24:41.535387Z", + "iopub.status.idle": "2025-12-17T21:24:42.246476Z", + "shell.execute_reply": "2025-12-17T21:24:42.245382Z" + }, + "id": "j_PsAuJkzX2i", + "outputId": "d6983401-72d3-468c-ffd8-c0833e8d5556" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Run 0: 21.6478 ms\n", + "Run 1: 21.4837 ms\n", + "Run 2: 21.4852 ms\n" + ] + } + ], + "source": [ + "def benchmark_warmup(func, *args, warmup_iters=30, benchmark_iters=3):\n", + " \"\"\"Better: Includes warmup to avoid cold-start penalty.\"\"\"\n", + " # Warmup phase\n", + " for _ in range(warmup_iters):\n", + " func(*args)\n", + " torch.cuda.synchronize(device=DEVICE)\n", + "\n", + " # Measurement phase\n", + " measurements = []\n", + " for _ in range(benchmark_iters):\n", + " measurements.append(benchmark_events(func, *args))\n", + " torch.cuda.synchronize(device=DEVICE)\n", + " return measurements\n", + "\n", + "# print(f\"Warmed up time: {benchmark_warmup(simple_mm, a, b):.4f} ms\")\n", + "\n", + "for i, measurement in enumerate(benchmark_warmup(simple_mm, a, b)):\n", + " print(f\"Run {i}: {measurement:.4f} ms\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OR3uOh7kzX2i" + }, + "source": [ + "### Attempt 5: The Single Sample Fallacy (Variance)\n", + "\n", + "Relying on a single sample after warmup is bad science. Operating systems are noisy; background processes interrupt the CPU, and GPU clocks fluctuate thermally. A single measurement is anecdotal, not statistical.\n", + "\n", + "#### Visualizing the Jitter\n", + "\n", + "Let's run the benchmark 100 times and plot every single run. You will clearly see the \"Cold Start\" spike and the noise floor of the OS." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 653 + }, + "execution": { + "iopub.execute_input": "2025-12-17T21:24:42.248937Z", + "iopub.status.busy": "2025-12-17T21:24:42.248818Z", + "iopub.status.idle": "2025-12-17T21:24:44.484746Z", + "shell.execute_reply": "2025-12-17T21:24:44.483759Z" + }, + "id": "T-7QH4cHzX2i", + "outputId": "c758effd-a810-422d-d3ca-66c128cdb716" + }, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Mean: 21.5096 ms\n", + "Median: 21.4942 ms\n", + "Std: 0.0625 ms\n", + "Min: 21.4420 ms\n", + "Max: 21.7808 ms\n" + ] + } + ], + "source": [ + "# Collect 100 samples\n", + "timings = []\n", + "for i in range(100):\n", + " timings.append(benchmark_events(simple_mm, a, b))\n", + "\n", + "plt.figure(figsize=(10, 6))\n", + "plt.scatter(range(100), timings, alpha=0.6)\n", + "plt.axhline(y=np.median(timings), color='r', linestyle='--', label=f'Median: {np.median(timings):.4f} ms')\n", + "plt.axhline(y=np.mean(timings), color='g', linestyle=':', label=f'Mean: {np.mean(timings):.4f} ms')\n", + "plt.title(\"Benchmarking Jitter & Cold Start\")\n", + "plt.ylabel(\"Time (ms)\")\n", + "plt.xlabel(\"Run Index\")\n", + "plt.legend()\n", + "plt.grid(True, alpha=0.3)\n", + "plt.show()\n", + "\n", + "print(f\"Mean: {np.mean(timings):.4f} ms\")\n", + "print(f\"Median: {np.median(timings):.4f} ms\")\n", + "print(f\"Std: {np.std(timings):.4f} ms\")\n", + "print(f\"Min: {np.min(timings):.4f} ms\")\n", + "print(f\"Max: {np.max(timings):.4f} ms\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hX_-OpftzX2i" + }, + "source": [ + "You will see a massive dot at $x=0$ (the cold start), followed by a cloud of dots hovering around the \"true\" time. This visualizes why we need **Warmup** (to skip $x=0$) and **Statistics** (to handle the cloud).\n", + "\n", + "Notice how the **Mean** is pulled upward by the outliers, while the **Median** represents the typical case more accurately. When possible, we should use the **Median** as our final metric.\n", + "\n", + "### Attempt 6: The \"Robust\" Harness (Flushing Cache)\n", + "\n", + "Modern GPUs have large L2 caches (40MB-192MB depending on architecture). If your data fits in the cache, subsequent iterations in your loop will skip the slow VRAM access, artificially inflating your speed. In production, data usually streams in from VRAM, so this \"hot cache\" benchmark is misleading.\n", + "\n", + "**The Fix:**\n", + "We must **flush the L2 cache** between *every single sample*. We do this by writing to a tensor large enough to completely evict the cache contents. KernelBench uses a ~256MB tensor to safely cover all GPU architectures, including the largest caches (e.g., Blackwell at ~192MB)." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "execution": { + "iopub.execute_input": "2025-12-17T21:24:44.487140Z", + "iopub.status.busy": "2025-12-17T21:24:44.487016Z", + "iopub.status.idle": "2025-12-17T21:24:44.489937Z", + "shell.execute_reply": "2025-12-17T21:24:44.489208Z" + }, + "id": "Kj5azcpxzX2j" + }, + "outputs": [], + "source": [ + "# KernelBench provides utilities to flush the L2 cache\n", + "# This is important for cold cache measurements that simulate real-world inference\n", + "\n", + "def clear_l2_cache(device=DEVICE):\n", + " \"\"\"Flush L2 cache by writing to a large tensor.\n", + " \n", + " L2 cache sizes vary by GPU, so we use 256MB to cover all cases.\n", + " \"\"\"\n", + " dummy = torch.empty((32, 1024, 1024), dtype=torch.int64, device=device) # 256MB\n", + " dummy.fill_(1901) # Force write to thrash cache\n", + " del dummy\n", + "\n", + "# KernelBench also provides clear_l2_cache_triton() for cross-platform support\n", + "# (works on both NVIDIA and AMD GPUs via Triton's device abstraction)\n", + "from src.timing import clear_l2_cache_triton" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Why does flushing the cache matter?\n", + "\n", + "Let's see the cache effect in action. We'll benchmark the same operation twice:\n", + "1. **Without** cache flushing between runs (data stays in L2 cache)\n", + "2. **With** cache flushing between runs (data must be fetched from VRAM each time)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "execution": { + "iopub.execute_input": "2025-12-17T21:24:44.492322Z", + "iopub.status.busy": "2025-12-17T21:24:44.492090Z", + "iopub.status.idle": "2025-12-17T21:24:44.507209Z", + "shell.execute_reply": "2025-12-17T21:24:44.506066Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Without cache flushing (warm cache):\n", + "\n", + "With cache flushing (cold cache):\n", + "\n", + "Warm cache median: 0.0280 ms\n", + "Cold cache median: 0.0318 ms\n", + "Difference: 0.0038 ms (13.7% slower with cold cache)\n", + "\n", + "Without cache flushing, you measure artificially fast times!\n" + ] + } + ], + "source": [ + "# Demonstrate why L2 cache flushing matters\n", + "# Use a smaller matrix so the effect is visible (data fits in cache)\n", + "N_SMALL = 512\n", + "a_small, b_small = get_data(N_SMALL)\n", + "\n", + "NUM_SAMPLES = 20\n", + "\n", + "# do warmup runs\n", + "for _ in range(NUM_SAMPLES):\n", + " clear_l2_cache(device=DEVICE)\n", + " benchmark_events(simple_mm, a_small, b_small)\n", + " torch.cuda.synchronize(device=DEVICE)\n", + "\n", + "# Benchmark WITHOUT cache flushing (warm cache - unrealistic)\n", + "print(\"Without cache flushing (warm cache):\")\n", + "times_warm = []\n", + "for i in range(NUM_SAMPLES):\n", + " t = benchmark_events(simple_mm, a_small, b_small)\n", + " times_warm.append(t)\n", + "\n", + "# Benchmark WITH cache flushing (cold cache - realistic)\n", + "print(\"\\nWith cache flushing (cold cache):\")\n", + "times_cold = []\n", + "for i in range(NUM_SAMPLES):\n", + " clear_l2_cache(device=DEVICE) # Flush cache before each measurement\n", + " t = benchmark_events(simple_mm, a_small, b_small)\n", + " times_cold.append(t)\n", + "\n", + "print(f\"\\nWarm cache median: {np.median(times_warm):.4f} ms\")\n", + "print(f\"Cold cache median: {np.median(times_cold):.4f} ms\")\n", + "print(f\"Difference: {np.median(times_cold) - np.median(times_warm):.4f} ms ({(np.median(times_cold)/np.median(times_warm) - 1)*100:.1f}% slower with cold cache)\")\n", + "print(\"\\nWithout cache flushing, you measure artificially fast times!\")" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "execution": { + "iopub.execute_input": "2025-12-17T21:24:44.509465Z", + "iopub.status.busy": "2025-12-17T21:24:44.509344Z", + "iopub.status.idle": "2025-12-17T21:24:44.597419Z", + "shell.execute_reply": "2025-12-17T21:24:44.596500Z" + } + }, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Visualize the cache effect\n", + "# Discard first few samples to remove initialization noise\n", + "discard = 3\n", + "times_warm_clean = times_warm[discard:]\n", + "times_cold_clean = times_cold[discard:]\n", + "\n", + "plt.figure(figsize=(10, 5))\n", + "plt.scatter(range(len(times_warm_clean)), times_warm_clean, alpha=0.7, label=f'Warm Cache (mean={np.mean(times_warm_clean):.4f}ms)', color='orange', s=60)\n", + "plt.scatter(range(len(times_cold_clean)), times_cold_clean, alpha=0.7, label=f'Cold Cache (mean={np.mean(times_cold_clean):.4f}ms)', color='blue', s=60)\n", + "plt.axhline(y=np.mean(times_warm_clean), color='orange', linestyle='--', alpha=0.5)\n", + "plt.axhline(y=np.mean(times_cold_clean), color='blue', linestyle='--', alpha=0.5)\n", + "plt.xlabel('Run Index')\n", + "plt.ylabel('Time (ms)')\n", + "plt.title(f'Cache Effect on {N_SMALL}x{N_SMALL} Matrix Multiplication\\n(first {discard} samples discarded)')\n", + "plt.legend()\n", + "plt.grid(True, alpha=0.3)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FAaH1cdBzX2j" + }, + "source": [ + "### Putting it all together\n", + "\n", + "We have now discovered that a robust benchmark requires:\n", + "\n", + "1. Device Synchronization\n", + "2. CUDA Events (to avoid CPU overhead)\n", + "3. Warmup Runs (to avoid initialization costs)\n", + "4. Multiple Samples (to handle variance)\n", + "5. Cache Flushing (to simulate VRAM access)\n", + "6. Median/Mean Aggregation (to ignore jitter)\n", + "\n", + "Writing this boilerplate every time is painful. We've packaged all these lessons into **KernelBench's timing module**, which provides multiple timing methods for different use cases. There are also other robust implementations available, such as Triton's `do_bench` [function](https://triton-lang.org/main/python-api/generated/triton.testing.do_bench.html).\n", + "\n", + "The default `cuda_event` method in KernelBench implements all of the above automatically, plus an additional insight: **`discard_first`** - discarding the first few trials after warmup, which often still have some initialization overhead." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2025-12-17T21:24:44.600123Z", + "iopub.status.busy": "2025-12-17T21:24:44.600004Z", + "iopub.status.idle": "2025-12-17T21:24:47.005899Z", + "shell.execute_reply": "2025-12-17T21:24:47.004654Z" + }, + "id": "3aVFtWt_zX2j", + "outputId": "6cf1e493-86ca-419e-a8e1-e19486814e09" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[Profiling] Using timing method: cuda_event\n", + "[Profiling] Using device: cuda:0 NVIDIA H200, warm up 10, trials 100\n", + "KernelBench cuda_event time: 21.5000 ms\n" + ] + } + ], + "source": [ + "# Get the timing function - cuda_event is the default for trusted code\n", + "timing_fn = get_timing_function(\"cuda_event\")\n", + "\n", + "def final_benchmark(func, *args, num_trials=100):\n", + " \"\"\"Production-ready benchmarking using KernelBench's timing module.\"\"\"\n", + " elapsed_times = timing_fn(\n", + " kernel_fn=func,\n", + " args=list(args),\n", + " num_warmup=10,\n", + " num_trials=num_trials,\n", + " discard_first=1, # Discard first trial for consistency\n", + " verbose=False,\n", + " device=DEVICE\n", + " )\n", + " stats = get_timing_stats(elapsed_times, device=DEVICE)\n", + " return stats[\"mean\"]\n", + "\n", + "t = final_benchmark(simple_mm, a, b)\n", + "print(f\"KernelBench cuda_event time: {t:.4f} ms\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MsZrCYQRzX2j" + }, + "source": [ + "*Note: KernelBench also wraps Triton's `do_bench` if you prefer adaptive trial counts. See the timing methods comparison below for details.*\n", + "\n", + "---\n", + "\n", + "## KernelBench's Timing Methods Explained\n", + "\n", + "Now that we've built up a robust benchmarking harness from first principles, let's explore KernelBench's timing module in depth. We'll examine:\n", + "- **All 4 timing methods** and when to use each\n", + "- **The `discard_first` parameter** and why it improves measurement consistency\n", + "- **How `host_time` detects side-stream exploits** in untrusted code\n", + "\n", + "KernelBench's timing module provides **4 timing methods**, each designed for different use cases:\n", + "\n", + "| Method | Use Case | Catches Side-Streams | Cold Cache | Trial Control |\n", + "|--------|----------|---------------------|------------|---------------|\n", + "| `cuda_event` | Default, trusted code | No | Yes | Explicit |\n", + "| `host_time` | Untrusted code, agent evals | **Yes** | Yes | Explicit |\n", + "| `do_bench` | Triton-style / robust adaptive | No | Yes | Adaptive (time-budget) |\n", + "| `do_bench_impl` | do_bench implementation for inference and trial control | No | Yes | Explicit |\n", + "\n", + "### Method Details\n", + "\n", + "**`cuda_event`** (Default)\n", + "- Uses `torch.cuda.Event` for GPU-side timing\n", + "- Most accurate for pure kernel time measurement\n", + "- Clears L2 cache before each trial for cold-cache performance\n", + "- Use for trusted code where you control the kernel implementation\n", + "\n", + "**`host_time`** (For Untrusted Code)\n", + "- Uses **both** `time.perf_counter()` (host) and `torch.cuda.Event` (device) timing\n", + "- Compares the two: if they differ significantly, the CUDA event time is likely invalid (e.g., side-stream exploit)\n", + "- Falls back to host time when discrepancy detected, ensuring correctness\n", + "- Waits for ALL streams via `torch.cuda.synchronize()`\n", + "- **Essential for evaluating untrusted/agent-generated code**\n", + "\n", + "**`do_bench`** (Triton's Adaptive Benchmarking)\n", + "- Wraps Triton's `triton.testing.do_bench`\n", + "- Uses fixed time budgets: 25ms warmup, 100ms for repetitions\n", + "- Trial count is automatic based on kernel runtime\n", + "- **Note:** `num_warmup`, `num_trials`, `discard_first` parameters are ignored\n", + "\n", + "**`do_bench_impl`** (Transparent Implementation)\n", + "- Custom implementation mirroring Triton's do_bench\n", + "- Gives you explicit control over `num_warmup` and `num_trials`\n", + "- Useful when you need do_bench's approach but with specific trial counts\n", + "\n", + "### Key Parameters\n", + "\n", + "All timing functions share a common interface:\n", + "\n", + "```python\n", + "timing_fn(\n", + " kernel_fn, # Function to time\n", + " args, # List of arguments to pass\n", + " num_warmup=3, # Warmup iterations before timing\n", + " num_trials=10, # Number of timing samples to collect\n", + " discard_first=1, # Drop first N trials after warmup\n", + " device=\"cuda:0\", # Explicit GPU device selection\n", + " verbose=True # Print per-trial timing info\n", + ") -> list[float] # Returns list of elapsed times in ms\n", + "```\n", + "\n", + "### Why `discard_first`?\n", + "\n", + "Even after warmup, the first few timing trials can be affected by:\n", + "- PyTorch's lazy tensor allocation finalizing\n", + "- cuDNN autotuning (still settling optimal algorithms)\n", + "- Driver state initialization\n", + "- First access to data structures\n", + "\n", + "Setting `discard_first=1` (the default) improves measurement consistency. Let's visualize this effect:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Comparing All 4 Timing Methods\n", + "\n", + "Let's see how the different timing methods compare on the same kernel. Each method has trade-offs between precision, features, and overhead." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "execution": { + "iopub.execute_input": "2025-12-17T21:24:47.008751Z", + "iopub.status.busy": "2025-12-17T21:24:47.008456Z", + "iopub.status.idle": "2025-12-17T21:24:50.238519Z", + "shell.execute_reply": "2025-12-17T21:24:50.237366Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Comparing all KernelBench timing methods on 4096x4096 matmul:\n", + "======================================================================\n", + "\n", + "Testing cuda_event...\n", + "[Profiling] Using timing method: cuda_event\n", + "[Profiling] Using device: cuda:0 NVIDIA H200, warm up 10, trials 50\n", + " cuda_event: 21.5000 ms (std=0.0572)\n", + "\n", + "Testing host_time...\n", + "[Profiling] Using timing method: host_time\n", + "[Profiling] Using device: cuda:0 NVIDIA H200, warm up 10, trials 50\n", + " host_time: 21.5000 ms (std=0.0427)\n", + "\n", + "Testing do_bench...\n", + "[Profiling] Using timing method: do_bench\n", + " do_bench: 21.4000 ms (std=0.0030)\n", + "\n", + "Testing do_bench_impl...\n", + "[Profiling] Using timing method: do_bench_impl\n", + " do_bench_impl: 21.4000 ms (std=0.0315)\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/tmp/ipykernel_710987/1660294801.py:64: UserWarning: Attempting to set identical low and high ylims makes transformation singular; automatically expanding.\n", + " axes[1].set_ylim(min_val - margin, max_val + margin)\n" + ] + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Key insight: host_time is slightly slower due to CPU overhead,\n", + "but it catches ALL work on ALL streams - essential for untrusted code!\n" + ] + } + ], + "source": [ + "# Compare all KernelBench timing methods on 4096x4096 matmul\n", + "print(\"Comparing all KernelBench timing methods on 4096x4096 matmul:\")\n", + "print(\"=\" * 70)\n", + "\n", + "methods = [\"cuda_event\", \"host_time\", \"do_bench\", \"do_bench_impl\"]\n", + "results = {}\n", + "\n", + "for method in methods:\n", + " print(f\"\\nTesting {method}...\")\n", + " try:\n", + " method_fn = get_timing_function(method)\n", + " times = method_fn(\n", + " simple_mm, \n", + " [a, b], \n", + " num_warmup=10, \n", + " num_trials=50, \n", + " verbose=False,\n", + " device=DEVICE\n", + " )\n", + " results[method] = get_timing_stats(times, device=DEVICE)\n", + " print(f\" {method}: {results[method]['mean']:.4f} ms (std={results[method]['std']:.4f})\")\n", + " except Exception as e:\n", + " print(f\" {method}: Skipped due to {type(e).__name__} (Triton version compatibility)\")\n", + " # Remove from list if it failed\n", + " methods = [m for m in methods if m in results]\n", + "\n", + "# Only plot if we have results\n", + "if results:\n", + " # Visualize the comparison\n", + " fig, axes = plt.subplots(1, 2, figsize=(12, 5))\n", + "\n", + " # Bar chart of mean times\n", + " available_methods = [m for m in methods if m in results]\n", + " means = [results[m]['mean'] for m in available_methods]\n", + " stds = [results[m]['std'] for m in available_methods]\n", + " colors = ['#2ecc71', '#e74c3c', '#3498db', '#9b59b6'][:len(available_methods)]\n", + "\n", + " axes[0].bar(available_methods, means, yerr=stds, capsize=5, color=colors, alpha=0.8)\n", + " axes[0].set_ylabel('Time (ms)')\n", + " axes[0].set_title('Mean Execution Time by Method\\n(graph truncated for readability)')\n", + " axes[0].tick_params(axis='x', rotation=45)\n", + "\n", + " # Truncate y-axis to make differences easier to see\n", + " min_mean = min(means)\n", + " max_mean = max(means)\n", + " margin = (max_mean - min_mean) * 2\n", + " axes[0].set_ylim(min_mean - margin, max_mean + margin)\n", + "\n", + " # Highlight cuda_event vs host_time with truncated y-axis for readability\n", + " if 'cuda_event' in results and 'host_time' in results:\n", + " cuda_mean = results['cuda_event']['mean']\n", + " host_mean = results['host_time']['mean']\n", + " \n", + " axes[1].bar(['cuda_event', 'host_time'], \n", + " [cuda_mean, host_mean], \n", + " color=['#2ecc71', '#e74c3c'], alpha=0.8)\n", + " axes[1].set_ylabel('Time (ms)')\n", + " axes[1].set_title('cuda_event vs host_time\\n(host_time catches side-streams)\\n(graph truncated for readability)')\n", + " \n", + " # Truncate y-axis to make the difference easier to see\n", + " min_val = min(cuda_mean, host_mean)\n", + " max_val = max(cuda_mean, host_mean)\n", + " margin = (max_val - min_val) * 2 # Add margin around the data\n", + " axes[1].set_ylim(min_val - margin, max_val + margin)\n", + " else:\n", + " axes[1].text(0.5, 0.5, 'Comparison unavailable', ha='center', va='center')\n", + " axes[1].set_axis_off()\n", + "\n", + " plt.tight_layout()\n", + " plt.show()\n", + "\n", + "print(\"\\nKey insight: host_time is slightly slower due to CPU overhead,\")\n", + "print(\"but it catches ALL work on ALL streams - essential for untrusted code!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### The `discard_first` Effect\n", + "\n", + "Even after warmup, the first timing trial can be affected by lazy initialization. Let's see this in action." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "execution": { + "iopub.execute_input": "2025-12-17T21:24:50.241069Z", + "iopub.status.busy": "2025-12-17T21:24:50.240945Z", + "iopub.status.idle": "2025-12-17T21:24:50.348421Z", + "shell.execute_reply": "2025-12-17T21:24:50.347364Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Demonstrating the discard_first effect:\n", + "============================================================\n", + "[Profiling] Using timing method: cuda_event\n", + "[Profiling] Using device: cuda:0 NVIDIA H200, warm up 3, trials 15\n", + "\n", + "First trial: 0.3660 ms\n", + "Mean of all trials: 0.3455 ms\n", + "Mean without first: 0.3440 ms\n", + "First trial overhead: 6.4%\n" + ] + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "The first trial often shows initialization overhead even after warmup.\n", + "Using discard_first=1 (default) gives more consistent measurements.\n" + ] + } + ], + "source": [ + "# Demonstrate the discard_first effect\n", + "# Even after warmup, the first timing trial can have higher overhead\n", + "\n", + "print(\"Demonstrating the discard_first effect:\")\n", + "print(\"=\" * 60)\n", + "\n", + "# Create fresh data and clear caches to make initialization overhead more visible\n", + "torch.cuda.empty_cache()\n", + "a_fresh, b_fresh = get_data(2048)\n", + "\n", + "# Collect trials with discard_first=0 to see ALL trials including the first one\n", + "timing_fn = get_timing_function(\"cuda_event\")\n", + "times_all = timing_fn(\n", + " simple_mm, [a_fresh, b_fresh],\n", + " num_warmup=3,\n", + " num_trials=15,\n", + " discard_first=0, # Keep ALL trials including first\n", + " verbose=False,\n", + " device=DEVICE\n", + ")\n", + "\n", + "# Calculate statistics\n", + "first_trial = times_all[0]\n", + "remaining_trials = times_all[1:]\n", + "mean_all = np.mean(times_all)\n", + "mean_remaining = np.mean(remaining_trials)\n", + "\n", + "print(f\"\\nFirst trial: {first_trial:.4f} ms\")\n", + "print(f\"Mean of all trials: {mean_all:.4f} ms\")\n", + "print(f\"Mean without first: {mean_remaining:.4f} ms\")\n", + "print(f\"First trial overhead: {((first_trial / mean_remaining) - 1) * 100:.1f}%\")\n", + "\n", + "# Visualize the effect with a scatter plot\n", + "plt.figure(figsize=(10, 5))\n", + "plt.scatter(range(len(times_all)), times_all, alpha=0.7, color='blue', s=60)\n", + "plt.scatter([0], [first_trial], color='red', s=100, zorder=5, label=f'First trial: {first_trial:.3f}ms')\n", + "plt.axhline(y=mean_remaining, color='green', linestyle='--', alpha=0.7, \n", + " label=f'Mean (without first): {mean_remaining:.3f}ms')\n", + "plt.axhline(y=mean_all, color='orange', linestyle=':', alpha=0.7,\n", + " label=f'Mean (all): {mean_all:.3f}ms')\n", + "plt.xlabel('Trial Index')\n", + "plt.ylabel('Time (ms)')\n", + "plt.title('First Trial Overhead Effect (after warmup)')\n", + "plt.legend(loc='upper right')\n", + "plt.grid(True, alpha=0.3)\n", + "plt.show()\n", + "\n", + "print(\"\\nThe first trial often shows initialization overhead even after warmup.\")\n", + "print(\"Using discard_first=1 (default) gives more consistent measurements.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HwsjlhAazX2j" + }, + "source": [ + "## The \"Agent\" Trap: Reward Hacking via Hidden Streams\n", + "\n", + "When evaluating LLM-generated kernels (like with [Kernel Bench](https://github.com/ScalingIntelligence/KernelBench)), you're not just fighting measurement noise—you're fighting an optimizer that may inadvertently discover exploits in your harness.\n", + "\n", + "One such exploit: launching work on a **side stream** to make the kernel appear instantaneous." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2025-12-17T21:24:50.351000Z", + "iopub.status.busy": "2025-12-17T21:24:50.350876Z", + "iopub.status.idle": "2025-12-17T21:24:52.778917Z", + "shell.execute_reply": "2025-12-17T21:24:52.777685Z" + }, + "id": "UuwtML39zX2j", + "outputId": "95ebbb26-e415-491a-da30-6a78ce387906" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[Profiling] Using device: cuda:0 NVIDIA H200, warm up 10, trials 100\n", + "Standard benchmark on tricky kernel: 0.1880 ms\n" + ] + } + ], + "source": [ + "def tricky_agent_kernel(a, b):\n", + " \"\"\"A 'clever' kernel that games the benchmarking harness.\"\"\"\n", + " # The agent creates a new stream to \"optimize\"\n", + " s = torch.cuda.Stream()\n", + " with torch.cuda.stream(s):\n", + " # This work happens on a side channel!\n", + " result = torch.matmul(a, b)\n", + " return result\n", + "\n", + "print(f\"Standard benchmark on tricky kernel: {final_benchmark(tricky_agent_kernel, a, b):.4f} ms\")\n", + "# Likely reports ~0.00ms or very close to it!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3HXns_XizX2j" + }, + "source": [ + "**The Issue:**\n", + "Standard benchmarking tools (including `do_bench`) record events on the *current default stream*.\n", + "\n", + "1. Benchmark starts timer on Stream A (the default stream).\n", + "2. Agent launches work on Stream B and returns immediately.\n", + "3. Benchmark stops timer on Stream A.\n", + "\n", + "Since Stream A had no work, the timer reports `~0.00ms`, while Stream B is still churning away in the background.\n", + "\n", + "**Why this matters for evals:**\n", + "If your reward signal is \"lower time = better score,\" an agent that discovers this pattern will be rewarded for producing *broken* code. The kernel \"runs\" instantly because you never measured it at all.\n", + "\n", + "**Mitigations:**\n", + "- **Wall-clock + full device sync:** Trade precision for correctness (catches all streams, but includes CPU overhead)\n", + "- **Static analysis:** Reject submissions that create `torch.cuda.Stream()` objects\n", + "- **Manual inspection:** For high-stakes evals, benchmark kernels in isolation outside the automated harness\n", + "\n", + "### How KernelBench Addresses This\n", + "\n", + "KernelBench's timing module provides the **`host_time`** method specifically designed for evaluating untrusted code:\n", + "\n", + "**Use `torch.cuda.synchronize()`** before AND after timing - this waits for ALL streams on the device, not just the default stream\n", + "\n", + "```python\n", + "# For trusted code (faster, but can be fooled)\n", + "timing_fn = get_timing_function(\"cuda_event\")\n", + "\n", + "# For untrusted/agent code (catches side-streams)\n", + "timing_fn = get_timing_function(\"host_time\")\n", + "```\n", + "\n", + "The trade-off: `host_time` includes some CPU overhead in the measurement. However, note that host_time should be pretty similar to sync_time. Therefore, if both times are within a some percent of each other, you can be pretty sure that the kernel is running correctly and score using sync_time." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2025-12-17T21:24:52.782281Z", + "iopub.status.busy": "2025-12-17T21:24:52.782061Z", + "iopub.status.idle": "2025-12-17T21:24:52.830292Z", + "shell.execute_reply": "2025-12-17T21:24:52.829161Z" + }, + "id": "KbAFqiyizX2j", + "outputId": "6bc91db4-935c-4af9-bdc9-be3c50e890c4" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Robust benchmark on tricky kernel: 21.6298 ms\n", + "Robust benchmark on normal kernel: 21.5420 ms\n" + ] + } + ], + "source": [ + "def benchmark_untrusted(func, *args):\n", + " \"\"\"Benchmark untrusted code by using wall-clock time with full device sync.\n", + "\n", + " This trades some precision (includes CPU overhead) for correctness\n", + " (catches work on any stream).\n", + " \"\"\"\n", + " torch.cuda.synchronize() # Clear any pending work\n", + " start = time.perf_counter()\n", + " func(*args)\n", + " torch.cuda.synchronize() # Wait for ALL streams\n", + " end = time.perf_counter()\n", + " return (end - start) * 1000\n", + "\n", + "print(f\"Robust benchmark on tricky kernel: {benchmark_untrusted(tricky_agent_kernel, a, b):.4f} ms\")\n", + "print(f\"Robust benchmark on normal kernel: {benchmark_untrusted(simple_mm, a, b):.4f} ms\")" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "execution": { + "iopub.execute_input": "2025-12-17T21:24:52.832854Z", + "iopub.status.busy": "2025-12-17T21:24:52.832734Z", + "iopub.status.idle": "2025-12-17T21:24:53.846639Z", + "shell.execute_reply": "2025-12-17T21:24:53.845578Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Side-Stream Detection Experiment:\n", + "============================================================\n", + "[Profiling] Using timing method: cuda_event\n", + "[Profiling] Using device: cuda:0 NVIDIA H200, warm up 3, trials 10\n", + "[Profiling] Using timing method: host_time\n", + "[Profiling] Using device: cuda:0 NVIDIA H200, warm up 3, trials 10\n", + "[Profiling] Using device: cuda:0 NVIDIA H200, warm up 3, trials 10\n", + "\n", + "Tricky kernel with cuda_event: 0.4400 ms (FOOLED!)\n", + "Tricky kernel with host_time: 21.8000 ms (CORRECT)\n", + "Normal kernel with host_time: 21.5000 ms (reference)\n" + ] + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Key insight: host_time correctly measures the tricky kernel!\n", + "Use host_time for evaluating untrusted/agent-generated code.\n" + ] + } + ], + "source": [ + "# Side-Stream Detection with KernelBench's host_time\n", + "# Let's demonstrate how host_time catches the tricky kernel\n", + "\n", + "print(\"Side-Stream Detection Experiment:\")\n", + "print(\"=\" * 60)\n", + "\n", + "# cuda_event (can be fooled by side-streams)\n", + "cuda_timing = get_timing_function(\"cuda_event\")\n", + "cuda_times = cuda_timing(tricky_agent_kernel, [a, b], num_trials=10, verbose=False, device=DEVICE)\n", + "cuda_stats = get_timing_stats(cuda_times, device=DEVICE)\n", + "\n", + "# host_time (catches all streams)\n", + "host_timing = get_timing_function(\"host_time\")\n", + "host_times = host_timing(tricky_agent_kernel, [a, b], num_trials=10, verbose=False, device=DEVICE)\n", + "host_stats = get_timing_stats(host_times, device=DEVICE)\n", + "\n", + "# Normal kernel for reference\n", + "normal_times = host_timing(simple_mm, [a, b], num_trials=10, verbose=False, device=DEVICE)\n", + "normal_stats = get_timing_stats(normal_times, device=DEVICE)\n", + "\n", + "print(f\"\\nTricky kernel with cuda_event: {cuda_stats['mean']:.4f} ms (FOOLED!)\")\n", + "print(f\"Tricky kernel with host_time: {host_stats['mean']:.4f} ms (CORRECT)\")\n", + "print(f\"Normal kernel with host_time: {normal_stats['mean']:.4f} ms (reference)\")\n", + "\n", + "# Visualize the dramatic difference\n", + "plt.figure(figsize=(10, 5))\n", + "methods = ['cuda_event\\n(fooled)', 'host_time\\n(correct)', 'Normal kernel\\n(reference)']\n", + "times = [cuda_stats['mean'], host_stats['mean'], normal_stats['mean']]\n", + "colors = ['red', 'green', 'blue']\n", + "\n", + "plt.bar(methods, times, color=colors, alpha=0.8)\n", + "plt.ylabel('Time (ms)')\n", + "plt.title('Side-Stream Detection: cuda_event vs host_time')\n", + "plt.grid(True, alpha=0.3, axis='y')\n", + "\n", + "# Add annotation\n", + "plt.annotate('Agent trick detected!', xy=(1, host_stats['mean']), \n", + " xytext=(1.3, host_stats['mean'] * 0.7),\n", + " arrowprops=dict(arrowstyle='->', color='green'),\n", + " fontsize=10, color='green')\n", + "plt.tight_layout()\n", + "plt.show()\n", + "\n", + "print(\"\\nKey insight: host_time correctly measures the tricky kernel!\")\n", + "print(\"Use host_time for evaluating untrusted/agent-generated code.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Uq4qvl8FzX2j" + }, + "source": [ + "## Correctness Before Speed\n", + "\n", + "A kernel that runs in 0.1ms but produces garbage is worthless. Before you start optimizing, **always verify correctness** against a reference implementation." + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2025-12-17T21:24:53.849299Z", + "iopub.status.busy": "2025-12-17T21:24:53.849171Z", + "iopub.status.idle": "2025-12-17T21:24:53.929702Z", + "shell.execute_reply": "2025-12-17T21:24:53.928700Z" + }, + "id": "J9W63Q5czX2k", + "outputId": "312076ee-f089-4276-8b8f-bb7a23575d3d" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✓ Correctness verified!\n", + "[Profiling] Using device: cuda:0 NVIDIA H200, warm up 10, trials 100\n", + "Kernel time: 0.0648 ms\n" + ] + } + ], + "source": [ + "def my_experimental_kernel(a, b):\n", + " \"\"\"Pretend this is our custom optimized kernel.\"\"\"\n", + " return torch.matmul(a, b) # In reality, this would be your Triton/CUDA code\n", + "\n", + "def verify_correctness(kernel_fn, ref_fn, *args, atol=1e-2, rtol=1e-2):\n", + " \"\"\"Verify kernel produces correct output before benchmarking.\"\"\"\n", + " ref_output = ref_fn(*args)\n", + " kernel_output = kernel_fn(*args)\n", + "\n", + " if not torch.allclose(ref_output, kernel_output, atol=atol, rtol=rtol):\n", + " max_diff = (ref_output - kernel_output).abs().max().item()\n", + " raise AssertionError(\n", + " f\"Kernel output doesn't match reference! \"\n", + " f\"Max difference: {max_diff:.6f}\"\n", + " )\n", + " print(\"✓ Correctness verified!\")\n", + " return True\n", + "\n", + "# Always verify before benchmarking\n", + "a_test, b_test = get_data(1024)\n", + "verify_correctness(my_experimental_kernel, simple_mm, a_test, b_test)\n", + "\n", + "# Only benchmark if correct\n", + "time_ms = final_benchmark(my_experimental_kernel, a_test, b_test)\n", + "print(f\"Kernel time: {time_ms:.4f} ms\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Computing TFLOPS: Are We Hitting the Speed of Light?\n", + "\n", + "It is also important to sanity check the results whether it is to see how well we are doing, or to see if our results are physically possible.\n", + "\n", + "To answer this, we convert our millisecond timings into **TFLOPS** (Tera Floating-Point Operations Per Second) and compare against the hardware's theoretical maximum—often called the **\"speed of light\"** or **roofline**.\n", + "\n", + "### Understanding Roofline Analysis\n", + "\n", + "The Roofline Model helps you understand whether your kernel is:\n", + "- **Compute-bound**: Limited by the GPU's arithmetic throughput (FLOPS)\n", + "- **Memory-bound**: Limited by memory bandwidth (GB/s)\n", + "\n", + "**Key formulas:**\n", + "- **Arithmetic Intensity** = FLOPs / Bytes accessed\n", + "- **Theoretical Peak FLOPS** = Clock speed × Cores × FLOPs/cycle\n", + "- **Theoretical Peak Bandwidth** = Memory clock × Bus width × 2 (for DDR)\n", + "\n", + "For matrix multiplication of two $N \\times N$ matrices:\n", + "- **FLOPs** = $2N^3$ (one multiply + one add per output element, summed $N$ times)\n", + "- **Bytes** = $3N^2 \\times \\text{sizeof(dtype)}$ (read A, read B, write C)\n", + "- **Arithmetic Intensity** = $\\frac{2N^3}{3N^2 \\times 4} = \\frac{N}{6}$ for float32\n", + "\n", + "Large matrix multiplications are highly compute-bound (high arithmetic intensity), so we expect to approach the compute roofline. For a deeper dive into roofline analysis and speed-of-light calculations, see the excellent [JAX Scaling Book chapter on Roofline](https://jax-ml.github.io/scaling-book/roofline/)." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "execution": { + "iopub.execute_input": "2025-12-17T21:24:53.932346Z", + "iopub.status.busy": "2025-12-17T21:24:53.932227Z", + "iopub.status.idle": "2025-12-17T21:24:56.741833Z", + "shell.execute_reply": "2025-12-17T21:24:56.740706Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Matrix Multiplication Performance\n", + "=================================================================\n", + "Size Time (ms) TFLOPS % of TF32 Peak \n", + "-----------------------------------------------------------------\n", + "[Profiling] Using device: cuda:0 NVIDIA H200, warm up 10, trials 100\n", + "1024 0.0648 33.14 3.4 %\n", + "[Profiling] Using device: cuda:0 NVIDIA H200, warm up 10, trials 100\n", + "2048 0.3440 49.94 5.0 %\n", + "[Profiling] Using device: cuda:0 NVIDIA H200, warm up 10, trials 100\n", + "4096 2.6800 51.28 5.2 %\n", + "[Profiling] Using device: cuda:0 NVIDIA H200, warm up 10, trials 100\n", + "8192 21.4000 51.38 5.2 %\n", + "\n", + "Note: PyTorch uses TF32 tensor cores by default on Ampere+ GPUs.\n", + "H200 TF32 theoretical peak: 989.0 TFLOPS\n", + "\n", + "For roofline analysis details, see: https://jax-ml.github.io/scaling-book/roofline/\n" + ] + } + ], + "source": [ + "def get_tflops(n, time_ms):\n", + " \"\"\"Calculate achieved TFLOPS for matrix multiplication.\"\"\"\n", + " flops = 2 * n ** 3 # Multiply-add for each of N^2 output elements\n", + " tflops = flops / (time_ms * 1e-3) / 1e12\n", + " return tflops\n", + "\n", + "# Theoretical peaks vary by GPU and precision\n", + "# PyTorch uses TF32 by default on Ampere+ GPUs for matmul\n", + "GPU_PEAK_TFLOPS = {\n", + " 'A100': {'fp32': 19.5, 'tf32': 156.0, 'fp16': 312.0},\n", + " 'H100': {'fp32': 67.0, 'tf32': 989.0, 'fp16': 1979.0},\n", + " 'H200': {'fp32': 67.0, 'tf32': 989.0, 'fp16': 1979.0},\n", + "}\n", + "\n", + "# Use TF32 peak since PyTorch defaults to TF32 on Ampere+\n", + "PEAK_TFLOPS = 989.0 # H200 TF32 peak\n", + "\n", + "# Benchmark at different sizes\n", + "print(\"Matrix Multiplication Performance\")\n", + "print(\"=\" * 65)\n", + "print(f\"{'Size':<8} {'Time (ms)':<12} {'TFLOPS':<12} {'% of TF32 Peak':<15}\")\n", + "print(\"-\" * 65)\n", + "\n", + "for size in [1024, 2048, 4096, 8192]:\n", + " a_test, b_test = get_data(size)\n", + " time_ms = final_benchmark(simple_mm, a_test, b_test)\n", + " tflops = get_tflops(size, time_ms)\n", + " efficiency = (tflops / PEAK_TFLOPS) * 100\n", + " print(f\"{size:<8} {time_ms:<12.4f} {tflops:<12.2f} {efficiency:<15.1f}%\")\n", + "\n", + "print(f\"\\nNote: PyTorch uses TF32 tensor cores by default on Ampere+ GPUs.\")\n", + "print(f\"H200 TF32 theoretical peak: {PEAK_TFLOPS} TFLOPS\")\n", + "print(f\"\\nFor roofline analysis details, see: https://jax-ml.github.io/scaling-book/roofline/\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zcYVXCkUzX2k" + }, + "source": [ + "## Conclusion\n", + "\n", + "Benchmarking on GPUs is fundamentally different from CPUs. The asynchronous nature of kernel launches, the hidden state of the L2 cache, and the noise of the OS scheduler all conspire to give you the wrong numbers.\n", + "\n", + "### What We Learned\n", + "\n", + "Through our journey, we discovered that robust GPU benchmarking requires:\n", + "1. **Device Synchronization** - Wait for GPU work to complete\n", + "2. **CUDA Events** - Use GPU-side timestamps, not CPU clocks\n", + "3. **Warmup Runs** - Settle compilation and memory allocators\n", + "4. **Multiple Samples** - Build statistical distributions\n", + "5. **L2 Cache Flushing** - Measure cold cache (realistic) performance\n", + "6. **Median Aggregation** - Filter out OS jitter and outliers\n", + "7. **Side-Stream Detection** - Catch work on non-default streams\n", + "\n", + "### What KernelBench Provides\n", + "\n", + "We've implemented all these best practices in **KernelBench's timing module** (`src/timing.py`):\n", + "\n", + "| Function | Purpose |\n", + "|----------|---------|\n", + "| `get_timing_function(method)` | Factory returning timing function by name |\n", + "| `clear_l2_cache(device)` | L2 cache flushing utility |\n", + "| `get_timing_stats(times)` | Statistical aggregation (mean, std, min, max) |\n", + "\n", + "**Four timing methods for different use cases:**\n", + "- **`cuda_event`** - Default for trusted code (fastest, GPU-side timing)\n", + "- **`host_time`** - For untrusted/agent code (catches all streams)\n", + "- **`do_bench`** - Triton-style adaptive trial counts\n", + "- **`do_bench_impl`** - Transparent do_bench with explicit control\n", + "\n", + "**Key parameters:**\n", + "- `num_warmup`, `num_trials`, `discard_first`, `device`, `verbose`\n", + "\n", + "### Recommended Usage\n", + "\n", + "```python\n", + "from src.timing import get_timing_function, get_timing_stats\n", + "\n", + "# For trusted code\n", + "timing_fn = get_timing_function(\"cuda_event\")\n", + "\n", + "# For agent evaluations (catches side-streams)\n", + "timing_fn = get_timing_function(\"host_time\")\n", + "\n", + "# Run benchmark\n", + "times = timing_fn(kernel, args, num_warmup=10, num_trials=100, device=\"cuda:0\")\n", + "stats = get_timing_stats(times, device=\"cuda:0\")\n", + "print(f\"Mean: {stats['mean']:.4f}ms, Std: {stats['std']:.4f}ms\")\n", + "```\n", + "\n", + "Happy optimizing!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7Ah151CHzX2k" + }, + "source": [ + "---\n", + "\n", + "### Footnotes\n", + "\n", + "**On GPU Clock States:** For highly reproducible benchmarks (e.g., publishing papers), consider locking GPU clocks with `nvidia-smi -lgc `. GPUs dynamically adjust clock speeds based on thermals and power, which can introduce variance between runs. For most development work, median-based benchmarking handles this adequately.\n", + "\n", + "**On Warmup Iterations:** We use fixed warmup counts (10-50 iterations) for simplicity, but this can be insufficient or wasteful depending on the kernel. In extremely sensitive environments, you can implement an adaptive stopping criterion: run warmup iterations until the variance of recent samples falls below a threshold, indicating the system has stabilized. This is covered in more detail in the [GPU MODE lecture](https://www.youtube.com/watch?v=1i7dxoAfKOU).\n", + "\n", + "**On Bare Metal vs. Virtualized Environments:** Cloud VMs and containers add layers of abstraction that can introduce variance and overhead. GPU passthrough in virtualized environments adds latency, and shared cloud instances suffer from \"noisy neighbor\" effects where other tenants' workloads impact your measurements. For publishable results or when chasing small performance deltas, prefer bare metal. For day-to-day development, cloud instances are fine as long as you're aware your numbers may not match others exactly." + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "gpuType": "A100", + "include_colab_link": true, + "provenance": [] + }, + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.9" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}