Hi, I’m Turner Jabbour. I’ve been a software engineer for ~6 years, primarily working in Node and React. Around September 2025, I became deeply interested in GPU programming, ROCm, and the broader world of low-level performance engineering.
This repository is my space to learn in public as I delve into GPU kernel engineering and inference systems work.
There are three important directories:
- kernels - I explore different kernels and include a write up of what I learned and how it relates to inference.
- papers - I summarize and discuss different papers.
- topics - I dive deep into some specific topic.
My long-term goal is to build strong competency in HIP, Triton, and AMD’s GPU software stack, with a focus on high-performance inference.
I'm currently re-visiting my previous kernels, profiling them, and creating profiling writeups - currently working on my block-level reduction.
I just finished the writeup for my halving reduction.
Here is everything I've worked on so far, ordered and dated.
AMD’s collectives library for multi-GPU communication (AllReduce, AllGather, ReduceScatter, etc.) used heavily in distributed inference.
A higher-level DSL for writing high-performance kernels, increasingly used in modern inference work (FlashAttention, fused ops, reductions).
Wavefronts, SIMDs, LDS, VGPRs, vectorized memory access, latency hiding, and the AMDGCN compiler toolchain.
PagedAttention, KV-cache management, continuous batching, speculative decoding, and multi-GPU parallelism.
rocprof, occupancy analysis, register pressure.