This repository contains a curated set of concurrent programming problems designed to help you master concurrent patterns across Python, C++, and Go, as well as GPU programming with Triton/CUDA.
Comprehensive guides and tutorials on concurrent and GPU programming. This repository includes documentation on concurrent and GPU programming in multiple formats.
The best viewing experience is via the built HTML documentation:
cd docs
make html
# Then open build/html/index.html in your browserThis provides full navigation, proper formatting for tables, code blocks, and diagrams.
Learn fundamental concurrent programming concepts with Python threading, asyncio, and multiprocessing.
| Topic | Focus | Level |
|---|---|---|
| Key Concepts | Concurrency vs parallelism, GIL, event loops, coroutines | Beginner |
| Hardware Parallelism | Multi-core architecture, NUMA, cache hierarchy | Intermediate |
| Threading Basics | Thread creation, management, lifecycle | Beginner |
| Asyncio Event Loop | Event-driven programming, async I/O patterns | Intermediate |
| Asyncio Coroutines | Coroutine definition, async/await syntax | Intermediate |
| Asyncio and Futures | Future-based concurrency, callbacks | Intermediate |
| Asyncio Task Manipulation | Task creation, management, cancellation | Advanced |
| Concurrent Futures | ThreadPoolExecutor, ProcessPoolExecutor, futures | Intermediate |
| Queue Explained | Queue data structure, thread-safe operations | Intermediate |
| Queue Internals | Queue implementation, condition variables | Advanced |
| Task Done Queue | Task synchronization, done signaling | Intermediate |
| RLock Explained | Reentrant locks, recursive synchronization | Intermediate |
| Semaphore Explained | Semaphore patterns, resource counting | Intermediate |
| Patterns & Problems | Common patterns, deadlock, race conditions | Advanced |
Master GPU architecture, optimization techniques, and Triton programming.
| Topic | Focus | Level |
|---|---|---|
| GPU Fundamentals | GPU architecture, threads, warps, blocks | Beginner |
| Memory Hierarchy | Global, shared, local memory; cache; bandwidth | Intermediate |
| Execution Model | Thread execution, synchronization, atomics | Advanced |
| Performance Optimization | Memory access patterns, occupancy, roofline | Advanced |
| Triton Concepts | Kernel launching, grid specification, constexpr | Intermediate |
Progressive tutorials from basic to advanced GPU kernel programming.
| # | Topic | Concepts | Level |
|---|---|---|---|
| 1 | Vector Addition | Basic kernel, grid launch, masking | Beginner |
| 2 | Fused Softmax | Row-wise reductions, numerical stability | Intermediate |
| 3 | Matrix Multiplication | Block tiling, memory hierarchy, cache | Intermediate |
| 4 | Low Memory Dropout | Memory efficiency, masking patterns | Intermediate |
| 5 | Layer Normalization | Normalization kernels, fusion | Intermediate |
| 6 | Fused Attention | Complex kernel fusion, online softmax | Advanced |
| 7 | Extern Functions | CUDA integration, custom ops | Advanced |
| 8 | Grouped GEMM | Batched operations, dynamic shapes | Advanced |
| 9 | Persistent MatMul | Persistent kernel pattern, work distribution | Advanced |
| 10 | Block Scaled MatMul | Block-wise scaling, quantization-aware | Advanced |
Understanding Triton's compilation pipeline and optimization.
| Topic | Focus | Level |
|---|---|---|
| Compiler Overview | Pipeline stages, IR representations | Intermediate |
| JIT Decorator | Specialization, caching, kernel variants | Advanced |
| Compilation Pipeline | Python→Triton IR→MLIR→LLVM→PTX→Cubin | Advanced |
| CUDA Comparison | Triton vs CUDA: code, compilation, optimization | Intermediate |
| MLIR Concepts | MLIR dialects, transformations, lowering | Advanced |
Structured learning paths and reference materials.
| Resource | Content |
|---|---|
| Learning Paths | Beginner → Intermediate → Advanced progressions |
| Troubleshooting | Common issues, debugging, performance problems |
| References | External links, official documentation, papers |
pip install sphinx sphinx-rtd-themecd docs
make html
open build/html/index.html # macOS
# or
xdg-open build/html/index.html # Linux
# or
start build/html/index.html # Windows# PDF (requires LaTeX)
make latexpdf
# Man pages
make man
# EPUB
make epub- Mutual Exclusion (Locks): Prevent concurrent access to shared resources
- Condition Variables: Wait/notify synchronization between threads
- Producer-Consumer: Data flow between threads with bounded queues
- Thread Pools: Reusable worker threads for task execution
- Async/Await: Efficient I/O-bound concurrency without explicit threading
- SPMD Model: Single Program, Multiple Data execution
- Thread Organization: Threads → Warps → Blocks → Grids
- Memory Hierarchy: Global → Shared → Local memory with different speeds
- Synchronization:
__syncthreads()for block-level, atomic ops for device-level - Occupancy: Ratio of active warps to maximum possible warps
- @triton.jit: JIT decorator for GPU kernel compilation
- Grid Specification:
lambda meta: (cdiv(N, BLOCK_SIZE),)for 1D grids - tl.constexpr: Compile-time constants enabling kernel specialization
- Block Tiling: Processing data in blocks for cache efficiency
- Masking: Safe handling of boundary conditions with
maskparameter
- Read: Key Concepts
- Study: Threading Basics
- Practice: See
/pythondirectory for practice problems
- Read: GPU Fundamentals
- Study: Vector Addition Tutorial
- Practice: Work through tutorials 1-5, then tackle 6+
- Read: Triton Concepts
- Study: Compiler Overview
- Dive Deep: MLIR Concepts
All source documentation is in docs/source/:
docs/source/
├── cpu-concurrency/ (14 files on threading, asyncio, synchronization)
├── gpu-concepts/ (5 files on GPU architecture)
├── gpu-tutorials/ (10 step-by-step tutorials)
├── triton-compiler/ (5 files on Triton compilation)
├── learning-paths.rst
├── troubleshooting.rst
├── references.rst
└── index.rst (Main documentation index)
Markdown versions for GitHub viewing in docs/markdown/:
docs/markdown/
└── gpu-concepts/ (5 Markdown files for GitHub web viewing)
To improve documentation:
- Edit RST files in
docs/source/ - Build locally:
cd docs && make html - Verify formatting in browser
- Commit changes
For new documentation:
- Create
.rstfile in appropriate folder - Add entry to
docs/source/index.rsttoctree - Follow existing formatting and structure
- Create corresponding
.mdfile for GitHub viewing
- Sphinx Documentation
- reStructuredText Primer
- Triton Documentation
- NVIDIA CUDA Programming Guide
- PyTorch Documentation
threading.Lock/threading.RLock- Mutual exclusionthreading.Condition- Wait/notify synchronizationthreading.Event- Simple signalingqueue.Queue- Thread-safe queueasyncio- Asynchronous programmingthreading.Semaphore- Resource counting
std::mutex- Mutual exclusionstd::condition_variable- Wait/notifystd::lock_guard/std::unique_lock- RAII wrappersstd::atomic- Atomic operationsstd::future/std::promise- Async results- Memory ordering:
acquire,release,relaxed,seq_cst
sync.Mutex- Mutual exclusionsync.RWMutex- Reader-writer locksync.WaitGroup- Coordination- Channels - Communication and synchronization
selectstatement - Channel multiplexinggoroutines- Lightweight concurrency
Introduction to concurrent programming concepts using Python's threading and asyncio modules.
-
01_thread_safe_counter.py - Thread-safe counter with mutex
- Learn: Locks, race conditions, synchronization primitives
- Difficulty: Beginner
- Key concepts:
threading.Lock, mutual exclusion
-
02_producer_consumer_queue.py - Thread-safe bounded queue
- Learn: Condition variables, blocking operations, signaling
- Difficulty: Intermediate
- Key concepts:
threading.Condition, wait/notify patterns
-
03_thread_pool.py - Thread pool executor
- Learn: Thread lifecycle, work distribution, futures
- Difficulty: Intermediate
- Key concepts: Worker threads, task queues, graceful shutdown
-
04_dining_philosophers.py - Classic synchronization problem
- Learn: Deadlock avoidance, fair scheduling, resource allocation
- Difficulty: Intermediate
- Key concepts: Deadlock prevention, fairness
-
05_async_web_scraper.py - Asynchronous concurrent fetching
- Learn: Async/await, coroutines, rate limiting
- Difficulty: Intermediate
- Key concepts:
asyncio, event loops, semaphores
Low-level concurrent programming with fine-grained control over synchronization.
-
01_thread_safe_counter.cpp - Atomic counter with synchronization
- Learn:
std::mutex,std::atomic, RAII patterns - Difficulty: Beginner
- Key concepts: Lock guards, atomic operations
- Learn:
-
02_producer_consumer_queue.cpp - Template-based queue with condition variables
- Learn:
std::condition_variable, template metaprogramming - Difficulty: Intermediate
- Key concepts: Wait/notify, timeout handling
- Learn:
-
03_thread_pool.cpp - Thread pool with futures
- Learn:
std::future,std::promise, thread management - Difficulty: Intermediate
- Key concepts: Async task execution, result handling
- Learn:
-
04_read_write_lock.cpp - Reader-writer lock implementation
- Learn: Multiple synchronized access patterns
- Difficulty: Advanced
- Key concepts: Shared/exclusive locks, writer priority
-
05_lock_free_queue.cpp - Lock-free queue using atomics
- Learn: CAS operations, memory ordering, lock-free data structures
- Difficulty: Advanced
- Key concepts:
std::atomic, compare-and-swap, ABA problem
Idiomatic concurrent programming using goroutines and channels.
-
01_goroutine_counter.go - Counter with mutex
- Learn:
sync.Mutex, goroutines, basic synchronization - Difficulty: Beginner
- Key concepts: Goroutines, mutex patterns
- Learn:
-
02_pipeline_pattern.go - Multi-stage data pipeline
- Learn: Channel communication, goroutine pipelines
- Difficulty: Intermediate
- Key concepts: Channels as pipes, concurrent stages
-
03_worker_pool.go - Worker pool pattern
- Learn: Work distribution, goroutine management
- Difficulty: Intermediate
- Key concepts: Worker goroutines, task queues
-
04_fan_out_fan_in.go - Distributing and aggregating work
- Learn: Fan-out/fan-in patterns, channel multiplexing
- Difficulty: Intermediate
- Key concepts: Work distribution, result aggregation
-
05_rate_limiter.go - Token bucket rate limiting
- Learn: Time-based coordination, token patterns
- Difficulty: Intermediate
- Key concepts: Rate control, channels with time
Implementation of core deep learning components from scratch in Python/NumPy.
-
01_linear_layer.py - Fully-connected layer
- Learn: Forward/backward passes, weight initialization, gradient computation
- Difficulty: Beginner
- Key concepts: Matrix operations, backpropagation, numerical gradients
-
02_conv2d_layer.py - 2D Convolution layer
- Learn: Convolution operation, padding/stride, gradient computation
- Difficulty: Intermediate
- Key concepts: Tensor operations, spatial convolutions, im2col
-
03_attention_mechanism.py - Scaled dot-product attention
- Learn: Softmax, multi-head patterns, causal masking
- Difficulty: Intermediate
- Key concepts: Attention weights, temperature scaling, numerical stability
-
04_batch_normalization.py - Batch normalization
- Learn: Training vs eval modes, running statistics, layer optimization
- Difficulty: Intermediate
- Key concepts: Internal covariate shift, momentum, affine transformations
-
05_lstm_cell.py - LSTM recurrent cell
- Learn: Gating mechanisms, hidden state tracking, sequence processing
- Difficulty: Intermediate
- Key concepts: Cell state, gradient flow, vanishing gradients
Systems-level problems for building LLM inference engines and agent systems.
-
01_dynamic_batching.py - Request batching for inference
- Learn: Queue management, batch formation strategies, latency optimization
- Difficulty: Intermediate
- Key concepts: FCFS/SJF scheduling, padding efficiency, throughput optimization
-
02_kv_cache_manager.py - KV cache management
- Learn: Memory allocation, eviction policies, sequence management
- Difficulty: Intermediate
- Key concepts: LRU/LFU eviction, cache fragmentation, memory limits
-
03_agent_system.py - Agent framework with tools
- Learn: ReAct pattern, tool calling, conversation memory
- Difficulty: Intermediate
- Key concepts: Multi-step reasoning, state management, error handling
GPU kernel programming using Triton and CUDA concepts.
-
01_matrix_multiply.py - Optimized matrix multiplication
- Learn: Block tiling, memory hierarchy, GPU optimization
- Difficulty: Intermediate
- Key concepts: Blocking strategy, cache locality, coalesced memory access
-
02_softmax_kernel.py - Softmax kernel implementation
- Learn: Row-wise reductions, numerical stability, online computation
- Difficulty: Intermediate
- Key concepts: Block operations, final normalization, attention kernels
-
03_flash_attention.py - IO-aware attention implementation
- Learn: IO-optimal algorithms, tiling strategy, online softmax
- Difficulty: Advanced
- Key concepts: Block-wise computation, running statistics, memory efficiency
-
Start with Python basics:
python/01_thread_safe_counter.pypython/02_producer_consumer_queue.py
-
Move to DL Modeling basics:
dl_modeling/01_linear_layer.py
-
Try Go basics:
go/01_goroutine_counter.gogo/02_pipeline_pattern.go
-
Expand concurrent programming:
python/03_thread_pool.py&cpp/03_thread_pool.cppgo/03_worker_pool.go&go/04_fan_out_fan_in.go
-
Expand DL Modeling:
dl_modeling/02_conv2d_layer.pydl_modeling/03_attention_mechanism.pydl_modeling/04_batch_normalization.py
-
Learn GPU programming:
triton_cuda/01_matrix_multiply.pytriton_cuda/02_softmax_kernel.py
-
Advanced concurrent:
cpp/04_read_write_lock.cppcpp/05_lock_free_queue.cpppython/04_dining_philosophers.py
-
Advanced DL systems:
dl_modeling/05_lstm_cell.pydl_llm_systems/01_dynamic_batching.pydl_llm_systems/02_kv_cache_manager.pydl_llm_systems/03_agent_system.py
-
Advanced GPU:
triton_cuda/03_flash_attention.py
Each problem file contains:
- Problem Description - Detailed requirements and learning objectives
- Boilerplate Code - Skeleton implementation with TODO comments
- Test Cases - Comprehensive tests to verify correctness
- Performance Notes - Expected performance characteristics
- Read the problem description at the top of the file
- Understand the requirements and test cases
- Implement the TODOs (marked with
// TODO:or# TODO:) - Run the tests to verify correctness
- Benchmark your implementation
Python (Concurrent):
python python/01_thread_safe_counter.pyPython (DL Modeling):
python dl_modeling/01_linear_layer.pyPython (DL Systems):
python dl_llm_systems/01_dynamic_batching.pyPython (GPU):
python triton_cuda/01_matrix_multiply.pyC++:
cd cpp
g++ -std=c++17 -pthread -o counter 01_thread_safe_counter.cpp
./counterGo:
go run go/01_goroutine_counter.go| Problem | Language | Expected Throughput |
|---|---|---|
| Thread-safe Counter | Python | 10K-50K ops/sec |
| Thread-safe Counter | C++ | 1M-10M ops/sec |
| Producer-Consumer | Python | 1K-5K items/sec |
| Producer-Consumer | C++ | 100K-500K items/sec |
| Thread Pool | Python | 100-1K tasks/sec |
| Thread Pool | C++ | 10K-100K tasks/sec |
| Rate Limiter | Go | Configurable (accurate) |
- Start Simple - Get basic tests passing before optimizing
- Understand the Problem - Read comments carefully
- Test Thoroughly - Run all tests, including edge cases
- Measure Performance - Use benchmarks to verify optimization
- Avoid Common Pitfalls:
- Deadlocks (circular wait)
- Race conditions (unsynchronized access)
- Goroutine leaks (unreachable goroutines)
- Memory leaks (unclosed channels/resources)
- Caused by circular wait for resources
- Prevention: consistent lock ordering, timeout-based acquisition
- Unsynchronized access to shared data
- Solution: proper synchronization primitives
- One thread continuously blocked while others proceed
- Solution: fair scheduling, writer/reader priority
- Lock contention reduces throughput
- Solution: reduce critical section, use lock-free structures
Good luck with your concurrent programming journey! Remember, mastering concurrency takes time and practice. Start with simple problems and gradually tackle more complex ones.