Fast Concurrent Programs - Practice Problems

This repository contains a curated set of concurrent programming problems designed to help you master concurrent patterns across Python, C++, and Go, as well as GPU programming with Triton/CUDA.

Documentation

Comprehensive guides and tutorials on concurrent and GPU programming. This repository includes documentation on concurrent and GPU programming in multiple formats.

Quick Navigation

HTML Documentation (Recommended)

The best viewing experience is via the built HTML documentation:

cd docs
make html
# Then open build/html/index.html in your browser

This provides full navigation, proper formatting for tables, code blocks, and diagrams.

Documentation Structure

CPU Concurrency Guides

Learn fundamental concurrent programming concepts with Python threading, asyncio, and multiprocessing.

Topic	Focus	Level
Key Concepts	Concurrency vs parallelism, GIL, event loops, coroutines	Beginner
Hardware Parallelism	Multi-core architecture, NUMA, cache hierarchy	Intermediate
Threading Basics	Thread creation, management, lifecycle	Beginner
Asyncio Event Loop	Event-driven programming, async I/O patterns	Intermediate
Asyncio Coroutines	Coroutine definition, async/await syntax	Intermediate
Asyncio and Futures	Future-based concurrency, callbacks	Intermediate
Asyncio Task Manipulation	Task creation, management, cancellation	Advanced
Concurrent Futures	ThreadPoolExecutor, ProcessPoolExecutor, futures	Intermediate
Queue Explained	Queue data structure, thread-safe operations	Intermediate
Queue Internals	Queue implementation, condition variables	Advanced
Task Done Queue	Task synchronization, done signaling	Intermediate
RLock Explained	Reentrant locks, recursive synchronization	Intermediate
Semaphore Explained	Semaphore patterns, resource counting	Intermediate
Patterns & Problems	Common patterns, deadlock, race conditions	Advanced

GPU Programming Guides

Master GPU architecture, optimization techniques, and Triton programming.

Topic	Focus	Level
GPU Fundamentals	GPU architecture, threads, warps, blocks	Beginner
Memory Hierarchy	Global, shared, local memory; cache; bandwidth	Intermediate
Execution Model	Thread execution, synchronization, atomics	Advanced
Performance Optimization	Memory access patterns, occupancy, roofline	Advanced
Triton Concepts	Kernel launching, grid specification, constexpr	Intermediate

GPU Tutorials (Step-by-Step Examples)

Progressive tutorials from basic to advanced GPU kernel programming.

#	Topic	Concepts	Level
1	Vector Addition	Basic kernel, grid launch, masking	Beginner
2	Fused Softmax	Row-wise reductions, numerical stability	Intermediate
3	Matrix Multiplication	Block tiling, memory hierarchy, cache	Intermediate
4	Low Memory Dropout	Memory efficiency, masking patterns	Intermediate
5	Layer Normalization	Normalization kernels, fusion	Intermediate
6	Fused Attention	Complex kernel fusion, online softmax	Advanced
7	Extern Functions	CUDA integration, custom ops	Advanced
8	Grouped GEMM	Batched operations, dynamic shapes	Advanced
9	Persistent MatMul	Persistent kernel pattern, work distribution	Advanced
10	Block Scaled MatMul	Block-wise scaling, quantization-aware	Advanced

Triton Compiler Deep Dive

Understanding Triton's compilation pipeline and optimization.

Topic	Focus	Level
Compiler Overview	Pipeline stages, IR representations	Intermediate
JIT Decorator	Specialization, caching, kernel variants	Advanced
Compilation Pipeline	Python→Triton IR→MLIR→LLVM→PTX→Cubin	Advanced
CUDA Comparison	Triton vs CUDA: code, compilation, optimization	Intermediate
MLIR Concepts	MLIR dialects, transformations, lowering	Advanced

Learning Resources

Structured learning paths and reference materials.

Resource	Content
Learning Paths	Beginner → Intermediate → Advanced progressions
Troubleshooting	Common issues, debugging, performance problems
References	External links, official documentation, papers

Building Documentation Locally

Prerequisites

pip install sphinx sphinx-rtd-theme

Build HTML Documentation

cd docs
make html
open build/html/index.html  # macOS
# or
xdg-open build/html/index.html  # Linux
# or
start build/html/index.html  # Windows

Build Other Formats

# PDF (requires LaTeX)
make latexpdf

# Man pages
make man

# EPUB
make epub

Key Concepts Quick Reference

Concurrent Programming Patterns

Mutual Exclusion (Locks): Prevent concurrent access to shared resources
Condition Variables: Wait/notify synchronization between threads
Producer-Consumer: Data flow between threads with bounded queues
Thread Pools: Reusable worker threads for task execution
Async/Await: Efficient I/O-bound concurrency without explicit threading

GPU Programming Fundamentals

SPMD Model: Single Program, Multiple Data execution
Thread Organization: Threads → Warps → Blocks → Grids
Memory Hierarchy: Global → Shared → Local memory with different speeds
Synchronization: __syncthreads() for block-level, atomic ops for device-level
Occupancy: Ratio of active warps to maximum possible warps

Triton Concepts

@triton.jit: JIT decorator for GPU kernel compilation
Grid Specification: lambda meta: (cdiv(N, BLOCK_SIZE),) for 1D grids
tl.constexpr: Compile-time constants enabling kernel specialization
Block Tiling: Processing data in blocks for cache efficiency
Masking: Safe handling of boundary conditions with mask parameter

Quick Start

For CPU Concurrency

Read: Key Concepts
Study: Threading Basics
Practice: See /python directory for practice problems

For GPU Programming

Read: GPU Fundamentals
Study: Vector Addition Tutorial
Practice: Work through tutorials 1-5, then tackle 6+

For Triton

Documentation File Locations

All source documentation is in docs/source/:

docs/source/
├── cpu-concurrency/      (14 files on threading, asyncio, synchronization)
├── gpu-concepts/         (5 files on GPU architecture)
├── gpu-tutorials/        (10 step-by-step tutorials)
├── triton-compiler/      (5 files on Triton compilation)
├── learning-paths.rst
├── troubleshooting.rst
├── references.rst
└── index.rst            (Main documentation index)

Markdown versions for GitHub viewing in docs/markdown/:

docs/markdown/
└── gpu-concepts/        (5 Markdown files for GitHub web viewing)

Contributing to Documentation

To improve documentation:

Edit RST files in docs/source/
Build locally: cd docs && make html
Verify formatting in browser
Commit changes

For new documentation:

Create .rst file in appropriate folder
Add entry to docs/source/index.rst toctree
Follow existing formatting and structure
Create corresponding .md file for GitHub viewing

Additional Resources

Key Concepts by Language

Python

threading.Lock / threading.RLock - Mutual exclusion
threading.Condition - Wait/notify synchronization
threading.Event - Simple signaling
queue.Queue - Thread-safe queue
asyncio - Asynchronous programming
threading.Semaphore - Resource counting

C++

std::mutex - Mutual exclusion
std::condition_variable - Wait/notify
std::lock_guard / std::unique_lock - RAII wrappers
std::atomic - Atomic operations
std::future / std::promise - Async results
Memory ordering: acquire, release, relaxed, seq_cst

Go

sync.Mutex - Mutual exclusion
sync.RWMutex - Reader-writer lock
sync.WaitGroup - Coordination
Channels - Communication and synchronization
select statement - Channel multiplexing
goroutines - Lightweight concurrency

Structure of problems

Python (`/python`)

Introduction to concurrent programming concepts using Python's threading and asyncio modules.

01_thread_safe_counter.py - Thread-safe counter with mutex
- Learn: Locks, race conditions, synchronization primitives
- Difficulty: Beginner
- Key concepts: threading.Lock, mutual exclusion
02_producer_consumer_queue.py - Thread-safe bounded queue
- Learn: Condition variables, blocking operations, signaling
- Difficulty: Intermediate
- Key concepts: threading.Condition, wait/notify patterns
03_thread_pool.py - Thread pool executor
- Learn: Thread lifecycle, work distribution, futures
- Difficulty: Intermediate
- Key concepts: Worker threads, task queues, graceful shutdown
04_dining_philosophers.py - Classic synchronization problem
- Learn: Deadlock avoidance, fair scheduling, resource allocation
- Difficulty: Intermediate
- Key concepts: Deadlock prevention, fairness
05_async_web_scraper.py - Asynchronous concurrent fetching
- Learn: Async/await, coroutines, rate limiting
- Difficulty: Intermediate
- Key concepts: asyncio, event loops, semaphores

C++ (`/cpp`)

Low-level concurrent programming with fine-grained control over synchronization.

01_thread_safe_counter.cpp - Atomic counter with synchronization
- Learn: std::mutex, std::atomic, RAII patterns
- Difficulty: Beginner
- Key concepts: Lock guards, atomic operations
02_producer_consumer_queue.cpp - Template-based queue with condition variables
- Learn: std::condition_variable, template metaprogramming
- Difficulty: Intermediate
- Key concepts: Wait/notify, timeout handling
03_thread_pool.cpp - Thread pool with futures
- Learn: std::future, std::promise, thread management
- Difficulty: Intermediate
- Key concepts: Async task execution, result handling
04_read_write_lock.cpp - Reader-writer lock implementation
- Learn: Multiple synchronized access patterns
- Difficulty: Advanced
- Key concepts: Shared/exclusive locks, writer priority
05_lock_free_queue.cpp - Lock-free queue using atomics
- Learn: CAS operations, memory ordering, lock-free data structures
- Difficulty: Advanced
- Key concepts: std::atomic, compare-and-swap, ABA problem

Go (`/go`)

Idiomatic concurrent programming using goroutines and channels.

01_goroutine_counter.go - Counter with mutex
- Learn: sync.Mutex, goroutines, basic synchronization
- Difficulty: Beginner
- Key concepts: Goroutines, mutex patterns
02_pipeline_pattern.go - Multi-stage data pipeline
- Learn: Channel communication, goroutine pipelines
- Difficulty: Intermediate
- Key concepts: Channels as pipes, concurrent stages
03_worker_pool.go - Worker pool pattern
- Learn: Work distribution, goroutine management
- Difficulty: Intermediate
- Key concepts: Worker goroutines, task queues
04_fan_out_fan_in.go - Distributing and aggregating work
- Learn: Fan-out/fan-in patterns, channel multiplexing
- Difficulty: Intermediate
- Key concepts: Work distribution, result aggregation
05_rate_limiter.go - Token bucket rate limiting
- Learn: Time-based coordination, token patterns
- Difficulty: Intermediate
- Key concepts: Rate control, channels with time

DL Modeling (`/dl_modeling`)

Implementation of core deep learning components from scratch in Python/NumPy.

01_linear_layer.py - Fully-connected layer
- Learn: Forward/backward passes, weight initialization, gradient computation
- Difficulty: Beginner
- Key concepts: Matrix operations, backpropagation, numerical gradients
02_conv2d_layer.py - 2D Convolution layer
- Learn: Convolution operation, padding/stride, gradient computation
- Difficulty: Intermediate
- Key concepts: Tensor operations, spatial convolutions, im2col
03_attention_mechanism.py - Scaled dot-product attention
- Learn: Softmax, multi-head patterns, causal masking
- Difficulty: Intermediate
- Key concepts: Attention weights, temperature scaling, numerical stability
04_batch_normalization.py - Batch normalization
- Learn: Training vs eval modes, running statistics, layer optimization
- Difficulty: Intermediate
- Key concepts: Internal covariate shift, momentum, affine transformations
05_lstm_cell.py - LSTM recurrent cell
- Learn: Gating mechanisms, hidden state tracking, sequence processing
- Difficulty: Intermediate
- Key concepts: Cell state, gradient flow, vanishing gradients

DL LLM Systems (`/dl_llm_systems`)

Systems-level problems for building LLM inference engines and agent systems.

01_dynamic_batching.py - Request batching for inference
- Learn: Queue management, batch formation strategies, latency optimization
- Difficulty: Intermediate
- Key concepts: FCFS/SJF scheduling, padding efficiency, throughput optimization
02_kv_cache_manager.py - KV cache management
- Learn: Memory allocation, eviction policies, sequence management
- Difficulty: Intermediate
- Key concepts: LRU/LFU eviction, cache fragmentation, memory limits
03_agent_system.py - Agent framework with tools
- Learn: ReAct pattern, tool calling, conversation memory
- Difficulty: Intermediate
- Key concepts: Multi-step reasoning, state management, error handling

Triton/CUDA (`/triton_cuda`)

GPU kernel programming using Triton and CUDA concepts.

01_matrix_multiply.py - Optimized matrix multiplication
- Learn: Block tiling, memory hierarchy, GPU optimization
- Difficulty: Intermediate
- Key concepts: Blocking strategy, cache locality, coalesced memory access
02_softmax_kernel.py - Softmax kernel implementation
- Learn: Row-wise reductions, numerical stability, online computation
- Difficulty: Intermediate
- Key concepts: Block operations, final normalization, attention kernels
03_flash_attention.py - IO-aware attention implementation
- Learn: IO-optimal algorithms, tiling strategy, online softmax
- Difficulty: Advanced
- Key concepts: Block-wise computation, running statistics, memory efficiency

Learning Path

Beginner Path (Concurrent Programming)

Start with Python basics:
- python/01_thread_safe_counter.py
- python/02_producer_consumer_queue.py
Move to DL Modeling basics:
- dl_modeling/01_linear_layer.py
Try Go basics:
- go/01_goroutine_counter.go
- go/02_pipeline_pattern.go

Intermediate Path (Concurrent & DL)

Expand concurrent programming:
- python/03_thread_pool.py & cpp/03_thread_pool.cpp
- go/03_worker_pool.go & go/04_fan_out_fan_in.go
Expand DL Modeling:
- dl_modeling/02_conv2d_layer.py
- dl_modeling/03_attention_mechanism.py
- dl_modeling/04_batch_normalization.py
Learn GPU programming:
- triton_cuda/01_matrix_multiply.py
- triton_cuda/02_softmax_kernel.py

Advanced Path (Systems & GPU)

Advanced concurrent:
- cpp/04_read_write_lock.cpp
- cpp/05_lock_free_queue.cpp
- python/04_dining_philosophers.py
Advanced DL systems:
- dl_modeling/05_lstm_cell.py
- dl_llm_systems/01_dynamic_batching.py
- dl_llm_systems/02_kv_cache_manager.py
- dl_llm_systems/03_agent_system.py
Advanced GPU:
- triton_cuda/03_flash_attention.py

How to Use

Each problem file contains:

Problem Description - Detailed requirements and learning objectives
Boilerplate Code - Skeleton implementation with TODO comments
Test Cases - Comprehensive tests to verify correctness
Performance Notes - Expected performance characteristics

Getting Started

Read the problem description at the top of the file
Understand the requirements and test cases
Implement the TODOs (marked with // TODO: or # TODO:)
Run the tests to verify correctness
Benchmark your implementation

Running Tests

Python (Concurrent):

python python/01_thread_safe_counter.py

Python (DL Modeling):

python dl_modeling/01_linear_layer.py

Python (DL Systems):

python dl_llm_systems/01_dynamic_batching.py

Python (GPU):

python triton_cuda/01_matrix_multiply.py

C++:

cd cpp
g++ -std=c++17 -pthread -o counter 01_thread_safe_counter.cpp
./counter

Go:

go run go/01_goroutine_counter.go

Performance Expectations

Problem	Language	Expected Throughput
Thread-safe Counter	Python	10K-50K ops/sec
Thread-safe Counter	C++	1M-10M ops/sec
Producer-Consumer	Python	1K-5K items/sec
Producer-Consumer	C++	100K-500K items/sec
Thread Pool	Python	100-1K tasks/sec
Thread Pool	C++	10K-100K tasks/sec
Rate Limiter	Go	Configurable (accurate)

Tips for Success

Start Simple - Get basic tests passing before optimizing
Understand the Problem - Read comments carefully
Test Thoroughly - Run all tests, including edge cases
Measure Performance - Use benchmarks to verify optimization
Avoid Common Pitfalls:
- Deadlocks (circular wait)
- Race conditions (unsynchronized access)
- Goroutine leaks (unreachable goroutines)
- Memory leaks (unclosed channels/resources)

Common Challenges

Deadlock

Caused by circular wait for resources
Prevention: consistent lock ordering, timeout-based acquisition

Race Conditions

Unsynchronized access to shared data
Solution: proper synchronization primitives

Starvation

One thread continuously blocked while others proceed
Solution: fair scheduling, writer/reader priority

Performance

Lock contention reduces throughput
Solution: reduce critical section, use lock-free structures

References

Python

C++

Go

GPU Programming

Good luck with your concurrent programming journey! Remember, mastering concurrency takes time and practice. Start with simple problems and gradually tackle more complex ones.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
cpp		cpp
cpu-concurrency		cpu-concurrency
dl_llm_systems		dl_llm_systems
dl_modeling		dl_modeling
docs		docs
go		go
python		python
triton_cuda		triton_cuda
.readthedocs.yaml		.readthedocs.yaml
CLAUDE.md		CLAUDE.md
README.md		README.md
STRUCTURE.md		STRUCTURE.md

kzhou003/fast-concurrent-programs

Folders and files

Latest commit

History

Repository files navigation

Fast Concurrent Programs - Practice Problems

Documentation

Quick Navigation

HTML Documentation (Recommended)

Documentation Structure

CPU Concurrency Guides

GPU Programming Guides

GPU Tutorials (Step-by-Step Examples)

Triton Compiler Deep Dive

Learning Resources

Building Documentation Locally

Prerequisites

Build HTML Documentation

Build Other Formats

Key Concepts Quick Reference

Concurrent Programming Patterns

GPU Programming Fundamentals

Triton Concepts

Quick Start

For CPU Concurrency

For GPU Programming

For Triton

Documentation File Locations

Contributing to Documentation

Additional Resources

Key Concepts by Language

Python

C++

Go

Structure of problems

Python (/python)

C++ (/cpp)

Go (/go)

DL Modeling (/dl_modeling)

DL LLM Systems (/dl_llm_systems)

Triton/CUDA (/triton_cuda)

Learning Path

Beginner Path (Concurrent Programming)

Intermediate Path (Concurrent & DL)

Advanced Path (Systems & GPU)

How to Use

Getting Started

Running Tests

Performance Expectations

Tips for Success

Common Challenges

Deadlock

Race Conditions

Starvation

Performance

References

Python

C++

Go

GPU Programming

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Python (`/python`)

C++ (`/cpp`)

Go (`/go`)

DL Modeling (`/dl_modeling`)

DL LLM Systems (`/dl_llm_systems`)

Triton/CUDA (`/triton_cuda`)

Packages