-
-
Notifications
You must be signed in to change notification settings - Fork 8
Description
Goal: Comprehensive Pipeline Parallelism Implementation
Implement a complete pipeline parallelism system covering all major scheduling strategies from 2016-2025, production-grade partitioning, and activation checkpointing. This aligns with PyTorch's torch.distributed.pipelining, Megatron-LM, and DeepSpeed pipeline APIs.
Pipeline Schedule Taxonomy
Single-Stage Schedules (1 model chunk per rank)
| Schedule | Paper/Year | Bubble Fraction | Peak Memory | Status |
|---|---|---|---|---|
| GPipe | Huang et al., 2019 | (P-1)/(P-1+M) ~50% | O(M) activations | ✅ Done |
| 1F1B | PipeDream, Narayanan 2019 | (P-1)/(2M+P-1) ~12-15% | O(P) activations | ✅ Done |
| ZB-H1 | Qi et al., ICLR 2024 | ~1/3 of 1F1B | Same as 1F1B | ✅ Done |
| ZB-H2 | Qi et al., ICLR 2024 | ~0% (zero bubble) | Higher than 1F1B | ✅ Done |
Multi-Stage Schedules (multiple virtual chunks per rank)
| Schedule | Paper/Year | Bubble Fraction | Peak Memory | Status |
|---|---|---|---|---|
| Interleaved 1F1B | Narayanan et al., 2021 (Megatron-LM v2) | (P-1)/(M*V+P-1) | O(P) | ✅ Done |
| Looped BFS | Lamy-Poirier, 2022 | Similar to Interleaved 1F1B | O(P) | ✅ Done |
| ZB-V | Qi et al., ICLR 2024 | ~0% (zero bubble) | Same as 1F1B | ✅ Done |
Where P = pipeline stages, M = micro-batches, V = virtual stages per rank.
Key Innovation: Backward Pass Decomposition
The Zero Bubble family (ZB-H1, ZB-H2, ZB-V) splits the backward pass into two independent computations:
- B (BackwardInput): Computes gradient w.r.t. layer input activations (needed by upstream stage)
- W (BackwardWeight): Computes gradient w.r.t. layer weights (can be deferred)
Since W doesn't block the critical path (no stage depends on it), it can be scheduled to fill pipeline bubbles. This is the fundamental insight that enables zero-bubble schedules.
Traditional: F -> B (combined backward, blocks pipeline)
Zero Bubble: F -> B (activation grad, critical path) + W (weight grad, fills bubbles)
Schedule Details
1. GPipe (Done)
All forward passes, then all backward passes. Simple but ~50% bubble.
Reference: Huang et al., "GPipe", NeurIPS 2019
2. 1F1B (Done)
Interleaves forward and backward in steady state. ~12-15% bubble, O(P) memory.
Reference: Narayanan et al., "PipeDream", SOSP 2019
3. ZB-H1 (Needed)
Like 1F1B but with B/W split. Defers W to fill bubbles while maintaining same peak memory as 1F1B. Reduces bubble to ~1/3 of 1F1B.
Key: Adjusts W scheduling based on warmup microbatch count to keep same number of in-flight microbatches as 1F1B.
Reference: Qi et al., "Zero Bubble Pipeline Parallelism", ICLR 2024 Spotlight
4. ZB-H2 (Needed)
Allows more in-flight microbatches than 1F1B to achieve true zero bubble. Higher peak memory but maximizes throughput.
Reference: Qi et al., 2024
5. Interleaved 1F1B (Needed)
Assigns V virtual stages (non-contiguous model chunks) to each rank. Rank i holds chunks {i, i+P, i+2P, ...}. Reduces bubble by factor of V compared to 1F1B.
Used in production by Megatron-LM v2 and NVIDIA NeMo.
Reference: Narayanan et al., "Efficient Large-Scale Language Model Training on GPU Clusters", SC 2021
6. Looped BFS (Needed)
Like Interleaved 1F1B but prioritizes earlier stage over earlier microbatch. When multiple local stages have work ready, runs all microbatches for the earlier stage first.
Reference: Lamy-Poirier, "Breadth-First Pipeline Parallelism", 2022
7. ZB-V (Needed)
Assigns exactly 2 virtual stages per rank in a V-shaped pattern. Achieves zero bubble with same memory as 1F1B. Requires M >= 2P microbatches.
The V-shape: rank i holds chunks i (ascending) and 2P-1-i (descending), forming a V pattern in the dependency graph.
Reference: Qi et al., 2024
Partitioning Strategies
| Strategy | Description | Status |
|---|---|---|
| Uniform | Even parameter split | ✅ Done |
| Load-Balanced DP | Min-max cost via dynamic programming | ✅ Done |
| Profiling-Based | Per-layer timing/memory profiling | 🚧 Future |
| Adaptive (AdaPipe) | Joint partition + recomputation optimization | 🚧 Future |
References:
Activation Checkpointing
| Technique | Description | Status |
|---|---|---|
| Basic | Store every Nth activation | ✅ Done |
| Selective | Per-layer cost-based checkpoint decisions | ✅ Config exists |
| Mario | Near zero-cost recomputation via PP overlap | 🚧 Future |
| Joint (AdaPipe) | Co-optimize checkpoint + partition | 🚧 Future |
References:
- Chen et al., "Training Deep Nets with Sublinear Memory Cost", 2016
- Mario, PPoPP 2025
Implementation Checklist
Infrastructure
- PipelineOperationType enum: Forward, Backward
- Expand to: Forward, BackwardInput, BackwardWeight (for Zero Bubble)
- PipelineOperation class with Type, MicroBatchIndex, IsWarmup, IsCooldown
- Add VirtualStageIndex field (for Interleaved/Multi-stage schedules)
- IPipelineSchedule interface
- Add VirtualStagesPerRank property
Single-Stage Schedules
- GPipeSchedule
- OneForwardOneBackwardSchedule (1F1B)
- ZeroBubbleH1Schedule (B/W split, 1/3 bubble)
- ZeroBubbleH2Schedule (B/W split, zero bubble)
Multi-Stage Schedules
- Interleaved1F1BSchedule (V virtual stages)
- LoopedBFSSchedule (stage-priority ordering)
- ZeroBubbleVSchedule (V-shaped, zero bubble)
Partitioning
- UniformPartitionStrategy
- LoadBalancedPartitionStrategy (DP-based)
Checkpointing
- ActivationCheckpointConfig
- RecomputeStrategy enum
Facade Integration
- AiModelBuilder.ConfigureDistributedTraining() accepts pipeline options
- Update for new schedule types
API Examples
// Simple: GPipe (default, backward compatible)
builder.ConfigureDistributedTraining(strategy: DistributedStrategy.PipelineParallel)
// Efficient: 1F1B with load balancing
builder.ConfigureDistributedTraining(
strategy: DistributedStrategy.PipelineParallel,
pipelineSchedule: new OneForwardOneBackwardSchedule(),
pipelinePartitionStrategy: new LoadBalancedPartitionStrategy<double>(estimatedLayerSize: 1024),
pipelineMicroBatchSize: 8)
// State-of-the-art: Zero Bubble with activation checkpointing
builder.ConfigureDistributedTraining(
strategy: DistributedStrategy.PipelineParallel,
pipelineSchedule: new ZeroBubbleH1Schedule(),
pipelineCheckpointConfig: new ActivationCheckpointConfig { Enabled = true },
pipelineMicroBatchSize: 16)
// Maximum throughput: ZB-V (zero bubble, same memory as 1F1B)
builder.ConfigureDistributedTraining(
strategy: DistributedStrategy.PipelineParallel,
pipelineSchedule: new ZeroBubbleVSchedule(),
pipelineMicroBatchSize: 32)References
Foundational
- GPipe: https://arxiv.org/abs/1811.06965 (Huang et al., 2019)
- PipeDream 1F1B: https://arxiv.org/abs/1806.03377 (Narayanan et al., 2019)
- Activation Checkpointing: https://arxiv.org/abs/1604.06174 (Chen et al., 2016)
Modern (2021-2025)
- Interleaved 1F1B / Megatron-LM v2: https://arxiv.org/abs/2104.04473 (Narayanan et al., 2021)
- Breadth-First PP (Looped BFS): https://arxiv.org/abs/2211.05953 (Lamy-Poirier, 2022)
- Zero Bubble PP (ZB-H1, ZB-H2, ZB-V): https://arxiv.org/abs/2401.10241 (Qi et al., ICLR 2024 Spotlight)
- Hanayo Wave-like PP: https://arxiv.org/abs/2308.15762 (SC 2023)
- AdaPipe: https://dl.acm.org/doi/10.1145/3620666.3651359 (ASPLOS 2024)
- DawnPiper: https://arxiv.org/abs/2505.05856 (2025)
- Mario Near Zero-cost Checkpointing: PPoPP 2025
- Seq1F1B: https://aclanthology.org/2025.naacl-long.454.pdf (NAACL 2025)
- Pipeline PP Survey: https://jcst.ict.ac.cn/fileup/1000-9000/PDF/JCST-2024-3-4-3872-567.pdf (JCST 2024)
- PyTorch Pipelining: https://docs.pytorch.org/docs/stable/distributed.pipelining.html