Skip to content

[Enhancement] Pipeline Parallelism: Production Optimizations for GPipe #463

@ooples

Description

@ooples

Goal: Comprehensive Pipeline Parallelism Implementation

Implement a complete pipeline parallelism system covering all major scheduling strategies from 2016-2025, production-grade partitioning, and activation checkpointing. This aligns with PyTorch's torch.distributed.pipelining, Megatron-LM, and DeepSpeed pipeline APIs.

Related: PR #393, PR #845


Pipeline Schedule Taxonomy

Single-Stage Schedules (1 model chunk per rank)

Schedule Paper/Year Bubble Fraction Peak Memory Status
GPipe Huang et al., 2019 (P-1)/(P-1+M) ~50% O(M) activations ✅ Done
1F1B PipeDream, Narayanan 2019 (P-1)/(2M+P-1) ~12-15% O(P) activations ✅ Done
ZB-H1 Qi et al., ICLR 2024 ~1/3 of 1F1B Same as 1F1B ✅ Done
ZB-H2 Qi et al., ICLR 2024 ~0% (zero bubble) Higher than 1F1B ✅ Done

Multi-Stage Schedules (multiple virtual chunks per rank)

Schedule Paper/Year Bubble Fraction Peak Memory Status
Interleaved 1F1B Narayanan et al., 2021 (Megatron-LM v2) (P-1)/(M*V+P-1) O(P) ✅ Done
Looped BFS Lamy-Poirier, 2022 Similar to Interleaved 1F1B O(P) ✅ Done
ZB-V Qi et al., ICLR 2024 ~0% (zero bubble) Same as 1F1B ✅ Done

Where P = pipeline stages, M = micro-batches, V = virtual stages per rank.


Key Innovation: Backward Pass Decomposition

The Zero Bubble family (ZB-H1, ZB-H2, ZB-V) splits the backward pass into two independent computations:

  • B (BackwardInput): Computes gradient w.r.t. layer input activations (needed by upstream stage)
  • W (BackwardWeight): Computes gradient w.r.t. layer weights (can be deferred)

Since W doesn't block the critical path (no stage depends on it), it can be scheduled to fill pipeline bubbles. This is the fundamental insight that enables zero-bubble schedules.

Traditional:  F -> B (combined backward, blocks pipeline)
Zero Bubble:  F -> B (activation grad, critical path) + W (weight grad, fills bubbles)

Schedule Details

1. GPipe (Done)

All forward passes, then all backward passes. Simple but ~50% bubble.

Reference: Huang et al., "GPipe", NeurIPS 2019

2. 1F1B (Done)

Interleaves forward and backward in steady state. ~12-15% bubble, O(P) memory.

Reference: Narayanan et al., "PipeDream", SOSP 2019

3. ZB-H1 (Needed)

Like 1F1B but with B/W split. Defers W to fill bubbles while maintaining same peak memory as 1F1B. Reduces bubble to ~1/3 of 1F1B.

Key: Adjusts W scheduling based on warmup microbatch count to keep same number of in-flight microbatches as 1F1B.

Reference: Qi et al., "Zero Bubble Pipeline Parallelism", ICLR 2024 Spotlight

4. ZB-H2 (Needed)

Allows more in-flight microbatches than 1F1B to achieve true zero bubble. Higher peak memory but maximizes throughput.

Reference: Qi et al., 2024

5. Interleaved 1F1B (Needed)

Assigns V virtual stages (non-contiguous model chunks) to each rank. Rank i holds chunks {i, i+P, i+2P, ...}. Reduces bubble by factor of V compared to 1F1B.

Used in production by Megatron-LM v2 and NVIDIA NeMo.

Reference: Narayanan et al., "Efficient Large-Scale Language Model Training on GPU Clusters", SC 2021

6. Looped BFS (Needed)

Like Interleaved 1F1B but prioritizes earlier stage over earlier microbatch. When multiple local stages have work ready, runs all microbatches for the earlier stage first.

Reference: Lamy-Poirier, "Breadth-First Pipeline Parallelism", 2022

7. ZB-V (Needed)

Assigns exactly 2 virtual stages per rank in a V-shaped pattern. Achieves zero bubble with same memory as 1F1B. Requires M >= 2P microbatches.

The V-shape: rank i holds chunks i (ascending) and 2P-1-i (descending), forming a V pattern in the dependency graph.

Reference: Qi et al., 2024


Partitioning Strategies

Strategy Description Status
Uniform Even parameter split ✅ Done
Load-Balanced DP Min-max cost via dynamic programming ✅ Done
Profiling-Based Per-layer timing/memory profiling 🚧 Future
Adaptive (AdaPipe) Joint partition + recomputation optimization 🚧 Future

References:


Activation Checkpointing

Technique Description Status
Basic Store every Nth activation ✅ Done
Selective Per-layer cost-based checkpoint decisions ✅ Config exists
Mario Near zero-cost recomputation via PP overlap 🚧 Future
Joint (AdaPipe) Co-optimize checkpoint + partition 🚧 Future

References:


Implementation Checklist

Infrastructure

  • PipelineOperationType enum: Forward, Backward
  • Expand to: Forward, BackwardInput, BackwardWeight (for Zero Bubble)
  • PipelineOperation class with Type, MicroBatchIndex, IsWarmup, IsCooldown
  • Add VirtualStageIndex field (for Interleaved/Multi-stage schedules)
  • IPipelineSchedule interface
  • Add VirtualStagesPerRank property

Single-Stage Schedules

  • GPipeSchedule
  • OneForwardOneBackwardSchedule (1F1B)
  • ZeroBubbleH1Schedule (B/W split, 1/3 bubble)
  • ZeroBubbleH2Schedule (B/W split, zero bubble)

Multi-Stage Schedules

  • Interleaved1F1BSchedule (V virtual stages)
  • LoopedBFSSchedule (stage-priority ordering)
  • ZeroBubbleVSchedule (V-shaped, zero bubble)

Partitioning

  • UniformPartitionStrategy
  • LoadBalancedPartitionStrategy (DP-based)

Checkpointing

  • ActivationCheckpointConfig
  • RecomputeStrategy enum

Facade Integration

  • AiModelBuilder.ConfigureDistributedTraining() accepts pipeline options
  • Update for new schedule types

API Examples

// Simple: GPipe (default, backward compatible)
builder.ConfigureDistributedTraining(strategy: DistributedStrategy.PipelineParallel)

// Efficient: 1F1B with load balancing
builder.ConfigureDistributedTraining(
    strategy: DistributedStrategy.PipelineParallel,
    pipelineSchedule: new OneForwardOneBackwardSchedule(),
    pipelinePartitionStrategy: new LoadBalancedPartitionStrategy<double>(estimatedLayerSize: 1024),
    pipelineMicroBatchSize: 8)

// State-of-the-art: Zero Bubble with activation checkpointing
builder.ConfigureDistributedTraining(
    strategy: DistributedStrategy.PipelineParallel,
    pipelineSchedule: new ZeroBubbleH1Schedule(),
    pipelineCheckpointConfig: new ActivationCheckpointConfig { Enabled = true },
    pipelineMicroBatchSize: 16)

// Maximum throughput: ZB-V (zero bubble, same memory as 1F1B)
builder.ConfigureDistributedTraining(
    strategy: DistributedStrategy.PipelineParallel,
    pipelineSchedule: new ZeroBubbleVSchedule(),
    pipelineMicroBatchSize: 32)

References

Foundational

Modern (2021-2025)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions