[Enhancement] Pipeline Parallelism: Production Optimizations for GPipe

## Goal: Comprehensive Pipeline Parallelism Implementation

Implement a complete pipeline parallelism system covering all major scheduling strategies from 2016-2025, production-grade partitioning, and activation checkpointing. This aligns with PyTorch's `torch.distributed.pipelining`, Megatron-LM, and DeepSpeed pipeline APIs.

Related: PR #393, PR #845

---

## Pipeline Schedule Taxonomy

### Single-Stage Schedules (1 model chunk per rank)

| Schedule | Paper/Year | Bubble Fraction | Peak Memory | Status |
|---|---|---|---|---|
| **GPipe** | Huang et al., 2019 | (P-1)/(P-1+M) ~50% | O(M) activations | :white_check_mark: Done |
| **1F1B** | PipeDream, Narayanan 2019 | (P-1)/(2M+P-1) ~12-15% | O(P) activations | :white_check_mark: Done |
| **ZB-H1** | Qi et al., ICLR 2024 | ~1/3 of 1F1B | Same as 1F1B | :white_check_mark: Done |
| **ZB-H2** | Qi et al., ICLR 2024 | ~0% (zero bubble) | Higher than 1F1B | :white_check_mark: Done |

### Multi-Stage Schedules (multiple virtual chunks per rank)

| Schedule | Paper/Year | Bubble Fraction | Peak Memory | Status |
|---|---|---|---|---|
| **Interleaved 1F1B** | Narayanan et al., 2021 (Megatron-LM v2) | (P-1)/(M*V+P-1) | O(P) | :white_check_mark: Done |
| **Looped BFS** | Lamy-Poirier, 2022 | Similar to Interleaved 1F1B | O(P) | :white_check_mark: Done |
| **ZB-V** | Qi et al., ICLR 2024 | ~0% (zero bubble) | Same as 1F1B | :white_check_mark: Done |

Where P = pipeline stages, M = micro-batches, V = virtual stages per rank.

---

## Key Innovation: Backward Pass Decomposition

The **Zero Bubble** family (ZB-H1, ZB-H2, ZB-V) splits the backward pass into two independent computations:

- **B (BackwardInput)**: Computes gradient w.r.t. layer input activations (needed by upstream stage)
- **W (BackwardWeight)**: Computes gradient w.r.t. layer weights (can be deferred)

Since W doesn't block the critical path (no stage depends on it), it can be scheduled to fill pipeline bubbles. This is the fundamental insight that enables zero-bubble schedules.

```
Traditional:  F -> B (combined backward, blocks pipeline)
Zero Bubble:  F -> B (activation grad, critical path) + W (weight grad, fills bubbles)
```

---

## Schedule Details

### 1. GPipe (Done)
All forward passes, then all backward passes. Simple but ~50% bubble.

Reference: [Huang et al., "GPipe", NeurIPS 2019](https://arxiv.org/abs/1811.06965)

### 2. 1F1B (Done)
Interleaves forward and backward in steady state. ~12-15% bubble, O(P) memory.

Reference: [Narayanan et al., "PipeDream", SOSP 2019](https://arxiv.org/abs/1806.03377)

### 3. ZB-H1 (Needed)
Like 1F1B but with B/W split. Defers W to fill bubbles while maintaining same peak memory as 1F1B. Reduces bubble to ~1/3 of 1F1B.

Key: Adjusts W scheduling based on warmup microbatch count to keep same number of in-flight microbatches as 1F1B.

Reference: [Qi et al., "Zero Bubble Pipeline Parallelism", ICLR 2024 Spotlight](https://arxiv.org/abs/2401.10241)

### 4. ZB-H2 (Needed)
Allows more in-flight microbatches than 1F1B to achieve true zero bubble. Higher peak memory but maximizes throughput.

Reference: [Qi et al., 2024](https://arxiv.org/abs/2401.10241)

### 5. Interleaved 1F1B (Needed)
Assigns V virtual stages (non-contiguous model chunks) to each rank. Rank i holds chunks {i, i+P, i+2P, ...}. Reduces bubble by factor of V compared to 1F1B.

Used in production by Megatron-LM v2 and NVIDIA NeMo.

Reference: [Narayanan et al., "Efficient Large-Scale Language Model Training on GPU Clusters", SC 2021](https://arxiv.org/abs/2104.04473)

### 6. Looped BFS (Needed)
Like Interleaved 1F1B but prioritizes earlier stage over earlier microbatch. When multiple local stages have work ready, runs all microbatches for the earlier stage first.

Reference: [Lamy-Poirier, "Breadth-First Pipeline Parallelism", 2022](https://arxiv.org/abs/2211.05953)

### 7. ZB-V (Needed)
Assigns exactly 2 virtual stages per rank in a V-shaped pattern. Achieves zero bubble with same memory as 1F1B. Requires M >= 2P microbatches.

The V-shape: rank i holds chunks i (ascending) and 2P-1-i (descending), forming a V pattern in the dependency graph.

Reference: [Qi et al., 2024](https://arxiv.org/abs/2401.10241)

---

## Partitioning Strategies

| Strategy | Description | Status |
|---|---|---|
| **Uniform** | Even parameter split | :white_check_mark: Done |
| **Load-Balanced DP** | Min-max cost via dynamic programming | :white_check_mark: Done |
| **Profiling-Based** | Per-layer timing/memory profiling | :construction: Future |
| **Adaptive** (AdaPipe) | Joint partition + recomputation optimization | :construction: Future |

References:
- [AdaPipe, ASPLOS 2024](https://dl.acm.org/doi/10.1145/3620666.3651359)
- [DawnPiper, 2025](https://arxiv.org/abs/2505.05856)
- [iPipe, 2024](https://dl.acm.org/doi/10.1145/3744103.3744115)

---

## Activation Checkpointing

| Technique | Description | Status |
|---|---|---|
| **Basic** | Store every Nth activation | :white_check_mark: Done |
| **Selective** | Per-layer cost-based checkpoint decisions | :white_check_mark: Config exists |
| **Mario** | Near zero-cost recomputation via PP overlap | :construction: Future |
| **Joint** (AdaPipe) | Co-optimize checkpoint + partition | :construction: Future |

References:
- [Chen et al., "Training Deep Nets with Sublinear Memory Cost", 2016](https://arxiv.org/abs/1604.06174)
- Mario, PPoPP 2025

---

## Implementation Checklist

### Infrastructure
- [x] PipelineOperationType enum: Forward, Backward
- [x] Expand to: Forward, BackwardInput, BackwardWeight (for Zero Bubble)
- [x] PipelineOperation class with Type, MicroBatchIndex, IsWarmup, IsCooldown
- [x] Add VirtualStageIndex field (for Interleaved/Multi-stage schedules)
- [x] IPipelineSchedule interface
- [x] Add VirtualStagesPerRank property

### Single-Stage Schedules
- [x] GPipeSchedule
- [x] OneForwardOneBackwardSchedule (1F1B)
- [x] ZeroBubbleH1Schedule (B/W split, 1/3 bubble)
- [x] ZeroBubbleH2Schedule (B/W split, zero bubble)

### Multi-Stage Schedules
- [x] Interleaved1F1BSchedule (V virtual stages)
- [x] LoopedBFSSchedule (stage-priority ordering)
- [x] ZeroBubbleVSchedule (V-shaped, zero bubble)

### Partitioning
- [x] UniformPartitionStrategy
- [x] LoadBalancedPartitionStrategy (DP-based)

### Checkpointing
- [x] ActivationCheckpointConfig
- [x] RecomputeStrategy enum

### Facade Integration
- [x] AiModelBuilder.ConfigureDistributedTraining() accepts pipeline options
- [x] Update for new schedule types

---

## API Examples

```csharp
// Simple: GPipe (default, backward compatible)
builder.ConfigureDistributedTraining(strategy: DistributedStrategy.PipelineParallel)

// Efficient: 1F1B with load balancing
builder.ConfigureDistributedTraining(
    strategy: DistributedStrategy.PipelineParallel,
    pipelineSchedule: new OneForwardOneBackwardSchedule(),
    pipelinePartitionStrategy: new LoadBalancedPartitionStrategy<double>(estimatedLayerSize: 1024),
    pipelineMicroBatchSize: 8)

// State-of-the-art: Zero Bubble with activation checkpointing
builder.ConfigureDistributedTraining(
    strategy: DistributedStrategy.PipelineParallel,
    pipelineSchedule: new ZeroBubbleH1Schedule(),
    pipelineCheckpointConfig: new ActivationCheckpointConfig { Enabled = true },
    pipelineMicroBatchSize: 16)

// Maximum throughput: ZB-V (zero bubble, same memory as 1F1B)
builder.ConfigureDistributedTraining(
    strategy: DistributedStrategy.PipelineParallel,
    pipelineSchedule: new ZeroBubbleVSchedule(),
    pipelineMicroBatchSize: 32)
```

---

## References

### Foundational
- GPipe: https://arxiv.org/abs/1811.06965 (Huang et al., 2019)
- PipeDream 1F1B: https://arxiv.org/abs/1806.03377 (Narayanan et al., 2019)
- Activation Checkpointing: https://arxiv.org/abs/1604.06174 (Chen et al., 2016)

### Modern (2021-2025)
- Interleaved 1F1B / Megatron-LM v2: https://arxiv.org/abs/2104.04473 (Narayanan et al., 2021)
- Breadth-First PP (Looped BFS): https://arxiv.org/abs/2211.05953 (Lamy-Poirier, 2022)
- **Zero Bubble PP (ZB-H1, ZB-H2, ZB-V)**: https://arxiv.org/abs/2401.10241 (Qi et al., ICLR 2024 Spotlight)
- Hanayo Wave-like PP: https://arxiv.org/abs/2308.15762 (SC 2023)
- AdaPipe: https://dl.acm.org/doi/10.1145/3620666.3651359 (ASPLOS 2024)
- DawnPiper: https://arxiv.org/abs/2505.05856 (2025)
- Mario Near Zero-cost Checkpointing: PPoPP 2025
- Seq1F1B: https://aclanthology.org/2025.naacl-long.454.pdf (NAACL 2025)
- Pipeline PP Survey: https://jcst.ict.ac.cn/fileup/1000-9000/PDF/JCST-2024-3-4-3872-567.pdf (JCST 2024)
- PyTorch Pipelining: https://docs.pytorch.org/docs/stable/distributed.pipelining.html



Strategy	Description	Status
Uniform	Even parameter split	✅ Done
Load-Balanced DP	Min-max cost via dynamic programming	✅ Done
Profiling-Based	Per-layer timing/memory profiling	🚧 Future
Adaptive (AdaPipe)	Joint partition + recomputation optimization	🚧 Future

Technique	Description	Status
Basic	Store every Nth activation	✅ Done
Selective	Per-layer cost-based checkpoint decisions	✅ Config exists
Mario	Near zero-cost recomputation via PP overlap	🚧 Future
Joint (AdaPipe)	Co-optimize checkpoint + partition	🚧 Future

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Enhancement] Pipeline Parallelism: Production Optimizations for GPipe #463

Goal: Comprehensive Pipeline Parallelism Implementation

Pipeline Schedule Taxonomy

Single-Stage Schedules (1 model chunk per rank)

Multi-Stage Schedules (multiple virtual chunks per rank)

Key Innovation: Backward Pass Decomposition

Schedule Details

1. GPipe (Done)

2. 1F1B (Done)

3. ZB-H1 (Needed)

4. ZB-H2 (Needed)

5. Interleaved 1F1B (Needed)

6. Looped BFS (Needed)

7. ZB-V (Needed)

Partitioning Strategies

Activation Checkpointing

Implementation Checklist

Infrastructure

Single-Stage Schedules

Multi-Stage Schedules

Partitioning

Checkpointing

Facade Integration

API Examples

References

Foundational

Modern (2021-2025)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Schedule	Paper/Year	Bubble Fraction	Peak Memory	Status
GPipe	Huang et al., 2019	(P-1)/(P-1+M) ~50%	O(M) activations	✅ Done
1F1B	PipeDream, Narayanan 2019	(P-1)/(2M+P-1) ~12-15%	O(P) activations	✅ Done
ZB-H1	Qi et al., ICLR 2024	~1/3 of 1F1B	Same as 1F1B	✅ Done
ZB-H2	Qi et al., ICLR 2024	~0% (zero bubble)	Higher than 1F1B	✅ Done

Schedule	Paper/Year	Bubble Fraction	Peak Memory	Status
Interleaved 1F1B	Narayanan et al., 2021 (Megatron-LM v2)	(P-1)/(M*V+P-1)	O(P)	✅ Done
Looped BFS	Lamy-Poirier, 2022	Similar to Interleaved 1F1B	O(P)	✅ Done
ZB-V	Qi et al., ICLR 2024	~0% (zero bubble)	Same as 1F1B	✅ Done

Uh oh!

[Enhancement] Pipeline Parallelism: Production Optimizations for GPipe #463

Description

Goal: Comprehensive Pipeline Parallelism Implementation

Pipeline Schedule Taxonomy

Single-Stage Schedules (1 model chunk per rank)

Multi-Stage Schedules (multiple virtual chunks per rank)

Key Innovation: Backward Pass Decomposition

Schedule Details

1. GPipe (Done)

2. 1F1B (Done)

3. ZB-H1 (Needed)

4. ZB-H2 (Needed)

5. Interleaved 1F1B (Needed)

6. Looped BFS (Needed)

7. ZB-V (Needed)

Partitioning Strategies

Activation Checkpointing

Implementation Checklist

Infrastructure

Single-Stage Schedules

Multi-Stage Schedules

Partitioning

Checkpointing

Facade Integration

API Examples

References

Foundational

Modern (2021-2025)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions