workload balancer and datasets folder refactor #275

JacoCheung · 2026-01-20T15:27:06Z

Description

Address #207 .

This PR adds a batch shuffler that can balance the workloads among data patallel group. User can set TrainerArgs.enable_balanced_shuffler = True to enable this feature. It does purely gemm compute workloads.

Breaking changes:

RetrievalBatch and RankingBatch are consolidated into one. They all subclass BaseBatch
The BaseBatch definition. In the future, a model-specific batch should subclass this class.

@dataclass
class BaseBatch(Pipelineable):
    """
    All tensors must share a same batch size.
    """

    features: KeyedJaggedTensor
    batch_size: int  # local batch size
    feature_to_max_seqlen: Dict[str, int]

    contextual_feature_names: List[str] = field(default_factory=list)
    # when labels is a tensor, it means the labels can be reshaped to [actual_batch_size, ...] and select along the batch dimension.
    labels: Union[KeyedJaggedTensor, torch.Tensor] = None
    actual_batch_size: Optional[int] = None  # in case of padding.

The labels in hstu Batch is changed into KeyJaggedTensor for easier shuffling.
Batch shuffler is added into train pipeline, follows H2D.
The hstu dataset args max_sequence_length is now max_history_seqlen, the full seqlen should be max_history_seqlen * 2 + max_num_candidates * 2 + num_contextual_features.
Datasets are moved under commons.

CI

greptile-apps · 2026-01-27T01:34:12Z

Greptile Summary

This PR implements workload balancing for data parallel training and refactors the datasets folder structure. The main changes include:

Core Features:

Added batch shuffler infrastructure using the Karmarkar-Karp algorithm to balance computational workloads (GEMM FLOPs) across data parallel ranks
Implemented factory patterns for both BatchShufflerFactory and TrainPipelineFactory to support extensible registration
Consolidated RetrievalBatch and RankingBatch into unified BaseBatch with HSTUBatch subclass, changing labels from Tensor to KeyedJaggedTensor for easier shuffling
Moved datasets from examples/hstu/datasets and examples/sid_gr/datasets to examples/commons/datasets

Breaking Changes:

Dataset argument max_sequence_length renamed to max_history_seqlen for HSTU datasets
Batch classes consolidated - old RetrievalBatch/RankingBatch replaced by BaseBatch/HSTUBatch
Labels in HSTU batches are now KeyedJaggedTensor instead of torch.Tensor

Key Implementation Details:

Batch shuffling happens after H2D transfer in the memcpy stream
Workload calculation based on attention complexity: HSTU uses 4 projections (QKVU), SID-GR uses standard self-attention (3 projections)
Factory pattern allows models to register custom batch shufflers and pipelines
Opt-in via TrainerArgs.enable_balanced_shuffler = True

Issues Found:

Potential logic issue in BaseBatch.index_select (line 137) where actual_batch_size calculation after shuffling may not correctly handle padding indices from the global batch context
Equal-size partitioning constraint may fail on incomplete batches where global batch size is not evenly divisible by number of ranks
Workload gathering for incomplete batches needs verification that padding samples have zero/minimal workloads

Confidence Score: 4/5

This PR is mostly safe to merge with minor concerns around incomplete batch handling
The implementation is well-structured with comprehensive tests and follows good design patterns. However, there's a logic issue in BaseBatch.index_select around actual_batch_size calculation after shuffling, and potential edge cases with incomplete batches in the partitioning algorithm that should be verified before production use
examples/commons/sequence_batch/batch.py requires attention for the actual_batch_size calculation after index_select when shuffling is enabled

Important Files Changed

Filename	Overview
examples/commons/distributed/batch_shuffler.py	New core batch shuffler implementation using Karmarkar-Karp algorithm for workload balancing across data parallel ranks
examples/commons/perf_model/partitioner.py	Karmarkar-Karp partitioning algorithm implementation for equal-sized partition balancing
examples/commons/sequence_batch/batch.py	New base batch class with support for index_select, allgather operations, and incomplete batch handling
examples/commons/datasets/hstu_batch.py	HSTU-specific batch class consolidating retrieval and ranking batches, with KeyedJaggedTensor labels support
examples/commons/pipeline/train_pipeline.py	Integrated batch shuffler into training pipeline, added _batch_shuffle method in copy_batch_to_gpu_and_shuffle
examples/hstu/utils/hstu_batch_balancer.py	HSTU-specific batch balancer using HSTUAttentionTask to calculate workloads based on sequence length

Sequence Diagram

sequenceDiagram
    participant DataLoader
    participant TrainPipeline
    participant BatchShuffler
    participant AllGather
    participant Partitioner
    participant Model

    DataLoader->>TrainPipeline: next(batch)
    TrainPipeline->>TrainPipeline: _to_device (H2D transfer)
    
    alt Balanced Shuffler Enabled
        TrainPipeline->>BatchShuffler: shuffle(batch, pg_group)
        BatchShuffler->>BatchShuffler: get_workloads(batch)
        Note over BatchShuffler: Calculate FLOPs based on<br/>sequence length & attention params
        BatchShuffler->>AllGather: gather_along_first_dim(workloads)
        AllGather-->>BatchShuffler: global_workloads
        BatchShuffler->>Partitioner: karmarkar_karp(workloads, num_ranks)
        Note over Partitioner: Partition with equal size<br/>constraint for balanced load
        Partitioner-->>BatchShuffler: partition_indices[rank]
        BatchShuffler->>BatchShuffler: sort(indices)
        Note over BatchShuffler: Sorting ensures padding<br/>indices at the end
        BatchShuffler->>AllGather: allgather_batch(batch)
        AllGather-->>BatchShuffler: global_batch
        BatchShuffler->>BatchShuffler: index_select(indices)
        BatchShuffler-->>TrainPipeline: shuffled_batch
    else Identity Shuffler
        Note over TrainPipeline: No shuffling, batch unchanged
    end
    
    TrainPipeline->>TrainPipeline: start_sparse_data_dist
    TrainPipeline->>Model: forward(batch)
    Model-->>TrainPipeline: loss, output

greptile-apps

_{53 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

examples/commons/distributed/batch_shuffler.py

examples/commons/sequence_batch/batch.py

greptile-apps · 2026-01-27T05:18:03Z

Greptile found no issues!

From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section.

_{This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".}

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

shijieliu · 2026-01-27T07:25:30Z

examples/hstu/training/README.md

-cd <root-to-repo>/examples/hstu && 
-mkdir -p ./tmp_data && python3 ./preprocessor.py --dataset_name <"ml-1m"|"ml-20m"|"kuairand-pure"|"kuairand-1k"|"kuairand-27k">
+cd <root-to-repo>/examples/commons && 
+mkdir -p ./tmp_data && python3 ./hstu_data_preprocessor.py --dataset_name <"ml-1m"|"ml-20m"|"kuairand-pure"|"kuairand-1k"|"kuairand-27k">


is this needed? Can we make data_preprocessor.py as before

I'm not sure. Now we have sid and hstu datasets, I assume they will not share the preprocessor.( Though we don't have a sid preprocessor yet).

Or we can consolidte them into one?

shijieliu · 2026-01-27T09:09:05Z

examples/tests/commons/test_batch.py

@@ -0,0 +1,187 @@
+import sys


do we add the corresponding part in CI

Yes, I did create an MR. See the 2nd CI line above in the desription.

shijieliu · 2026-01-27T09:12:10Z

this PR does not include refactoring about modules/ops, I assume it will be another PR. Should #237 be merged after this one, or after the refactor modules/ops PR? @JacoCheung @geoffreyQiu

JacoCheung · 2026-01-27T09:40:02Z

I would say we might land this one first, and @geoffreyQiu follows. Otherwise his ci will fail due to package name conflict.

@shijieliu

JacoCheung · 2026-01-27T09:54:18Z

examples/hstu/modules/hstu_processor.py

+            # 1. use jagged split to get [history_embs, candidate_embs]
+            # 2. use cat to interleave the history_embs and history_action_embs part
+            # 3. use jagged concat to append the candidate_embs
+


I happened to see the for loop concat. And I think there should be a chance to optimize with the steps I commented above. Just for a reference @geoffreyQiu

greptile-apps

_{81 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-27T09:54:27Z

examples/commons/sequence_batch/batch.py

+        if self.actual_batch_size != self.batch_size:
+            # we use the actual batch size to filter the indices.
+            actual_batch_size = indices[indices < self.actual_batch_size].numel()
+        else:
+            actual_batch_size = indices.numel()


logic: The calculation indices[indices < self.actual_batch_size].numel() filters indices to compute the new actual_batch_size. However, after shuffling across ranks, indices reference the global batch, not the local batch. The comparison indices < self.actual_batch_size uses the old local actual_batch_size, which may not correctly identify valid vs padded samples in the global context.

Suggested change

if self.actual_batch_size != self.batch_size:

# we use the actual batch size to filter the indices.

actual_batch_size = indices[indices < self.actual_batch_size].numel()

else:

actual_batch_size = indices.numel()

if self.actual_batch_size != self.batch_size:

# Filter indices to count only valid (non-padded) samples from the global batch

# actual_batch_size from allgather represents total valid samples across all ranks

actual_batch_size = indices.numel()

# Count how many selected indices are below the global actual_batch_size

if hasattr(self, '_global_actual_batch_size'):

actual_batch_size = indices[indices < self._global_actual_batch_size].numel()

else:

actual_batch_size = indices.numel()

greptile-apps · 2026-01-27T09:54:28Z

examples/commons/distributed/batch_shuffler.py

+        # 1. Allgather the workloads
+        allgather_workloads = gather_along_first_dim(workloads, pg_group)


style: The gather_along_first_dim is called on workloads before partitioning. For incomplete batches where actual_batch_size < batch_size, verify that workloads for padded samples are set to 0 or minimal values to avoid incorrect partitioning.

greptile-apps · 2026-01-27T09:54:28Z

examples/commons/perf_model/partitioner.py

+        assert (
+            len(workloads) % k_partitions == 0
+        ), f"{len(workloads)} % {k_partitions} != 0"


style: The assertion len(workloads) % k_partitions == 0 will fail for incomplete batches where the global batch size is not evenly divisible by the number of ranks. Verify this is handled correctly by the caller, or handle incomplete batches differently.

greptile-apps · 2026-01-27T09:54:29Z

examples/commons/pipeline/train_pipeline.py

                    batch = _to_device(batch, self._device, non_blocking=True)
+                    # TODO@junzhang, there are cpu ops / nccl comm and lots of sync in shuffle.
+                    batch = self._batch_shuffle(batch)


style: Batch shuffling with allgather and CPU-based partitioning happens in the memcpy stream. The comment mentions "lots of sync" - verify this doesn't significantly impact pipeline overlap, especially with large batches or high DP world sizes.

JacoCheung changed the title ~~Junzhang/workload balancer~~ [Draft] workload balancer Jan 21, 2026

JacoCheung mentioned this pull request Jan 21, 2026

rename wrong module name #278

Merged

3 tasks

JacoCheung changed the title ~~[Draft] workload balancer~~ workload balancer Jan 22, 2026

JacoCheung requested a review from shijieliu January 22, 2026 10:12

JacoCheung changed the title ~~workload balancer~~ [Draft]workload balancer Jan 22, 2026

greptile-apps bot reviewed Jan 27, 2026

View reviewed changes

examples/commons/distributed/batch_shuffler.py Show resolved Hide resolved

examples/commons/distributed/batch_shuffler.py Outdated Show resolved Hide resolved

examples/commons/sequence_batch/batch.py Show resolved Hide resolved

JacoCheung requested a review from geoffreyQiu January 27, 2026 05:25

JacoCheung changed the title ~~[Draft]workload balancer~~ workload balancer and datasets folder refactor Jan 27, 2026

JacoCheung added 12 commits January 27, 2026 08:40

[Placeholder] add balancer

e3d90f4

Refactor Batch and change labels into KJT

443c0c1

Add AG & index_select to Batch

c00d56b

Add balanced batch shuffler and hstu utest

3abf797

Add pipeline & balancer factory and incorporate balancer into pipeline

fe1910b

Apply batch shuffle to sid_gr and refactor gpt sid batch

d760a4f

Add patch for incomplete batch shuffle

72eaaed

Replace max_sequence_length with history + candidate

86be13b

Finalize balancer and add more collective utils

895d87f

hstu/datasets -> commons/dataset; Batch-> HSTUBatch

e2eb526

Refactor datasets folder

5f8e6b1

ResolvePR comments

4aad2bc

JacoCheung force-pushed the junzhang/workload_balancer branch from 1330adb to 4aad2bc Compare January 27, 2026 08:40

greptile-apps bot reviewed Jan 27, 2026

View reviewed changes

shijieliu reviewed Jan 27, 2026

View reviewed changes

Move common tests under examples

f828d47

JacoCheung commented Jan 27, 2026

View reviewed changes

greptile-apps bot reviewed Jan 27, 2026

View reviewed changes

		# 1. Allgather the workloads
		allgather_workloads = gather_along_first_dim(workloads, pg_group)

workload balancer and datasets folder refactor #275

Are you sure you want to change the base?

workload balancer and datasets folder refactor #275

Uh oh!

Conversation

JacoCheung commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Breaking changes:

Uh oh!

greptile-apps bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Jan 27, 2026

Greptile found no issues!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

shijieliu Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

JacoCheung Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

shijieliu Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

JacoCheung Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

shijieliu commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JacoCheung commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JacoCheung Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JacoCheung commented Jan 20, 2026 •

edited

Loading

greptile-apps bot commented Jan 27, 2026 •

edited

Loading

shijieliu commented Jan 27, 2026 •

edited

Loading

JacoCheung commented Jan 27, 2026 •

edited

Loading