Fix OOM in contrastive pair generation with streaming approach #627

Wert1996 · 2026-01-22T17:54:12Z

Summary

This PR fixes out-of-memory issues when training SetFit on large datasets by replacing eager O(n²) pair generation with streaming pair generation.

Problem

Training with contrastive loss (e.g., CosineSimilarityLoss) on datasets with a large enough number of samples causes OOM before training even starts. The root cause is three layers of O(n²) memory allocation:

shuffle_combinations() creates all pair indices upfront
ContrastiveDataset stores all pairs in pos_pairs/neg_pairs lists
Dataset.from_list(list(...)) materializes the iterator again

Solution

sampler.py:

Replace shuffle_combinations() with on-the-fly random pair sampling
ContrastiveDataset now stores only label_to_indices mapping (O(n))
__iter__() generates pairs on-the-fly with set-based uniqueness tracking
Same changes for ContrastiveDistillationDataset

trainer.py / trainer_distillation.py:

Use IterableDataset.from_generator() instead of Dataset.from_list(list(...))
Compute max_steps automatically for IterableDataset compatibility

Memory Comparison

Component	Before	After
Index arrays	O(n²)	0
Pair lists	O(n²)	0
Label grouping	0	O(n)
Uniqueness set	0	O(num_pairs)
Dataset copy	O(n²)	0

Breaking Changes

ContrastiveDataset.pos_pairs and neg_pairs attributes removed
ContrastiveDataset.len_pos_pairs / len_neg_pairs now represent target counts, not stored counts
Added estimated_num_pairs property for logging

Testing

Verified training completes on large datasets that previously OOM'd
Pairs maintain uniqueness via set-based tracking
Reproducibility preserved via seeded RNG

stream contrastive pairs for prevent memory explosion

0d780be

Wert1996 mentioned this pull request Jan 22, 2026

OOM on large datasets due to O(n²) pair generation in contrastive learning #628

Open

Wert1996 marked this pull request as ready for review January 23, 2026 15:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix OOM in contrastive pair generation with streaming approach #627

Fix OOM in contrastive pair generation with streaming approach #627

Uh oh!

Wert1996 commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix OOM in contrastive pair generation with streaming approach #627

Are you sure you want to change the base?

Fix OOM in contrastive pair generation with streaming approach #627

Uh oh!

Conversation

Wert1996 commented Jan 22, 2026

Summary

Problem

Solution

Memory Comparison

Breaking Changes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant