sample-dataset is a small library for generating balanced, constraint-driven samples from tabular data, using:
- pandas for data handling
- Google OR-Tools CP-SAT for constraint optimization
It allows you to divide your dataset into buckets (e.g., train/test × attribute combinations) while respecting arbitrary minimum size requirements provided by the user.
This is useful for:
- Train/test splits with structural constraints
- Balanced dataset construction
- Controlled sampling for linguistic, NLP, or behavioral datasets
- Any application where "random sampling" must satisfy non-trivial rules
- Constraint-based sampling using OR-Tools
- Flexible bucket definitions via a separate minima dataframe
- Supports arbitrary bucket-defining columns (e.g., split, feature_a, feature_b, split1, split2, ...)
- Automatically infers which dataset rows are eligible for which buckets
- Guarantees minimum bucket sizes
- Supports multiple randomized feasible solutions
- Simple API (assign_buckets, assign_buckets_multiple)
pip install sample-datasetimport pandas as pd
from sample_dataset import assign_bucketsYour dataset (df) might look like:
df = pd.DataFrame({
"ID": [1, 2, 3, 4],
"feature_a": ["a", "a", "su", "su"],
"feature_b": ["yes", "no", "yes", "no"],
"context": ["...", "...", "...", "..."],
})df_minima = pd.DataFrame({
"split": ["train", "test", "train", "test"],
"feature_a": ["a", "a", "su", "su"],
"feature_b": ["yes", "yes", "no", "no"],
"min_required": [150, 50, 150, 50],
})Each row represents a bucket. All columns except min_required define the bucket identity.
df_out = assign_buckets(df, df_minima, verbose=True)
print(df_out.head())Output:
ID feature_a feature_b context bucket
0 1 a yes ... train|a|yes
1 2 a no ... test|a|no
2 3 su yes ... train|su|yes
3 4 su no ... test|su|no
To generate N different feasible assignments, use:
from sample_dataset import assign_buckets_multiple
df_samples = assign_buckets_multiple_wide(df, df_minima, n_samples=3)
print(df_samples.head())You’ll get:
bucket_0 bucket_1 bucket_2
train|a|yes test|a|yes test|a|yes
...
- The code interprets df_minima as the full set of buckets.
- Matches rows to buckets when all key_cols match
- Enforces minimum bucket sizes
- Forces each row to belong to exactly one bucket
- Uses a randomized objective to obtain diverse feasible assignments
- Solved using OR-Tools’ CP-SAT engine
- Python ≥ 3.9
- pandas
- numpy
- ortools
These are installed automatically.