sample-dataset

sample-dataset

sample-dataset is a small library for generating balanced, constraint-driven samples from tabular data, using:

pandas for data handling
Google OR-Tools CP-SAT for constraint optimization

It allows you to divide your dataset into buckets (e.g., train/test × attribute combinations) while respecting arbitrary minimum size requirements provided by the user.

This is useful for:

Train/test splits with structural constraints
Balanced dataset construction
Controlled sampling for linguistic, NLP, or behavioral datasets
Any application where "random sampling" must satisfy non-trivial rules

Features

Constraint-based sampling using OR-Tools
Flexible bucket definitions via a separate minima dataframe
Supports arbitrary bucket-defining columns (e.g., split, feature_a, feature_b, split1, split2, ...)
Automatically infers which dataset rows are eligible for which buckets
Guarantees minimum bucket sizes
Supports multiple randomized feasible solutions
Simple API (assign_buckets, assign_buckets_multiple)

Installation

pip install sample-dataset

Quick Start

Import your data

import pandas as pd
from sample_dataset import assign_buckets

Your dataset (df) might look like:

df = pd.DataFrame({
    "ID": [1, 2, 3, 4],
    "feature_a": ["a", "a", "su", "su"],
    "feature_b": ["yes", "no", "yes", "no"],
    "context": ["...", "...", "...", "..."],
})

Define the bucket structure + minima

df_minima = pd.DataFrame({
    "split": ["train", "test", "train", "test"],
    "feature_a": ["a", "a", "su", "su"],
    "feature_b": ["yes", "yes", "no", "no"],
    "min_required": [150, 50, 150, 50],
})

Each row represents a bucket. All columns except min_required define the bucket identity.

Assign buckets

df_out = assign_buckets(df, df_minima, verbose=True)
print(df_out.head())

Output:

   ID feature_a feature_b   context      bucket
0   1        a       yes       ...   train|a|yes
1   2        a        no       ...    test|a|no
2   3       su       yes       ...  train|su|yes
3   4       su        no       ...   test|su|no

Multiple randomized balanced samples

To generate N different feasible assignments, use:

from sample_dataset import assign_buckets_multiple

df_samples = assign_buckets_multiple_wide(df, df_minima, n_samples=3)
print(df_samples.head())

You’ll get:

bucket_0      bucket_1      bucket_2
train|a|yes   test|a|yes    test|a|yes
...

How it works

The code interprets df_minima as the full set of buckets.
Matches rows to buckets when all key_cols match
Enforces minimum bucket sizes
Forces each row to belong to exactly one bucket
Uses a randomized objective to obtain diverse feasible assignments
Solved using OR-Tools’ CP-SAT engine

Requirements

Python ≥ 3.9
pandas
numpy
ortools

These are installed automatically.

Links

Source code: [https://github.com/LaboratorioSperimentale/sample-dataset]
Issues: [https://github.com/LaboratorioSperimentale/sample-dataset/issues]
PyPI: [https://pypi.org/project/sample-dataset/]

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
src/sample_dataset		src/sample_dataset
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
example.py		example.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sample-dataset

Features

Installation

Quick Start

Import your data

Define the bucket structure + minima

Assign buckets

Multiple randomized balanced samples

How it works

Requirements

Links

About

Uh oh!

Releases 1

Packages

Languages

License

LaboratorioSperimentale/sample-dataset

Folders and files

Latest commit

History

Repository files navigation

sample-dataset

Features

Installation

Quick Start

Import your data

Define the bucket structure + minima

Assign buckets

Multiple randomized balanced samples

How it works

Requirements

Links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages