Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/source/model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ This is the documentation for the supported models in OpenTau.
pi05
----
- Pi05 is a state of the art Vision-language-action flow model for general robot control. It supports both autoregressive discrete actions and flow matching continuous actions.
- More details can be found in the `paper <https://www.pi.website/download/pi05.pdf>`_.
- More details can be found in the `pi05 paper <https://www.pi.website/download/pi05.pdf>`_.
- See the implementation in `src/opentau/policies/pi05/modeling_pi05.py`.
- Checkpoint of the model finetuned on the LIBERO dataset is available on Hugging Face: `TensorAuto/tPi05-Libero <https://huggingface.co/TensorAuto/tPi05-Libero>`_
- Disclaimer: Our implementation doesn't support sub-task prediction yet, as mentioned in the paper.
Expand All @@ -15,13 +15,13 @@ pi05
pi0
----
- Pi0 is a Vision-language-action flow model that only supports flow matching continuous actions.
- More details can be found in the `paper <https://www.pi.website/download/pi0.pdf>`_.
- More details can be found in the `pi0 paper <https://www.pi.website/download/pi0.pdf>`_.
- See the implementation in `src/opentau/policies/pi0/modeling_pi0.py`.
- This model can be changed to pi0-star by changing the `advantage_always_on` flag to `on`/'use' in the config file.
- Checkpoint of the model finetuned on the LIBERO dataset is available on Hugging Face: `TensorAuto/tPi0-Libero <https://huggingface.co/TensorAuto/tPi0-Libero>`_

value
-----
- Value model is a Vision-language model used to predict the value of the current state. Its used to train VLA policies with RECAP framework.
- More details can be found in the `paper <https://www.pi.website/download/pistar06.pdf>`_.
- More details can be found in the `pi*06 paper <https://www.pi.website/download/pistar06.pdf>`_.
- See the implementation in `src/opentau/policies/value/modeling_value.py`.
16 changes: 7 additions & 9 deletions docs/source/tutorials/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,12 @@ Datasets
.. note::
Make sure you have followed the :doc:`/installation` guide before proceeding.

Building a dataset mixture
Building a dataset mixture
--------------------------

You can define a dataset mixture in your configuration file using the ``dataset_mixture`` key. Here is an example:

.. code-block:: json
.. code-block:: javascript

{
"dataset_mixture": {
Expand All @@ -30,21 +30,21 @@ You can define a dataset mixture in your configuration file using the ``dataset_
...
}

For each new dataset, you must add an entry to ``opentau/datasets/standard_data_format_mapping.py`` to map the dataset features to the Standard Data Format (see the :ref:`Standard Data Format section <concepts/standard-data-format>` in the Concepts documentation).
For each new dataset, you must add an entry to ``opentau/datasets/standard_data_format_mapping.py`` to map the dataset features to the Standard Data Format.
Alternatively, you can provide a custom mapping in the dataset config using the ``data_features_name_mapping`` and ``loss_type_mapping`` keys.
For example:

.. code-block:: json
.. code-block:: javascript

{
"dataset_mixture": {
"datasets": [
{
"repo_id": "physical-intelligence/libero"
"repo_id": "physical-intelligence/libero",
"data_features_name_mapping": {
"camera0": "observation.images.exterior_image_1_left",
"camera1": "observation.images.exterior_image_2_left",
}
"camera1": "observation.images.exterior_image_2_left"
},
"loss_type_mapping": "MSE"
},
{
Expand Down Expand Up @@ -73,5 +73,3 @@ Each training config should contain a dataset mixture definition. To evaluate th
--num_workers=10

This will output a token count for each language key in the dataset mixture, and save it to ``outputs/stats/token_count.json``.


71 changes: 71 additions & 0 deletions src/opentau/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,74 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Dataset management and processing utilities for robot learning and vision-language tasks.

This module provides a comprehensive toolkit for loading, creating, managing, and
processing datasets for training vision-language-action (VLA) models. It supports
both robot learning datasets (with actions and states) and vision-language
grounding datasets (for multimodal understanding tasks).

The module is organized into several key components:

- **Core Datasets**: LeRobotDataset for robot learning data with support for
temporal alignment, multi-modal data, and version compatibility.
- **Grounding Datasets**: Vision-language datasets (CLEVR, COCO-QA, PIXMO, VSR)
for training visual understanding without robot actions.
- **Dataset Mixtures**: WeightedDatasetMixture for combining multiple datasets
with controlled sampling proportions.
- **Data Processing**: Utilities for statistics computation, image/video
handling, transforms, and format standardization.
- **Factory Functions**: High-level functions for creating datasets and mixtures
from configuration objects.

Key Features:

- **HuggingFace Integration**: Seamless loading from HuggingFace Hub with
automatic version checking and backward compatibility.
- **Temporal Alignment**: Delta timestamps enable sampling features at
different time offsets with optional Gaussian noise for data augmentation.
- **Multi-modal Support**: Handles images, videos, state vectors, actions,
and text prompts with automatic format conversion.
- **Weighted Sampling**: Combine heterogeneous datasets with configurable
sampling weights for balanced training.
- **Standard Data Format**: Unified data format across all datasets for
consistent model input/output interfaces.
- **Statistics Management**: Automatic computation and aggregation of dataset
statistics for normalization.
- **Video Handling**: Multiple video backends (torchcodec, pyav, video_reader)
for efficient frame extraction and encoding.
- **Asynchronous I/O**: High-performance image writing for real-time data
recording without blocking.

Main Modules:

- **lerobot_dataset**: Core dataset implementation for robot learning data.
- **grounding**: Vision-language grounding datasets (CLEVR, COCO-QA, PIXMO, VSR).
- **dataset_mixture**: Weighted combination of multiple datasets.
- **factory**: Factory functions for creating datasets from configurations.
- **utils**: Utility functions for I/O, metadata management, and validation.
- **compute_stats**: Statistics computation and aggregation utilities.
- **transforms**: Image transformation pipelines for data augmentation.
- **video_utils**: Video encoding, decoding, and metadata extraction.
- **image_writer**: Asynchronous image writing for high-frequency recording.
- **sampler**: Episode-aware sampling with boundary frame filtering.
- **standard_data_format_mapping**: Feature name and loss type mappings.

Example:
Create a dataset mixture from configuration:

>>> from opentau.datasets.factory import make_dataset_mixture
>>> mixture = make_dataset_mixture(train_cfg)
>>> dataloader = mixture.get_dataloader()

Load a single dataset:

>>> from opentau.datasets.factory import make_dataset
>>> dataset = make_dataset(dataset_cfg, train_cfg)

Access grounding datasets:

>>> from opentau import available_grounding_datasets
>>> print(list(available_grounding_datasets.keys()))
['clevr', 'cocoqa', 'dummy', 'pixmo', 'vsr']
"""
26 changes: 16 additions & 10 deletions src/opentau/datasets/compute_stats.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,16 +42,22 @@
weighted variance across multiple statistics.

Functions:
estimate_num_samples: Heuristic to estimate optimal number of samples
based on dataset size.
sample_indices: Generate evenly spaced sample indices from a dataset.
auto_downsample_height_width: Automatically downsample large images.
sample_images: Load and downsample a subset of images from file paths.
get_feature_stats: Compute statistical measures for an array.
compute_episode_stats: Compute statistics for a single episode.
aggregate_feature_stats: Aggregate statistics for a feature across
multiple episodes.
aggregate_stats: Aggregate statistics from multiple episodes/datasets.
estimate_num_samples
Heuristic to estimate optimal number of samples based on dataset size.
sample_indices
Generate evenly spaced sample indices from a dataset.
auto_downsample_height_width
Automatically downsample large images.
sample_images
Load and downsample a subset of images from file paths.
get_feature_stats
Compute statistical measures for an array.
compute_episode_stats
Compute statistics for a single episode.
aggregate_feature_stats
Aggregate statistics for a feature across multiple episodes.
aggregate_stats
Aggregate statistics from multiple episodes/datasets.

Example:
Compute statistics for a single episode:
Expand Down
2 changes: 2 additions & 0 deletions src/opentau/datasets/factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,8 +101,10 @@ def resolve_delta_timestamps(

Returns:
A 2-tuple containing:

- At index 0, a 4-tuple containing delta timestamps mean, std, lower, and upper bounds for each group.
- At index 1, a dictionary mapping feature names to their corresponding group and index.

The delta timestamps and group mapping should follow the structure expected by LeRobotDataset.
"""
group = "input_group"
Expand Down
23 changes: 7 additions & 16 deletions src/opentau/datasets/grounding/pixmo.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

# Copyright 2026 Tensor Auto Inc. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
Expand All @@ -12,27 +11,18 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""
Datasets for Image-Text Point Set grounding tasks.
"""Datasets for Image-Text Point Set grounding tasks.

This module provides the PIXMO (Pixel-level Manipulation) dataset implementation
for training vision-language models on part localization and object grounding
tasks. The dataset contains images with point annotations for object parts,
enabling models to learn fine-grained spatial understanding.
for training vision-language models on part localization and object grounding tasks.

The dataset contains images with point annotations for object parts, enabling models
to learn fine-grained spatial understanding.

The dataset is loaded from HuggingFace (allenai/pixmo-points) and includes
automatic retry logic for handling image download failures. Point coordinates
are normalized to a 255x255 grid and formatted as JSON strings in the postfix.

Key Features:
- Point set grounding: Provides pixel-level point annotations for object
parts with labels.
- Robust loading: Automatic retry with random sampling for failed image
downloads.
- Grid normalization: Converts pixel coordinates to normalized grid space
for consistent representation.

Classes:
PixmoDataset: Dataset class that loads and formats PIXMO data for part
localization tasks.
Expand All @@ -50,7 +40,8 @@
HTTP_TIMEOUT: HTTP request timeout in seconds.

Example:
Use PIXMO dataset in training:
Use PIXMO dataset in training::

>>> from opentau.configs.default import DatasetConfig
>>> cfg = DatasetConfig(grounding="pixmo")
>>> dataset = make_dataset(cfg, train_cfg)
Expand Down
11 changes: 5 additions & 6 deletions src/opentau/datasets/grounding/vsr.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

# Copyright 2026 Tensor Auto Inc. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
Expand All @@ -12,7 +11,6 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""VSR (Visual Spatial Reasoning) dataset for true/false statement grounding.

This module provides the VSR dataset implementation for training vision-language
Expand All @@ -25,10 +23,10 @@
formatted as grounding tasks with true/false labels.

Key Features:
- Spatial reasoning: Tests understanding of spatial relationships between
* Spatial reasoning: Tests understanding of spatial relationships between
objects in images.
- Binary classification: Simple true/false format for clear learning signal.
- Robust loading: Automatic retry with random sampling for failed image
* Binary classification: Simple true/false format for clear learning signal.
* Robust loading: Automatic retry with random sampling for failed image
downloads.

Classes:
Expand All @@ -45,7 +43,8 @@
HTTP_TIMEOUT: HTTP request timeout in seconds.

Example:
Use VSR dataset in training:
Use VSR dataset in training::

>>> from opentau.configs.default import DatasetConfig
>>> cfg = DatasetConfig(grounding="vsr")
>>> dataset = make_dataset(cfg, train_cfg)
Expand Down
25 changes: 15 additions & 10 deletions src/opentau/datasets/image_writer.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
robots and recording data at high frame rates without blocking the main process.

The module supports two execution models:

1. Threading mode (num_processes=0): Creates a pool of worker threads
for concurrent image writing within a single process.
2. Multiprocessing mode (num_processes>0): Creates multiple processes,
Expand All @@ -39,18 +40,22 @@
even when exceptions occur.

Classes:
AsyncImageWriter: Main class for asynchronous image writing with
configurable threading or multiprocessing backends.

AsyncImageWriter
Main class for asynchronous image writing with configurable threading
or multiprocessing backends.

Functions:
image_array_to_pil_image: Convert numpy array to PIL Image with format
and type conversion.
write_image: Write an image (numpy array or PIL Image) to disk.
worker_thread_loop: Worker thread loop for processing image write queue.
worker_process: Worker process that manages multiple threads for image
writing.
safe_stop_image_writer: Decorator to safely stop image writer on
exceptions.
image_array_to_pil_image
Convert numpy array to PIL Image with format and type conversion.
write_image
Write an image (numpy array or PIL Image) to disk.
worker_thread_loop
Worker thread loop for processing image write queue.
worker_process
Worker process that manages multiple threads for image writing.
safe_stop_image_writer
Decorator to safely stop image writer on exceptions.

Example:
Create an async image writer with threading:
Expand Down
Loading
Loading