Skip to content

🔥 The first open-sourced diffusion vision-langauge-action model.

License

Notifications You must be signed in to change notification settings

OpenHelix-Team/Unified-Diffusion-VLA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process

Paper Page Hugging Face Collection WeChat

🔥 The first open-sourced diffusion vision-langauge-action model.

List

Overview | TODO | Installation | Download checkpoint | Train | Evaluation | Results

Overview

This repository is an improved fork of UniVLA. You can set up the environment by following UniVLA’s instructions. Below, we provide a step-by-step guide to using UD-VLA in the CALVIN setting.

TODO List

  • The code of real world implementaion.
  • Research of distillation of discrete diffusion VLA for more efficient Inference.

Installation

1. Install Base environment.

# Create and activate conda environment
conda create -n udvla-calvin python=3.10 -y
conda activate udvla-calvin
# Clone and install the openvla repo
git clone https://github.com/OpenHelix-Team/UD-VLA.git
cd UD-VLA
pip install -r requirements.txt

2. Install Calvin environment

This setup is only for evaluation. The following steps are required to set up the environment:

# Install dependencies
cd reference/RoboVLMs

# This will install the required environment and download the calvin dataset.
bash scripts/setup_calvin.sh

# Only for rendering environment.
bash scripts/setup_calvin_vla.sh

# Check if the environment is set up correctly
python eval/calvin/env_test.py

Download pretraning checkpoint

Emu3-base

Emu3-base for text tokenizer.

Emu3-vision

Emu3-vision for visual tokenizer.

Autoregressive World model

This model is trained in an autoregressive manner with causal attention.

Dataset Preparation

# 1. process the dataset
python tools/process/calvin_process.py

# 2. extract the vq tokens, need to change the dataset & output path
bash scripts/tokenizer/extract_vq_emu3.sh 

# 3. pickle generation for training
python tools/pickle_gen/pickle_generation_calvin.py

Model Training

FAST Tokenizer

You can fit the FAST tokenizer on the corresponding dataset. Also, you can adjust the scale in tokenizer for more fine-grained tokenization.For CALVIN ABCD→D, we set the action chunk size to 10.

python tools/action_tokenizer/fit_fast.py

Discreate difussion training

bash scripts/simulator/calvin/train_calvin_abcd_video_i-ia_bid_mi.sh

We recommend at least 4Ă—80-GB GPUs (e.g., A100/H100 80GB). Each sample contains many image tokens, which results in long sequence lengths and increased memory usage.

We also release our checkpoint fituned on CALVIN ABCD→D at UD-VLA_CALVIN

# train_calvin_abcd_video_i-ia_bid_mi.sh
WORLD_SIZE=${WORLD_SIZE:-1}
RANK=${RANK:-0}
MASTER_ADDR=${MASTER_ADDR:-127.0.0.1}
MASTER_PORT=${MASTER_PORT:-23456}
NGPUS=4
MODEL_PATH="logs/ckpts/WORLD_MODEL_POSTTRAIN"
DATAPATH='/share/user/iperror/data/udvla/processed_data/meta/calvin_abcd_norm.pkl'
ACTION_TOKENIZER_PATH="./pretrain/fast_calvin_abcd_a10_s50"

EXP_NAME="UNIVLA_CALVIN_ABCD_VIDEO_BS64_32k_I2IA_mi_0915"

export PYTHONPATH=$(pwd)

torchrun \
    --nproc_per_node=${NGPUS} \
    --nnodes=1 \
    --node_rank=${RANK} \
    train/train_moe.py \
    --model_name_or_path ${MODEL_PATH}\
    --model_config_path configs/moe_fast_video.json \
    --deepspeed scripts/sft/zero3.json \
    --output_dir "logs/"${EXP_NAME} \
    --learning_rate 8e-5 \
    --null_prompt_prob 0.15 \
    --weight_decay 0.1 \
    --min_learning_rate 5e-6 \
    --max_grad_norm 5.0 \
    --adam_beta1 0.9 \
    --adam_beta2 0.95 \
    --adam_epsilon 1e-6 \
    --bf16 True \
    --tf32 True \
    --data_path ${DATAPATH} \
    --max_steps 24000 \
    --dataloader_num_workers 16 \
    --lr_scheduler_type "cosine_with_min_lr" \
    --warmup_steps 50 \
    --per_device_train_batch_size 4 \
    --frames 2 \
    --action_frames 10 \
    --max_position_embeddings 1650 \
    --seed 42 \
    --logging_steps 20 \
    --gradient_checkpointing True \
    --gradient_accumulation_steps 4 \
    --save_strategy steps \
    --save_steps 3000 \
    --eval_strategy no \
    --apply_loss_on_only_vision False \
    --apply_loss_on_only_action False \
    --actions True \
    --actions_format "fast" \
    --use_gripper True \
    --video_format "interleave" \
    --action_tokenizer_path ${ACTION_TOKENIZER_PATH} \
    --with_i_ia True \
    --mask_image True \
    --use_blockwise_attn_mask True \
    --attn_type "None" \

Key training flags

  • --with_i_ia True Enable joint generation of *future image tokens * and *action tokens * in the same denoising step (JD3P).

  • --mask_image True Apply <MASK> to future-image positions during training and compute cross-entropy only on masked positions (single-step mask-predict objective).

  • --use_blockwise_attn_mask True Use blockwise hybrid attention:

    • Bidirectional within the future-image block and within the action block.
    • Causal across blocks
  • --attn_type "None" Not use "flash" (Flash-Attention).

Model Evaluation

You also can dowload our checkpoint fituned on CALVIN ABCD→D at UD-VLA_CALVIN

cd reference/RoboVLMs

# 4 GPUs inference,we set difussion step to 72.
bash scripts/run_eval_calvin_univla_i2ia_dis.sh 

# above command will generate the 4 results in the `results` folder, calculate the final average score
python tools/evaluation/calvin_score.py

Experiment Result

Performance on CALVIN ABCD→D Benchmark.

UniVLA* denotes the variant without historical frames for fair comparison. We evaluate 500 rollouts for our model, where each rollout involves a sequence of 5 consecutive sub-tasks.

MethodTask12345Avg. Len ↑
MCILABCD→D0.3730.0270.0020.0000.0000.40
RT-1ABCD→D0.8440.6170.4380.3230.2272.45
Robo-FlamingoABCD→D0.9640.8960.8240.7400.6604.09
GR-1ABCD→D0.9490.8960.8440.7890.7314.21
ReconVLAABCD→D0.9800.9000.8450.7850.7054.23
UniVLA*ABCD→D0.9480.9060.8620.8340.6904.24
UP-VLAABCD→D0.9620.9210.8790.8420.8124.42
UD-VLA (ours)ABCD→D0.9920.9680.9360.9040.8404.64

Performance on Real-world.

Our real-world setup consists of a 6-DoF UR5e robotic arm equipped with a 6-DoF Inspire RH56E2 robotic hand for dexterous manipulation.

Other Simulation Benchmark Setup

❤️ Acknowledgment

We thank Univla, Emu3, RobotVLM and Show-o for their open-sourced work!

We thank Yuqi Wang and Zhide zhong for their guidance about experiment!

đź“–Citation

If you find UD-VLA useful, please consider citing our work🤗:

@article{udvla2025,
title={Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process},
author={Jiayi Chen, Wenxuan Song, Pengxiang Ding, Ziyang Zhou, Han Zhao, Feilong Tang, Donglin Wang, Haoang Li},
year={2025},
journal={arXiv preprint arXiv:2511.01718}
}

About

🔥 The first open-sourced diffusion vision-langauge-action model.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •