Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process

🔥 The first open-sourced diffusion vision-langauge-action model.

List

Overview

This repository is an improved fork of UniVLA. You can set up the environment by following UniVLA’s instructions. Below, we provide a step-by-step guide to using UD-VLA in the CALVIN setting.

TODO List

The code of real world implementaion.
Research of distillation of discrete diffusion VLA for more efficient Inference.

Installation

1. Install Base environment.

# Create and activate conda environment
conda create -n udvla-calvin python=3.10 -y
conda activate udvla-calvin
# Clone and install the openvla repo
git clone https://github.com/OpenHelix-Team/UD-VLA.git
cd UD-VLA
pip install -r requirements.txt

2. Install Calvin environment

This setup is only for evaluation. The following steps are required to set up the environment:

# Install dependencies
cd reference/RoboVLMs

# This will install the required environment and download the calvin dataset.
bash scripts/setup_calvin.sh

# Only for rendering environment.
bash scripts/setup_calvin_vla.sh

# Check if the environment is set up correctly
python eval/calvin/env_test.py

Download pretraning checkpoint

Emu3-base

Emu3-base for text tokenizer.

Emu3-vision

Emu3-vision for visual tokenizer.

Autoregressive World model

This model is trained in an autoregressive manner with causal attention.

Dataset Preparation

# 1. process the dataset
python tools/process/calvin_process.py

# 2. extract the vq tokens, need to change the dataset & output path
bash scripts/tokenizer/extract_vq_emu3.sh 

# 3. pickle generation for training
python tools/pickle_gen/pickle_generation_calvin.py

Model Training

FAST Tokenizer

You can fit the FAST tokenizer on the corresponding dataset. Also, you can adjust the scale in tokenizer for more fine-grained tokenization.For CALVIN ABCD→D, we set the action chunk size to 10.

python tools/action_tokenizer/fit_fast.py

Discreate difussion training

bash scripts/simulator/calvin/train_calvin_abcd_video_i-ia_bid_mi.sh

We recommend at least 4×80-GB GPUs (e.g., A100/H100 80GB). Each sample contains many image tokens, which results in long sequence lengths and increased memory usage.

We also release our checkpoint fituned on CALVIN ABCD→D at UD-VLA_CALVIN

# train_calvin_abcd_video_i-ia_bid_mi.sh
WORLD_SIZE=${WORLD_SIZE:-1}
RANK=${RANK:-0}
MASTER_ADDR=${MASTER_ADDR:-127.0.0.1}
MASTER_PORT=${MASTER_PORT:-23456}
NGPUS=4
MODEL_PATH="logs/ckpts/WORLD_MODEL_POSTTRAIN"
DATAPATH='/share/user/iperror/data/udvla/processed_data/meta/calvin_abcd_norm.pkl'
ACTION_TOKENIZER_PATH="./pretrain/fast_calvin_abcd_a10_s50"

EXP_NAME="UNIVLA_CALVIN_ABCD_VIDEO_BS64_32k_I2IA_mi_0915"

export PYTHONPATH=$(pwd)

torchrun \
    --nproc_per_node=${NGPUS} \
    --nnodes=1 \
    --node_rank=${RANK} \
    train/train_moe.py \
    --model_name_or_path ${MODEL_PATH}\
    --model_config_path configs/moe_fast_video.json \
    --deepspeed scripts/sft/zero3.json \
    --output_dir "logs/"${EXP_NAME} \
    --learning_rate 8e-5 \
    --null_prompt_prob 0.15 \
    --weight_decay 0.1 \
    --min_learning_rate 5e-6 \
    --max_grad_norm 5.0 \
    --adam_beta1 0.9 \
    --adam_beta2 0.95 \
    --adam_epsilon 1e-6 \
    --bf16 True \
    --tf32 True \
    --data_path ${DATAPATH} \
    --max_steps 24000 \
    --dataloader_num_workers 16 \
    --lr_scheduler_type "cosine_with_min_lr" \
    --warmup_steps 50 \
    --per_device_train_batch_size 4 \
    --frames 2 \
    --action_frames 10 \
    --max_position_embeddings 1650 \
    --seed 42 \
    --logging_steps 20 \
    --gradient_checkpointing True \
    --gradient_accumulation_steps 4 \
    --save_strategy steps \
    --save_steps 3000 \
    --eval_strategy no \
    --apply_loss_on_only_vision False \
    --apply_loss_on_only_action False \
    --actions True \
    --actions_format "fast" \
    --use_gripper True \
    --video_format "interleave" \
    --action_tokenizer_path ${ACTION_TOKENIZER_PATH} \
    --with_i_ia True \
    --mask_image True \
    --use_blockwise_attn_mask True \
    --attn_type "None" \

Key training flags

--with_i_ia True Enable joint generation of *future image tokens * and *action tokens * in the same denoising step (JD3P).
--mask_image True Apply <MASK> to future-image positions during training and compute cross-entropy only on masked positions (single-step mask-predict objective).
--use_blockwise_attn_mask True Use blockwise hybrid attention:
- Bidirectional within the future-image block and within the action block.
- Causal across blocks
--attn_type "None" Not use "flash" (Flash-Attention).

Model Evaluation

You also can dowload our checkpoint fituned on CALVIN ABCD→D at UD-VLA_CALVIN

cd reference/RoboVLMs

# 4 GPUs inference,we set difussion step to 72.
bash scripts/run_eval_calvin_univla_i2ia_dis.sh 

# above command will generate the 4 results in the `results` folder, calculate the final average score
python tools/evaluation/calvin_score.py

Experiment Result

Performance on CALVIN ABCD→D Benchmark.

UniVLA* denotes the variant without historical frames for fair comparison. We evaluate 500 rollouts for our model, where each rollout involves a sequence of 5 consecutive sub-tasks.

Method	Task	1	2	3	4	5	Avg. Len ↑
MCIL	ABCD→D	0.373	0.027	0.002	0.000	0.000	0.40
RT-1	ABCD→D	0.844	0.617	0.438	0.323	0.227	2.45
Robo-Flamingo	ABCD→D	0.964	0.896	0.824	0.740	0.660	4.09
GR-1	ABCD→D	0.949	0.896	0.844	0.789	0.731	4.21
ReconVLA	ABCD→D	0.980	0.900	0.845	0.785	0.705	4.23
UniVLA*	ABCD→D	0.948	0.906	0.862	0.834	0.690	4.24
UP-VLA	ABCD→D	0.962	0.921	0.879	0.842	0.812	4.42
UD-VLA (ours)	ABCD→D	0.992	0.968	0.936	0.904	0.840	4.64

Performance on Real-world.

Our real-world setup consists of a 6-DoF UR5e robotic arm equipped with a 6-DoF Inspire RH56E2 robotic hand for dexterous manipulation.

Other Simulation Benchmark Setup

❤️ Acknowledgment

We thank Univla, Emu3, RobotVLM and Show-o for their open-sourced work!

We thank Yuqi Wang and Zhide zhong for their guidance about experiment!

📖Citation

If you find UD-VLA useful, please consider citing our work🤗:

@article{udvla2025,
title={Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process},
author={Jiayi Chen, Wenxuan Song, Pengxiang Ding, Ziyang Zhou, Han Zhao, Feilong Tang, Donglin Wang, Haoang Li},
year={2025},
journal={arXiv preprint arXiv:2511.01718}
}

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
configs		configs
docs		docs
models		models
pretrain		pretrain
reference		reference
scripts		scripts
tools		tools
train		train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
overeview.png		overeview.png
real-world.png		real-world.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process

List

Overview

TODO List

Installation

1. Install Base environment.

2. Install Calvin environment

Download pretraning checkpoint

Dataset Preparation

Model Training

FAST Tokenizer

Discreate difussion training

Model Evaluation

Experiment Result

Performance on CALVIN ABCD→D Benchmark.

Performance on Real-world.

Other Simulation Benchmark Setup

❤️ Acknowledgment

📖Citation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

OpenHelix-Team/Unified-Diffusion-VLA

Folders and files

Latest commit

History

Repository files navigation

Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process

List

Overview

TODO List

Installation

1. Install Base environment.

2. Install Calvin environment

Download pretraning checkpoint

Dataset Preparation

Model Training

FAST Tokenizer

Discreate difussion training

Model Evaluation

Experiment Result

Performance on CALVIN ABCD→D Benchmark.

Performance on Real-world.

Other Simulation Benchmark Setup

❤️ Acknowledgment

📖Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages