🔥 The first open-sourced diffusion vision-langauge-action model.
Overview | TODO | Installation | Download checkpoint | Train | Evaluation | Results
This repository is an improved fork of UniVLA. You can set up the environment by following UniVLA’s instructions. Below, we provide a step-by-step guide to using UD-VLA in the CALVIN setting.

- The code of real world implementaion.
- Research of distillation of discrete diffusion VLA for more efficient Inference.
# Create and activate conda environment
conda create -n udvla-calvin python=3.10 -y
conda activate udvla-calvin
# Clone and install the openvla repo
git clone https://github.com/OpenHelix-Team/UD-VLA.git
cd UD-VLA
pip install -r requirements.txt
This setup is only for evaluation. The following steps are required to set up the environment:
# Install dependencies
cd reference/RoboVLMs
# This will install the required environment and download the calvin dataset.
bash scripts/setup_calvin.sh
# Only for rendering environment.
bash scripts/setup_calvin_vla.sh
# Check if the environment is set up correctly
python eval/calvin/env_test.pyEmu3-base for text tokenizer.
Emu3-vision for visual tokenizer.
This model is trained in an autoregressive manner with causal attention.
# 1. process the dataset
python tools/process/calvin_process.py
# 2. extract the vq tokens, need to change the dataset & output path
bash scripts/tokenizer/extract_vq_emu3.sh
# 3. pickle generation for training
python tools/pickle_gen/pickle_generation_calvin.pyYou can fit the FAST tokenizer on the corresponding dataset. Also, you can adjust the scale in tokenizer for more fine-grained tokenization.For CALVIN ABCD→D, we set the action chunk size to 10.
python tools/action_tokenizer/fit_fast.pybash scripts/simulator/calvin/train_calvin_abcd_video_i-ia_bid_mi.shWe recommend at least 4Ă—80-GB GPUs (e.g., A100/H100 80GB). Each sample contains many image tokens, which results in long sequence lengths and increased memory usage.
We also release our checkpoint fituned on CALVIN ABCD→D at UD-VLA_CALVIN
# train_calvin_abcd_video_i-ia_bid_mi.sh
WORLD_SIZE=${WORLD_SIZE:-1}
RANK=${RANK:-0}
MASTER_ADDR=${MASTER_ADDR:-127.0.0.1}
MASTER_PORT=${MASTER_PORT:-23456}
NGPUS=4
MODEL_PATH="logs/ckpts/WORLD_MODEL_POSTTRAIN"
DATAPATH='/share/user/iperror/data/udvla/processed_data/meta/calvin_abcd_norm.pkl'
ACTION_TOKENIZER_PATH="./pretrain/fast_calvin_abcd_a10_s50"
EXP_NAME="UNIVLA_CALVIN_ABCD_VIDEO_BS64_32k_I2IA_mi_0915"
export PYTHONPATH=$(pwd)
torchrun \
--nproc_per_node=${NGPUS} \
--nnodes=1 \
--node_rank=${RANK} \
train/train_moe.py \
--model_name_or_path ${MODEL_PATH}\
--model_config_path configs/moe_fast_video.json \
--deepspeed scripts/sft/zero3.json \
--output_dir "logs/"${EXP_NAME} \
--learning_rate 8e-5 \
--null_prompt_prob 0.15 \
--weight_decay 0.1 \
--min_learning_rate 5e-6 \
--max_grad_norm 5.0 \
--adam_beta1 0.9 \
--adam_beta2 0.95 \
--adam_epsilon 1e-6 \
--bf16 True \
--tf32 True \
--data_path ${DATAPATH} \
--max_steps 24000 \
--dataloader_num_workers 16 \
--lr_scheduler_type "cosine_with_min_lr" \
--warmup_steps 50 \
--per_device_train_batch_size 4 \
--frames 2 \
--action_frames 10 \
--max_position_embeddings 1650 \
--seed 42 \
--logging_steps 20 \
--gradient_checkpointing True \
--gradient_accumulation_steps 4 \
--save_strategy steps \
--save_steps 3000 \
--eval_strategy no \
--apply_loss_on_only_vision False \
--apply_loss_on_only_action False \
--actions True \
--actions_format "fast" \
--use_gripper True \
--video_format "interleave" \
--action_tokenizer_path ${ACTION_TOKENIZER_PATH} \
--with_i_ia True \
--mask_image True \
--use_blockwise_attn_mask True \
--attn_type "None" \
Key training flags
-
--with_i_ia TrueEnable joint generation of *future image tokens * and *action tokens * in the same denoising step (JD3P). -
--mask_image TrueApply<MASK>to future-image positions during training and compute cross-entropy only on masked positions (single-step mask-predict objective). -
--use_blockwise_attn_mask TrueUse blockwise hybrid attention:- Bidirectional within the future-image block and within the action block.
- Causal across blocks
-
--attn_type "None"Not use"flash"(Flash-Attention).
You also can dowload our checkpoint fituned on CALVIN ABCD→D at UD-VLA_CALVIN
cd reference/RoboVLMs
# 4 GPUs inference,we set difussion step to 72.
bash scripts/run_eval_calvin_univla_i2ia_dis.sh
# above command will generate the 4 results in the `results` folder, calculate the final average score
python tools/evaluation/calvin_score.pyUniVLA* denotes the variant without historical frames for fair comparison. We evaluate 500 rollouts for our model, where each rollout involves a sequence of 5 consecutive sub-tasks.
| Method | Task | 1 | 2 | 3 | 4 | 5 | Avg. Len ↑ |
|---|---|---|---|---|---|---|---|
| MCIL | ABCD→D | 0.373 | 0.027 | 0.002 | 0.000 | 0.000 | 0.40 |
| RT-1 | ABCD→D | 0.844 | 0.617 | 0.438 | 0.323 | 0.227 | 2.45 |
| Robo-Flamingo | ABCD→D | 0.964 | 0.896 | 0.824 | 0.740 | 0.660 | 4.09 |
| GR-1 | ABCD→D | 0.949 | 0.896 | 0.844 | 0.789 | 0.731 | 4.21 |
| ReconVLA | ABCD→D | 0.980 | 0.900 | 0.845 | 0.785 | 0.705 | 4.23 |
| UniVLA* | ABCD→D | 0.948 | 0.906 | 0.862 | 0.834 | 0.690 | 4.24 |
| UP-VLA | ABCD→D | 0.962 | 0.921 | 0.879 | 0.842 | 0.812 | 4.42 |
| UD-VLA (ours) | ABCD→D | 0.992 | 0.968 | 0.936 | 0.904 | 0.840 | 4.64 |
Our real-world setup consists of a 6-DoF UR5e robotic arm equipped with a 6-DoF Inspire RH56E2 robotic hand for dexterous manipulation.

We thank Univla, Emu3, RobotVLM and Show-o for their open-sourced work!
We thank Yuqi Wang and Zhide zhong for their guidance about experiment!
If you find UD-VLA useful, please consider citing our work🤗:
@article{udvla2025,
title={Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process},
author={Jiayi Chen, Wenxuan Song, Pengxiang Ding, Ziyang Zhou, Han Zhao, Feilong Tang, Donglin Wang, Haoang Li},
year={2025},
journal={arXiv preprint arXiv:2511.01718}
}