This is the official repository for the paper:
MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots
Ting Huang*, Dongjian Li*, Rui Yang*, Zeyu Zhang*†, Zida Yang, and Hao Tang#
*Equal contribution. †Project lead. #Corresponding author.
teaser_compressed.mp4
If you find our code or paper helpful, please consider starring ⭐ us and citing:
@article{huang2025mobilevla,
title={MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots},
author={Huang, Ting and Li, Dongjian and Yang, Rui and Zhang, Zeyu and Yang, Zida and Tang, Hao},
journal={arXiv preprint arXiv:2511.17889},
year={2025}
}MobileVLA-R1 enables robust real-world quadruped control by unifying language reasoning and continuous action through structured CoT alignment and GRPO training.
Grounding natural-language instructions into continuous control for quadruped robots remains a fundamental challenge in vision language action. Existing methods struggle to bridge high-level semantic reasoning and low-level actuation, leading to unstable grounding and weak generalization in the real-world. To address these issues, we present MobileVLA-R1, a unified vision–language–action framework that enables explicit reasoning and continuous control for quadruped robots. We construct MobileVLA-CoT, a large-scale dataset of multi-granularity CoT for embodied trajectories, providing structured reasoning supervision for alignment. Built upon this foundation, we introduce a two-stage training paradigm that combines supervised CoT alignment with GRPO reinforcement learning to enhance reasoning consistency, control stability, and long-horizon execution. Extensive evaluations on VLN and VLA tasks demonstrate superior performance over strong baselines, with approximately a 5% improvement. Real-world deployment on a quadruped robot validates robust performance in complex environments.
2025/12/05: 📣 Our paper has been promoted by AI Era.
2025/11/30: 🔔 Our paper has been promoted by Embodied Intelligent Mind.
2025/11/27: 🎉 Our paper has been shared by AK.
- Upload our paper to arXiv and build project pages.
- Release MobileVLA-CoT dataset.
- Upload the code.
Our pipeline expects three synchronized modalities per observation: RGB frames (MP3D skybox crops), Depth Anything v2 maps, and point clouds derived from the depth maps. The default dataset used in this repo is Nav_CoT_FINAL_38K.jsonl, which augments R2R/RxR trajectories with CoT reasoning.
- Download CoT annotations
wget https://your-storage/Nav_CoT_FINAL_38K.jsonl -O ./Nav_CoT_FINAL_38K.jsonl
- Extract RGB frames
- Clone Matterport3D scans or reuse the official MP3D release.
- Create a root folder (e.g.
/root/autodl-tmp/dataset/NavCoT/frames) that mirrors the path structure in the JSONL file. The loader automatically rewrites Windows-style paths vianavcot_image_root.
- Generate Depth Anything v2 maps
- Run Depth Anything v2 on each RGB frame and save the outputs (
png,npy, orpt) under/root/autodl-tmp/dataset/NavCoT/depth. - Name each depth file with the same basename as its RGB frame so the loader can resolve it.
- Run Depth Anything v2 on each RGB frame and save the outputs (
- (Optional) Pre-compute point clouds
- The training code can derive point clouds on the fly from depth maps. If you have higher-quality
.npypoint sets, place them next to your depth files and pass their path throughnavcot_use_point.
- The training code can derive point clouds on the fly from depth maps. If you have higher-quality
Key CLI arguments controlling the data loader live in llava/train/args.py, notably:
--navcot_image_root,--navcot_depth_root,--navcot_depth_format,--navcot_depth_scale--navcot_use_depth,--navcot_use_point,--navcot_pointcloud_points,--navcot_depth_frames
These options also apply to GRPO generation via environment variables (see below).
The repo ships with a helper script that creates a compatible Conda environment, installs CUDA/FlashAttention2, links Hugging Face Transformers replacements, and installs all train/eval extras. Run it once per machine:
# Optional: pass a conda env name, otherwise it assumes you already activated one.
bash environment_setup.sh mobilevlaThe script will:
- Create/activate a Python 3.10 environment (if you passed a name).
- Upgrade
pip, installcuda-toolkit, FlashAttention 2.5.8 (CUDA 12.2, torch 2.3 build), and this project in editable mode with[train]/[eval]extras. - Pull
transformers==4.37.2from source and copy our patched files fromllava/train/{transformers_replace,deepspeed_replace}into your site-packages so the long-context + sequence-parallel features work.
If you manage environments manually, replicate the same steps (torch 2.3 + CUDA 12.2, FlashAttention2, transformers 4.37.2 with the provided patches) before launching the SFT/GRPO jobs.
Run the provided LoRA script, which already enables the depth/point encoders and mixes MobileVLA-CoT with the Nav-CoT JSONL:
WANDB_MODE=offline \
bash scripts/train/sft_8frames.sh \
--data_mixture cot+nav_cot_vln \
--navcot_image_root /root/autodl-tmp/dataset/NavCoT/frames \
--navcot_depth_root /root/autodl-tmp/dataset/NavCoT/depth \
--navcot_use_depth True \
--navcot_use_point TrueFeel free to override any argument defined inside scripts/train/sft_8frames.sh for your cluster (batch size, LoRA rank, dataset paths, etc.).
- Reference model server
python ref_server.py
- Policy fine-tuning with DeepSpeed
deepspeed train.py
- Generative worker with multi-modal inputs
GEN_DATASET=/root/autodl-tmp/Nav_CoT_FINAL_38K.jsonl \ NAVCOT_IMAGE_ROOT=/root/autodl-tmp/dataset/NavCoT/frames \ NAVCOT_DEPTH_ROOT=/root/autodl-tmp/dataset/NavCoT/depth \ NAVCOT_USE_POINT=1 \ python gen_worker.py
The worker instantiates NaVILAImageInference, streams RGB/Depth/Point payloads into the policy, scores candidates via the reward calculator, and pushes normalized rewards plus log-probs back to the reference server. train.py consumes those batches and performs GRPO updates with periodic model refreshes for the generator.
We thank the authors of Qwen, NaVILA and DeepSeek-Math for their open-source code.
