Name	Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets	assets
evaluation	evaluation
llava	llava
scripts	scripts
.flake8	.flake8
.gitignore	.gitignore
.pre-commit-config.yaml	.pre-commit-config.yaml
LICENSE	LICENSE
README.md	README.md
action_extractor.py	action_extractor.py
environment_setup.sh	environment_setup.sh
gen_worker.py	gen_worker.py
inference.py	inference.py
inference_remote.py	inference_remote.py
merge.py	merge.py
pyproject.toml	pyproject.toml
ref_server.py	ref_server.py
test_remote.py	test_remote.py
train.py	train.py

MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots

This is the official repository for the paper:

MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots

Ting Huang*, Dongjian Li*, Rui Yang*, Zeyu Zhang*^†, Zida Yang, and Hao Tang^#

*Equal contribution. ^†Project lead. ^#Corresponding author.

Paper | Website | Data | Models | HF Paper

teaser_compressed.mp4

✏️ Citation

If you find our code or paper helpful, please consider starring ⭐ us and citing:

@article{huang2025mobilevla,
  title={MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots},
  author={Huang, Ting and Li, Dongjian and Yang, Rui and Zhang, Zeyu and Yang, Zida and Tang, Hao},
  journal={arXiv preprint arXiv:2511.17889},
  year={2025}
}

🏃 Intro MobileVLA-R1

MobileVLA-R1 enables robust real-world quadruped control by unifying language reasoning and continuous action through structured CoT alignment and GRPO training.

Grounding natural-language instructions into continuous control for quadruped robots remains a fundamental challenge in vision language action. Existing methods struggle to bridge high-level semantic reasoning and low-level actuation, leading to unstable grounding and weak generalization in the real-world. To address these issues, we present MobileVLA-R1, a unified vision–language–action framework that enables explicit reasoning and continuous control for quadruped robots. We construct MobileVLA-CoT, a large-scale dataset of multi-granularity CoT for embodied trajectories, providing structured reasoning supervision for alignment. Built upon this foundation, we introduce a two-stage training paradigm that combines supervised CoT alignment with GRPO reinforcement learning to enhance reasoning consistency, control stability, and long-horizon execution. Extensive evaluations on VLN and VLA tasks demonstrate superior performance over strong baselines, with approximately a 5% improvement. Real-world deployment on a quadruped robot validates robust performance in complex environments.

📰 News

2025/12/05: 📣 Our paper has been promoted by AI Era.

2025/11/30: 🔔 Our paper has been promoted by Embodied Intelligent Mind.

2025/11/27: 🎉 Our paper has been shared by AK.

TODO List

Upload our paper to arXiv and build project pages.
Release MobileVLA-CoT dataset.
Upload the code.

📦 Data Preparation

Our pipeline expects three synchronized modalities per observation: RGB frames (MP3D skybox crops), Depth Anything v2 maps, and point clouds derived from the depth maps. The default dataset used in this repo is Nav_CoT_FINAL_38K.jsonl, which augments R2R/RxR trajectories with CoT reasoning.

Download CoT annotations

wget https://your-storage/Nav_CoT_FINAL_38K.jsonl -O ./Nav_CoT_FINAL_38K.jsonl

Extract RGB frames
- Clone Matterport3D scans or reuse the official MP3D release.
- Create a root folder (e.g. /root/autodl-tmp/dataset/NavCoT/frames) that mirrors the path structure in the JSONL file. The loader automatically rewrites Windows-style paths via navcot_image_root.
Generate Depth Anything v2 maps
- Run Depth Anything v2 on each RGB frame and save the outputs (png, npy, or pt) under /root/autodl-tmp/dataset/NavCoT/depth.
- Name each depth file with the same basename as its RGB frame so the loader can resolve it.
(Optional) Pre-compute point clouds
- The training code can derive point clouds on the fly from depth maps. If you have higher-quality .npy point sets, place them next to your depth files and pass their path through navcot_use_point.

Key CLI arguments controlling the data loader live in llava/train/args.py, notably:

--navcot_image_root, --navcot_depth_root, --navcot_depth_format, --navcot_depth_scale
--navcot_use_depth, --navcot_use_point, --navcot_pointcloud_points, --navcot_depth_frames

These options also apply to GRPO generation via environment variables (see below).

⚙️ Environment Setup

The repo ships with a helper script that creates a compatible Conda environment, installs CUDA/FlashAttention2, links Hugging Face Transformers replacements, and installs all train/eval extras. Run it once per machine:

# Optional: pass a conda env name, otherwise it assumes you already activated one.
bash environment_setup.sh mobilevla

The script will:

Create/activate a Python 3.10 environment (if you passed a name).
Upgrade pip, install cuda-toolkit, FlashAttention 2.5.8 (CUDA 12.2, torch 2.3 build), and this project in editable mode with [train] / [eval] extras.
Pull transformers==4.37.2 from source and copy our patched files from llava/train/{transformers_replace,deepspeed_replace} into your site-packages so the long-context + sequence-parallel features work.

If you manage environments manually, replicate the same steps (torch 2.3 + CUDA 12.2, FlashAttention2, transformers 4.37.2 with the provided patches) before launching the SFT/GRPO jobs.

🚀 Training

Stage 1: Supervised CoT Alignment (SFT)

Run the provided LoRA script, which already enables the depth/point encoders and mixes MobileVLA-CoT with the Nav-CoT JSONL:

WANDB_MODE=offline \
bash scripts/train/sft_8frames.sh \
  --data_mixture cot+nav_cot_vln \
  --navcot_image_root /root/autodl-tmp/dataset/NavCoT/frames \
  --navcot_depth_root /root/autodl-tmp/dataset/NavCoT/depth \
  --navcot_use_depth True \
  --navcot_use_point True

Feel free to override any argument defined inside scripts/train/sft_8frames.sh for your cluster (batch size, LoRA rank, dataset paths, etc.).

Stage 2: GRPO Reinforcement Learning

Reference model server
```
python ref_server.py
```
Policy fine-tuning with DeepSpeed
```
deepspeed train.py
```

Generative worker with multi-modal inputs

GEN_DATASET=/root/autodl-tmp/Nav_CoT_FINAL_38K.jsonl \
NAVCOT_IMAGE_ROOT=/root/autodl-tmp/dataset/NavCoT/frames \
NAVCOT_DEPTH_ROOT=/root/autodl-tmp/dataset/NavCoT/depth \
NAVCOT_USE_POINT=1 \
python gen_worker.py

The worker instantiates NaVILAImageInference, streams RGB/Depth/Point payloads into the policy, scores candidates via the reward calculator, and pushes normalized rewards plus log-probs back to the reference server. train.py consumes those batches and performs GRPO updates with periodic model refreshes for the generator.

🌟 Star History

😘 Acknowledgement

We thank the authors of Qwen, NaVILA and DeepSeek-Math for their open-source code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots

Paper | Website | Data | Models | HF Paper

✏️ Citation

🏃 Intro MobileVLA-R1

📰 News

TODO List

📦 Data Preparation

⚙️ Environment Setup

🚀 Training

Stage 1: Supervised CoT Alignment (SFT)

Stage 2: GRPO Reinforcement Learning

🌟 Star History

😘 Acknowledgement

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

AIGeeksGroup/MobileVLA-R1

Folders and files

Latest commit

History

Repository files navigation

MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots

Paper | Website | Data | Models | HF Paper

✏️ Citation

🏃 Intro MobileVLA-R1

📰 News

TODO List

📦 Data Preparation

⚙️ Environment Setup

🚀 Training

Stage 1: Supervised CoT Alignment (SFT)

Stage 2: GRPO Reinforcement Learning

🌟 Star History

😘 Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages