Yanghong Mei*1,5,
Yirong Yang*2,
Longteng Guo†1,
Qunbo Wang3,
Ming-Ming Yu2,
Xingjian He1,
Wenjun Wu2,4,
Jing Liu1,5,
1Institute of Automation, Chinese Academy of Sciences
2Beihang University
3Beijing Jiaotong University
4Hangzhou International Innovation Institute
5School of Artificial Intelligence, University of Chinese Academy of Sciences
- [12/19/2025]: UrbanNav dataset is now available! Check out our dataset on Hugging Face for more details.
UrbanNav is a large-scale urban navigation dataset automatically constructed using Qwen2.5-VL based on web data, comprising 47k trajectories and 3M language instructions. designed for training language-guided embodied urban navigation and enabling fair offline evaluation. Our navigation policy trained on UrbanNav achieves state-of-the-art performance both on the benchmark and in real-world deployment.
UrbanNav navigation policy is deployed on a four-wheel differential-drive robot and validated in challenging and diverse outdoor urban environments.
conda create -n urbannav python=3.10
conda activate urbannav
pip install -r requirements.txt
You can easily prepare the UrbanNav dataset by following the steps below:
All YouTube video IDs used by UrbanNav are listed in the video list. You need to download these videos in 360p resolution at 30 FPS and place them in the same directory.
The trajectory data and instruction annotations are publicly available on Hugging Face. Please download annos.tar.gz and extract it:
wget https://huggingface.co/datasets/Vigar001/UrbanNav/resolve/main/annos.tar.gz
tar -xzf annos.tar.gz
Use split_video_parallel.py to split the raw videos into 120-second segments in parallel. After completion, the original videos can be safely deleted to save storage space.
python scripts/split_video_parallel.py \
--video-dir /path/to/videos \
--output-dir /path/to/video_clips \
--workers 32
Use extract_video_frames.py to extract images in parallel from each trajectory. UrbanNav uses an image frequency of 1 FPS; if your downloaded videos are recorded at 30 FPS, set --stride 30 to align the extracted frames with our labels.
python scripts/extract_video_frames.py \
--input_dir /path/to/video_clips \
--output-dir /path/to/data_dir \
--stride 30 \
--workers 32
This is the final step in preparing the UrbanNav dataset! Run merge_annotations.py to copy annotation files into their corresponding trajectory folders.
scripts/merge_annotations.py \
--data-dir /path/to/data_dir \
--anno-dir /path/to/annotation_dir
After running the script, your data will have the following structure.
UrbanNav/data
├── <video_name_0000>
| ├── 0000.jpg
| ├── 0001.jpg
| ├── ...
| ├── T_1.jpg
| ├── traj_data.pkl
| └── label.json
├── <video_name_0001>
| ├── 0000.jpg
| ├── 0001.jpg
| ├── ...
| ├── T_2.jpg
| ├── traj_data.pkl
| └── label.json
| ...
└── <video_name_N>
├── 0000.jpg
├── 0001.jpg
├── ...
├── T_N.jpg
├── traj_data.pkl
└── label.json
Note: Approximately 50% of the data were filtered out by the data cleaning pipeline and therefore do not have annotations. The filtered trajectories are listed in filtered_trajs.txt (generated by merge_annotations.py) and can be safely deleted to free up storage space.
For a quick start, use urbannav_train.sh to train the navigation policy on the UrbanNav dataset,
bash launch/urbannav_train.sh
or use torchrun to customize the training configuration:
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export NUM_NODES=1
export CURR_NODE_RANK=0
export WORLD_SIZE=8
torchrun \
--nnodes=${NUM_NODES} \
--nproc-per-node=${WORLD_SIZE} \
--node-rank=${CURR_NODE_RANK} \
--rdzv-backend=c10d \
--rdzv-endpoint=localhost:12333 \
train.py --config configs/urbannav_train.yaml
You can also use a single GPU for debugging:
python train.py -c configs/urbannav_debug.yaml
We strongly recommend pre-extracting and caching the visual features before training. This significantly improves training efficiency and reduces GPU memory usage.
Use extract_features_cache.py to extract image features and save them in LMDB format:
python scripts/extract_features_cache.py \
--data-dir /path/to/data_dir \
--cache-dir /path/to/cache_dir \
--model-name <model_name> \
--gpus <gpu_ids>
Then, update the data/feat_file parameter in configs/urbannav_train.yaml to point to your LMDB cache path, and you're ready to start efficient training.
Use test.py to evaluate the model on a single GPU.
python test.py \
--checkpoint /path/to/checkpoint \
--batch-size <test_batch_size> \
--gpu 0
The result.json will be saved in the project directory corresponding to the checkpoint. We plan to enhance the script to support multi-GPU parallel evaluation in the future.
If you find this repository or our paper useful, please consider starring this repository and citing our paper:
@misc{mei2025urbannavlearninglanguageguidedurban,
title={UrbanNav: Learning Language-Guided Urban Navigation from Web-Scale Human Trajectories},
author={Yanghong Mei and Yirong Yang and Longteng Guo and Qunbo Wang and Ming-Ming Yu and Xingjian He and Wenjun Wu and Jing Liu},
year={2025},
eprint={2512.09607},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2512.09607},
}
