UrbanNav: Learning Language-Guided Urban Navigation from Web-Scale Human Trajectories

Yanghong Mei^*1,5, Yirong Yang^*2, Longteng Guo^†1, Qunbo Wang³, Ming-Ming Yu²,
Xingjian He¹, Wenjun Wu^2,4, Jing Liu^1,5,
¹Institute of Automation, Chinese Academy of Sciences
²Beihang University
³Beijing Jiaotong University
⁴Hangzhou International Innovation Institute
⁵School of Artificial Intelligence, University of Chinese Academy of Sciences

📢 News

[12/19/2025]: UrbanNav dataset is now available! Check out our dataset on Hugging Face for more details.

👋 Introduction

UrbanNav is a large-scale urban navigation dataset automatically constructed using Qwen2.5-VL based on web data, comprising 47k trajectories and 3M language instructions. designed for training language-guided embodied urban navigation and enabling fair offline evaluation. Our navigation policy trained on UrbanNav achieves state-of-the-art performance both on the benchmark and in real-world deployment.

🚗 Demo

UrbanNav navigation policy is deployed on a four-wheel differential-drive robot and validated in challenging and diverse outdoor urban environments.

⚙️ Installation

conda create -n urbannav python=3.10
conda activate urbannav
pip install -r requirements.txt

🗃️ UrbanNav Dataset

You can easily prepare the UrbanNav dataset by following the steps below:

1. Download

All YouTube video IDs used by UrbanNav are listed in the video list. You need to download these videos in 360p resolution at 30 FPS and place them in the same directory.

The trajectory data and instruction annotations are publicly available on Hugging Face. Please download annos.tar.gz and extract it:

wget https://huggingface.co/datasets/Vigar001/UrbanNav/resolve/main/annos.tar.gz
tar -xzf annos.tar.gz

2. Split videos

Use split_video_parallel.py to split the raw videos into 120-second segments in parallel. After completion, the original videos can be safely deleted to save storage space.

python scripts/split_video_parallel.py \ 
    --video-dir /path/to/videos \
    --output-dir /path/to/video_clips \
    --workers 32

3. Extract frames

Use extract_video_frames.py to extract images in parallel from each trajectory. UrbanNav uses an image frequency of 1 FPS; if your downloaded videos are recorded at 30 FPS, set --stride 30 to align the extracted frames with our labels.

python scripts/extract_video_frames.py \
    --input_dir /path/to/video_clips \
    --output-dir /path/to/data_dir \
    --stride 30 \
    --workers 32

4. Merge annotations

This is the final step in preparing the UrbanNav dataset! Run merge_annotations.py to copy annotation files into their corresponding trajectory folders.

scripts/merge_annotations.py \
    --data-dir /path/to/data_dir \
    --anno-dir  /path/to/annotation_dir

After running the script, your data will have the following structure.

UrbanNav/data
├── <video_name_0000>
|   ├── 0000.jpg
|   ├── 0001.jpg
|   ├── ...
|   ├── T_1.jpg
|   ├── traj_data.pkl
|   └── label.json
├── <video_name_0001>
|   ├── 0000.jpg
|   ├── 0001.jpg
|   ├── ...
|   ├── T_2.jpg
|   ├── traj_data.pkl
|   └── label.json 
|   ...
└── <video_name_N>
    ├── 0000.jpg
    ├── 0001.jpg
    ├── ...
    ├── T_N.jpg
    ├── traj_data.pkl
    └── label.json

Note: Approximately 50% of the data were filtered out by the data cleaning pipeline and therefore do not have annotations. The filtered trajectories are listed in filtered_trajs.txt (generated by merge_annotations.py) and can be safely deleted to free up storage space.

🏋️ Training

Quick Start

For a quick start, use urbannav_train.sh to train the navigation policy on the UrbanNav dataset,

bash launch/urbannav_train.sh

or use torchrun to customize the training configuration:

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export NUM_NODES=1
export CURR_NODE_RANK=0
export WORLD_SIZE=8

torchrun \
  --nnodes=${NUM_NODES} \
  --nproc-per-node=${WORLD_SIZE} \
  --node-rank=${CURR_NODE_RANK} \
  --rdzv-backend=c10d \
  --rdzv-endpoint=localhost:12333 \
  train.py --config configs/urbannav_train.yaml

You can also use a single GPU for debugging:

python train.py -c configs/urbannav_debug.yaml

Optional

We strongly recommend pre-extracting and caching the visual features before training. This significantly improves training efficiency and reduces GPU memory usage.

Use extract_features_cache.py to extract image features and save them in LMDB format:

python scripts/extract_features_cache.py \
    --data-dir /path/to/data_dir \
    --cache-dir /path/to/cache_dir \
    --model-name <model_name> \
    --gpus <gpu_ids>

Then, update the data/feat_file parameter in configs/urbannav_train.yaml to point to your LMDB cache path, and you're ready to start efficient training.

📊 Evaluation

Use test.py to evaluate the model on a single GPU.

python test.py \
    --checkpoint /path/to/checkpoint \
    --batch-size <test_batch_size> \
    --gpu 0

The result.json will be saved in the project directory corresponding to the checkpoint. We plan to enhance the script to support multi-GPU parallel evaluation in the future.

🌟 Citation

If you find this repository or our paper useful, please consider starring this repository and citing our paper:

@misc{mei2025urbannavlearninglanguageguidedurban,
      title={UrbanNav: Learning Language-Guided Urban Navigation from Web-Scale Human Trajectories}, 
      author={Yanghong Mei and Yirong Yang and Longteng Guo and Qunbo Wang and Ming-Ming Yu and Xingjian He and Wenjun Wu and Jing Liu},
      year={2025},
      eprint={2512.09607},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2512.09607}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
configs		configs
data_splits		data_splits
launch		launch
model		model
scripts		scripts
src		src
thirdparty		thirdparty
utils		utils
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
test.py		test.py
train.py		train.py
video_list.txt		video_list.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

UrbanNav: Learning Language-Guided Urban Navigation from Web-Scale Human Trajectories

📢 News

👋 Introduction

🚗 Demo

⚙️ Installation

🗃️ UrbanNav Dataset

1. Download

2. Split videos

3. Extract frames

4. Merge annotations

🏋️ Training

Quick Start

Optional

📊 Evaluation

🌟 Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

CASIA-IVA-Lab/UrbanNav

Folders and files

Latest commit

History

Repository files navigation

UrbanNav: Learning Language-Guided Urban Navigation from Web-Scale Human Trajectories

📢 News

👋 Introduction

🚗 Demo

⚙️ Installation

🗃️ UrbanNav Dataset

1. Download

2. Split videos

3. Extract frames

4. Merge annotations

🏋️ Training

Quick Start

Optional

📊 Evaluation

🌟 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages