Liyun Zhu1,2,
Qixiang Chen1,
Xi Shen3,
Xiaodong Cun2
1Australian National University
2GVC Lab, Great Bay University
3Intellindust
🌐 Project Website | 📑 Paper | 🤗 Data
This repository contains the official implementation of our paper: VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning.
We propose VAU-R1, a Reinforcement Fine-Tuning (RFT) framework that improves the reasoning ability of MLLMs for video anomaly understanding (VAU). Specifically, we adopt Group Relative Policy Optimization (GRPO) to optimize the model with task-specific rewards, such as answer format, accuracy, and temporal Intersection-over-Union (IoU). We decompose the VAU task into four complementary tasks to facilitate comprehensive reasoning: multiple-choice QA, temporal anomaly grounding, anomaly reasoning, and anomaly classification.
We use Qwen2-VL and Qwen2.5-VL as our base model. We provide a simple installation:
pip install transformers
pip install qwen_vl_utilsWe construct VAU-Bench, a unified benchmark built from MSAD, UCF-Crime, and ECVA, enriched with Chain-of-Thought (CoT) annotations, including: (i) video descriptions, (ii) temporal boundaries, (iii) multiple-choice QA, and (iv) reasoning rationales.
Please download the original video files from UCF-Crime , ECVA and MSAD for our experiments. Our Chain of Thought annotation for these three datasets can be found in annotations/ folder or link here.
We use scripts/training/run_grpo_video_qa.sh for training the RFT for Multi-choice QA task.
sh scripts/training/run_grpo_video_qa.shWe use scripts/training/run_grpo_video_tag.sh for training the RFT for temporal anomaly grouding task.
sh scripts/training/run_grpo_video_tag.shPlease follow the evaluation scripts in scripts/evaluation folder to evaluate the model performance on four tasks.
If you find our VAU-R1 useful in your research, please consider cite our work or star our repo.
@misc{zhu2025vaur1,
title={VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning},
author={Liyun Zhu and Qixiang Chen and Xi Shen and Xiaodong Cun},
year={2025},
eprint={2505.23504},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.23504},
}This codebase is built on top of VideoChat-R1, and we thank the authors for their work.
