Taming Hallucinations: Boosting MLLMsโ Video Understanding via Counterfactual Video Generation
๐ Project Page | Paper
TL;DR: Taming Hallucinations introduces DualityForge, a controllable diffusion-based framework that turns real videos into counterfactual ones, automatically generating paired videos and QA data for contrastive training. Based on the large-scale DualityVidQA dataset and the proposed DNA-Train SFTโRL regime with โ1-normalized advantages, our approach reduces hallucinations in multimodal LLMs by 24% and shows strong generalization across benchmarks. Dataset and code will be released.
If you find this repository useful, please consider citing:
@article{huang2025taming,
title={Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation},
author={Huang, Zhe and Wen, Hao and Hao, Aiming and Song, Bingze and Wu, Meiqi and Wu, Jiahong and Chu, Xiangxiang and Lu, Sheng and Wang, Haoqian},
journal={arXiv preprint arXiv:2512.24271},
year={2025}
}