Keep Updating...
This repository is the official implementation of Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task (NeurIPS 2025 main track).
-
Release all video tools and test scripts
-
Release toolchain algorithm (STAR)
-
Release evaluating scripts
In this work, we equip MLLM with a comprehensive and extensible Video Toolkit, to enhance MLLM's spatiotemporal reasoning capabilities and ensure the harmony between the quantity and diversity of tools. To better control the tool invocation sequence and avoid toolchain shortcut issues, we propose a Spatiotemporal Reasoning Framework (STAR) that strategically schedules temporal and spatial tools, thereby progressively localizing the key area in the video. Our STAR framework enhances GPT-4o using lightweight tools, achieving an 8.2% gain on VideoMME and 4.6% on LongVideoBench.
-
Clone the repository 📦:
git clone git@github.com:fansunqi/VideoTool.git cd ToolChainVideo
-
Create a virtual environment 🧹 and install the dependencies 🧑🍳:
conda create -n videotool python=3.9 conda activate videotool pip install -r requirements.txt
-
Set up your API key 🗝️ in
config/*.yaml:openai: GPT_API_KEY: "put your openai api key here" PROXY: "put your openai base url here"
-
Bulid related projects 🧩:
mkdir projects cd projects
-
Download Grounded-Video-LLM for temporal grounding and temporal QA
git clone git@github.com:WHB139426/Grounded-Video-LLM.git
-
Build LLaVA for image QA
git clone git@github.com:fansunqi/LLaVA.git cd LLaVA pip install -e . cd ..
-
Thanks to the authors of these open-source projects for providing excellent projects.
Temporal Tools:
- Frame Selector
- select frames of interest based on current information, driven by LLM.
- Temporal Grounding
- Grounded-Video-LLM-7B: https://github.com/WHB139426/Grounded-Video-LLM
- Temporal Refering
- Grounded-Video-LLM-7B: https://github.com/WHB139426/Grounded-Video-LLM
- Temporal QA
- Grounded-Video-LLM-7B: https://github.com/WHB139426/Grounded-Video-LLM
Spatial Tools:
- Object Tracking
- YOLO by ultralytics: https://github.com/ultralytics/ultralytics
- Image Captioning
- Image QA
Generalist Solution:
- Image Grid QA
- Image Grid QA driven by GPT-4o: https://github.com/microsoft/VLM-Video-Action-Localization
- Video QA
- Qwen-VL-2.5-7B: https://github.com/QwenLM/Qwen2.5-VL
- NeXT-QA:
specify your data path in
git clone git@github.com:doc-doc/NExT-QA.gitconfig/nextqa.yaml
