VideoTool

Keep Updating...

This repository is the official implementation of Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task (NeurIPS 2025 main track).

News and Todo 🗓️

Release all video tools and test scripts
Release toolchain algorithm (STAR)
Release evaluating scripts

Introduction

In this work, we equip MLLM with a comprehensive and extensible Video Toolkit, to enhance MLLM's spatiotemporal reasoning capabilities and ensure the harmony between the quantity and diversity of tools. To better control the tool invocation sequence and avoid toolchain shortcut issues, we propose a Spatiotemporal Reasoning Framework (STAR) that strategically schedules temporal and spatial tools, thereby progressively localizing the key area in the video. Our STAR framework enhances GPT-4o using lightweight tools, achieving an 8.2% gain on VideoMME and 4.6% on LongVideoBench.

Setup and Configuration 🛠️

Clone the repository 📦:

git clone git@github.com:fansunqi/VideoTool.git
cd ToolChainVideo

Create a virtual environment 🧹 and install the dependencies 🧑‍🍳:

conda create -n videotool python=3.9
conda activate videotool
pip install -r requirements.txt

Set up your API key 🗝️ in config/*.yaml:

openai:
  GPT_API_KEY: "put your openai api key here"
  PROXY: "put your openai base url here"

Bulid related projects 🧩:

mkdir projects
cd projects

Download Grounded-Video-LLM for temporal grounding and temporal QA
```
git clone git@github.com:WHB139426/Grounded-Video-LLM.git
```

Build LLaVA for image QA

git clone git@github.com:fansunqi/LLaVA.git
cd LLaVA
pip install -e .
cd ..

Tools

Thanks to the authors of these open-source projects for providing excellent projects.

Temporal Tools:

Frame Selector
- select frames of interest based on current information, driven by LLM.
Temporal Grounding
- Grounded-Video-LLM-7B: https://github.com/WHB139426/Grounded-Video-LLM
Temporal Refering
- Grounded-Video-LLM-7B: https://github.com/WHB139426/Grounded-Video-LLM
Temporal QA
- Grounded-Video-LLM-7B: https://github.com/WHB139426/Grounded-Video-LLM

Spatial Tools:

Object Tracking
- YOLO by ultralytics: https://github.com/ultralytics/ultralytics
Image Captioning
- BLIP: https://huggingface.co/docs/transformers/model_doc/blip
Image QA
- BLIP: https://huggingface.co/docs/transformers/model_doc/blip
- LLaVA: https://github.com/haotian-liu/LLaVA

Generalist Solution:

Image Grid QA
- Image Grid QA driven by GPT-4o: https://github.com/microsoft/VLM-Video-Action-Localization
Video QA
- Qwen-VL-2.5-7B: https://github.com/QwenLM/Qwen2.5-VL

Download Datasets

NeXT-QA：
```
git clone git@github.com:doc-doc/NExT-QA.git
```
specify your data path in config/nextqa.yaml

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
config		config
engine		engine
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
eval.py		eval.py
main.py		main.py
prompts.py		prompts.py
reasoning.py		reasoning.py
requirements.txt		requirements.txt
util.py		util.py
visible_frames.py		visible_frames.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VideoTool

News and Todo 🗓️

Introduction

Setup and Configuration 🛠️

Tools

Download Datasets

About

Uh oh!

Releases

Packages

Languages

License

fansunqi/VideoTool

Folders and files

Latest commit

History

Repository files navigation

VideoTool

News and Todo 🗓️

Introduction

Setup and Configuration 🛠️

Tools

Download Datasets

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages