LongShOT: A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

Mohammed Irfan K, Jaseel Muhammad Kaithakkodan, Jinxing Zhou, Sahal Shaji Mullappilly, Mohammad Almansoori, Noor Ahsan, Beknur Kalmakhanbet, Sambal Shikhar, Rishabh Lalla, Jean Lahoud, Mariette Awad, Fahad Shahbaz Khan, Salman Khan, Rao Muhammad Anwer, and Hisham Cholakkal

*Equally contributing first authors

Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), UAE

Overview

LongShOT introduces a diagnostic benchmark and agentic framework for long-form multimodal video understanding. LongShOTBench features open-ended questions, multi-turn dialogues, and tasks requiring vision, speech, and audio reasoning with tool use. Each sample includes reference answers and graded rubrics for interpretable evaluation. LongShOTAgent employs preprocessing, search, and iterative refinement to analyze long videos. Current state-of-the-art models show significant gaps: Gemini-2.5-Flash achieves 52.95%, open-source models remain below 30%, and LongShOTAgent attains 44.66%, highlighting the challenge of real-world long video understanding.

Figure 1: Construction pipeline of LongShOTBench. The pipeline begins with raw video data where speech, visuals, and audio cues are extracted. These are passed into multimodal processing to generate segment-wise aligned and fused metadata. Only the distilled information flows to question design, where scenarios and question types are mapped, followed by the generation of questions and conversational answers. Next, verifiable rubrics are created to evaluate correctness and difficulty. Finally, the core dataset, comprising Q&A pairs and tailored evaluation rubrics, is manually reviewed and corrected by human validators, ensuring a clean, reliable benchmark.

Installation

Create and activate a conda environment with Python 3.11:

conda create -n longshot python=3.11 -y
conda activate longshot
pip install -r requirements.txt

Note: If you encounter CUDNN issues, install it separately for your version:
conda install -c conda-forge cudnn=8

Data Preprocessing

The preprocessing pipeline downloads videos from YouTube and processes them to extract video, audio, and speech metadata/transcriptions.

Setup YouTube Cookie Authentication

To download videos from YouTube, you need to provide authentication cookies:

Install the Get cookies.txt LOCALLY Chrome extension
Navigate to YouTube and refresh the page
Click on the extension icon and copy the cookies
Paste the cookies into preprocess/cookies.txt

Download Audio Flamingo Model

Navigate to the preprocess directory and download the Audio Flamingo model:

cd preprocess
hf download nvidia/audio-flamingo-3 --local-dir ./MODELS/audio-flamingo-3/

Download Videos

Download videos from YouTube:

python download_youtube_videos.py

Note: Some videos may fail to download through the script. Please manually download any missing videos and place them in the appropriate directory (preprocess/dataset/videos).

Process Videos with Vision-Language Models (VLM)

Start the VLM server in a separate terminal:

./vllm_start.sh vlm

In another terminal, run the VLM processing stages:

python run.py vlm-stages

Process Videos with Language Models (LLM)

Start the LLM server in a separate terminal:

./vllm_start.sh llm

In another terminal, run the LLM processing stages:

python run.py llm-stages

Once complete, the raw dataset with video, audio, and speech descriptions will be ready in the preprocess/dataset/ directory.

Dataset Generation

Generate the LongShOTBench dataset from the preprocessed videos:

cd ../datagen
python main.py

The final dataset will be saved at datagen/results/final_dataset.jsonl.

Upload to Hugging Face

To share the dataset on Hugging Face Hub:

from datasets import Dataset
dataset = Dataset.from_json("results/clean_dataset.jsonl")
dataset.push_to_hub("your-org/longshot-bench", config_name="postvalid")

Evaluation

Evaluate model performance on the LongShOTBench dataset.

Generate Model Responses

To generate responses from candidate models:

cd ../eval
bash generate.sh

Evaluate Responses

To evaluate the generated responses:

bash eval.sh

Results will be saved in the eval/results_postvalid/ directory.

Acknowledgements

This work is partially supported by the Meta Regional Research Grant, Project OMER, the Google Gift Research Award, and the NVIDIA Academic Grant.

Citation

If you find this work useful, please cite our paper:

@misc{kurpath2025benchmarkagenticframeworkomnimodal,
      title={A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos}, 
      author={Mohammed Irfan Kurpath and Jaseel Muhammad Kaithakkodan and Jinxing Zhou and Sahal Shaji Mullappilly and Mohammad Almansoori and Noor Ahsan and Beknur Kalmakhanbet and Sambal Shikhar and Rishabh Lalla and Jean Lahoud and Mariette Awad and Fahad Shahbaz Khan and Salman Khan and Rao Muhammad Anwer and Hisham Cholakkal},
      year={2025},
      eprint={2512.16978},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.16978}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
datagen		datagen
eval		eval
figures		figures
preprocess		preprocess
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LongShOT: A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

Mohammed Irfan K, Jaseel Muhammad Kaithakkodan, Jinxing Zhou, Sahal Shaji Mullappilly, Mohammad Almansoori, Noor Ahsan, Beknur Kalmakhanbet, Sambal Shikhar, Rishabh Lalla, Jean Lahoud, Mariette Awad, Fahad Shahbaz Khan, Salman Khan, Rao Muhammad Anwer, and Hisham Cholakkal

Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), UAE

Table of Contents

Overview

Installation

Data Preprocessing

Setup YouTube Cookie Authentication

Download Audio Flamingo Model

Download Videos

Process Videos with Vision-Language Models (VLM)

Process Videos with Language Models (LLM)

Dataset Generation

Upload to Hugging Face

Evaluation

Generate Model Responses

Evaluate Responses

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

mbzuai-oryx/LongShOT

Folders and files

Latest commit

History

Repository files navigation

LongShOT: A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

Mohammed Irfan K*, Jaseel Muhammad Kaithakkodan*, Jinxing Zhou, Sahal Shaji Mullappilly, Mohammad Almansoori, Noor Ahsan, Beknur Kalmakhanbet, Sambal Shikhar, Rishabh Lalla, Jean Lahoud, Mariette Awad, Fahad Shahbaz Khan, Salman Khan, Rao Muhammad Anwer, and Hisham Cholakkal

Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), UAE

Table of Contents

Overview

Installation

Data Preprocessing

Setup YouTube Cookie Authentication

Download Audio Flamingo Model

Download Videos

Process Videos with Vision-Language Models (VLM)

Process Videos with Language Models (LLM)

Dataset Generation

Upload to Hugging Face

Evaluation

Generate Model Responses

Evaluate Responses

Acknowledgements

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Mohammed Irfan K, Jaseel Muhammad Kaithakkodan, Jinxing Zhou, Sahal Shaji Mullappilly, Mohammad Almansoori, Noor Ahsan, Beknur Kalmakhanbet, Sambal Shikhar, Rishabh Lalla, Jean Lahoud, Mariette Awad, Fahad Shahbaz Khan, Salman Khan, Rao Muhammad Anwer, and Hisham Cholakkal

Packages