Mohammed Irfan K*, Jaseel Muhammad Kaithakkodan*, Jinxing Zhou, Sahal Shaji Mullappilly, Mohammad Almansoori, Noor Ahsan, Beknur Kalmakhanbet, Sambal Shikhar, Rishabh Lalla, Jean Lahoud, Mariette Awad, Fahad Shahbaz Khan, Salman Khan, Rao Muhammad Anwer, and Hisham Cholakkal
*Equally contributing first authors
- Overview
- Installation
- Data Preprocessing
- Dataset Generation
- Upload to Hugging Face
- Evaluation
- Acknowledgements
- Citation
LongShOT introduces a diagnostic benchmark and agentic framework for long-form multimodal video understanding. LongShOTBench features open-ended questions, multi-turn dialogues, and tasks requiring vision, speech, and audio reasoning with tool use. Each sample includes reference answers and graded rubrics for interpretable evaluation. LongShOTAgent employs preprocessing, search, and iterative refinement to analyze long videos. Current state-of-the-art models show significant gaps: Gemini-2.5-Flash achieves 52.95%, open-source models remain below 30%, and LongShOTAgent attains 44.66%, highlighting the challenge of real-world long video understanding.
Figure 1: Construction pipeline of LongShOTBench. The pipeline begins with raw video data where speech, visuals, and audio cues are extracted. These are passed into multimodal processing to generate segment-wise aligned and fused metadata. Only the distilled information flows to question design, where scenarios and question types are mapped, followed by the generation of questions and conversational answers. Next, verifiable rubrics are created to evaluate correctness and difficulty. Finally, the core dataset, comprising Q&A pairs and tailored evaluation rubrics, is manually reviewed and corrected by human validators, ensuring a clean, reliable benchmark.
Create and activate a conda environment with Python 3.11:
conda create -n longshot python=3.11 -y
conda activate longshot
pip install -r requirements.txtNote: If you encounter CUDNN issues, install it separately for your version:
conda install -c conda-forge cudnn=8
The preprocessing pipeline downloads videos from YouTube and processes them to extract video, audio, and speech metadata/transcriptions.
To download videos from YouTube, you need to provide authentication cookies:
- Install the Get cookies.txt LOCALLY Chrome extension
- Navigate to YouTube and refresh the page
- Click on the extension icon and copy the cookies
- Paste the cookies into
preprocess/cookies.txt
Navigate to the preprocess directory and download the Audio Flamingo model:
cd preprocess
hf download nvidia/audio-flamingo-3 --local-dir ./MODELS/audio-flamingo-3/Download videos from YouTube:
python download_youtube_videos.pyNote: Some videos may fail to download through the script. Please manually download any missing videos and place them in the appropriate directory (preprocess/dataset/videos).
Start the VLM server in a separate terminal:
./vllm_start.sh vlmIn another terminal, run the VLM processing stages:
python run.py vlm-stagesStart the LLM server in a separate terminal:
./vllm_start.sh llmIn another terminal, run the LLM processing stages:
python run.py llm-stagesOnce complete, the raw dataset with video, audio, and speech descriptions will be ready in the preprocess/dataset/ directory.
Generate the LongShOTBench dataset from the preprocessed videos:
cd ../datagen
python main.pyThe final dataset will be saved at datagen/results/final_dataset.jsonl.
To share the dataset on Hugging Face Hub:
from datasets import Dataset
dataset = Dataset.from_json("results/clean_dataset.jsonl")
dataset.push_to_hub("your-org/longshot-bench", config_name="postvalid")Evaluate model performance on the LongShOTBench dataset.
To generate responses from candidate models:
cd ../eval
bash generate.shTo evaluate the generated responses:
bash eval.shResults will be saved in the eval/results_postvalid/ directory.
This work is partially supported by the Meta Regional Research Grant, Project OMER, the Google Gift Research Award, and the NVIDIA Academic Grant.
If you find this work useful, please cite our paper:
@misc{kurpath2025benchmarkagenticframeworkomnimodal,
title={A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos},
author={Mohammed Irfan Kurpath and Jaseel Muhammad Kaithakkodan and Jinxing Zhou and Sahal Shaji Mullappilly and Mohammad Almansoori and Noor Ahsan and Beknur Kalmakhanbet and Sambal Shikhar and Rishabh Lalla and Jean Lahoud and Mariette Awad and Fahad Shahbaz Khan and Salman Khan and Rao Muhammad Anwer and Hisham Cholakkal},
year={2025},
eprint={2512.16978},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.16978},
}
