🔥2025.11.08: Our paper got accepted to AAAI 2026! Thanks to all co-authors and the anonymous reviewers🎉🎉
🔥2025.11.01: Data, Code, and Checkpoints are released!
If our work assists your research, feel free to give us a star ⭐ and cite us using
@article{zhou2025think,
title={Think before you segment: An object-aware reasoning agent for referring audio-visual segmentation},
author={Zhou, Jinxing and Zhou, Yanghao and Han, Mingfei and Wang, Tong and Chang, Xiaojun and Cholakkal, Hisham and Anwer, Rao Muhammad},
journal={arXiv preprint arXiv:2508.04418},
year={2025}
}
git clone https://github.com/jasongief/TGS-Agent.git
cd TGS-Agentconda env create -f think_environment.yml
conda activate think
Alternative: you may also refer to Crab for environment installation
cd ground_segment_scripts
git clone https://github.com/IDEA-Research/Grounded-SAM-2.git
cd Grounded-SAM-2
conda env create -f dinosam2_environment.yml
conda activate dino
Alternative: you may refer to Grounded-SAM2 for environment installation
cd ../TGS-Agent
- Download the official Ref-AVSBench dataset from here and put them in
./REFAVS. The metadata (csv file) should also be copyed to./R2AVSBench - Download our instruction tuning data for Ref-Thinker training from here and put the json file into
./R2AVSBench. - Download metadata of our R2-AVSBench from here and put the csv file into
./R2AVSBench.
Download the necessary pre-trained backbones and put them in ./pretrained_weights, including
Multimodal Encoder Weights:
- download visual encoder openai-clip-vit-large-patch14
- download audio encoder BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2
LLM Weights:
download LLaMA-2-Chat-HF
Pretrained Multimodal Projector
- download pretrained audio projector: audio pretrain checkpoint
- download pretrained visual projector: visual pretrain checkpoint,
Download the following checkpoints:
- download our pretrained Ref-thinker and put it into
results_real. - run the following scripts to prepare GroundingDINO weights:
cd ./ground_segment_scripts/Grounded-SAM-2/gdino_checkpoints
bash download_ckpts.sh
- run the following scripts to prepare SAM2 weights:
cd ./ground_segment_scripts/Grounded-SAM-2/checkpoints
bash download_ckpts.sh
cd TGS-Agent
conda activate think
bash scripts/finetune/finetune_hyperlora.sh
cd TGS-Agent
conda activate think
bash scripts/finetune/inference_hyper_lora.sh
This generates the object-aware reasoning chain for each given reference from default Ref-AVSBench. You may change the test meta csv path for evaluating our proposed R^2-AVSBench. After obtaining the fine-grained and simplified object description, we can start the subsequent Ground and Segment phase.
cd ground_segment_scripts
conda activate dino
- inference on Ref-AVSBench prompted by Ref-Thinker
python ground_segment_with_object_text_after_thinking_for_RefAVSBench.py
- inference on Ref-AVSBench prompted by Original raw reference
python ground_segment_with_direct_reference_of_RefAVSBench.py
- inference on R^2-AVSBench prompted by Ref-Thinker
python ground_segment_with_object_text_after_thinking_for_R2AVSBench.py
- inference on R^2-AVSBench prompted by Original raw reference
python ground_segment_with_direct_reference_of_R2AVSBench.py
We thank the Crab and Grounded-SAM2 for their open-source, which help a lot in this project.