Skip to content

[2026 AAAI] Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation

Notifications You must be signed in to change notification settings

jasongief/TGS-Agent

Repository files navigation

Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation

TGS Dataset HF_checkpoint

📰 News

🔥2025.11.08: Our paper got accepted to AAAI 2026! Thanks to all co-authors and the anonymous reviewers🎉🎉

🔥2025.11.01: Data, Code, and Checkpoints are released!

📄 Citation

If our work assists your research, feel free to give us a star ⭐ and cite us using

@article{zhou2025think,
  title={Think before you segment: An object-aware reasoning agent for referring audio-visual segmentation},
  author={Zhou, Jinxing and Zhou, Yanghao and Han, Mingfei and Wang, Tong and Chang, Xiaojun and Cholakkal, Hisham and Anwer, Rao Muhammad},
  journal={arXiv preprint arXiv:2508.04418},
  year={2025}
}

⚙️ Installation

git clone https://github.com/jasongief/TGS-Agent.git
cd TGS-Agent

For Think Phase

conda env create -f think_environment.yml
conda activate think

Alternative: you may also refer to Crab for environment installation

For Ground-Segment Phase

cd ground_segment_scripts

git clone https://github.com/IDEA-Research/Grounded-SAM-2.git
cd Grounded-SAM-2

conda env create -f dinosam2_environment.yml
conda activate dino

Alternative: you may refer to Grounded-SAM2 for environment installation

cd ../TGS-Agent

🤗 Setup

Datasets

  • Download the official Ref-AVSBench dataset from here and put them in ./REFAVS. The metadata (csv file) should also be copyed to ./R2AVSBench
  • Download our instruction tuning data for Ref-Thinker training from here and put the json file into ./R2AVSBench.
  • Download metadata of our R2-AVSBench from here and put the csv file into ./R2AVSBench.

Pretrained Backbones

Download the necessary pre-trained backbones and put them in ./pretrained_weights, including

Multimodal Encoder Weights:

LLM Weights:

download LLaMA-2-Chat-HF

Pretrained Multimodal Projector

Checkpoints

Download the following checkpoints:

  • download our pretrained Ref-thinker and put it into results_real.
  • run the following scripts to prepare GroundingDINO weights:
cd ./ground_segment_scripts/Grounded-SAM-2/gdino_checkpoints
bash download_ckpts.sh
  • run the following scripts to prepare SAM2 weights:
cd ./ground_segment_scripts/Grounded-SAM-2/checkpoints
bash download_ckpts.sh

📌 Getting Started

Train Ref-Thinker

cd TGS-Agent
conda activate think
bash scripts/finetune/finetune_hyperlora.sh

Test Ref-Thinker

cd TGS-Agent
conda activate think
bash scripts/finetune/inference_hyper_lora.sh

This generates the object-aware reasoning chain for each given reference from default Ref-AVSBench. You may change the test meta csv path for evaluating our proposed R^2-AVSBench. After obtaining the fine-grained and simplified object description, we can start the subsequent Ground and Segment phase.

Ground-Segment

cd ground_segment_scripts
conda activate dino
  • inference on Ref-AVSBench prompted by Ref-Thinker
python ground_segment_with_object_text_after_thinking_for_RefAVSBench.py
  • inference on Ref-AVSBench prompted by Original raw reference
python ground_segment_with_direct_reference_of_RefAVSBench.py
  • inference on R^2-AVSBench prompted by Ref-Thinker
python ground_segment_with_object_text_after_thinking_for_R2AVSBench.py
  • inference on R^2-AVSBench prompted by Original raw reference
python ground_segment_with_direct_reference_of_R2AVSBench.py

Acknowledgement

We thank the Crab and Grounded-SAM2 for their open-source, which help a lot in this project.

About

[2026 AAAI] Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published