Sound Source Localization is All About Alignment (ICCV’23)

Official PyTorch implementation of our following papers:

Sound Source Localization is All About Cross-Modal Alignment

Arda Senocak*, Hyeonggon Ryu*, Junsik Kim*, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung (* Equal Contribution)

ICCV 2023

Toward Interactive Sound Source Localization: Better Align Sight and Sound!

Arda Senocak*, Hyeonggon Ryu*, Junsik Kim*, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung (* Equal Contribution)

TPAMI 2025

Index

Overview
Interactive Synthetic Sound Source (IS3) Dataset
Environment
Model Checkpoints
Inference
Training
Citation

Overview

Interactive Synthetic Sound Source (IS3) Dataset

IS3 dataset is available here

The IS3 data is organized as follows:

Note that in IS3 dataset, each annotation is saved as a separate file. For example; the sample accordion_baby_10467 image contains two annotations for accordion and baby objects. These annotations are saved as accordion_baby_10467_accordion and accordion_baby_10467_baby for straightforward use. You can always project bounding boxes or segmentation maps onto the original image to see them all at once.

images and audio_waw folders contain all the image and audio files respectively.

IS3_annotation.json file contains ground truth bounding box and category information of each annotation.

gt_segmentation folder contains segmentation maps in binary image format for each annotation. You can query the file name in IS3_annotation.json to get semantic category of each segmentation map.

Environment

Model Checkpoints

The model checkpoints are available for the following experiments:

Training Set	Test Set	Model Type	Performance (cIoU)	Checkpoint
VGGSound-144K	VGG-SS	NN w/ Sup. Pre. Enc.	39.94	Link
VGGSound-144K	VGG-SS	NN w/ Self-Sup. Pre. Enc.	39.16	Link
VGGSound-144K	VGG-SS	NN w/ Sup. Pre. Enc. Pre-trained Vision	41.42	Link
Flickr-SoundNet-144K	Flickr-SoundNet	NN w/ Sup. Pre. Enc.	85.20	Link
Flickr-SoundNet-144K	Flickr-SoundNet	NN w/ Self-Sup. Pre. Enc.	84.80	Link
Flickr-SoundNet-144K	Flickr-SoundNet	NN w/ Sup. Pre. Enc. Pre-trained Vision	86.00	Link

Inference

Put checkpoint files into the 'checkpoints' directory:

inference
│
└───checkpoints
│       ours_sup_previs.pth.tar
│       ours_sup.pth.tar
│       ours_selfsup.pth.tar
│   test.py
│   datasets.py
│   model.py

To evaluate a trained model run

python test.py --testset {testset_name} --pth_name {pth_name}

Test Set	testset_name
VGG-SS	vggss
Flickr-SoundNet	flickr
IS3	is3

Evaluate other methods

Simply save the checkpoint files from the methods as '{method_name}_{put_your_own_message}.pth', such as 'ezvsl_flickr.pth'. We have already handled the trivial settings.

Paper title	pth_name must contains
Localizing Visual Sounds the Hard Way (CVPR 21) [Paper]	lvs
Localizing Visual Sounds the Easy Way (ECCV 22) [Paper]	ezvsl
A Closer Look at Weakly-Supervised Audio-Visual Source Localization (NeurIPS 22) [Paper]	slavc
Exploiting Transformation Invariance and Equivariance for Self-supervised Sound Localisation (ACMMM 22) [Paper]	ssltie
Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning (CVPR 23) [Paper]	fnac

Example

python test.py --testset flickr --pth_name ezvsl_flickr.pth

Training

Training code is coming soon!

Citation

If you find this code useful, please consider giving a star ⭐ and citing us:

@inproceedings{senocak2023sound,
  title={Sound source localization is all about cross-modal alignment},
  author={Senocak, Arda and Ryu, Hyeonggon and Kim, Junsik and Oh, Tae-Hyun and Pfister, Hanspeter and Chung, Joon Son},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={7777--7787},
  year={2023}
}

If you use this dataset, please consider giving a star ⭐ and citing us:

@article{senocak2025align,
  author={Senocak, Arda and Ryu, Hyeonggon and Kim, Junsik and Oh, Tae-Hyun and Pfister, Hanspeter and Chung, Joon Son},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, 
  title={Toward Interactive Sound Source Localization: Better Align Sight and Sound!}, 
  year={2025},
}

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
figs		figs
metadata		metadata
models_lvs		models_lvs
networks		networks
README.md		README.md
audio_io.py		audio_io.py
datasets.py		datasets.py
datasets_lvs.py		datasets_lvs.py
datasets_ssltie.py		datasets_ssltie.py
model.py		model.py
model_lvs.py		model_lvs.py
model_ssltie.py		model_ssltie.py
opts_ssltie.py		opts_ssltie.py
ours.sh		ours.sh
test.py		test.py
utils.py		utils.py
utils_lvs.py		utils_lvs.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sound Source Localization is All About Alignment (ICCV’23)

Index

Overview

Interactive Synthetic Sound Source (IS3) Dataset

Environment

Model Checkpoints

Inference

Evaluate other methods

Training

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

kaistmm/SSLalignment

Folders and files

Latest commit

History

Repository files navigation

Sound Source Localization is All About Alignment (ICCV’23)

Index

Overview

Interactive Synthetic Sound Source (IS3) Dataset

Environment

Model Checkpoints

Inference

Evaluate other methods

Training

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages