Official PyTorch implementation of our following papers:
Sound Source Localization is All About Cross-Modal Alignment
Arda Senocak*, Hyeonggon Ryu*, Junsik Kim*, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung (* Equal Contribution)
ICCV 2023
Toward Interactive Sound Source Localization: Better Align Sight and Sound!
Arda Senocak*, Hyeonggon Ryu*, Junsik Kim*, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung (* Equal Contribution)
TPAMI 2025
- Overview
- Interactive Synthetic Sound Source (IS3) Dataset
- Environment
- Model Checkpoints
- Inference
- Training
- Citation
IS3 dataset is available here
The IS3 data is organized as follows:
Note that in IS3 dataset, each annotation is saved as a separate file. For example; the sample accordion_baby_10467 image contains two annotations for accordion and baby objects. These annotations are saved as accordion_baby_10467_accordion and accordion_baby_10467_baby for straightforward use. You can always project bounding boxes or segmentation maps onto the original image to see them all at once.
images and audio_waw folders contain all the image and audio files respectively.
IS3_annotation.json file contains ground truth bounding box and category information of each annotation.
gt_segmentation folder contains segmentation maps in binary image format for each annotation. You can query the file name in IS3_annotation.json to get semantic category of each segmentation map.
The model checkpoints are available for the following experiments:
| Training Set | Test Set | Model Type | Performance (cIoU) | Checkpoint |
|---|---|---|---|---|
| VGGSound-144K | VGG-SS | NN w/ Sup. Pre. Enc. | 39.94 | Link |
| VGGSound-144K | VGG-SS | NN w/ Self-Sup. Pre. Enc. | 39.16 | Link |
| VGGSound-144K | VGG-SS | NN w/ Sup. Pre. Enc. Pre-trained Vision | 41.42 | Link |
| Flickr-SoundNet-144K | Flickr-SoundNet | NN w/ Sup. Pre. Enc. | 85.20 | Link |
| Flickr-SoundNet-144K | Flickr-SoundNet | NN w/ Self-Sup. Pre. Enc. | 84.80 | Link |
| Flickr-SoundNet-144K | Flickr-SoundNet | NN w/ Sup. Pre. Enc. Pre-trained Vision | 86.00 | Link |
Put checkpoint files into the 'checkpoints' directory:
inference
│
└───checkpoints
│ ours_sup_previs.pth.tar
│ ours_sup.pth.tar
│ ours_selfsup.pth.tar
│ test.py
│ datasets.py
│ model.py
To evaluate a trained model run
python test.py --testset {testset_name} --pth_name {pth_name}
| Test Set | testset_name |
|---|---|
| VGG-SS | vggss |
| Flickr-SoundNet | flickr |
| IS3 | is3 |
Simply save the checkpoint files from the methods as '{method_name}_{put_your_own_message}.pth', such as 'ezvsl_flickr.pth'. We have already handled the trivial settings.
| Paper title | pth_name must contains |
|---|---|
| Localizing Visual Sounds the Hard Way (CVPR 21) [Paper] | lvs |
| Localizing Visual Sounds the Easy Way (ECCV 22) [Paper] | ezvsl |
| A Closer Look at Weakly-Supervised Audio-Visual Source Localization (NeurIPS 22) [Paper] | slavc |
| Exploiting Transformation Invariance and Equivariance for Self-supervised Sound Localisation (ACMMM 22) [Paper] | ssltie |
| Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning (CVPR 23) [Paper] | fnac |
Example
python test.py --testset flickr --pth_name ezvsl_flickr.pth
Training code is coming soon!
If you find this code useful, please consider giving a star ⭐ and citing us:
@inproceedings{senocak2023sound,
title={Sound source localization is all about cross-modal alignment},
author={Senocak, Arda and Ryu, Hyeonggon and Kim, Junsik and Oh, Tae-Hyun and Pfister, Hanspeter and Chung, Joon Son},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={7777--7787},
year={2023}
}If you use this dataset, please consider giving a star ⭐ and citing us:
@article{senocak2025align,
author={Senocak, Arda and Ryu, Hyeonggon and Kim, Junsik and Oh, Tae-Hyun and Pfister, Hanspeter and Chung, Joon Son},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
title={Toward Interactive Sound Source Localization: Better Align Sight and Sound!},
year={2025},
}
