Query by Vocal Imitation (QVIM) enables users to search a database of sounds via a vocal imitation of the desired sound. This offers sound designers an intuitively expressive way of navigating large sound effects databases.
We invite participants to submit systems that accept a vocal imitation query and retrieve a perceptually similar recording from a large database of sound effects.
Important Dates
- Challenge start: April 1, 2025
- Challenge end: June 15, 2025
- Challenge results announcement: July 15, 2025
For more details, please have a look at our website.
For Updates please register here.
This repository contains the baseline system for the AES AIMLA Challenge 2025. The architecture and the training procedure is based on "Improving Query-by-Vocal Imitation with Contrastive Learning and Audio Pretraining" (DCASE2025 Workshop).
- The training loop is implemented using PyTorch and PyTorch Lightning.
- Logging is implemented using Weights and Biases.
- It uses the MobileNetV3 (MN) pretrained on AudioSet to encode audio recordings.
- The system is trained on VimSketch and evaluated on the public evaluation dataset described on our website.
Prerequisites
- linux (tested on Ubuntu 24.04)
- conda, e.g., Miniconda3-latest-Linux-x86_64.sh
- Clone this repository.
git clone https://github.com/qvim-aes/qvim-baseline.git
- Create and activate a conda environment with Python 3.10:
conda create -n environment.yml
conda activate qvim-baseline
- Install 7z, e.g.,
# (on linux)
sudo apt install p7zip-full
# (on windows)
conda install -c conda-forge 7zip
For linux users: do not use conda package p7zip - this package is based on the outdated version 16.02 of 7zip; to extract the dataset, you need a more recent version of p7zip.
- If you have not used Weights and Biases for logging before, you can create a free account. On your
machine, run
wandb loginand copy your API key from this link to the command line.
To start the training, run the following command.
cd MAIN_FOLDER_OF_THIS_REPOSITORY
export PYTHONPATH=$(pwd)/src
python src/qvim_mn_baseline/ex_qvim.py
| Model Name | MRR (exact match) | NDCG (category match) |
|---|---|---|
| random | 0.0444 | ~0.337 |
| 2DFT | 0.1262 | 0.4793 |
| MN baseline | 0.2726 | 0.6463 |
- The Mean Reciprocal Rank (MRR) is the metric used to select submitted systems for the subjective evaluation. The MRR gives the average inverse rank
$\frac{1}{r_i}$ of the reference sound$i$ averaged over all imitations$Q$ :
- The Normalized Discounted Cumulative Gain (NDCG) measures the systems' ability to retrieve sounds of the imitated category (i.e., how good is the system at retrieving an arbitrary dog bark if a specific dog bark was imitated). The NDCG will not be used for ranking.
For questions or inquiries, please contact paul.primus@jku.at.
@inproceedings{Greif2024,
author = "Greif, Jonathan and Schmid, Florian and Primus, Paul and Widmer, Gerhard",
title = "Improving Query-By-Vocal Imitation with Contrastive Learning and Audio Pretraining",
booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2024 Workshop (DCASE2024)",
address = "Tokyo, Japan",
month = "October",
year = "2024",
pages = "51--55"
}