Official implementation of "Revealing Cross-Lingual Bias in Synthetic Speech Detection under Controlled Conditions"
Victor MorenoΒΉ, JoΓ£o LimaΒΉ, FlΓ‘vio SimΓ΅esΒ², Ricardo ViolatoΒ², MΓ‘rio Uliani NetoΒ², Fernando RunsteinΒ², Paula CostaΒΉ
ΒΉUniversidade Estadual de Campinas (UNICAMP), Brazil
Β²CPQD, Brazil
This work investigates whether language identity influences the detectability of synthetic speech in state-of-the-art countermeasure systems. We train a detector on English-only data (ASVspoof 5) and evaluate it under controlled conditions using spoofed samples in ten languages synthesized by Meta's MMS TTS system. Despite uniform synthesis settings, we observe significant language-dependent disparities in detection performance, revealing systematic bias in cross-lingual generalization.
- Language-dependent detection bias confirmed: Detection performance varies significantly across languages despite identical TTS synthesis conditions
- Counter-intuitive results: English (training language) does not achieve the best detection scores
- Extreme performance gaps: Romanian (99% mean score) vs Ukrainian (12% mean score)
- Statistical significance: Mann-Whitney U tests confirm systematic differences (p < 0.001 for most pairs)
- ASVspoof 5 Track 01: English-only corpus
- 145,000+ utterances (balanced bonafide/spoof)
- Various TTS and VC systems
- Used for monolingual training
- MLAAD MMS Subset: Controlled multilingual evaluation
- 10 languages: Finnish, German, Russian, Swahili, Ukrainian, English, French, Dutch, Hungarian, Romanian
- 1,000 spoofed utterances per language
- All generated with identical MMS TTS settings (VITS-based, 100k steps, same architecture)
- Enables isolation of language as the only variable
- Python 3.8+
- CUDA-capable GPU (recommended)
- 16GB+ RAM
# Clone repository
git clone https://github.com/victorgmoreno/crosslingual_bias_audiodeepfake.git
cd crosslingual_bias_audiodeepfake
# Create environment
conda create -n crosslingual python=3.8
conda activate crosslingual
# Install PyTorch (adjust CUDA version)
pip install torch==1.10.0+cu113 torchvision==0.11.0+cu113 torchaudio==0.10.0 -f https://download.pytorch.org/whl/torch_stable.html
# Install dependencies
pip install -r requirements.txtnumpy>=1.20.0
scipy>=1.7.0
pandas>=1.3.0
scikit-learn>=1.0.0
matplotlib>=3.4.0
seaborn>=0.11.0
tqdm>=4.62.0
transformers>=4.20.0
fairseq>=0.12.0Raw Audio (16kHz)
β
wav2vec2 XLS-R 300M (Front-end)
β
2D Self-Attentive Pooling
β
AASIST (Back-end)
- Heterogeneous Graph Attention
- Spectro-temporal modeling
β
CM Score [0,1]
- Front-end: wav2vec2 XLS-R 300M (pretrained on 436K hours multilingual speech)
- Back-end: AASIST with graph attention networks
- Output: Countermeasure scores (higher = more likely spoof)
crosslingual_bias_audiodeepfake/
βββ SSL_Anti-spoofing/ # Main detection framework
β βββ fairseq-*/ # Modified fairseq for wav2vec2
β βββ core_scripts/ # Data I/O and utilities
β βββ models/
β β βββ aasist.py # AASIST implementation
β β βββ wav2vec2_ssl.py # wav2vec2 front-end
β βββ main.py # Training script
βββ evaluate_mlaad/ # Cross-lingual evaluation
β βββ main.py # Evaluation entry point
β βββ config.py # Configuration
β βββ data.py # MLAAD data handling
β βββ evaluate.py # Bias analysis
β βββ visualize.py # Generate paper figures
βββ scripts/
β βββ statistical_tests.py # Mann-Whitney U, CLES
βββ configs/ # Configuration files
βββ requirements.txt
βββ README.md
# Download ASVspoof 5 Track 01 (registration required)
# https://www.asvspoof.org/
# Download MLAAD dataset
# https://github.com/piotrkawa/mlaad
# Extract only the MMS subset for evaluationpython scripts/prepare_data.py \
--asvspoof_dir /path/to/asvspoof5 \
--mlaad_dir /path/to/mlaad \
--output_dir data/python SSL_Anti-spoofing/main.py \
--config configs/aasist_wav2vec2.yaml \
--train_data data/asvspoof5/train \
--val_data data/asvspoof5/dev \
--epochs 100 \
--batch_size 32 \
--lr 1e-4cd evaluate_mlaad/
python main.py \
--model_path ../models/aasist_wav2vec2_asvspoof5.pth \
--mlaad_path ../data/mlaad_mms/ \
--languages fi,de,ru,sw,uk,en,fr,nl,hu,ro \
--output_dir ../results/| Language | Code | Mean CM Score | Std Dev | Detection Quality |
|---|---|---|---|---|
| Romanian | ro | 0.99 | 0.05 | β Excellent |
| French | fr | 0.97 | 0.15 | β Excellent |
| Russian | ru | 0.97 | 0.14 | β Excellent |
| Finnish | fi | 0.95 | 0.18 | β Very Good |
| English | en | 0.84 | 0.31 | |
| German | de | 0.82 | 0.32 | |
| Dutch | nl | 0.82 | 0.30 | |
| Hungarian | hu | 0.74 | 0.38 | |
| Swahili | sw | 0.48 | 0.41 | β Poor |
| Ukrainian | uk | 0.12 | 0.27 | β Very Poor |
# Generate paper figures
python ./scripts/statistical_tests.py \
--csv_path <path to results file> \
--output_dir <desired output folder>The input CSV file is the one with the following columns:
original_file|language|is_original_language|duration|training_data|model_name|architecture|transcript|language_folder|model_folder|meta_csv_path|absolute_audio_path|score|gender
The script writes the following plots to disk:
- Figure 2: Score distributions (violin plots)
- Figure 3: P-value and CLES heatmap
# 1. Train on ASVspoof 5 (English only)
./scripts/train_english_only.sh
# 2. Evaluate on MLAAD MMS subset
./scripts/evaluate_crosslingual.sh
# 3. Statistical analysis
./scripts/statistical_tests.py- Model achieves ~5.16% EER on ASVspoof 5 evaluation set
- Significant language-dependent performance gaps
- P-values < 0.001 for most language pairs
- CLES values showing large effect sizes
Key parameters in configs/aasist_wav2vec2.yaml:
model:
frontend: "wav2vec2-xls-r-300m"
backend: "aasist"
pooling: "2d_self_attentive"
training:
epochs: 100
batch_size: 32
learning_rate: 1e-4
optimizer: "adam"
loss: "binary_cross_entropy"
evaluation:
languages: ["fi", "de", "ru", "sw", "uk", "en", "fr", "nl", "hu", "ro"]
samples_per_language: 1000
Soon
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Submit a pull request
This project is licensed under the MIT License - see LICENSE file.
- Ethical Considerations: This research reveals biases that should be addressed in deployment, not exploited
- Reproducibility: Random seeds fixed for reproducibility
- Computational Requirements: ~2-3 hours on single GPU for full evaluation
- Data Access: ASVspoof 5 requires registration; MLAAD is publicly available
- CAPES β Finance Code 001
- FAPESP Horus project (Grant #2023/12865-8)
- FAPESP BI0S project (Grant #2020/09838-0)
- CPQD Company
- Authors of MLAAD dataset (MΓΌller et al., 2024)
- Authors of Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation (Tak et al., 2022)
- ASVspoof challenge organizers
For questions: paulad@unicamp.br
Note: This work analyzes bias in existing detection systems. The MLAAD dataset was created by MΓΌller et al. (2024) and is used here for controlled evaluation.