AI-based speech recognition as a method for investigating hearing loss

Requirements

1.1. Virtual Environment

1.2. Working Environment
Scripts
Data - CommonVoice English Dataset

3.1. Setup Authentication

3.2. Download the Dataset

3.3. Download Options
Data Preprocessing

4.1. Audio Masking for Hearing Loss Simulation

4.1.1. Overview

4.1.2. Usage

4.1.3. Advanced Options

4.1.4. Processing Parameters

4.1.5. Technical Details

4.2. Log Mel-Frequency Spectrograms

4.2.1. Overview

4.2.2. Individual Processing Script

4.2.3. Batch Processing Script
Whisper

5.1. Training of Whisper

5.2. Prediction with Whisper
LDL-AURIS

6.1. Installation Montreal Forced Aligner (MFA)

6.2. Data Processing with MFA

6.2.1. Forced Alignment Script

6.2.2. Extract Word Audio Script

6.2.3. Mask Word Audio Script
SLURM Processing

7.1. Audio Masking

7.2. Log Mel-Frequency Spectrograms

7.3. Whisper Training

7.4. Whisper Prediction

7.5. Create Forced Alignment

7.6. Extract Word Audios

7.7. Mask Word Audios

1. Requirements

1.1. Virtual Environment

To ensure that the same requirements are met across different operating systems and machines, it is recommended to create a virtual environment. This can be set up with UV.

which uv || echo "UV not found" # checks the UV installation

If UV is not installed, it can be installed as follows.

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env # Add UV to PATH for current session or restart the terminal.

Afterwards, the virtual environment can be created and activated.

uv venv .venv # creates a virtual environment with the name ".venv"
source .venv/bin/activate # activates the virtual environment

Then the required packages are installed. UV ensures that the exact versions are installed.

uv sync --active  # installs exact versions

1.2. Working Environment

Note: This project is designed to run on the AISC cluster at HPI. The SLURM batch scripts are pre-configured with AISC-specific settings (--account=aisc, --partition=aisc, --qos=aisc). If you're using a different HPC cluster, you'll need to modify the SLURM directives in the .sbatch files accordingly.

Getting Access to the AISC Cluster:

General HPI cluster documentation: https://docs.sc.hpi.de/
AISC-specific documentation: https://aisc.hpi.de/doc/doku.php?id=start

Before running any SLURM scripts, you need to configure your personal working directory:

Copy the environment template:
```
cp .env.local.template .env.local
```

Edit .env.local to set your working directory:

# Open in your preferred editor
nano .env.local
# or
vim .env.local

Update the PROJECT_ROOT variable to point to your personal working directory:

# Example for user "john.doe":
PROJECT_ROOT=/sc/home/john.doe/pilotproject-hearing-loss

# Example for different mount point:
PROJECT_ROOT=/home/username/projects/pilotproject-hearing-loss

Verify your configuration:

source .env.local
echo "Project root: $PROJECT_ROOT"

Note: The .env.local file is ignored by git, so your personal configuration won't be committed to the repository.

2. Scripts

All scripts are located in the scripts folder.

3. Data - CommonVoice English Dataset

⚠️ CRITICAL: Dataset Access Has Changed (Updated 18.11.2025)
Mozilla has removed CommonVoice datasets from HuggingFace. The dataset is now exclusively available through the Mozilla Data Collective platform.

Status Update:

The original CommonVoice 16.1 dataset (66 GB, 1.7M clips) is no longer accessible through HuggingFace

Mozilla Data Collective only provides access to the latest version (CommonVoice 23.0: 86.83 GB, 2.54M clips)

Older dataset versions cannot be downloaded anymore easily. One would need to get in touch with Mozilla Data Collective via email to tell them which version you need and why. See: Mozilla's community discussion on accessing older versions

Migration Required: This repository must be updated to work with CommonVoice 23.0

Action Required:

The instructions below reference the old HuggingFace location and do not work anymore

The download_commonvoice.py script needs to be rewritten to use the Mozilla Data Collective API

Dataset format conversion from TSV (Mozilla) to Arrow (HuggingFace) format is required

See: Mozilla's FAQ on dataset access

This migration is planned and will be implemented in a separate pull request.

Create a folder where the data will be stored. Because the amount of data is relatively big, data will not be provided by this github repository but has to be downloaded with the scripts below.

mkdir data

This project uses the CommonVoice English dataset from HuggingFace. Note: This dataset requires HuggingFace authentication.

3.1. Setup Authentication

Create a HuggingFace account at huggingface.co
Get an access token: Go to Settings → Access Tokens → Create new token (Read access is sufficient)
Accept the dataset terms: Visit CommonVoice 16.1 and accept the terms

3.2. Download the Dataset

# Method 1: Using environment variable (recommended)
export HF_TOKEN="your_token_here"
python scripts/download_commonvoice.py

# Method 2: Using command line argument
python scripts/download_commonvoice.py --token "your_token_here"

This will download the dataset to data/CommonVoiceEN/ by default.

3.3. Download Options

The script supports several options:

# Download with full caching (recommended for development)
python scripts/download_commonvoice.py

# Download in streaming mode (for large-scale processing)
python scripts/download_commonvoice.py --streaming

# Download only specific splits
python scripts/download_commonvoice.py --splits train validation

# Custom output directory
python scripts/download_commonvoice.py --output-dir /path/to/custom/location

# Use different CommonVoice version (if needed)
ice.py --version mozilla-foundation/common_voice_17_0

4. Data Preprocessing

4.1. Audio Masking for Hearing Loss Simulation

This project includes functionality to simulate different types of hearing loss by applying frequency-specific attenuation masks to the CommonVoice dataset.

4.1.1. Overview

The scripts/mask_audio.py script processes CommonVoice datasets to create three variants:

Normal hearing baseline (*_normal): 10 dB threshold across all frequencies
Low-frequency hearing loss (*_lfloss): High attenuation at low frequencies (125 Hz: 100 dB → 8000 Hz: 10 dB)
High-frequency hearing loss (*_hfloss): High attenuation at high frequencies (125 Hz: 10 dB → 8000 Hz: 100 dB)

4.1.2. Usage

# Basic usage - process the downloaded CommonVoice dataset
python scripts/mask_audio.py

# This creates three new datasets:
# - data/CommonVoiceEN_normal/
# - data/CommonVoiceEN_lfloss/
# - data/CommonVoiceEN_hfloss/

4.1.3. Advanced Options

# Specify input and output directories
python scripts/mask_audio.py \
    --input-dir data/CommonVoiceEN \
    --output-base data/MyProcessedDataset

# Configure processing parameters
python scripts/mask_audio.py \
    --sample-rate 16000 \
    --batch-size 64 \
    --num-workers 8

# Enable debug logging
python scripts/mask_audio.py --log-level DEBUG

4.1.4. Processing Parameters

--input-dir: Path to the input CommonVoice dataset (default: data/CommonVoiceEN)
--output-base: Base name for output directories (default: same as input directory)
--sample-rate: Target sample rate in Hz (default: 16000, required for Whisper models)
--batch-size: Number of audio samples to process per batch (default: 32)
--num-workers: Number of CPU cores for parallel processing (default: 4)
--log-level: Logging verbosity: DEBUG, INFO, WARNING, ERROR (default: INFO)

4.1.5. Technical Details

Audio Processing Pipeline:

Resampling: Audio is resampled to the target sample rate (16 kHz by default)
STFT: Short-Time Fourier Transform with 2048-sample window and 512-sample hop
Frequency Masking: Interpolated attenuation based on hearing loss profiles
Reconstruction: Inverse STFT to reconstruct audio signals
Normalization: Audio amplitude normalization to prevent clipping

Memory Management:

Uses batch processing to handle large datasets efficiently
Supports multiprocessing for faster execution
Automatically manages memory cleanup between batches

Output Structure: Each output dataset preserves the exact structure of the input dataset, including:

All data splits (train, validation, test, etc.)
Complete metadata (transcriptions, speaker information, etc.)
HuggingFace dataset format compatibility

4.2. Log Mel-Frequency Spectrograms

This project includes functionality to convert the hearing loss datasets into Log Mel-Frequency Spectrograms suitable for Whisper and LDL-AURIS model training. The preprocessing pipeline converts audio into 128-dimensional Log-Mel spectrograms required by Whisper Large V3.

4.2.1. Overview

The Log-Mel preprocessing creates training-ready datasets from the hearing loss variants:

Normal hearing spectrograms from *_normal datasets
Low-frequency hearing loss spectrograms from *_lfloss datasets
High-frequency hearing loss spectrograms from *_hfloss datasets

Each dataset is converted to Log-Mel spectrograms with proper tokenization for Whisper training. For large-scale processing on computing clusters, see SLURM Processing → Log Mel-Frequency Spectrograms.

4.2.2. Individual Processing Script

Use scripts/DataSet2LogMel.py to convert a single hearing loss dataset:

# Convert normal hearing dataset
python scripts/DataSet2LogMel.py \
    --input_dataset data/CommonVoiceEN_normal/dataset \
    --output_dataset data/CommonVoiceEN_normal_logmel

# Convert low-frequency hearing loss dataset
python scripts/DataSet2LogMel.py \
    --input_dataset data/CommonVoiceEN_lfloss/dataset \
    --output_dataset data/CommonVoiceEN_lfloss_logmel

Command-line Arguments

--input_dataset (required): Path to input CommonVoice dataset folder
--output_dataset (required): Path where preprocessed dataset will be saved
--model_size: Whisper model size for feature extraction (default: "large-v3")
--num_cpus: Number of CPU cores to use (default: all available)
--batch_size: Processing batch size (default: 1000)
--writer_batch_size: Writer batch size for disk saving (default: 100)
--max_memory_per_worker: Maximum memory per worker in GB (default: 4.0)
--language: Language for tokenizer (default: "en")
--task: Task type for tokenizer: "transcribe" or "translate" (default: "transcribe")
--shuffle_seed: Random seed for shuffling (default: 42)
--max_samples: Maximum samples per split for testing (default: all)

4.2.3. Batch Processing Script

Use scripts/DataSet2LogMelBatch.py to automatically process all three hearing loss variants:

# Process all datasets with default settings
python scripts/DataSet2LogMelBatch.py

# Process with custom resource allocation
python scripts/DataSet2LogMelBatch.py \
    --cpus-per-task 64 \
    --memory 500G \
    --batch-size 2000

Command-line Arguments

SLURM Resource Parameters:

--cpus-per-task: CPU cores per SLURM task (default: 48)
--memory: Memory allocation per job (default: "400G")
--time: Time limit per job (default: "32:00:00")

Processing Parameters:

--model-size: Whisper model size (default: "large-v3")
--batch-size: Processing batch size (default: 1000)
--max-samples: Maximum samples per split for testing (default: all)
--skip-existing: Skip datasets with existing output directories

5. Whisper

5.1. Training of Whisper

The train_whisper.py script fine-tunes OpenAI's Whisper model on hearing loss datasets. It supports training on normal audio as well as high-frequency and low-frequency hearing loss simulations.

Basic Usage:

python scripts/train_whisper.py \
    --model_name_or_path openai/whisper-large-v3 \
    --train_dataset_path data/CommonVoiceEN_normal_logmel \
    --output_dir results/whisper_finetuned_normal \
    --num_train_epochs 3

Key Parameters:

--model_name_or_path: Base Whisper model (default: "openai/whisper-large-v3")
--train_dataset_path: Path to Log-Mel spectrogram dataset
--output_dir: Directory to save model checkpoints
--num_train_epochs: Number of training epochs (default: 3)
--per_device_train_batch_size: Batch size per GPU (default: 8)
--gradient_accumulation_steps: Gradient accumulation steps (default: 2)
--learning_rate: Learning rate (default: 1e-5)
--warmup_steps: Number of warmup steps (default: 500)
--save_steps: Save checkpoint every N steps (default: 1000)
--eval_steps: Evaluate every N steps (default: 1000)
--logging_steps: Log metrics every N steps (default: 25)

Training on Different Hearing Loss Variants:

# Normal hearing
python scripts/train_whisper.py \
    --train_dataset_path data/CommonVoiceEN_normal_logmel \
    --output_dir results/whisper_finetuned_normal

# High-frequency hearing loss
python scripts/train_whisper.py \
    --train_dataset_path data/CommonVoiceEN_hfloss_logmel \
    --output_dir results/whisper_finetuned_hfloss

# Low-frequency hearing loss
python scripts/train_whisper.py \
    --train_dataset_path data/CommonVoiceEN_lfloss_logmel \
    --output_dir results/whisper_finetuned_lfloss

Monitoring Training:

# View training progress with TensorBoard
python scripts/tensorboard_visualise_runs.py --logdir results/whisper_finetuned_normal

For large-scale training on computing clusters, see SLURM Processing → Whisper Training.

5.2. Prediction with Whisper

The analyse_with_whisper.py script performs comprehensive analysis of Whisper model predictions, extracting detailed metrics including token probabilities, embeddings, entropy, and semantic similarity measures.

Basic Usage:

python scripts/analyse_with_whisper.py \
    --input-folder data/MALD \
    --output-path results/whisper_predictions_normal/analysis.json \
    --model-path results/whisper_finetuned_normal/checkpoint-2000 \
    --num-workers 4 \
    --top-k 1000

Key Parameters:

--input-folder: Directory containing audio files (ground truth = filename)
--output-path: Path for main JSON output file
--model-path: Path to Whisper model checkpoint
--num-workers: Number of dataloader workers (default: 20)
--num-threads: CPU threads for processing (default: 8)
--num-gpus: Number of GPUs to use (default: 1, 0 for CPU only)
--batch-size: Batch size for processing (default: 1)
--top-k: Number of top predictions to save (default: 1000)

Output Structure:

The script generates:

Main JSON file (analysis.json): Contains all results with model metadata
Individual JSON files: One per audio file with detailed per-token metrics

Metrics Extracted:

Per-Token Metrics:

Predicted token probability and rank
Entropy and semantic density
Top-k alternative predictions with probabilities and ranks
Hidden state embeddings (1280-dimensional)
Ground truth comparison (cosine similarity, correlation)

Normalized Metrics:

Case-insensitive token grouping
Aggregated probabilities across token variants
Normalized embeddings (probability-weighted, simple average, most probable)

Pooled Metrics (across all predicted tokens):

Average rank, probability, and entropy
Pooled hidden states
Pooled ground truth comparison metrics

Example Analysis:

# Analyze single audio file
python scripts/analyse_with_whisper.py \
    --input-folder test/transcendentalists \
    --output-path results/single_analysis.json \
    --model-path results/whisper_finetuned_normal/checkpoint-2000 \
    --num-workers 0

# Analyze MALD dataset with multiple models
python scripts/analyse_with_whisper.py \
    --input-folder data/MALD \
    --output-path results/whisper_predictions_normal/mald_analysis.json \
    --model-path results/whisper_finetuned_normal/checkpoint-2000

For detailed documentation of all metrics, see README_whisper_metrics.md. For batch processing on computing clusters, see SLURM Processing → Whisper Prediction.

6. LDL-AURIS

6.1. Installation Montreal Forced Aligner (MFA)

Montreal Forced Aligner (MFA) is used for forced alignment to generate word and phoneme boundaries from audio files and their transcriptions. On the HPI cluster, MFA is installed using enroot containers since direct pip/conda installation is not available.

Initial Setup (One-Time)

To set up MFA on your account, you need to import the Docker image and create a container. This only needs to be done once per user account.

Step 1: Import the Docker image

Run on a compute node with sufficient memory (64GB recommended):

srun -A aisc -p aisc-interactive --mem=64G --cpus-per-task=8 --time=01:00:00 --constraint=ARCH:X86 \
  enroot import docker://mmcauliffe/montreal-forced-aligner:latest

This downloads and converts the Docker image to a squashfs file (mmcauliffe+montreal-forced-aligner+latest.sqsh, ~4.5 GB) in your current directory.

Step 2: Create the container

srun -A aisc -p aisc-interactive --mem=32G --cpus-per-task=4 --time=00:30:00 --constraint=ARCH:X86 \
  enroot create --name mfa mmcauliffe+montreal-forced-aligner+latest.sqsh

This unpacks the squashfs file into a container stored in ~/.local/share/enroot/mfa/.

Step 3: Download acoustic models and dictionary

Use --rw to make changes persistent:

srun -A aisc -p aisc-interactive --mem=32G --cpus-per-task=4 --time=00:30:00 --constraint=ARCH:X86 \
  enroot start --rw --mount $PWD:/workspace mfa \
  bash -c "mfa model download acoustic english_mfa && mfa model download dictionary english_mfa"

Verifying the Installation

# Check MFA version
srun -A aisc -p aisc-interactive --mem=8G --cpus-per-task=2 --time=00:10:00 --constraint=ARCH:X86 \
  enroot start --mount $PWD:/workspace mfa mfa version

# List available acoustic models
srun -A aisc -p aisc-interactive --mem=8G --cpus-per-task=2 --time=00:10:00 --constraint=ARCH:X86 \
  enroot start --mount $PWD:/workspace mfa mfa model list acoustic

# Expected output: ['english_mfa']

6.2. Data Processing with MFA

For generating word-level alignments and extracting individual word audio segments, we provide two Python scripts that use the torchaudio MMS_FA model instead of MFA's TextGrid output.

Input Requirements

Both scripts require audio files and corresponding transcription files in the same directory:

Audio files: .wav format
Transcription files: .txt format (same filename, different extension)
Example: 478.wav and 478.txt (containing just the transcription text)

The transcription file should contain only the transcript text, e.g.:

Where was his horse?

6.2.1 Forced Alignment Script (`scripts/forced_alignment.py`)

Performs GPU-accelerated forced alignment using torchaudio's MMS_FA model to generate word-level timing information stored as JSON files.

Usage:

# Basic usage with defaults
python scripts/forced_alignment.py

# Custom directories
python scripts/forced_alignment.py --input-dir data/CommonVoiceENraw --output-dir data/CommonVoiceENJSON

# Process specific splits with multi-GPU support
python scripts/forced_alignment.py --splits train test --batch-size 32 --num-gpus 4

Key Arguments:

--input-dir: Input directory with wav/txt pairs (default: data/CommonVoiceENraw)
--output-dir: Output directory for JSON files (default: data/CommonVoiceENJSON)
--splits: Splits to process (default: train test validation)
--num-gpus: Number of GPUs to use (default: 1)
--batch-size: Number of files per batch (default: 16)
--skip-existing: Skip files that already have JSON output

Output: JSON Files

The script generates JSON files with word boundaries for each audio file:

Example output (10110.json):

{
  "file_id": "10110",
  "original_transcript": "I really liked the film we saw last week.",
  "normalized_transcript": "i really liked the film we saw last week",
  "duration": 5.88,
  "words": [
    {"word": "i", "start": 1.1238, "end": 1.1439, "score": 0.9977},
    {"word": "really", "start": 1.2041, "end": 1.4449, "score": 0.9496},
    {"word": "liked", "start": 1.5051, "end": 1.7459, "score": 0.9624},
    ...
  ]
}

6.2.2 Extract Word Audio Script (`scripts/extract_word_audio.py`)

Extracts individual word audio segments from full audio files using the word boundary JSON files generated by forced_alignment.py.

Usage:

# Basic usage with defaults
python scripts/extract_word_audio.py

# Custom directories
python scripts/extract_word_audio.py --input-json data/CommonVoiceENJSON --input-audio data/CommonVoiceENraw --output-audio data/CommonVoiceENWords

# Adjust parallelism
python scripts/extract_word_audio.py --nworkers 32 --batch-size 200

Key Arguments:

--input-json: Directory containing word alignment JSON files (default: data/CommonVoiceENJSON)
--input-audio: Directory containing source audio files (default: data/CommonVoiceENraw)
--output-audio: Directory to write extracted word audio files (default: data/CommonVoiceENWords)
--nworkers: Number of parallel workers (default: 24)
--batch-size: Number of files to process per batch (default: 100)
--splits: Splits to process (default: train test validation)

Output: Individual Word Audio Files

The script creates one WAV file per word with the naming convention:

{file_id}_{word_index}_{word}.wav

Example output structure:

data/CommonVoiceENWords/
├── train/
│   ├── 10110_1_i.wav
│   ├── 10110_2_really.wav
│   ├── 10110_3_liked.wav
│   └── ...
├── test/
│   └── ...
└── validation/
    └── ...

Word indices are 1-based, and apostrophes are preserved in filenames (e.g., 12345_5_don't.wav).

6.2.3 Mask Word Audio Script (`scripts/mask_word_audio.py`)

Applies hearing loss masks to individual word audio files, creating three versions: normal hearing (copy), low-frequency loss, and high-frequency loss. Output remains as WAV files (no mel spectrogram conversion).

Usage:

# Basic usage with defaults
python scripts/mask_word_audio.py

# Custom directories
python scripts/mask_word_audio.py --input-dir data/CommonVoiceENWords --output-base data/CommonVoiceENWords

# Adjust parallelism
python scripts/mask_word_audio.py --nworkers 32 --batch-size 200

# Process only specific conditions
python scripts/mask_word_audio.py --conditions hfloss lfloss

Key Arguments:

--input-dir: Directory containing input word audio files (default: data/CommonVoiceENWords)
--output-base: Base name for output directories (default: data/CommonVoiceENWords)
--nworkers: Number of parallel workers (default: 24)
--batch-size: Number of files to process per batch (default: 100)
--splits: Splits to process (default: train test validation)
--conditions: Hearing loss conditions to generate (default: normal lfloss hfloss)

Output: Masked Word Audio Files

Creates three output directories with the same structure as input:

data/CommonVoiceENWords_normal/{train,test,validation}/*.wav   # Copies of original files
data/CommonVoiceENWords_lfloss/{train,test,validation}/*.wav   # Low-frequency hearing loss
data/CommonVoiceENWords_hfloss/{train,test,validation}/*.wav   # High-frequency hearing loss

Hearing Loss Profiles:

normal: 10 dB threshold across all frequencies (essentially unchanged)
lfloss: 100 dB at low frequencies → 10 dB at high frequencies (attenuates low frequencies)
hfloss: 10 dB at low frequencies → 100 dB at high frequencies (attenuates high frequencies)

7. SLURM Processing

7.1. Audio Masking

For processing the full CommonVoice dataset (1.7M+ samples), use the SLURM batch script:

# Make sure you've configured your working environment first (see Requirements > Working Environment)
# Basic SLURM submission with default settings
sbatch scripts/mask_audio.sbatch

The script will automatically:

Load your personal working directory from .env.local
Navigate to your project directory
Activate the virtual environment
Run the audio masking processing

Advanced SLURM Options:

# Override specific parameters via environment variables
BATCH_SIZE=256 NUM_WORKERS=32 sbatch scripts/mask_audio.sbatch

# Or pass arguments directly to the underlying script
sbatch scripts/mask_audio.sbatch --batch-size 256 --num-workers 32

7.2. Log Mel-Frequency Spectrograms

For processing large hearing loss datasets to Log-Mel spectrograms using SLURM:

# Process all three datasets (normal, lfloss, hfloss) automatically
python scripts/DataSet2LogMelBatch.py

# This will submit 3 SLURM jobs:
# - Job 1: CommonVoiceEN_normal → CommonVoiceEN_normal_logmel
# - Job 2: CommonVoiceEN_lfloss → CommonVoiceEN_lfloss_logmel  
# - Job 3: CommonVoiceEN_hfloss → CommonVoiceEN_hfloss_logmel

Individual SLURM Job Submission:

# Process a single dataset via SLURM
sbatch scripts/DataSet2LogMel.sbatch \
    --input_dataset data/CommonVoiceEN_normal/dataset \
    --output_dataset data/CommonVoiceEN_normal_logmel

The SLURM script will automatically:

Load your personal working directory from .env.local
Navigate to your project directory
Activate the virtual environment
Run the Log-Mel preprocessing with optimized memory settings

SLURM Resource Configuration:

Default: 48 CPU cores, 400GB memory, 32-hour time limit
Recommended: For ~60K samples, jobs typically complete in 8-12 minutes
Output: Each dataset produces ~87GB of Log-Mel spectrograms ready for Whisper training

7.3. Whisper Training

For training Whisper models on hearing loss datasets using SLURM, use train_whisper.sbatch. The script uses environment variables to specify input and output directories.

Basic SLURM Submission:

# Train on normal hearing dataset
export INPUT_FOLDER="data/CommonVoiceEN_normal_logmel"
export OUTPUT_FOLDER="results/whisper_finetuned_normal"
sbatch scripts/train_whisper.sbatch

# Train on high-frequency hearing loss dataset
export INPUT_FOLDER="data/CommonVoiceEN_hfloss_logmel"
export OUTPUT_FOLDER="results/whisper_finetuned_hfloss"
sbatch scripts/train_whisper.sbatch

# Train on low-frequency hearing loss dataset
export INPUT_FOLDER="data/CommonVoiceEN_lfloss_logmel"
export OUTPUT_FOLDER="results/whisper_finetuned_lfloss"
sbatch scripts/train_whisper.sbatch

SLURM Resource Configuration:

GPUs: 4× NVIDIA H100 80GB HBM3
CPUs: 24 cores per task
Memory: 200GB
Time Limit: 24 hours
Partition: aisc-batch

Training Configuration:

Batch size: 8 per device (32 total with 4 GPUs)
Gradient accumulation: 2 steps
Learning rate: 1e-5 with warmup
Evaluation: Every 1000 steps
Checkpoints: Saved every 1000 steps

The script automatically:

Activates the virtual environment
Runs training with train_whisper.py
Logs training metrics to TensorBoard
Saves model checkpoints to the output directory

Monitoring Training:

# Check job status
squeue -u $USER

# View training logs
tail -f logs/train_whisper_[JOBID].err

# Launch TensorBoard (after training starts)
python scripts/tensorboard_visualise_runs.py --logdir results/whisper_finetuned_normal

7.4. Whisper Prediction

For batch prediction and analysis on large datasets using SLURM, use analyse_with_whisper.sbatch. The script forwards all command-line arguments to analyse_with_whisper.py.

Single Model Analysis:

sbatch scripts/analyse_with_whisper.sbatch \
    --input-folder data/MALD \
    --output-path results/whisper_predictions_normal/mald_analysis.json \
    --model-path results/whisper_finetuned_normal/checkpoint-2000 \
    --num-workers 20 \
    --top-k 1000

Sequential Multi-Model Analysis:

Use run_mald_analysis.sh to analyze with all three models sequentially using SLURM job dependencies:

./scripts/run_mald_analysis.sh

This script will:

Submit Job 1: Normal model → results/whisper_predictions_normal/
Submit Job 2: HF loss model (after Job 1) → results/whisper_predictions_hfloss/
Submit Job 3: LF loss model (after Job 2) → results/whisper_predictions_lfloss/

SLURM Resource Configuration:

GPUs: 1× NVIDIA H100 80GB HBM3
CPUs: 24 cores (for 20 dataloader workers)
Memory: 200GB
Time Limit: 24 hours
Partition: aisc-batch

Processing Performance:

Rate: ~1.3 samples/second (~78 samples/minute)
Dataset: 26,793 audio files (MALD)
Estimated Time: ~6 hours per model

Monitoring Progress:

# Check job status and dependencies
squeue -u $USER

# Monitor live progress
tail -f logs/whisper_normal_[JOBID].err | grep "Processing batches"

# Check number of processed samples
ls results/whisper_predictions_normal/*.json 2>/dev/null | grep -v "analysis.json" | wc -l

Output Structure:

Main JSON: mald_analysis.json (contains model_path and all results)
Individual JSONs: One file per audio sample with detailed metrics
Logs: Stored in logs/whisper_[model]_[JOBID].{out,err}

7.5. Create Forced Alignment (`forced_alignment.sbatch`)

For running forced alignment on the full CommonVoice dataset using SLURM:

# Basic submission with defaults
sbatch scripts/forced_alignment.sbatch

# Process only specific splits
sbatch scripts/forced_alignment.sbatch --splits train

# Resume interrupted job (skip already processed files)
sbatch scripts/forced_alignment.sbatch --skip-existing

# Use multiple GPUs with larger batch size
sbatch --gres=gpu:4 scripts/forced_alignment.sbatch --num-gpus 4 --batch-size 32

SLURM Resource Configuration:

GPUs: 1× GPU (configurable)
CPUs: 24 cores
Memory: 128GB
Time Limit: 72 hours
Partition: aisc-batch

Default Input/Output:

Input: data/CommonVoiceENraw (audio + transcript files)
Output: data/CommonVoiceENJSON (JSON files with word boundaries)

Monitoring:

# Check job status
squeue -u $USER

# Monitor progress
tail -f logs/forced_alignment_[JOBID].err

7.6. Extract Word Audios (`extract_word_audio.sbatch`)

For extracting individual word audio segments from aligned data using SLURM:

# Basic submission with defaults
sbatch scripts/extract_word_audio.sbatch

# Process only specific splits
sbatch scripts/extract_word_audio.sbatch --splits train

# Adjust parallelism
sbatch scripts/extract_word_audio.sbatch --nworkers 32 --batch-size 200

# Chain with forced alignment job (run after alignment completes)
sbatch --dependency=afterok:<ALIGNMENT_JOB_ID> scripts/extract_word_audio.sbatch

SLURM Resource Configuration:

CPUs: 24 cores
Memory: 64GB
Time Limit: 24 hours
Partition: aisc-batch

Default Input/Output:

Input JSON: data/CommonVoiceENJSON (word boundary JSON files)
Input Audio: data/CommonVoiceENraw (source audio files)
Output: data/CommonVoiceENWords (individual word audio files)

Output Naming Convention:

{file_id}_{word_index}_{word}.wav

Example: 10110_2_really.wav (word index 2, word "really" from file 10110)

Monitoring:

# Check job status
squeue -u $USER

# Monitor progress
tail -f logs/extract_word_audio_[JOBID].err

# Count extracted words
ls data/CommonVoiceENWords/train/*.wav 2>/dev/null | wc -l

7.7. Mask Word Audios (`mask_word_audio.sbatch`)

For applying hearing loss masks to extracted word audio files using SLURM:

# Basic submission with defaults
sbatch scripts/mask_word_audio.sbatch

# Process only specific splits
sbatch scripts/mask_word_audio.sbatch --splits train

# Process only specific conditions
sbatch scripts/mask_word_audio.sbatch --conditions hfloss lfloss

# Adjust parallelism
sbatch scripts/mask_word_audio.sbatch --nworkers 32 --batch-size 200

# Chain with extract word audio job (run after extraction completes)
sbatch --dependency=afterok:<EXTRACT_JOB_ID> scripts/mask_word_audio.sbatch

SLURM Resource Configuration:

CPUs: 48 cores
Memory: 128GB
Time Limit: 48 hours
Partition: aisc-batch

Default Input/Output:

Input: data/CommonVoiceENWords/{train,test,validation}/*.wav
Output:
- data/CommonVoiceENWords_normal/ (copies of original files)
- data/CommonVoiceENWords_lfloss/ (low-frequency hearing loss)
- data/CommonVoiceENWords_hfloss/ (high-frequency hearing loss)

Monitoring:

# Check job status
squeue -u $USER

# Monitor progress
tail -f logs/mask_word_audio_[JOBID].err

# Count masked files per condition
ls data/CommonVoiceENWords_hfloss/train/*.wav 2>/dev/null | wc -l

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
img		img
logs		logs
scripts		scripts
.env.local.template		.env.local.template
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_whisper_metrics.md		README_whisper_metrics.md
pyproject.toml		pyproject.toml

License

aihpi/pilotproject-hearing-loss

Folders and files

Latest commit

History

Repository files navigation

AI-based speech recognition as a method for investigating hearing loss

Table of Contents

1. Requirements

1.1. Virtual Environment

1.2. Working Environment

2. Scripts

3. Data - CommonVoice English Dataset

3.1. Setup Authentication

3.2. Download the Dataset

3.3. Download Options

4. Data Preprocessing

4.1. Audio Masking for Hearing Loss Simulation

4.1.1. Overview

4.1.2. Usage

4.1.3. Advanced Options

4.1.4. Processing Parameters

4.1.5. Technical Details

4.2. Log Mel-Frequency Spectrograms

4.2.1. Overview

4.2.2. Individual Processing Script

Command-line Arguments

4.2.3. Batch Processing Script

Command-line Arguments

5. Whisper

5.1. Training of Whisper

5.2. Prediction with Whisper

6. LDL-AURIS

6.1. Installation Montreal Forced Aligner (MFA)

Initial Setup (One-Time)

Verifying the Installation

6.2. Data Processing with MFA

Input Requirements

6.2.1 Forced Alignment Script (scripts/forced_alignment.py)

6.2.2 Extract Word Audio Script (scripts/extract_word_audio.py)

6.2.3 Mask Word Audio Script (scripts/mask_word_audio.py)

7. SLURM Processing

7.1. Audio Masking

7.2. Log Mel-Frequency Spectrograms

7.3. Whisper Training

7.4. Whisper Prediction

7.5. Create Forced Alignment (forced_alignment.sbatch)

7.6. Extract Word Audios (extract_word_audio.sbatch)

7.7. Mask Word Audios (mask_word_audio.sbatch)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

6.2.1 Forced Alignment Script (`scripts/forced_alignment.py`)

6.2.2 Extract Word Audio Script (`scripts/extract_word_audio.py`)

6.2.3 Mask Word Audio Script (`scripts/mask_word_audio.py`)

7.5. Create Forced Alignment (`forced_alignment.sbatch`)

7.6. Extract Word Audios (`extract_word_audio.sbatch`)

7.7. Mask Word Audios (`mask_word_audio.sbatch`)

Packages