-
1.1. Virtual Environment
1.2. Working Environment
-
Data - CommonVoice English Dataset
3.1. Setup Authentication
3.2. Download the Dataset
3.3. Download Options
-
4.1. Audio Masking for Hearing Loss Simulation
4.1.1. Overview
4.1.2. Usage
4.1.3. Advanced Options
4.1.4. Processing Parameters
4.1.5. Technical Details
4.2. Log Mel-Frequency Spectrograms
4.2.1. Overview
4.2.2. Individual Processing Script
4.2.3. Batch Processing Script
-
5.1. Training of Whisper
-
6.1. Installation Montreal Forced Aligner (MFA)
6.2.1. Forced Alignment Script
6.2.2. Extract Word Audio Script
6.2.3. Mask Word Audio Script
-
7.1. Audio Masking
7.2. Log Mel-Frequency Spectrograms
7.3. Whisper Training
7.4. Whisper Prediction
7.6. Extract Word Audios
7.7. Mask Word Audios
To ensure that the same requirements are met across different operating systems and machines, it is recommended to create a virtual environment. This can be set up with UV.
which uv || echo "UV not found" # checks the UV installationIf UV is not installed, it can be installed as follows.
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env # Add UV to PATH for current session or restart the terminal.Afterwards, the virtual environment can be created and activated.
uv venv .venv # creates a virtual environment with the name ".venv"
source .venv/bin/activate # activates the virtual environmentThen the required packages are installed. UV ensures that the exact versions are installed.
uv sync --active # installs exact versionsNote: This project is designed to run on the AISC cluster at HPI. The SLURM batch scripts are pre-configured with AISC-specific settings (--account=aisc, --partition=aisc, --qos=aisc). If you're using a different HPC cluster, you'll need to modify the SLURM directives in the .sbatch files accordingly.
Getting Access to the AISC Cluster:
- General HPI cluster documentation: https://docs.sc.hpi.de/
- AISC-specific documentation: https://aisc.hpi.de/doc/doku.php?id=start
Before running any SLURM scripts, you need to configure your personal working directory:
-
Copy the environment template:
cp .env.local.template .env.local
-
Edit
.env.localto set your working directory:# Open in your preferred editor nano .env.local # or vim .env.local
-
Update the
PROJECT_ROOTvariable to point to your personal working directory:# Example for user "john.doe": PROJECT_ROOT=/sc/home/john.doe/pilotproject-hearing-loss # Example for different mount point: PROJECT_ROOT=/home/username/projects/pilotproject-hearing-loss
-
Verify your configuration:
source .env.local echo "Project root: $PROJECT_ROOT"
Note: The .env.local file is ignored by git, so your personal configuration won't be committed to the repository.
All scripts are located in the scripts folder.
⚠️ CRITICAL: Dataset Access Has Changed (Updated 18.11.2025)
Mozilla has removed CommonVoice datasets from HuggingFace. The dataset is now exclusively available through the Mozilla Data Collective platform.Status Update:
- The original CommonVoice 16.1 dataset (66 GB, 1.7M clips) is no longer accessible through HuggingFace
- Mozilla Data Collective only provides access to the latest version (CommonVoice 23.0: 86.83 GB, 2.54M clips)
- Older dataset versions cannot be downloaded anymore easily. One would need to get in touch with Mozilla Data Collective via email to tell them which version you need and why. See: Mozilla's community discussion on accessing older versions
- Migration Required: This repository must be updated to work with CommonVoice 23.0
Action Required:
- The instructions below reference the old HuggingFace location and do not work anymore
- The
download_commonvoice.pyscript needs to be rewritten to use the Mozilla Data Collective API- Dataset format conversion from TSV (Mozilla) to Arrow (HuggingFace) format is required
- See: Mozilla's FAQ on dataset access
This migration is planned and will be implemented in a separate pull request.
Create a folder where the data will be stored. Because the amount of data is relatively big, data will not be provided by this github repository but has to be downloaded with the scripts below.
mkdir dataThis project uses the CommonVoice English dataset from HuggingFace. Note: This dataset requires HuggingFace authentication.
- Create a HuggingFace account at huggingface.co
- Get an access token: Go to Settings → Access Tokens → Create new token (Read access is sufficient)
- Accept the dataset terms: Visit CommonVoice 16.1 and accept the terms
# Method 1: Using environment variable (recommended)
export HF_TOKEN="your_token_here"
python scripts/download_commonvoice.py
# Method 2: Using command line argument
python scripts/download_commonvoice.py --token "your_token_here"This will download the dataset to data/CommonVoiceEN/ by default.
The script supports several options:
# Download with full caching (recommended for development)
python scripts/download_commonvoice.py
# Download in streaming mode (for large-scale processing)
python scripts/download_commonvoice.py --streaming
# Download only specific splits
python scripts/download_commonvoice.py --splits train validation
# Custom output directory
python scripts/download_commonvoice.py --output-dir /path/to/custom/location
# Use different CommonVoice version (if needed)
ice.py --version mozilla-foundation/common_voice_17_0This project includes functionality to simulate different types of hearing loss by applying frequency-specific attenuation masks to the CommonVoice dataset.
The scripts/mask_audio.py script processes CommonVoice datasets to create three variants:
- Normal hearing baseline (
*_normal): 10 dB threshold across all frequencies - Low-frequency hearing loss (
*_lfloss): High attenuation at low frequencies (125 Hz: 100 dB → 8000 Hz: 10 dB) - High-frequency hearing loss (
*_hfloss): High attenuation at high frequencies (125 Hz: 10 dB → 8000 Hz: 100 dB)
# Basic usage - process the downloaded CommonVoice dataset
python scripts/mask_audio.py
# This creates three new datasets:
# - data/CommonVoiceEN_normal/
# - data/CommonVoiceEN_lfloss/
# - data/CommonVoiceEN_hfloss/# Specify input and output directories
python scripts/mask_audio.py \
--input-dir data/CommonVoiceEN \
--output-base data/MyProcessedDataset
# Configure processing parameters
python scripts/mask_audio.py \
--sample-rate 16000 \
--batch-size 64 \
--num-workers 8
# Enable debug logging
python scripts/mask_audio.py --log-level DEBUG--input-dir: Path to the input CommonVoice dataset (default:data/CommonVoiceEN)--output-base: Base name for output directories (default: same as input directory)--sample-rate: Target sample rate in Hz (default: 16000, required for Whisper models)--batch-size: Number of audio samples to process per batch (default: 32)--num-workers: Number of CPU cores for parallel processing (default: 4)--log-level: Logging verbosity: DEBUG, INFO, WARNING, ERROR (default: INFO)
Audio Processing Pipeline:
- Resampling: Audio is resampled to the target sample rate (16 kHz by default)
- STFT: Short-Time Fourier Transform with 2048-sample window and 512-sample hop
- Frequency Masking: Interpolated attenuation based on hearing loss profiles
- Reconstruction: Inverse STFT to reconstruct audio signals
- Normalization: Audio amplitude normalization to prevent clipping
Memory Management:
- Uses batch processing to handle large datasets efficiently
- Supports multiprocessing for faster execution
- Automatically manages memory cleanup between batches
Output Structure: Each output dataset preserves the exact structure of the input dataset, including:
- All data splits (train, validation, test, etc.)
- Complete metadata (transcriptions, speaker information, etc.)
- HuggingFace dataset format compatibility
This project includes functionality to convert the hearing loss datasets into Log Mel-Frequency Spectrograms suitable for Whisper and LDL-AURIS model training. The preprocessing pipeline converts audio into 128-dimensional Log-Mel spectrograms required by Whisper Large V3.
The Log-Mel preprocessing creates training-ready datasets from the hearing loss variants:
- Normal hearing spectrograms from
*_normaldatasets - Low-frequency hearing loss spectrograms from
*_lflossdatasets - High-frequency hearing loss spectrograms from
*_hflossdatasets
Each dataset is converted to Log-Mel spectrograms with proper tokenization for Whisper training. For large-scale processing on computing clusters, see SLURM Processing → Log Mel-Frequency Spectrograms.
Use scripts/DataSet2LogMel.py to convert a single hearing loss dataset:
# Convert normal hearing dataset
python scripts/DataSet2LogMel.py \
--input_dataset data/CommonVoiceEN_normal/dataset \
--output_dataset data/CommonVoiceEN_normal_logmel
# Convert low-frequency hearing loss dataset
python scripts/DataSet2LogMel.py \
--input_dataset data/CommonVoiceEN_lfloss/dataset \
--output_dataset data/CommonVoiceEN_lfloss_logmel--input_dataset(required): Path to input CommonVoice dataset folder--output_dataset(required): Path where preprocessed dataset will be saved--model_size: Whisper model size for feature extraction (default: "large-v3")--num_cpus: Number of CPU cores to use (default: all available)--batch_size: Processing batch size (default: 1000)--writer_batch_size: Writer batch size for disk saving (default: 100)--max_memory_per_worker: Maximum memory per worker in GB (default: 4.0)--language: Language for tokenizer (default: "en")--task: Task type for tokenizer: "transcribe" or "translate" (default: "transcribe")--shuffle_seed: Random seed for shuffling (default: 42)--max_samples: Maximum samples per split for testing (default: all)
Use scripts/DataSet2LogMelBatch.py to automatically process all three hearing loss variants:
# Process all datasets with default settings
python scripts/DataSet2LogMelBatch.py
# Process with custom resource allocation
python scripts/DataSet2LogMelBatch.py \
--cpus-per-task 64 \
--memory 500G \
--batch-size 2000SLURM Resource Parameters:
--cpus-per-task: CPU cores per SLURM task (default: 48)--memory: Memory allocation per job (default: "400G")--time: Time limit per job (default: "32:00:00")
Processing Parameters:
--model-size: Whisper model size (default: "large-v3")--batch-size: Processing batch size (default: 1000)--max-samples: Maximum samples per split for testing (default: all)--skip-existing: Skip datasets with existing output directories
The train_whisper.py script fine-tunes OpenAI's Whisper model on hearing loss datasets. It supports training on normal audio as well as high-frequency and low-frequency hearing loss simulations.
Basic Usage:
python scripts/train_whisper.py \
--model_name_or_path openai/whisper-large-v3 \
--train_dataset_path data/CommonVoiceEN_normal_logmel \
--output_dir results/whisper_finetuned_normal \
--num_train_epochs 3Key Parameters:
--model_name_or_path: Base Whisper model (default: "openai/whisper-large-v3")--train_dataset_path: Path to Log-Mel spectrogram dataset--output_dir: Directory to save model checkpoints--num_train_epochs: Number of training epochs (default: 3)--per_device_train_batch_size: Batch size per GPU (default: 8)--gradient_accumulation_steps: Gradient accumulation steps (default: 2)--learning_rate: Learning rate (default: 1e-5)--warmup_steps: Number of warmup steps (default: 500)--save_steps: Save checkpoint every N steps (default: 1000)--eval_steps: Evaluate every N steps (default: 1000)--logging_steps: Log metrics every N steps (default: 25)
Training on Different Hearing Loss Variants:
# Normal hearing
python scripts/train_whisper.py \
--train_dataset_path data/CommonVoiceEN_normal_logmel \
--output_dir results/whisper_finetuned_normal
# High-frequency hearing loss
python scripts/train_whisper.py \
--train_dataset_path data/CommonVoiceEN_hfloss_logmel \
--output_dir results/whisper_finetuned_hfloss
# Low-frequency hearing loss
python scripts/train_whisper.py \
--train_dataset_path data/CommonVoiceEN_lfloss_logmel \
--output_dir results/whisper_finetuned_lflossMonitoring Training:
# View training progress with TensorBoard
python scripts/tensorboard_visualise_runs.py --logdir results/whisper_finetuned_normalFor large-scale training on computing clusters, see SLURM Processing → Whisper Training.
The analyse_with_whisper.py script performs comprehensive analysis of Whisper model predictions, extracting detailed metrics including token probabilities, embeddings, entropy, and semantic similarity measures.
Basic Usage:
python scripts/analyse_with_whisper.py \
--input-folder data/MALD \
--output-path results/whisper_predictions_normal/analysis.json \
--model-path results/whisper_finetuned_normal/checkpoint-2000 \
--num-workers 4 \
--top-k 1000Key Parameters:
--input-folder: Directory containing audio files (ground truth = filename)--output-path: Path for main JSON output file--model-path: Path to Whisper model checkpoint--num-workers: Number of dataloader workers (default: 20)--num-threads: CPU threads for processing (default: 8)--num-gpus: Number of GPUs to use (default: 1, 0 for CPU only)--batch-size: Batch size for processing (default: 1)--top-k: Number of top predictions to save (default: 1000)
Output Structure:
The script generates:
- Main JSON file (
analysis.json): Contains all results with model metadata - Individual JSON files: One per audio file with detailed per-token metrics
Metrics Extracted:
Per-Token Metrics:
- Predicted token probability and rank
- Entropy and semantic density
- Top-k alternative predictions with probabilities and ranks
- Hidden state embeddings (1280-dimensional)
- Ground truth comparison (cosine similarity, correlation)
Normalized Metrics:
- Case-insensitive token grouping
- Aggregated probabilities across token variants
- Normalized embeddings (probability-weighted, simple average, most probable)
Pooled Metrics (across all predicted tokens):
- Average rank, probability, and entropy
- Pooled hidden states
- Pooled ground truth comparison metrics
Example Analysis:
# Analyze single audio file
python scripts/analyse_with_whisper.py \
--input-folder test/transcendentalists \
--output-path results/single_analysis.json \
--model-path results/whisper_finetuned_normal/checkpoint-2000 \
--num-workers 0
# Analyze MALD dataset with multiple models
python scripts/analyse_with_whisper.py \
--input-folder data/MALD \
--output-path results/whisper_predictions_normal/mald_analysis.json \
--model-path results/whisper_finetuned_normal/checkpoint-2000For detailed documentation of all metrics, see README_whisper_metrics.md. For batch processing on computing clusters, see SLURM Processing → Whisper Prediction.
Montreal Forced Aligner (MFA) is used for forced alignment to generate word and phoneme boundaries from audio files and their transcriptions. On the HPI cluster, MFA is installed using enroot containers since direct pip/conda installation is not available.
To set up MFA on your account, you need to import the Docker image and create a container. This only needs to be done once per user account.
Step 1: Import the Docker image
Run on a compute node with sufficient memory (64GB recommended):
srun -A aisc -p aisc-interactive --mem=64G --cpus-per-task=8 --time=01:00:00 --constraint=ARCH:X86 \
enroot import docker://mmcauliffe/montreal-forced-aligner:latestThis downloads and converts the Docker image to a squashfs file (mmcauliffe+montreal-forced-aligner+latest.sqsh, ~4.5 GB) in your current directory.
Step 2: Create the container
srun -A aisc -p aisc-interactive --mem=32G --cpus-per-task=4 --time=00:30:00 --constraint=ARCH:X86 \
enroot create --name mfa mmcauliffe+montreal-forced-aligner+latest.sqshThis unpacks the squashfs file into a container stored in ~/.local/share/enroot/mfa/.
Step 3: Download acoustic models and dictionary
Use --rw to make changes persistent:
srun -A aisc -p aisc-interactive --mem=32G --cpus-per-task=4 --time=00:30:00 --constraint=ARCH:X86 \
enroot start --rw --mount $PWD:/workspace mfa \
bash -c "mfa model download acoustic english_mfa && mfa model download dictionary english_mfa"# Check MFA version
srun -A aisc -p aisc-interactive --mem=8G --cpus-per-task=2 --time=00:10:00 --constraint=ARCH:X86 \
enroot start --mount $PWD:/workspace mfa mfa version
# List available acoustic models
srun -A aisc -p aisc-interactive --mem=8G --cpus-per-task=2 --time=00:10:00 --constraint=ARCH:X86 \
enroot start --mount $PWD:/workspace mfa mfa model list acoustic
# Expected output: ['english_mfa']For generating word-level alignments and extracting individual word audio segments, we provide two Python scripts that use the torchaudio MMS_FA model instead of MFA's TextGrid output.
Both scripts require audio files and corresponding transcription files in the same directory:
- Audio files:
.wavformat - Transcription files:
.txtformat (same filename, different extension) - Example:
478.wavand478.txt(containing just the transcription text)
The transcription file should contain only the transcript text, e.g.:
Where was his horse?
Performs GPU-accelerated forced alignment using torchaudio's MMS_FA model to generate word-level timing information stored as JSON files.
Usage:
# Basic usage with defaults
python scripts/forced_alignment.py
# Custom directories
python scripts/forced_alignment.py --input-dir data/CommonVoiceENraw --output-dir data/CommonVoiceENJSON
# Process specific splits with multi-GPU support
python scripts/forced_alignment.py --splits train test --batch-size 32 --num-gpus 4Key Arguments:
--input-dir: Input directory with wav/txt pairs (default:data/CommonVoiceENraw)--output-dir: Output directory for JSON files (default:data/CommonVoiceENJSON)--splits: Splits to process (default: train test validation)--num-gpus: Number of GPUs to use (default: 1)--batch-size: Number of files per batch (default: 16)--skip-existing: Skip files that already have JSON output
Output: JSON Files
The script generates JSON files with word boundaries for each audio file:
Example output (10110.json):
{
"file_id": "10110",
"original_transcript": "I really liked the film we saw last week.",
"normalized_transcript": "i really liked the film we saw last week",
"duration": 5.88,
"words": [
{"word": "i", "start": 1.1238, "end": 1.1439, "score": 0.9977},
{"word": "really", "start": 1.2041, "end": 1.4449, "score": 0.9496},
{"word": "liked", "start": 1.5051, "end": 1.7459, "score": 0.9624},
...
]
}Extracts individual word audio segments from full audio files using the word boundary JSON files generated by forced_alignment.py.
Usage:
# Basic usage with defaults
python scripts/extract_word_audio.py
# Custom directories
python scripts/extract_word_audio.py --input-json data/CommonVoiceENJSON --input-audio data/CommonVoiceENraw --output-audio data/CommonVoiceENWords
# Adjust parallelism
python scripts/extract_word_audio.py --nworkers 32 --batch-size 200Key Arguments:
--input-json: Directory containing word alignment JSON files (default:data/CommonVoiceENJSON)--input-audio: Directory containing source audio files (default:data/CommonVoiceENraw)--output-audio: Directory to write extracted word audio files (default:data/CommonVoiceENWords)--nworkers: Number of parallel workers (default: 24)--batch-size: Number of files to process per batch (default: 100)--splits: Splits to process (default: train test validation)
Output: Individual Word Audio Files
The script creates one WAV file per word with the naming convention:
{file_id}_{word_index}_{word}.wav
Example output structure:
data/CommonVoiceENWords/
├── train/
│ ├── 10110_1_i.wav
│ ├── 10110_2_really.wav
│ ├── 10110_3_liked.wav
│ └── ...
├── test/
│ └── ...
└── validation/
└── ...
Word indices are 1-based, and apostrophes are preserved in filenames (e.g., 12345_5_don't.wav).
Applies hearing loss masks to individual word audio files, creating three versions: normal hearing (copy), low-frequency loss, and high-frequency loss. Output remains as WAV files (no mel spectrogram conversion).
Usage:
# Basic usage with defaults
python scripts/mask_word_audio.py
# Custom directories
python scripts/mask_word_audio.py --input-dir data/CommonVoiceENWords --output-base data/CommonVoiceENWords
# Adjust parallelism
python scripts/mask_word_audio.py --nworkers 32 --batch-size 200
# Process only specific conditions
python scripts/mask_word_audio.py --conditions hfloss lflossKey Arguments:
--input-dir: Directory containing input word audio files (default:data/CommonVoiceENWords)--output-base: Base name for output directories (default:data/CommonVoiceENWords)--nworkers: Number of parallel workers (default: 24)--batch-size: Number of files to process per batch (default: 100)--splits: Splits to process (default: train test validation)--conditions: Hearing loss conditions to generate (default: normal lfloss hfloss)
Output: Masked Word Audio Files
Creates three output directories with the same structure as input:
data/CommonVoiceENWords_normal/{train,test,validation}/*.wav # Copies of original files
data/CommonVoiceENWords_lfloss/{train,test,validation}/*.wav # Low-frequency hearing loss
data/CommonVoiceENWords_hfloss/{train,test,validation}/*.wav # High-frequency hearing loss
Hearing Loss Profiles:
- normal: 10 dB threshold across all frequencies (essentially unchanged)
- lfloss: 100 dB at low frequencies → 10 dB at high frequencies (attenuates low frequencies)
- hfloss: 10 dB at low frequencies → 100 dB at high frequencies (attenuates high frequencies)
For processing the full CommonVoice dataset (1.7M+ samples), use the SLURM batch script:
# Make sure you've configured your working environment first (see Requirements > Working Environment)
# Basic SLURM submission with default settings
sbatch scripts/mask_audio.sbatchThe script will automatically:
- Load your personal working directory from
.env.local - Navigate to your project directory
- Activate the virtual environment
- Run the audio masking processing
Advanced SLURM Options:
# Override specific parameters via environment variables
BATCH_SIZE=256 NUM_WORKERS=32 sbatch scripts/mask_audio.sbatch
# Or pass arguments directly to the underlying script
sbatch scripts/mask_audio.sbatch --batch-size 256 --num-workers 32For processing large hearing loss datasets to Log-Mel spectrograms using SLURM:
# Process all three datasets (normal, lfloss, hfloss) automatically
python scripts/DataSet2LogMelBatch.py
# This will submit 3 SLURM jobs:
# - Job 1: CommonVoiceEN_normal → CommonVoiceEN_normal_logmel
# - Job 2: CommonVoiceEN_lfloss → CommonVoiceEN_lfloss_logmel
# - Job 3: CommonVoiceEN_hfloss → CommonVoiceEN_hfloss_logmelIndividual SLURM Job Submission:
# Process a single dataset via SLURM
sbatch scripts/DataSet2LogMel.sbatch \
--input_dataset data/CommonVoiceEN_normal/dataset \
--output_dataset data/CommonVoiceEN_normal_logmelThe SLURM script will automatically:
- Load your personal working directory from
.env.local - Navigate to your project directory
- Activate the virtual environment
- Run the Log-Mel preprocessing with optimized memory settings
SLURM Resource Configuration:
- Default: 48 CPU cores, 400GB memory, 32-hour time limit
- Recommended: For ~60K samples, jobs typically complete in 8-12 minutes
- Output: Each dataset produces ~87GB of Log-Mel spectrograms ready for Whisper training
For training Whisper models on hearing loss datasets using SLURM, use train_whisper.sbatch. The script uses environment variables to specify input and output directories.
Basic SLURM Submission:
# Train on normal hearing dataset
export INPUT_FOLDER="data/CommonVoiceEN_normal_logmel"
export OUTPUT_FOLDER="results/whisper_finetuned_normal"
sbatch scripts/train_whisper.sbatch
# Train on high-frequency hearing loss dataset
export INPUT_FOLDER="data/CommonVoiceEN_hfloss_logmel"
export OUTPUT_FOLDER="results/whisper_finetuned_hfloss"
sbatch scripts/train_whisper.sbatch
# Train on low-frequency hearing loss dataset
export INPUT_FOLDER="data/CommonVoiceEN_lfloss_logmel"
export OUTPUT_FOLDER="results/whisper_finetuned_lfloss"
sbatch scripts/train_whisper.sbatchSLURM Resource Configuration:
- GPUs: 4× NVIDIA H100 80GB HBM3
- CPUs: 24 cores per task
- Memory: 200GB
- Time Limit: 24 hours
- Partition: aisc-batch
Training Configuration:
- Batch size: 8 per device (32 total with 4 GPUs)
- Gradient accumulation: 2 steps
- Learning rate: 1e-5 with warmup
- Evaluation: Every 1000 steps
- Checkpoints: Saved every 1000 steps
The script automatically:
- Activates the virtual environment
- Runs training with
train_whisper.py - Logs training metrics to TensorBoard
- Saves model checkpoints to the output directory
Monitoring Training:
# Check job status
squeue -u $USER
# View training logs
tail -f logs/train_whisper_[JOBID].err
# Launch TensorBoard (after training starts)
python scripts/tensorboard_visualise_runs.py --logdir results/whisper_finetuned_normalFor batch prediction and analysis on large datasets using SLURM, use analyse_with_whisper.sbatch. The script forwards all command-line arguments to analyse_with_whisper.py.
Single Model Analysis:
sbatch scripts/analyse_with_whisper.sbatch \
--input-folder data/MALD \
--output-path results/whisper_predictions_normal/mald_analysis.json \
--model-path results/whisper_finetuned_normal/checkpoint-2000 \
--num-workers 20 \
--top-k 1000Sequential Multi-Model Analysis:
Use run_mald_analysis.sh to analyze with all three models sequentially using SLURM job dependencies:
./scripts/run_mald_analysis.shThis script will:
- Submit Job 1: Normal model →
results/whisper_predictions_normal/ - Submit Job 2: HF loss model (after Job 1) →
results/whisper_predictions_hfloss/ - Submit Job 3: LF loss model (after Job 2) →
results/whisper_predictions_lfloss/
SLURM Resource Configuration:
- GPUs: 1× NVIDIA H100 80GB HBM3
- CPUs: 24 cores (for 20 dataloader workers)
- Memory: 200GB
- Time Limit: 24 hours
- Partition: aisc-batch
Processing Performance:
- Rate: ~1.3 samples/second (~78 samples/minute)
- Dataset: 26,793 audio files (MALD)
- Estimated Time: ~6 hours per model
Monitoring Progress:
# Check job status and dependencies
squeue -u $USER
# Monitor live progress
tail -f logs/whisper_normal_[JOBID].err | grep "Processing batches"
# Check number of processed samples
ls results/whisper_predictions_normal/*.json 2>/dev/null | grep -v "analysis.json" | wc -lOutput Structure:
- Main JSON:
mald_analysis.json(contains model_path and all results) - Individual JSONs: One file per audio sample with detailed metrics
- Logs: Stored in
logs/whisper_[model]_[JOBID].{out,err}
For running forced alignment on the full CommonVoice dataset using SLURM:
# Basic submission with defaults
sbatch scripts/forced_alignment.sbatch
# Process only specific splits
sbatch scripts/forced_alignment.sbatch --splits train
# Resume interrupted job (skip already processed files)
sbatch scripts/forced_alignment.sbatch --skip-existing
# Use multiple GPUs with larger batch size
sbatch --gres=gpu:4 scripts/forced_alignment.sbatch --num-gpus 4 --batch-size 32SLURM Resource Configuration:
- GPUs: 1× GPU (configurable)
- CPUs: 24 cores
- Memory: 128GB
- Time Limit: 72 hours
- Partition: aisc-batch
Default Input/Output:
- Input:
data/CommonVoiceENraw(audio + transcript files) - Output:
data/CommonVoiceENJSON(JSON files with word boundaries)
Monitoring:
# Check job status
squeue -u $USER
# Monitor progress
tail -f logs/forced_alignment_[JOBID].errFor extracting individual word audio segments from aligned data using SLURM:
# Basic submission with defaults
sbatch scripts/extract_word_audio.sbatch
# Process only specific splits
sbatch scripts/extract_word_audio.sbatch --splits train
# Adjust parallelism
sbatch scripts/extract_word_audio.sbatch --nworkers 32 --batch-size 200
# Chain with forced alignment job (run after alignment completes)
sbatch --dependency=afterok:<ALIGNMENT_JOB_ID> scripts/extract_word_audio.sbatchSLURM Resource Configuration:
- CPUs: 24 cores
- Memory: 64GB
- Time Limit: 24 hours
- Partition: aisc-batch
Default Input/Output:
- Input JSON:
data/CommonVoiceENJSON(word boundary JSON files) - Input Audio:
data/CommonVoiceENraw(source audio files) - Output:
data/CommonVoiceENWords(individual word audio files)
Output Naming Convention:
{file_id}_{word_index}_{word}.wav
Example: 10110_2_really.wav (word index 2, word "really" from file 10110)
Monitoring:
# Check job status
squeue -u $USER
# Monitor progress
tail -f logs/extract_word_audio_[JOBID].err
# Count extracted words
ls data/CommonVoiceENWords/train/*.wav 2>/dev/null | wc -lFor applying hearing loss masks to extracted word audio files using SLURM:
# Basic submission with defaults
sbatch scripts/mask_word_audio.sbatch
# Process only specific splits
sbatch scripts/mask_word_audio.sbatch --splits train
# Process only specific conditions
sbatch scripts/mask_word_audio.sbatch --conditions hfloss lfloss
# Adjust parallelism
sbatch scripts/mask_word_audio.sbatch --nworkers 32 --batch-size 200
# Chain with extract word audio job (run after extraction completes)
sbatch --dependency=afterok:<EXTRACT_JOB_ID> scripts/mask_word_audio.sbatchSLURM Resource Configuration:
- CPUs: 48 cores
- Memory: 128GB
- Time Limit: 48 hours
- Partition: aisc-batch
Default Input/Output:
- Input:
data/CommonVoiceENWords/{train,test,validation}/*.wav - Output:
data/CommonVoiceENWords_normal/(copies of original files)data/CommonVoiceENWords_lfloss/(low-frequency hearing loss)data/CommonVoiceENWords_hfloss/(high-frequency hearing loss)
Monitoring:
# Check job status
squeue -u $USER
# Monitor progress
tail -f logs/mask_word_audio_[JOBID].err
# Count masked files per condition
ls data/CommonVoiceENWords_hfloss/train/*.wav 2>/dev/null | wc -l