A deep learning-based Voice Activity Detection (VAD) system implemented from scratch using Bidirectional LSTM (BiLSTM) networks. This project demonstrates the application of modern AI techniques in speech processing, focusing on detecting speech and non-speech regions in audio signals, particularly in noisy environments.
- Deep Learning Architecture: Utilizes BiLSTM for temporal modeling of speech patterns
- Noise Robustness: Trained with noise augmentation to handle low SNR conditions
- Real-time Capable: Optimized for frame-level predictions suitable for streaming audio
- Open Source: Built with PyTorch and publicly available datasets
- Educational: Comprehensive implementation with detailed explanations
- Python 3.8 or higher
- uv package manager
- Jupyter Notebook or JupyterLab
-
Clone the repository:
git clone https://github.com/hasithdd/VAD-Model-from-Scratch.git cd VAD-Model-from-Scratch -
Install dependencies with uv:
uv sync
This will create a virtual environment and install all required packages as specified in
pyproject.toml. -
Activate the environment (optional, uv handles this automatically):
uv run python --version
This project uses the Google Speech Commands Dataset v0.01, a public dataset containing:
- 65,000 one-second audio clips of 30 different words
- Background noise samples
- Mono WAV files at 16 kHz sample rate
The dataset is automatically downloaded and prepared during notebook execution.
The VAD system employs:
- Feature Extraction: Mel-Frequency Cepstral Coefficients (MFCCs)
- Network: Bidirectional LSTM with 2 layers and 128 hidden units
- Output: Frame-level binary classification (speech/non-speech)
- Sequence Length: 8-second overlapping windows
-
Start Jupyter Notebook:
uv run jupyter notebook
-
Open the main notebook: Navigate to
Cw1_w1987535_HasithVikasithaDharmarathna.ipynband open it. -
Run the cells sequentially:
- The notebook includes data preparation, model training, and evaluation
- Training takes approximately 10 epochs on a GPU-enabled system
- Evaluation provides confusion matrix and performance metrics
-
Key sections:
- Data loading and preprocessing
- Feature extraction (MFCC)
- Model training with BiLSTM
- Validation and performance analysis
The model achieves:
- Accuracy: 86.3%
- Precision: 86.1%
- Recall: 82.7%
- F1 Score: 84.4%
Evaluated on noisy validation data with -10 dB SNR.
To add new packages:
uv add package-nameuv run python -m pytestFor deployment, the trained model can be exported to ONNX:
uv run python export_model.py # (if you create this script)Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- Based on the coursework for 6COSC020W Applied AI
- Inspired by modern VAD systems like Silero VAD and pyannote.audio
Author: P.A. Hasith Vikasitha Dharmarathna
ID: 20223265
GitHub: hasithdd