Skip to content

Voice Activity Detection Model Trained by famous google speech commands dataset by BiLSTM Technique

Notifications You must be signed in to change notification settings

hasithdd/VAD-Model-from-Scratch

Repository files navigation

Voice Activity Detection (VAD) Model from Scratch

Python PyTorch License

A deep learning-based Voice Activity Detection (VAD) system implemented from scratch using Bidirectional LSTM (BiLSTM) networks. This project demonstrates the application of modern AI techniques in speech processing, focusing on detecting speech and non-speech regions in audio signals, particularly in noisy environments.

🚀 Features

  • Deep Learning Architecture: Utilizes BiLSTM for temporal modeling of speech patterns
  • Noise Robustness: Trained with noise augmentation to handle low SNR conditions
  • Real-time Capable: Optimized for frame-level predictions suitable for streaming audio
  • Open Source: Built with PyTorch and publicly available datasets
  • Educational: Comprehensive implementation with detailed explanations

📋 Prerequisites

  • Python 3.8 or higher
  • uv package manager
  • Jupyter Notebook or JupyterLab

🛠️ Installation

  1. Clone the repository:

    git clone https://github.com/hasithdd/VAD-Model-from-Scratch.git
    cd VAD-Model-from-Scratch
  2. Install dependencies with uv:

    uv sync

    This will create a virtual environment and install all required packages as specified in pyproject.toml.

  3. Activate the environment (optional, uv handles this automatically):

    uv run python --version

📊 Dataset

This project uses the Google Speech Commands Dataset v0.01, a public dataset containing:

  • 65,000 one-second audio clips of 30 different words
  • Background noise samples
  • Mono WAV files at 16 kHz sample rate

The dataset is automatically downloaded and prepared during notebook execution.

🏗️ Model Architecture

The VAD system employs:

  • Feature Extraction: Mel-Frequency Cepstral Coefficients (MFCCs)
  • Network: Bidirectional LSTM with 2 layers and 128 hidden units
  • Output: Frame-level binary classification (speech/non-speech)
  • Sequence Length: 8-second overlapping windows

🚀 Usage

  1. Start Jupyter Notebook:

    uv run jupyter notebook
  2. Open the main notebook: Navigate to Cw1_w1987535_HasithVikasithaDharmarathna.ipynb and open it.

  3. Run the cells sequentially:

    • The notebook includes data preparation, model training, and evaluation
    • Training takes approximately 10 epochs on a GPU-enabled system
    • Evaluation provides confusion matrix and performance metrics
  4. Key sections:

    • Data loading and preprocessing
    • Feature extraction (MFCC)
    • Model training with BiLSTM
    • Validation and performance analysis

📈 Results

The model achieves:

  • Accuracy: 86.3%
  • Precision: 86.1%
  • Recall: 82.7%
  • F1 Score: 84.4%

Evaluated on noisy validation data with -10 dB SNR.

🔧 Development

Adding Dependencies

To add new packages:

uv add package-name

Running Tests

uv run python -m pytest

Exporting Model

For deployment, the trained model can be exported to ONNX:

uv run python export_model.py  # (if you create this script)

🤝 Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

📚 References

🙏 Acknowledgments

  • Based on the coursework for 6COSC020W Applied AI
  • Inspired by modern VAD systems like Silero VAD and pyannote.audio

Author: P.A. Hasith Vikasitha Dharmarathna
ID: 20223265
GitHub: hasithdd

About

Voice Activity Detection Model Trained by famous google speech commands dataset by BiLSTM Technique

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published