This repository contains the Mini Project Report (MPR) notebook for "AI-Generated Voice Cloning Using GANs", focused on developing a system that synthesizes human-like voices using Generative Adversarial Networks (GANs). The project utilizes the Mozilla Common Voice dataset to train and evaluate the generative model on diverse speech samples.
To build an AI system capable of cloning human voices by training a GAN architecture on real-world multilingual voice data. The aim is to replicate the natural tone, pitch, and speaking style of a given speaker through synthesized speech samples.
-
Generative Adversarial Network (GAN) Architecture:
- Custom generator and discriminator models tailored for raw audio signal generation.
- Feature extraction techniques applied before feeding audio into GAN.
-
Voice Preprocessing & Feature Engineering:
- Audio normalization, silence trimming, and spectrogram generation.
- Conversion to Mel spectrograms for stable GAN training.
-
Training Loop:
- Balanced generator-discriminator training cycle.
- Loss functions customized for speech signal characteristics.
-
Voice Cloning Evaluation:
- Comparison of real vs. synthesized audio using waveform visualization and audio output.
- Metric evaluation (e.g., Spectral Convergence, Signal-to-Noise Ratio).
This project uses the Mozilla Common Voice Dataset:
- Open-source, multilingual dataset of speech samples.
- Provides thousands of validated clips across multiple speakers, languages, and accents.
- Used for both training and evaluation phases.
voice-cloning-gan/
βββ AAI MPR.ipynb # Main project notebook
βββ README.md # Project documentation
-
Clone the repository:
git clone https://github.com/yourusername/voice-cloning-gan.git cd voice-cloning-gan -
Install dependencies:
pip install -r requirements.txt
-
Download dataset:
- Visit Common Voice and download your preferred language version.
- Extract and place the audio files inside a
data/directory (if using externally).
-
Run the notebook:
jupyter notebook "AAI MPR.ipynb"
- Synthesized voice samples generated after each training epoch.
- Visual comparison between input voice spectrograms and generated outputs.
- Evaluation through waveform plots and perceptual listening.
β
Core GAN architecture implemented
π§ Currently testing across multiple speakers and languages
π Future improvements include attention layers and multi-speaker conditioning