Skip to content

sumithkumar123/Univox

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Univox

Univox is a comprehensive AI-powered application for voice cloning, video dubbing, and cross-lingual audio translation. It leverages Retrieval-based Voice Conversion (RVC) to create high-quality voice models and integrates them into a full-stack web interface for dubbing video content.

🚀 Features

  • Voice Cloning (RVC): Train custom voice models using Retrieval-based Voice Conversion (RVC v2).
  • Video Dubbing: Automatically merge cloned or translated audio with video templates using moviepy.
  • Audio Translation: Translate spoken audio from one language to another (e.g., English to Spanish) using GoogleTrans and gTTS.
  • Web Interface: User-friendly React frontend for uploading audio/video and viewing results.

🧠 Voice Cloning Process (RVC Model)

The voice cloning functionality is powered by the RVC v2 Disconnected notebook (Copy_of_RVC_v2_Disconnected.ipynb). This process uses deep learning to learn the timbre and pitch of a target speaker.

1. Environment Setup

The model runs on a Python environment (typically Google Colab) and requires specific deep learning libraries:

  • Fairseq: For handling the Hubert soft-content encoder.
  • Faiss-GPU: For high-speed vector similarity search (used in the retrieval index).
  • PyTorch: The core deep learning framework.
  • FFmpeg & Praat: For audio signal processing and pitch extraction.

2. Preprocessing Pipeline

Before training, the input audio (dataset) undergoes several transformation steps:

  • Sanitization: Audio files are converted to WAV format, and non-audio files are removed.
  • Sample Rate Conversion: Audio is resampled to the target rate (e.g., 40k, 48k).
  • Pitch Extraction (f0): The system extracts pitch data using algorithms like RMVPE (Robust Model for Voice Pitch Estimation) or CREPE to ensure the cloned voice captures the correct intonation.
  • Feature Extraction: The Hubert model extracts "soft speech units" (content features) from the audio. These features represent what is being said, separate from how it sounds.

3. Training

The training process consists of two main components:

  • Generator & Discriminator: The model trains a Generator (G) to synthesize speech that sounds like the target speaker and a Discriminator (D) to distinguish between real and synthesized speech. It uses pretrained base models (e.g., OV2Super, TITAN) to accelerate learning.
  • Index Training (Faiss): A Feature Index is trained on the extracted Hubert features. This index allows the model to "retrieve" style details from the reference audio during inference, reducing audio leakage and improving similarity.

4. Inference

Once trained, the model (.pth file) and index (.index file) are used to convert source audio into the target voice while preserving the original speech rhythm and pitch.

🛠️ Tech Stack

AI & Model

  • RVC v2 (Retrieval-based Voice Conversion)
  • PyTorch
  • Fairseq
  • Faiss (Facebook AI Similarity Search)

Backend

  • FastAPI: High-performance web framework.
  • MoviePy: For video editing and audio/video merging.
  • GoogleTrans & gTTS: For translation and text-to-speech.
  • PyDub & SpeechRecognition: For audio manipulation and transcription.

Frontend

  • React.js: UI Component library.
  • Axios: For API requests.
  • CSS Modules: For component-scoped styling.

📂 Project Structure

univox/
├── backend/
│   ├── server.py           # Main FastAPI backend
│   ├── input_videos/       # Source videos
│   ├── output_audios/      # Translated/Cloned audio
│   └── uploads/            # User uploaded files
├── frontend/
│   ├── src/components/     # React components (Demo, Result, Translate)
│   └── public/             # Static assets
├── Univox-.../             # RVC Model Training Folder
│   └── Copy_of_RVC_v2_Disconnected.ipynb  # RVC Training Notebook
└── README.md

⚙️ Setup & Installation

1. Voice Model Training

To train your own voice model:

  1. Open Copy_of_RVC_v2_Disconnected.ipynb in Google Colab.
  2. Upload your dataset (ZIP of WAV files) to your Google Drive in a folder named rvcDisconnected.
  3. Run the notebook cells sequentially to:
  • Install dependencies.
  • Preprocess data.
  • Extract features.
  • Train the model and index.
  1. Download the resulting .pth and .index files for use.

2. Web Application Setup

Backend:

cd backend
pip install fastapi uvicorn moviepy googletrans==3.1.0a0 gTTS pydub SpeechRecognition python-multipart
uvicorn server:app --reload

Frontend:

cd frontend
npm install
npm start

📝 Usage

  1. Train: Use the notebook to create a voice model of your desired speaker.
  2. Upload: Use the React web app to upload a voice recording.
  3. Process: The backend processes the audio (translating or dubbing) and merges it with the video template.
  4. Result: View and download the final dubbed video directly in the browser.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •