Univox

Univox is a comprehensive AI-powered application for voice cloning, video dubbing, and cross-lingual audio translation. It leverages Retrieval-based Voice Conversion (RVC) to create high-quality voice models and integrates them into a full-stack web interface for dubbing video content.

🚀 Features

Voice Cloning (RVC): Train custom voice models using Retrieval-based Voice Conversion (RVC v2).
Video Dubbing: Automatically merge cloned or translated audio with video templates using moviepy.
Audio Translation: Translate spoken audio from one language to another (e.g., English to Spanish) using GoogleTrans and gTTS.
Web Interface: User-friendly React frontend for uploading audio/video and viewing results.

🧠 Voice Cloning Process (RVC Model)

The voice cloning functionality is powered by the RVC v2 Disconnected notebook (Copy_of_RVC_v2_Disconnected.ipynb). This process uses deep learning to learn the timbre and pitch of a target speaker.

1. Environment Setup

The model runs on a Python environment (typically Google Colab) and requires specific deep learning libraries:

Fairseq: For handling the Hubert soft-content encoder.
Faiss-GPU: For high-speed vector similarity search (used in the retrieval index).
PyTorch: The core deep learning framework.
FFmpeg & Praat: For audio signal processing and pitch extraction.

2. Preprocessing Pipeline

Before training, the input audio (dataset) undergoes several transformation steps:

Sanitization: Audio files are converted to WAV format, and non-audio files are removed.
Sample Rate Conversion: Audio is resampled to the target rate (e.g., 40k, 48k).
Pitch Extraction (f0): The system extracts pitch data using algorithms like RMVPE (Robust Model for Voice Pitch Estimation) or CREPE to ensure the cloned voice captures the correct intonation.
Feature Extraction: The Hubert model extracts "soft speech units" (content features) from the audio. These features represent what is being said, separate from how it sounds.

3. Training

The training process consists of two main components:

Generator & Discriminator: The model trains a Generator (G) to synthesize speech that sounds like the target speaker and a Discriminator (D) to distinguish between real and synthesized speech. It uses pretrained base models (e.g., OV2Super, TITAN) to accelerate learning.
Index Training (Faiss): A Feature Index is trained on the extracted Hubert features. This index allows the model to "retrieve" style details from the reference audio during inference, reducing audio leakage and improving similarity.

4. Inference

Once trained, the model (.pth file) and index (.index file) are used to convert source audio into the target voice while preserving the original speech rhythm and pitch.

🛠️ Tech Stack

AI & Model

RVC v2 (Retrieval-based Voice Conversion)
PyTorch
Fairseq
Faiss (Facebook AI Similarity Search)

Backend

FastAPI: High-performance web framework.
MoviePy: For video editing and audio/video merging.
GoogleTrans & gTTS: For translation and text-to-speech.
PyDub & SpeechRecognition: For audio manipulation and transcription.

Frontend

React.js: UI Component library.
Axios: For API requests.
CSS Modules: For component-scoped styling.

📂 Project Structure

univox/
├── backend/
│   ├── server.py           # Main FastAPI backend
│   ├── input_videos/       # Source videos
│   ├── output_audios/      # Translated/Cloned audio
│   └── uploads/            # User uploaded files
├── frontend/
│   ├── src/components/     # React components (Demo, Result, Translate)
│   └── public/             # Static assets
├── Univox-.../             # RVC Model Training Folder
│   └── Copy_of_RVC_v2_Disconnected.ipynb  # RVC Training Notebook
└── README.md

⚙️ Setup & Installation

1. Voice Model Training

To train your own voice model:

Open Copy_of_RVC_v2_Disconnected.ipynb in Google Colab.
Upload your dataset (ZIP of WAV files) to your Google Drive in a folder named rvcDisconnected.
Run the notebook cells sequentially to:

Install dependencies.
Preprocess data.
Extract features.
Train the model and index.

Download the resulting .pth and .index files for use.

2. Web Application Setup

Backend:

cd backend
pip install fastapi uvicorn moviepy googletrans==3.1.0a0 gTTS pydub SpeechRecognition python-multipart
uvicorn server:app --reload

Frontend:

cd frontend
npm install
npm start

📝 Usage

Train: Use the notebook to create a voice model of your desired speaker.
Upload: Use the React web app to upload a voice recording.
Process: The backend processes the audio (translating or dubbing) and merges it with the video template.
Result: View and download the final dubbed video directly in the browser.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Univox

🚀 Features

🧠 Voice Cloning Process (RVC Model)

1. Environment Setup

2. Preprocessing Pipeline

3. Training

4. Inference

🛠️ Tech Stack

AI & Model

Backend

Frontend

📂 Project Structure

⚙️ Setup & Installation

1. Voice Model Training

2. Web Application Setup

📝 Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
backend		backend
frontend		frontend
Copy_of_RVC_v2_Disconnected.ipynb		Copy_of_RVC_v2_Disconnected.ipynb
README.md		README.md

sumithkumar123/Univox

Folders and files

Latest commit

History

Repository files navigation

Univox

🚀 Features

🧠 Voice Cloning Process (RVC Model)

1. Environment Setup

2. Preprocessing Pipeline

3. Training

4. Inference

🛠️ Tech Stack

AI & Model

Backend

Frontend

📂 Project Structure

⚙️ Setup & Installation

1. Voice Model Training

2. Web Application Setup

📝 Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages