Univox is a comprehensive AI-powered application for voice cloning, video dubbing, and cross-lingual audio translation. It leverages Retrieval-based Voice Conversion (RVC) to create high-quality voice models and integrates them into a full-stack web interface for dubbing video content.
- Voice Cloning (RVC): Train custom voice models using Retrieval-based Voice Conversion (RVC v2).
- Video Dubbing: Automatically merge cloned or translated audio with video templates using
moviepy. - Audio Translation: Translate spoken audio from one language to another (e.g., English to Spanish) using
GoogleTransandgTTS. - Web Interface: User-friendly React frontend for uploading audio/video and viewing results.
The voice cloning functionality is powered by the RVC v2 Disconnected notebook (Copy_of_RVC_v2_Disconnected.ipynb). This process uses deep learning to learn the timbre and pitch of a target speaker.
The model runs on a Python environment (typically Google Colab) and requires specific deep learning libraries:
- Fairseq: For handling the Hubert soft-content encoder.
- Faiss-GPU: For high-speed vector similarity search (used in the retrieval index).
- PyTorch: The core deep learning framework.
- FFmpeg & Praat: For audio signal processing and pitch extraction.
Before training, the input audio (dataset) undergoes several transformation steps:
- Sanitization: Audio files are converted to WAV format, and non-audio files are removed.
- Sample Rate Conversion: Audio is resampled to the target rate (e.g., 40k, 48k).
- Pitch Extraction (f0): The system extracts pitch data using algorithms like RMVPE (Robust Model for Voice Pitch Estimation) or CREPE to ensure the cloned voice captures the correct intonation.
- Feature Extraction: The Hubert model extracts "soft speech units" (content features) from the audio. These features represent what is being said, separate from how it sounds.
The training process consists of two main components:
- Generator & Discriminator: The model trains a Generator (G) to synthesize speech that sounds like the target speaker and a Discriminator (D) to distinguish between real and synthesized speech. It uses pretrained base models (e.g.,
OV2Super,TITAN) to accelerate learning. - Index Training (Faiss): A Feature Index is trained on the extracted Hubert features. This index allows the model to "retrieve" style details from the reference audio during inference, reducing audio leakage and improving similarity.
Once trained, the model (.pth file) and index (.index file) are used to convert source audio into the target voice while preserving the original speech rhythm and pitch.
- RVC v2 (Retrieval-based Voice Conversion)
- PyTorch
- Fairseq
- Faiss (Facebook AI Similarity Search)
- FastAPI: High-performance web framework.
- MoviePy: For video editing and audio/video merging.
- GoogleTrans & gTTS: For translation and text-to-speech.
- PyDub & SpeechRecognition: For audio manipulation and transcription.
- React.js: UI Component library.
- Axios: For API requests.
- CSS Modules: For component-scoped styling.
univox/
├── backend/
│ ├── server.py # Main FastAPI backend
│ ├── input_videos/ # Source videos
│ ├── output_audios/ # Translated/Cloned audio
│ └── uploads/ # User uploaded files
├── frontend/
│ ├── src/components/ # React components (Demo, Result, Translate)
│ └── public/ # Static assets
├── Univox-.../ # RVC Model Training Folder
│ └── Copy_of_RVC_v2_Disconnected.ipynb # RVC Training Notebook
└── README.md
To train your own voice model:
- Open
Copy_of_RVC_v2_Disconnected.ipynbin Google Colab. - Upload your dataset (ZIP of WAV files) to your Google Drive in a folder named
rvcDisconnected. - Run the notebook cells sequentially to:
- Install dependencies.
- Preprocess data.
- Extract features.
- Train the model and index.
- Download the resulting
.pthand.indexfiles for use.
Backend:
cd backend
pip install fastapi uvicorn moviepy googletrans==3.1.0a0 gTTS pydub SpeechRecognition python-multipart
uvicorn server:app --reload
Frontend:
cd frontend
npm install
npm start
- Train: Use the notebook to create a voice model of your desired speaker.
- Upload: Use the React web app to upload a voice recording.
- Process: The backend processes the audio (translating or dubbing) and merges it with the video template.
- Result: View and download the final dubbed video directly in the browser.