This project is a real-time speech transcription and translation application developed using Python. It leverages advanced machine learning models and provides a user-friendly interface for capturing audio, transcribing speech, and translating it into English in real-time.
Note: This is currently a conceptual project and contains known bugs. It serves as a demonstration and requires additional development for production use.
- Real-Time Audio Capture: Utilizes the microphone to capture audio streams with minimal latency.
- Advanced Speech Recognition: Implements OpenAI's Whisper Large-V3 Turbo model for high-accuracy transcription.
- Language Support: Capable of transcribing over 90 languages, with automatic language detection.
- Real-Time Translation: Translates transcribed text into English on-the-fly using the same model.
- Voice Activity Detection (VAD): Integrates Silero VAD for efficient and accurate speech segment detection.
- Customizable Settings: Offers adjustable parameters for VAD and transcription to optimize performance.
- Graphical User Interface: Built with PyQt5, providing an intuitive and responsive user experience.
- Multithreaded Processing: Employs threading for concurrent audio capture, processing, and UI updates without blocking.
- Python 3.7+
- PyQt5: For the graphical user interface.
- PyTorch: As the deep learning framework.
- Hugging Face Transformers: To access and run the Whisper model.
- SoundDevice: For real-time audio capture.
- NumPy: For efficient numerical computations.
- Silero VAD: For voice activity detection.
- Multithreading: To handle real-time processing efficiently.
- Modular Codebase: The project is divided into clear modules for audio capture, processing, transcription, and UI.
- Thread-Safe Operations: Utilizes locks and thread-safe queues to manage data between threads safely.
- Asynchronous Processing: Ensures that audio capture and transcription do not block the main thread, providing a smooth user experience.
- Optimized for Performance:
- GPU Acceleration: Automatically detects and utilizes CUDA-compatible GPUs for running the Whisper model.
- Efficient Memory Management: Implements deque and queues with maximum sizes to prevent memory leaks.
- Python 3.7 or higher
- Pip package manager
- Virtual Environment (recommended)
- CUDA-compatible GPU (optional for performance enhancement)
- Microphone connected to your computer
-
Clone the Repository
git clone https://github.com/DataDrivenDNA/real-time-transcription.git cd real-time-transcription -
Set Up Virtual Environment
python -m venv venv # Activate the virtual environment: # On Windows: venv\Scripts\activate # On Unix or MacOS: source venv/bin/activate
-
Install Dependencies
pip install --upgrade pip pip install -r requirements.txt
-
Run the Application
python main.py
- Start Button: Begins capturing audio and starts the transcription and translation process.
- Stop Button: Stops the audio capture and processing.
- Language Selection Dropdown: Allows the user to select the source language or opt for automatic detection.
- VAD Settings:
- VAD Window Size: Adjusts the duration of audio analyzed for speech detection.
- VAD Hop Size: Sets the step size between consecutive VAD windows.
- No Speech Threshold: Configures the sensitivity for detecting the end of speech.
- Transcription Area: Displays the real-time transcribed text.
- Translation Area: Shows the translated text in English.
- Ctrl+S: Start transcription and translation.
- Ctrl+Q: Stop the process.
- Real-Time Processing: Achieved low-latency audio capture and processing by optimizing buffer management and using efficient data structures like
deque. - Thread Synchronization: Ensured thread-safe operations with locks and thread-safe queues to prevent race conditions.
- Model Integration: Integrated the large Whisper model efficiently, handling device management (CPU/GPU) and data types for optimal performance.
- VAD Accuracy: Fine-tuned the VAD parameters and implemented post-processing steps to improve speech segment detection and reduce false positives.
- Dockerization: Containerizing the application for easier deployment.
- Cross-Platform Support: Ensuring compatibility across different operating systems.
- Enhanced UI/UX: Adding features like dark mode, custom themes, and more interactive controls.
- Additional Language Translation: Extending translation capabilities beyond English.
- Machine Learning: Practical experience with state-of-the-art NLP models.
- Software Development: Proficient in Python programming and software architecture design.
- Multithreading and Concurrency: Managed complex threading scenarios to maintain application responsiveness.
- User Interface Design: Developed a functional and user-friendly GUI using PyQt5.
- Performance Optimization: Optimized resource usage for real-time processing requirements.
- Problem-Solving: Addressed and resolved challenges related to audio processing and model integration.
Published as part of a portfolio to showcase expertise in machine learning, real-time systems, and software development.