Warning
This is an ongoing personal project currently in active development. Features are subject to change.
Make your screen talk back. Real-time AI desktop narration.
Report Bug
·
Request Feature
Table of Contents
ScreenBanter is an amateur project that marries Google’s Gemini Vision (for high-speed OCR) and the new Gemini 2.5 Flash Preview TTS (for cloud audio) to provide real-time desktop narration.
New in v0.3.0: The application is now distributed as a lightweight Lite Client (~100MB) with Cloud TTS support out-of-the-box. Local neural TTS (via Microsoft VibeVoice-0.5B) is fully supported through a "Bring Your Own Engine" (BYOE) model, allowing power users to host their own inference server.
- Cloud TTS (Default): High-quality, low-latency narration using Gemini 2.5 Flash Preview (
gemini-2.5-flash-preview-tts) with 30+ native voices. Zero local GPU load. - Local TTS (BYOE): Connect to your own local VibeVoice instance for private, offline, neural speech generation.
- Smart Vision: Uses Gemini 2.5 Flash Lite (
models/gemini-flash-lite-latest) via Gemini API for intelligent text extraction and context-aware merging of multiple screenshots.
The easiest way to use ScreenBanter is to download the latest pre-compiled build:
- Go to the GitHub Actions tab.
- Select the latest successful "Build Windows Executable" run.
- Scroll down to Artifacts and download
ScreenBanter_Windows_Executable. - Extract the ZIP, create a
.envfile with yourGEMINI_KEY, and runScreenBanter.exe.
Lite Client (Cloud Only)
- OS: Windows 10/11 (Required for
DXcamand Win32 tray integration). - Internet: Active connection for Gemini API.
Local TTS Engine (Optional)
- GPU: NVIDIA GPU with CUDA 12.1 support (RTX 3060+ recommended).
- Python: 3.10+ installed.
- Git: Installed.
-
Clone the Repository
git clone https://github.com/alfred1137/ScreenBanter.git cd ScreenBanter -
Setup Environment Variables Create a
.envfile from the example:cp .env.example .env
Edit
.envand add your GEMINI_KEY from Google AI Studio. -
Install Dependencies (Lite) Using
uv:uv sync
-
Optional: Setup Local TTS To use VibeVoice locally, follow the Local TTS Setup Guide.
If developing locally:
uv sync --extra local-tts
1. Launch the Application Starts the system tray app.
uv run python -m app.mainWait for the announcement: "ScreenBanter is active."
2. Controls (Default Hotkeys)
| Hotkey | Action | Description |
|---|---|---|
Ctrl + Alt + S |
Instant Capture | Narrates the current screen/region immediately. |
F10 |
Queue Screenshot | Adds current view to buffer (confirmed by a beep). |
F11 |
Process Queue | Merges all queued captures and narrating the result. |
3. Banter HUD ScreenBanter features a non-intrusive HUD that appears automatically during operation:
- Scanning: Indicates Gemini is analyzing the screen.
- Thinking: Displays the extracted text for verification.
- Speaking: Shows playback status. The HUD is "click-through" and will not steal focus from your active game.
Access settings by right-clicking the Loudspeaker icon in the system tray.
- Hotkeys: Rebind any action to your preferred key combinations.
- Audio:
- Cloud: Select Gemini Model and Voice (e.g.,
Puck,Kore). - Local: Configure external engine path and select VibeVoice presets.
- Cloud: Select Gemini Model and Voice (e.g.,
- Capture Mode: Toggle between
FullscreenandRegion. In Region mode, use the interactive selector to define your capture area. - HUD / UI: Toggle the Banter HUD, adjust opacity, and configure focus behavior (Immersive vs. Focus mode).
- Performance: Configure "Process Priority" and "Playback Buffer" to optimize for your hardware.
ScreenBanter/
├── app/ # Frontend Daemon & GUI
│ ├── main.py # Application entry & Tray management
│ ├── capture.py # DXcam screen capture logic
│ ├── vision.py # Gemini API integration
│ ├── audio_client.py # Threaded PyAudio playback
│ ├── settings.py # Configuration management
│ ├── settings_window.py# CustomTkinter Settings GUI
│ └── region_selector.py# Transparent overlay for region selection
├── server/ # Local Inference Server Logic
│ ├── tts_server.py # FastAPI application (for local dev/BYOE)
│ └── model_loader.py # VibeVoice initialization
├── docs/ # Documentation
│ └── setup_local_tts.md# Guide for external engine setup
├── third_party/ # VibeVoice source code (submodule)
graph TD
User[User] -->|Hotkey| Trigger[Capture Trigger]
Trigger -->|DXcam| Capture[Screen/Region Capture]
Capture -->|Image Data| Vision[Gemini Vision Engine]
Vision -->|Extracted Text| Client[Audio Client]
Client -->|Option A: API| Cloud[Gemini Cloud TTS]
Client -->|Option B: Subprocess| Local[External VibeVoice Server]
Cloud -->|Audio Bytes| Playback[PyAudio Stream]
Local -->|Audio Bytes| Playback
Playback -->|Sound| Speakers[User Speakers]
- Instant Narration
- Batch Mode (Queueing)
- Custom Region Capture
- Settings GUI (Modern UI)
- Local Neural TTS Integration (BYOE)
- Cloud TTS (Gemini) Integration
- Dynamic Voice/Device Selection
- Standalone "Lite" Client Build
See the open issues for a full list of proposed features.
Contributions are welcome! If you have suggestions or bug fixes:
- Fork the Project.
- Create your Feature Branch (
git checkout -b feature/AmazingFeature). - Commit your Changes (
git commit -m 'Add some AmazingFeature'). - Push to the Branch (
git push origin feature/AmazingFeature). - Open a Pull Request.
Note: As an amateur project, PR reviews might take some time!
Distributed under the MIT License. See LICENSE for more information.
Alfred T - GitHub Profile
Project Link: https://github.com/alfred1137/ScreenBanter
- Microsoft/VibeVoice - Exceptional local TTS.
- Google Gemini Vision - High-speed multi-modal OCR.
- DXcam - Ultra-fast Windows screen capture.
- CustomTkinter - Modernizing Python GUIs.
- 2026-01-25: Fixed build dependencies and migrated to PyInstaller for CI stability. (v0.3.5)
- 2026-01-24: Refactored to Lite Client architecture (v0.3.0).
- 2026-01-20: Verified full integration workflow (HUD, 4-bit TTS, Region Capture) on Windows 11 with CUDA 12.1.
- 2026-01-16: Added Performance Mode (4-bit quantization, priority boosting) and Banter HUD for seamless gaming integration.
- 2026-01-14: Enhanced documentation, added Region Capture and Settings GUI polish.
- 2026-01-10: Implemented Settings GUI and dynamic configuration infrastructure.
- 2026-01-05: Initial MVP release with Gemini OCR and VibeVoice TTS integration.