Skip to content

Make your screen talk back. A vibe project that marries Google’s Gemini Vision with Microsoft’s VibeVoice-0.5B for real-time desktop narration.

License

Notifications You must be signed in to change notification settings

alfred1137/ScreenBanter

Repository files navigation

Forks Stargazers Issues MIT License

Python FastAPI PyTorch UV Google Gemini

Warning

This is an ongoing personal project currently in active development. Features are subject to change.


Logo

ScreenBanter

Make your screen talk back. Real-time AI desktop narration.

Report Bug · Request Feature

Table of Contents
  1. 📖 Overview
  2. ✨ Features
  3. 📦 Technologies
  4. 🚀 Installation & Setup
  5. 🛠️ Usage
  6. 🔧 Configuration
  7. 🗂️ Repository Structure
  8. 🔗 Flow Chart
  9. Roadmap
  10. 🤝 Contributing
  11. License
  12. Contact
  13. ❤️ Acknowledgements
  14. 📝 Changelog

📖 Overview

ScreenBanter is an amateur project that marries Google’s Gemini Vision (for high-speed OCR) and the new Gemini 2.5 Flash Preview TTS (for cloud audio) to provide real-time desktop narration.

New in v0.3.0: The application is now distributed as a lightweight Lite Client (~100MB) with Cloud TTS support out-of-the-box. Local neural TTS (via Microsoft VibeVoice-0.5B) is fully supported through a "Bring Your Own Engine" (BYOE) model, allowing power users to host their own inference server.

🌟 Key Features

  • Cloud TTS (Default): High-quality, low-latency narration using Gemini 2.5 Flash Preview (gemini-2.5-flash-preview-tts) with 30+ native voices. Zero local GPU load.
  • Local TTS (BYOE): Connect to your own local VibeVoice instance for private, offline, neural speech generation.
  • Smart Vision: Uses Gemini 2.5 Flash Lite (models/gemini-flash-lite-latest) via Gemini API for intelligent text extraction and context-aware merging of multiple screenshots.

(back to top)

🚀 Installation & Setup

📥 Download (Lite Client)

The easiest way to use ScreenBanter is to download the latest pre-compiled build:

  1. Go to the GitHub Actions tab.
  2. Select the latest successful "Build Windows Executable" run.
  3. Scroll down to Artifacts and download ScreenBanter_Windows_Executable.
  4. Extract the ZIP, create a .env file with your GEMINI_KEY, and run ScreenBanter.exe.

✅ Requirements

Lite Client (Cloud Only)

  • OS: Windows 10/11 (Required for DXcam and Win32 tray integration).
  • Internet: Active connection for Gemini API.

Local TTS Engine (Optional)

  • GPU: NVIDIA GPU with CUDA 12.1 support (RTX 3060+ recommended).
  • Python: 3.10+ installed.
  • Git: Installed.

Installation (Source)

  1. Clone the Repository

    git clone https://github.com/alfred1137/ScreenBanter.git
    cd ScreenBanter
  2. Setup Environment Variables Create a .env file from the example:

    cp .env.example .env

    Edit .env and add your GEMINI_KEY from Google AI Studio.

  3. Install Dependencies (Lite) Using uv:

    uv sync
  4. Optional: Setup Local TTS To use VibeVoice locally, follow the Local TTS Setup Guide.

    If developing locally:

    uv sync --extra local-tts

(back to top)

🛠️ Usage

1. Launch the Application Starts the system tray app.

uv run python -m app.main

Wait for the announcement: "ScreenBanter is active."

2. Controls (Default Hotkeys)

Hotkey Action Description
Ctrl + Alt + S Instant Capture Narrates the current screen/region immediately.
F10 Queue Screenshot Adds current view to buffer (confirmed by a beep).
F11 Process Queue Merges all queued captures and narrating the result.

3. Banter HUD ScreenBanter features a non-intrusive HUD that appears automatically during operation:

  • Scanning: Indicates Gemini is analyzing the screen.
  • Thinking: Displays the extracted text for verification.
  • Speaking: Shows playback status. The HUD is "click-through" and will not steal focus from your active game.

(back to top)

🔧 Configuration

Access settings by right-clicking the Loudspeaker icon in the system tray.

  • Hotkeys: Rebind any action to your preferred key combinations.
  • Audio:
    • Cloud: Select Gemini Model and Voice (e.g., Puck, Kore).
    • Local: Configure external engine path and select VibeVoice presets.
  • Capture Mode: Toggle between Fullscreen and Region. In Region mode, use the interactive selector to define your capture area.
  • HUD / UI: Toggle the Banter HUD, adjust opacity, and configure focus behavior (Immersive vs. Focus mode).
  • Performance: Configure "Process Priority" and "Playback Buffer" to optimize for your hardware.

(back to top)

🗂️ Repository Structure

ScreenBanter/
├── app/                  # Frontend Daemon & GUI
│   ├── main.py           # Application entry & Tray management
│   ├── capture.py        # DXcam screen capture logic
│   ├── vision.py         # Gemini API integration
│   ├── audio_client.py   # Threaded PyAudio playback
│   ├── settings.py       # Configuration management
│   ├── settings_window.py# CustomTkinter Settings GUI
│   └── region_selector.py# Transparent overlay for region selection
├── server/               # Local Inference Server Logic
│   ├── tts_server.py     # FastAPI application (for local dev/BYOE)
│   └── model_loader.py   # VibeVoice initialization
├── docs/                 # Documentation
│   └── setup_local_tts.md# Guide for external engine setup
├── third_party/          # VibeVoice source code (submodule)

(back to top)

🔗 Flow Chart

graph TD
    User[User] -->|Hotkey| Trigger[Capture Trigger]
    Trigger -->|DXcam| Capture[Screen/Region Capture]
    Capture -->|Image Data| Vision[Gemini Vision Engine]
    Vision -->|Extracted Text| Client[Audio Client]
    
    Client -->|Option A: API| Cloud[Gemini Cloud TTS]
    Client -->|Option B: Subprocess| Local[External VibeVoice Server]
    
    Cloud -->|Audio Bytes| Playback[PyAudio Stream]
    Local -->|Audio Bytes| Playback
    
    Playback -->|Sound| Speakers[User Speakers]
Loading

(back to top)

Roadmap

  • Instant Narration
  • Batch Mode (Queueing)
  • Custom Region Capture
  • Settings GUI (Modern UI)
  • Local Neural TTS Integration (BYOE)
  • Cloud TTS (Gemini) Integration
  • Dynamic Voice/Device Selection
  • Standalone "Lite" Client Build

See the open issues for a full list of proposed features.

(back to top)

🤝 Contributing

Contributions are welcome! If you have suggestions or bug fixes:

  1. Fork the Project.
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature).
  3. Commit your Changes (git commit -m 'Add some AmazingFeature').
  4. Push to the Branch (git push origin feature/AmazingFeature).
  5. Open a Pull Request.

Note: As an amateur project, PR reviews might take some time!

(back to top)

License

Distributed under the MIT License. See LICENSE for more information.

(back to top)

Contact

Alfred T - GitHub Profile

Project Link: https://github.com/alfred1137/ScreenBanter

(back to top)

❤️ Acknowledgements

(back to top)

📝 Changelog

  • 2026-01-25: Fixed build dependencies and migrated to PyInstaller for CI stability. (v0.3.5)
  • 2026-01-24: Refactored to Lite Client architecture (v0.3.0).
  • 2026-01-20: Verified full integration workflow (HUD, 4-bit TTS, Region Capture) on Windows 11 with CUDA 12.1.
  • 2026-01-16: Added Performance Mode (4-bit quantization, priority boosting) and Banter HUD for seamless gaming integration.
  • 2026-01-14: Enhanced documentation, added Region Capture and Settings GUI polish.
  • 2026-01-10: Implemented Settings GUI and dynamic configuration infrastructure.
  • 2026-01-05: Initial MVP release with Gemini OCR and VibeVoice TTS integration.

(back to top)

About

Make your screen talk back. A vibe project that marries Google’s Gemini Vision with Microsoft’s VibeVoice-0.5B for real-time desktop narration.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages