GitHub - alfred1137/ScreenBanter: Make your screen talk back. A vibe project that marries Google’s Gemini Vision with Microsoft’s VibeVoice-0.5B for real-time desktop narration.

Warning

This is an ongoing personal project currently in active development. Features are subject to change.

ScreenBanter

Make your screen talk back. Real-time AI desktop narration.

Report Bug · Request Feature

Table of Contents

📖 Overview
✨ Features
📦 Technologies
🚀 Installation & Setup
- ✅ Requirements
- Installation
🛠️ Usage
🔧 Configuration
🗂️ Repository Structure
🔗 Flow Chart
Roadmap
🤝 Contributing
License
Contact
❤️ Acknowledgements
📝 Changelog

📖 Overview

ScreenBanter is an amateur project that marries Google’s Gemini Vision (for high-speed OCR) and the new Gemini 2.5 Flash Preview TTS (for cloud audio) to provide real-time desktop narration.

New in v0.3.0: The application is now distributed as a lightweight Lite Client (~100MB) with Cloud TTS support out-of-the-box. Local neural TTS (via Microsoft VibeVoice-0.5B) is fully supported through a "Bring Your Own Engine" (BYOE) model, allowing power users to host their own inference server.

🌟 Key Features

Cloud TTS (Default): High-quality, low-latency narration using Gemini 2.5 Flash Preview (gemini-2.5-flash-preview-tts) with 30+ native voices. Zero local GPU load.
Local TTS (BYOE): Connect to your own local VibeVoice instance for private, offline, neural speech generation.
Smart Vision: Uses Gemini 2.5 Flash Lite (models/gemini-flash-lite-latest) via Gemini API for intelligent text extraction and context-aware merging of multiple screenshots.

(back to top)

🚀 Installation & Setup

📥 Download (Lite Client)

The easiest way to use ScreenBanter is to download the latest pre-compiled build:

Go to the GitHub Actions tab.
Select the latest successful "Build Windows Executable" run.
Scroll down to Artifacts and download ScreenBanter_Windows_Executable.
Extract the ZIP, create a .env file with your GEMINI_KEY, and run ScreenBanter.exe.

✅ Requirements

Lite Client (Cloud Only)

OS: Windows 10/11 (Required for DXcam and Win32 tray integration).
Internet: Active connection for Gemini API.

Local TTS Engine (Optional)

GPU: NVIDIA GPU with CUDA 12.1 support (RTX 3060+ recommended).
Python: 3.10+ installed.
Git: Installed.

Installation (Source)

Clone the Repository

git clone https://github.com/alfred1137/ScreenBanter.git
cd ScreenBanter

Setup Environment Variables Create a .env file from the example:
```
cp .env.example .env
```
Edit .env and add your GEMINI_KEY from Google AI Studio.
Install Dependencies (Lite) Using uv:
```
uv sync
```
Optional: Setup Local TTS To use VibeVoice locally, follow the Local TTS Setup Guide.

If developing locally:
```
uv sync --extra local-tts
```

(back to top)

🛠️ Usage

1. Launch the Application Starts the system tray app.

uv run python -m app.main

Wait for the announcement: "ScreenBanter is active."

2. Controls (Default Hotkeys)

Hotkey	Action	Description
`Ctrl + Alt + S`	Instant Capture	Narrates the current screen/region immediately.
`F10`	Queue Screenshot	Adds current view to buffer (confirmed by a beep).
`F11`	Process Queue	Merges all queued captures and narrating the result.

3. Banter HUD ScreenBanter features a non-intrusive HUD that appears automatically during operation:

Scanning: Indicates Gemini is analyzing the screen.
Thinking: Displays the extracted text for verification.
Speaking: Shows playback status. The HUD is "click-through" and will not steal focus from your active game.

(back to top)

🔧 Configuration

Access settings by right-clicking the Loudspeaker icon in the system tray.

Hotkeys: Rebind any action to your preferred key combinations.
Audio:
- Cloud: Select Gemini Model and Voice (e.g., Puck, Kore).
- Local: Configure external engine path and select VibeVoice presets.
Capture Mode: Toggle between Fullscreen and Region. In Region mode, use the interactive selector to define your capture area.
HUD / UI: Toggle the Banter HUD, adjust opacity, and configure focus behavior (Immersive vs. Focus mode).
Performance: Configure "Process Priority" and "Playback Buffer" to optimize for your hardware.

(back to top)

🗂️ Repository Structure

ScreenBanter/
├── app/                  # Frontend Daemon & GUI
│   ├── main.py           # Application entry & Tray management
│   ├── capture.py        # DXcam screen capture logic
│   ├── vision.py         # Gemini API integration
│   ├── audio_client.py   # Threaded PyAudio playback
│   ├── settings.py       # Configuration management
│   ├── settings_window.py# CustomTkinter Settings GUI
│   └── region_selector.py# Transparent overlay for region selection
├── server/               # Local Inference Server Logic
│   ├── tts_server.py     # FastAPI application (for local dev/BYOE)
│   └── model_loader.py   # VibeVoice initialization
├── docs/                 # Documentation
│   └── setup_local_tts.md# Guide for external engine setup
├── third_party/          # VibeVoice source code (submodule)

(back to top)

🔗 Flow Chart

graph TD
    User[User] -->|Hotkey| Trigger[Capture Trigger]
    Trigger -->|DXcam| Capture[Screen/Region Capture]
    Capture -->|Image Data| Vision[Gemini Vision Engine]
    Vision -->|Extracted Text| Client[Audio Client]
    
    Client -->|Option A: API| Cloud[Gemini Cloud TTS]
    Client -->|Option B: Subprocess| Local[External VibeVoice Server]
    
    Cloud -->|Audio Bytes| Playback[PyAudio Stream]
    Local -->|Audio Bytes| Playback
    
    Playback -->|Sound| Speakers[User Speakers]

(back to top)

Roadmap

See the open issues for a full list of proposed features.

(back to top)

🤝 Contributing

Contributions are welcome! If you have suggestions or bug fixes:

Fork the Project.
Create your Feature Branch (git checkout -b feature/AmazingFeature).
Commit your Changes (git commit -m 'Add some AmazingFeature').
Push to the Branch (git push origin feature/AmazingFeature).
Open a Pull Request.

Note: As an amateur project, PR reviews might take some time!

(back to top)

License

Distributed under the MIT License. See LICENSE for more information.

(back to top)

Contact

Alfred T - GitHub Profile

Project Link: https://github.com/alfred1137/ScreenBanter

(back to top)

❤️ Acknowledgements

Microsoft/VibeVoice - Exceptional local TTS.
Google Gemini Vision - High-speed multi-modal OCR.
DXcam - Ultra-fast Windows screen capture.
CustomTkinter - Modernizing Python GUIs.

(back to top)

📝 Changelog

2026-01-25: Fixed build dependencies and migrated to PyInstaller for CI stability. (v0.3.5)
2026-01-24: Refactored to Lite Client architecture (v0.3.0).
2026-01-20: Verified full integration workflow (HUD, 4-bit TTS, Region Capture) on Windows 11 with CUDA 12.1.
2026-01-16: Added Performance Mode (4-bit quantization, priority boosting) and Banter HUD for seamless gaming integration.
2026-01-14: Enhanced documentation, added Region Capture and Settings GUI polish.
2026-01-10: Implemented Settings GUI and dynamic configuration infrastructure.
2026-01-05: Initial MVP release with Gemini OCR and VibeVoice TTS integration.

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.gemini/skills		.gemini/skills
.github		.github
app		app
assets		assets
docs		docs
memory-bank		memory-bank
models/VibeVoice-Realtime-0.5B		models/VibeVoice-Realtime-0.5B
scripts		scripts
server		server
third_party		third_party
.env.example		.env.example
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
GEMINI.md		GEMINI.md
LICENSE		LICENSE
README.md		README.md
ScreenBanter.spec		ScreenBanter.spec
build_release.py		build_release.py
pyproject.toml		pyproject.toml
settings.json.example		settings.json.example
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ScreenBanter

📖 Overview

🌟 Key Features

🚀 Installation & Setup

📥 Download (Lite Client)

✅ Requirements

Installation (Source)

🛠️ Usage

🔧 Configuration

🗂️ Repository Structure

🔗 Flow Chart

Roadmap

🤝 Contributing

License

Contact

❤️ Acknowledgements

📝 Changelog

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

alfred1137/ScreenBanter

Folders and files

Latest commit

History

Repository files navigation

ScreenBanter

📖 Overview

🌟 Key Features

🚀 Installation & Setup

📥 Download (Lite Client)

✅ Requirements

Installation (Source)

🛠️ Usage

🔧 Configuration

🗂️ Repository Structure

🔗 Flow Chart

Roadmap

🤝 Contributing

License

Contact

❤️ Acknowledgements

📝 Changelog

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages