Local, zero-shot text-to-video generation — lightweight, transparent, and privacy-respecting.
Run short video synthesis on consumer hardware, without internet or external services.
ZeroScope is a minimal, locally executable implementation of zero-shot text-to-video generation, designed for research, prototyping, and creative exploration. Built on diffusion priors and temporal interpolation, it generates coherent short clips from text prompts — entirely on your machine.
Unlike cloud-based alternatives, ZeroScope requires no API keys, no telemetry, and no internet connection after initial setup. Your prompts stay private. Your outputs belong to you.
ZeroScope is not a product — it’s a reference implementation.
It prioritizes:
- Simplicity over feature bloat
- Transparency over black-box magic
- Local execution over cloud dependency
- Reproducibility over hidden optimizations
This project is intended for technical users, educators, and researchers who wish to understand or extend zero-shot video generation — not for production-grade video pipelines.
- Text-to-video generation (2–6 seconds)
- Resolution: 256×256 (optimized for speed & memory)
- Frame rate: 8–12 FPS
- Works on GPU (CUDA) and CPU
- Outputs standard MP4 files via FFmpeg
- Built on Hugging Face
diffusersandtransformers
Note: Quality is limited by model size and temporal coherence constraints. Outputs may exhibit artifacts or motion inconsistencies — this is expected behavior for a lightweight zero-shot system.
| Hardware | Time | VRAM Usage |
|---|---|---|
| RTX 3060 (12 GB) | ~2.1 min | ~5.8 GB |
| RTX 4070 (12 GB) | ~1.7 min | ~5.8 GB |
| Apple M1 Pro | ~4.3 min | ~6.2 GB |
| CPU (i7-12700) | ~28 min | ~9.1 GB RAM |
Use --fp16 for faster inference on compatible GPUs.
ZeroScope must not be used to:
- Generate synthetic media of real individuals without consent
- Create misleading or harmful content
- Circumvent copyright protections
- Automate disinformation
The model weights are derived from open academic releases and are provided for non-commercial research and educational purposes only.
- Based on the Text2Video-Zero methodology (Khurana et al., 2023)
- Model:
cerspense/zeroscope-v2-576w(from Hugging Face) - Code license: MIT
- Model license: CreativeML Open RAIL-M
- Requires ~3.2 GB disk space for weights
This is not a fork of commercial systems. It is a reimplementation for local use, with no affiliation to any corporate entity.
- Report bugs or inconsistencies: Issues
- Discuss use cases or limitations: Discussions
- Improve documentation or performance: PRs welcome
We value honesty about limitations as much as technical excellence.
ZeroScope: A small window into the future of video — open, local, and modest in its claims.