Ragged - : Leveraging Video Container Formats for Efficient Vector Database Distribution

🎥 Revolutionary Vector Database Distribution: Encode your knowledge base into MP4 video files and distribute globally through CDNs for lightning-fast semantic search.

Ragged transforms how we think about vector database distribution by leveraging the mature video streaming infrastructure. Instead of complex database deployments, simply upload an MP4 file to any CDN and get instant global semantic search capabilities.

🚀 Quick Start

# Install dependencies
poetry install

# configure the R2 bucket by creating a .env with the following variables. Any s3-compatible provider should be ok. I have only tested with R2
R2_BUCKET=ragged
R2_ENDPOINT=<cloudflare-r2-endpoint>
R2_ACCESS_KEY=<cloudflare-r2-access-key>
R2_SECRET_KEY=<cloudflare-r2-secret-key>

# next step is to start the model server. We dont have to but warming the embedding model reduces the processing time
python3 ragged/video/model_server.py --start

# now lets build the mp4 and other data from wikipedia. This will automtically upload the files to R2
python3 ragged/video/wiki_upload.py --max-articles 1000 

# Search the knowledge base we just built
 python3 ragged/video/search.py "machine" --show-performance --detailed

# If you want to run benchmarks
python3 ragged/video/benchmarks.py --benchmark

Caveats

First search run will be slow as the faiss index and the manifest will be populated one time from the cloud. Even this can be warmed up (future enhancement)
A seperate model server helps a lot with performance. I strongly recommend running that.
You might notice that the similarity results are somewhat low. That will be a fair critique but the point of this library and demo is to show the Mp4 storage and cloud retrieval functionality. People way smarter and efficient than myself have solved those problems and with some effort the quality of results can be improved (future enhancement)

Demo

a. Run model server --

Makua-202506-2815.38.10.mp4

b. Encode knowledge base --

Makua-202506-2815.43.32.mp4

c. Query first run --

Makua-202506-2814.59.46.mp4

d. Query subsequent runs --

Makua-202506-2815.03.31.mp4

🌟 What Makes Ragged Special?

Traditional Vector Databases 😰

Complex server deployments
Expensive hosting infrastructure
Cold-start penalties
Regional latency issues
Database connection limits

Ragged Approach 🎯

MP4 files → Upload anywhere (Cloudflare R2, AWS S3, etc.)
CDN distribution → Global edge caching automatically
HTTP range requests → Download only what you need
Zero servers → Serverless and edge-computing ready
Infinite scale → No connection limits

🏗️ How It Works

graph TD
    A[📄 Documents] --> B[🔤 Text Chunking]
    B --> C[🧮 Vector Encoding]
    C --> D[📦 MP4 Fragments]
    D --> E[🎬 MP4 Container]
    E --> F[🌍 CDN Distribution]
    F --> G[🔍 Global Search]
    
    H[📊 FAISS Index] --> F
    I[📋 JSON Manifest] --> F

📄 Input: Your documents (PDFs, text files, web content)
🔤 Processing: Smart chunking with overlap and topic extraction
🧮 Encoding: Convert to vectors using sentence-transformers
📦 Packaging: Encode vectors into MP4 fragments with metadata
🌍 Distribution: Upload to any CDN (Cloudflare R2, AWS CloudFront, etc.)
🔍 Search: Lightning-fast semantic search from anywhere in the world

🎯 Core Features

🎬 MP4 Vector Encoding

Standards Compliant: ISO/IEC 14496-12 MP4 containers
Fragment-Based: Optimized chunk sizes for CDN performance
Rich Metadata: Topic classification, timestamps, source attribution
Binary Efficiency: Float32 vectors with JSON metadata

🌐 CDN-Optimized Distribution

HTTP Range Requests: Surgical data access, download only needed fragments
Intelligent Prefetching: Background loading of adjacent fragments
Multi-Level Caching: Memory, disk, and CDN edge caching
Global Performance: Consistent search speed worldwide

🔍 Advanced Search Capabilities

Semantic Search: Natural language queries using sentence-transformers
Topic Filtering: Search within specific topics or domains
FAISS Integration: Exact and approximate similarity search
Similarity Thresholds: Configurable result quality filtering

⚡ Performance & Scalability

Cold-Start Optimization: 3-second initialization vs 45+ seconds for traditional DBs
Infinite Readers: No database connection limits
Edge Computing: Works in Cloudflare Workers, AWS Lambda, etc.
Bandwidth Efficient: 92% less data transfer for initial loads

📁 Project Structure

ragged/
├── 📁 ragged/                     # Core package
│   ├── 📁 video/                  # Video encoding/decoding
│   │   ├── encoder.py            # MP4 vector encoding
│   │   ├── decoder.py            # CDN-optimized decoding
│   │   └── config.py             # Video codec settings
│   ├── 📁 api/                    # FastAPI web service
│   │   └── v1/endpoints/         # REST API endpoints
│   ├── 📁 services/              # Business logic
│   │   └── uploader/             # CDN upload services
│   ├── 📁 enterprise/            # Enterprise features (WIP)
│   └── main.py                   # FastAPI app entry point
├── 📁 examples/                   # Usage examples
└── 📋 pyproject.toml             # Dependencies

📊 Performance Characteristics

📊 BENCHMARK SUMMARY - Obtained by running benchmarks.py against a random dataset. To be honest i feel while the system is very good, these numbers are a bit generous. Critiques on the benchmark script are welcome.

⚡ Performance Grade: A (10.0ms avg) 🎯 Quality Grade: F (43.3% relevance) 🚀 Throughput: 100.9 queries/sec 💾 Cache Hit Rate: 100.0%

📈 Detailed Metrics: Cold Start p95: 129ms Warm Search p95: 10ms Query Encoding: 8.0ms Result Diversity: 40.0% Memory Usage: 607.5MB

ps: Quality is highly dependent on the articles that you get from the wiki dataset. This will vary from run to run.

🎯 Use Cases

✅ Perfect For

📚 Knowledge Base Search: Documentation, FAQs, internal wikis
🤖 RAG Applications: Retrieval-augmented generation systems
🌐 Global Applications: Multi-region deployments with consistent performance
⚡ Edge Computing: Serverless functions, IoT devices, mobile apps
💰 Cost-Sensitive Deployments: Startups, side projects, research

❌ Not Ideal For

🔄 Frequent Updates: Real-time indexing requirements
🔍 Complex Queries: Multi-stage filtering, analytical workloads
📊 Traditional CRUD: Applications needing database transactions

🛣️ Roadmap

🎯 Current Focus (v0.1)

🚀 Next Phase (v0.2)

Multi-modal vectors (images, audio)
Streaming updates (incremental changes)
Advanced search (hybrid, faceted)
Enterprise SSO integration

🌟 Future Vision (v1.0)

Standard MP4 boxes for vectors
P2P distribution networks
Edge AI processing
Ecosystem integrations

🤝 Contributions

Welcome

📝 Documentation

Update README for new features
Add docstrings for new functions
Include usage examples

📚 Learn More

📖 Academic Paper

Read our arXiv paper: "Ragged: ragged.pdf Leveraging Video Container Formats for Efficient Vector Database Distribution"

🎥 Inspiration

This project was inspired by Memvid, which demonstrated storing data in video formats. Ragged extends this concept with vector-specific optimizations, CDN distribution, and semantic search capabilities.

🔗 Related Projects

FAISS: Efficient similarity search
Sentence Transformers: Text embedding models
FastAPI: Modern web framework

📄 License

MIT License - see LICENSE file for details.

🎉 Acknowledgments

Video Streaming Community: For the mature CDN infrastructure we leverage
FAISS Team: For efficient similarity search algorithms
Sentence-Transformers: For high-quality text embeddings
Memvid: For the initial inspiration of storing data in video formats
Open Source Community: For the foundational libraries that make this possible

🌟 Star this repo if you find it useful! 🌟

Questions? Open an issue or start a discussion!

💡 Fun Fact: Your entire knowledge base is now a video file that can be streamed, cached, and distributed just like any YouTube video - but instead of cat videos, it's semantic search! 🐱➡️🔍

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
examples		examples
ragged		ragged
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

License

nikitph/ragged

Folders and files

Latest commit

History

Repository files navigation