Skip to content

nikitph/ragged

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

30 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Ragged - : Leveraging Video Container Formats for Efficient Vector Database Distribution

Python 3.11+ License: MIT arXiv

๐ŸŽฅ Revolutionary Vector Database Distribution: Encode your knowledge base into MP4 video files and distribute globally through CDNs for lightning-fast semantic search.

Ragged transforms how we think about vector database distribution by leveraging the mature video streaming infrastructure. Instead of complex database deployments, simply upload an MP4 file to any CDN and get instant global semantic search capabilities.

๐Ÿš€ Quick Start

# Install dependencies
poetry install

# configure the R2 bucket by creating a .env with the following variables. Any s3-compatible provider should be ok. I have only tested with R2
R2_BUCKET=ragged
R2_ENDPOINT=<cloudflare-r2-endpoint>
R2_ACCESS_KEY=<cloudflare-r2-access-key>
R2_SECRET_KEY=<cloudflare-r2-secret-key>

# next step is to start the model server. We dont have to but warming the embedding model reduces the processing time
python3 ragged/video/model_server.py --start

# now lets build the mp4 and other data from wikipedia. This will automtically upload the files to R2
python3 ragged/video/wiki_upload.py --max-articles 1000 

# Search the knowledge base we just built
 python3 ragged/video/search.py "machine" --show-performance --detailed

# If you want to run benchmarks
python3 ragged/video/benchmarks.py --benchmark

Caveats

  • First search run will be slow as the faiss index and the manifest will be populated one time from the cloud. Even this can be warmed up (future enhancement)
  • A seperate model server helps a lot with performance. I strongly recommend running that.
  • You might notice that the similarity results are somewhat low. That will be a fair critique but the point of this library and demo is to show the Mp4 storage and cloud retrieval functionality. People way smarter and efficient than myself have solved those problems and with some effort the quality of results can be improved (future enhancement)

Demo

a. Run model server --

Makua-202506-2815.38.10.mp4

b. Encode knowledge base --

Makua-202506-2815.43.32.mp4

c. Query first run --

Makua-202506-2814.59.46.mp4

d. Query subsequent runs --

Makua-202506-2815.03.31.mp4

๐ŸŒŸ What Makes Ragged Special?

Traditional Vector Databases ๐Ÿ˜ฐ

  • Complex server deployments
  • Expensive hosting infrastructure
  • Cold-start penalties
  • Regional latency issues
  • Database connection limits

Ragged Approach ๐ŸŽฏ

  • MP4 files โ†’ Upload anywhere (Cloudflare R2, AWS S3, etc.)
  • CDN distribution โ†’ Global edge caching automatically
  • HTTP range requests โ†’ Download only what you need
  • Zero servers โ†’ Serverless and edge-computing ready
  • Infinite scale โ†’ No connection limits

๐Ÿ—๏ธ How It Works

graph TD
    A[๐Ÿ“„ Documents] --> B[๐Ÿ”ค Text Chunking]
    B --> C[๐Ÿงฎ Vector Encoding]
    C --> D[๐Ÿ“ฆ MP4 Fragments]
    D --> E[๐ŸŽฌ MP4 Container]
    E --> F[๐ŸŒ CDN Distribution]
    F --> G[๐Ÿ” Global Search]
    
    H[๐Ÿ“Š FAISS Index] --> F
    I[๐Ÿ“‹ JSON Manifest] --> F
Loading
  1. ๐Ÿ“„ Input: Your documents (PDFs, text files, web content)
  2. ๐Ÿ”ค Processing: Smart chunking with overlap and topic extraction
  3. ๐Ÿงฎ Encoding: Convert to vectors using sentence-transformers
  4. ๐Ÿ“ฆ Packaging: Encode vectors into MP4 fragments with metadata
  5. ๐ŸŒ Distribution: Upload to any CDN (Cloudflare R2, AWS CloudFront, etc.)
  6. ๐Ÿ” Search: Lightning-fast semantic search from anywhere in the world

๐ŸŽฏ Core Features

๐ŸŽฌ MP4 Vector Encoding

  • Standards Compliant: ISO/IEC 14496-12 MP4 containers
  • Fragment-Based: Optimized chunk sizes for CDN performance
  • Rich Metadata: Topic classification, timestamps, source attribution
  • Binary Efficiency: Float32 vectors with JSON metadata

๐ŸŒ CDN-Optimized Distribution

  • HTTP Range Requests: Surgical data access, download only needed fragments
  • Intelligent Prefetching: Background loading of adjacent fragments
  • Multi-Level Caching: Memory, disk, and CDN edge caching
  • Global Performance: Consistent search speed worldwide

๐Ÿ” Advanced Search Capabilities

  • Semantic Search: Natural language queries using sentence-transformers
  • Topic Filtering: Search within specific topics or domains
  • FAISS Integration: Exact and approximate similarity search
  • Similarity Thresholds: Configurable result quality filtering

โšก Performance & Scalability

  • Cold-Start Optimization: 3-second initialization vs 45+ seconds for traditional DBs
  • Infinite Readers: No database connection limits
  • Edge Computing: Works in Cloudflare Workers, AWS Lambda, etc.
  • Bandwidth Efficient: 92% less data transfer for initial loads

๐Ÿ“ Project Structure

ragged/
โ”œโ”€โ”€ ๐Ÿ“ ragged/                     # Core package
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ video/                  # Video encoding/decoding
โ”‚   โ”‚   โ”œโ”€โ”€ encoder.py            # MP4 vector encoding
โ”‚   โ”‚   โ”œโ”€โ”€ decoder.py            # CDN-optimized decoding
โ”‚   โ”‚   โ””โ”€โ”€ config.py             # Video codec settings
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ api/                    # FastAPI web service
โ”‚   โ”‚   โ””โ”€โ”€ v1/endpoints/         # REST API endpoints
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ services/              # Business logic
โ”‚   โ”‚   โ””โ”€โ”€ uploader/             # CDN upload services
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ enterprise/            # Enterprise features (WIP)
โ”‚   โ””โ”€โ”€ main.py                   # FastAPI app entry point
โ”œโ”€โ”€ ๐Ÿ“ examples/                   # Usage examples
โ””โ”€โ”€ ๐Ÿ“‹ pyproject.toml             # Dependencies

๐Ÿ“Š Performance Characteristics

๐Ÿ“Š BENCHMARK SUMMARY - Obtained by running benchmarks.py against a random dataset. To be honest i feel while the system is very good, these numbers are a bit generous. Critiques on the benchmark script are welcome.

โšก Performance Grade: A (10.0ms avg) ๐ŸŽฏ Quality Grade: F (43.3% relevance) ๐Ÿš€ Throughput: 100.9 queries/sec ๐Ÿ’พ Cache Hit Rate: 100.0%

๐Ÿ“ˆ Detailed Metrics: Cold Start p95: 129ms Warm Search p95: 10ms Query Encoding: 8.0ms Result Diversity: 40.0% Memory Usage: 607.5MB

ps: Quality is highly dependent on the articles that you get from the wiki dataset. This will vary from run to run.

๐ŸŽฏ Use Cases

โœ… Perfect For

  • ๐Ÿ“š Knowledge Base Search: Documentation, FAQs, internal wikis
  • ๐Ÿค– RAG Applications: Retrieval-augmented generation systems
  • ๐ŸŒ Global Applications: Multi-region deployments with consistent performance
  • โšก Edge Computing: Serverless functions, IoT devices, mobile apps
  • ๐Ÿ’ฐ Cost-Sensitive Deployments: Startups, side projects, research

โŒ Not Ideal For

  • ๐Ÿ”„ Frequent Updates: Real-time indexing requirements
  • ๐Ÿ” Complex Queries: Multi-stage filtering, analytical workloads
  • ๐Ÿ“Š Traditional CRUD: Applications needing database transactions

๐Ÿ›ฃ๏ธ Roadmap

๐ŸŽฏ Current Focus (v0.1)

  • Core MP4 encoding/decoding
  • CDN-optimized distribution
  • FastAPI web service
  • PDF upload pipeline
  • Production deployment guides
  • Performance optimization

๐Ÿš€ Next Phase (v0.2)

  • Multi-modal vectors (images, audio)
  • Streaming updates (incremental changes)
  • Advanced search (hybrid, faceted)
  • Enterprise SSO integration

๐ŸŒŸ Future Vision (v1.0)

  • Standard MP4 boxes for vectors
  • P2P distribution networks
  • Edge AI processing
  • Ecosystem integrations

๐Ÿค Contributions

Welcome

๐Ÿ“ Documentation

  • Update README for new features
  • Add docstrings for new functions
  • Include usage examples

๐Ÿ“š Learn More

๐Ÿ“– Academic Paper

Read our arXiv paper: "Ragged: ragged.pdf Leveraging Video Container Formats for Efficient Vector Database Distribution"

๐ŸŽฅ Inspiration

This project was inspired by Memvid, which demonstrated storing data in video formats. Ragged extends this concept with vector-specific optimizations, CDN distribution, and semantic search capabilities.

๐Ÿ”— Related Projects

๐Ÿ“„ License

MIT License - see LICENSE file for details.

๐ŸŽ‰ Acknowledgments

  • Video Streaming Community: For the mature CDN infrastructure we leverage
  • FAISS Team: For efficient similarity search algorithms
  • Sentence-Transformers: For high-quality text embeddings
  • Memvid: For the initial inspiration of storing data in video formats
  • Open Source Community: For the foundational libraries that make this possible

๐ŸŒŸ Star this repo if you find it useful! ๐ŸŒŸ

Questions? Open an issue or start a discussion!


๐Ÿ’ก Fun Fact: Your entire knowledge base is now a video file that can be streamed, cached, and distributed just like any YouTube video - but instead of cat videos, it's semantic search! ๐Ÿฑโžก๏ธ๐Ÿ”

About

Ragged - : Leveraging Video Container Formats for Efficient Vector Database Distribution

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages