Skip to content

This project aims to build a VideoRAG system capable of answering natural language queries over a large collection of videos. This requires a multimodal approach to indexing, where video content is processed and made searchable through a combination of visual and auditory information.

License

Notifications You must be signed in to change notification settings

AyhamJo7/VideoRAG

VideoRAG: Multimodal Retrieval-Augmented Generation over Video Corpora

CI Python 3.11 License: MIT

VideoRAG Logo

VideoRAG is a production-grade system for multimodal retrieval-augmented generation over video collections. Ask natural language questions about your videos and get timestamped, grounded answers with source clips.

Features

  • Multimodal Retrieval: Hybrid search across visual (CLIP) and textual (transcript) modalities
  • Automatic Transcription: Whisper-based ASR with speaker diarization support
  • Temporal Chunking: Intelligent video segmentation with configurable overlap
  • Grounded Generation: LLM answers with source citations and exact timestamps
  • Interactive UI: Streamlit interface with video playback and keyframe previews
  • Vector Search: Milvus-powered efficient similarity search at scale
  • Docker Ready: Complete stack deployment with compose
  • Evaluation Tools: Built-in metrics (Hit@K, Precision@K, MRR)

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Video Files β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Ingestion Pipeline                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Chunking β”‚β†’ β”‚ Keyframesβ”‚β†’ β”‚ Transcriptionβ”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
                     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Embedding & Indexing                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚  β”‚CLIP (img) β”‚     β”‚Text Embedder β”‚            β”‚
β”‚  β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜            β”‚
β”‚        β”‚                   β”‚                    β”‚
β”‚        β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β”‚
β”‚                 β–Ό                                β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                        β”‚
β”‚         β”‚Milvus Index  β”‚                        β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
                     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Retrieval & Generation                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Query    β”‚β†’ β”‚ Hybrid   β”‚β†’ β”‚ Grounded LLM β”‚  β”‚
β”‚  β”‚ Encoding β”‚  β”‚ Search   β”‚  β”‚ Answer       β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
                     β–Ό
               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
               β”‚Streamlit β”‚
               β”‚    UI    β”‚
               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Quick Start

Prerequisites

  • Python 3.11+
  • ffmpeg
  • Docker & Docker Compose (for Milvus)
  • API key for OpenAI or Anthropic (for answer generation)

Installation

# Clone repository
git clone https://github.com/AyhamJo7/VideoRAG.git
cd VideoRAG

# Create virtual environment and install
make setup

# Or manually:
python3.11 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -e ".[dev]"

Configuration

# Copy environment template
cp .env.example .env

# Edit .env with your settings:
# - Add OpenAI or Anthropic API key
# - Adjust chunk length, embedding models, etc.

Usage

1. Start Milvus

make up  # Starts Milvus with docker-compose

2. Add Videos

Place your video files in data/videos/. Supported formats: MP4, AVI, MKV, MOV, WEBM.

IMPORTANT: Only use videos you have rights to! See data/videos/README.md for recommendations.

3. Process Videos

# Process all videos: chunk β†’ extract keyframes β†’ transcribe β†’ embed
python scripts/process_videos.py

This pipeline:

  • Chunks videos into 30s segments (configurable)
  • Extracts keyframes at 2 FPS (configurable)
  • Transcribes audio with Whisper
  • Computes CLIP and text embeddings

4. Build Index

# Insert all processed data into Milvus
python scripts/build_index.py

5. Launch UI

make ui  # Or: streamlit run src/videorag/ui/app.py

Open http://localhost:8501 and start querying!

Example Queries

  • "Find where the lecture explains backpropagation"
  • "Show clips about data preprocessing"
  • "When does the speaker discuss ethical AI?"

Development

Run Tests

make test

Linting & Formatting

make format  # Run black + isort
make lint    # Run ruff + mypy

Pre-commit Hooks

pre-commit install
pre-commit run --all-files

Configuration

Key settings in .env:

Variable Description Default
CHUNK_LENGTH_SEC Video chunk duration 30
CHUNK_OVERLAP_SEC Chunk overlap 5
KEYFRAME_SAMPLE_RATE Keyframes per second 2.0
CLIP_MODEL HuggingFace CLIP model openai/clip-vit-base-patch32
TEXT_EMBEDDING_MODEL Sentence transformer model sentence-transformers/all-MiniLM-L6-v2
WHISPER_MODEL Whisper size base
TOP_K Number of retrieval results 5
CLIP_WEIGHT Visual similarity weight 0.5
TEXT_WEIGHT Text similarity weight 0.5
LLM_PROVIDER openai or anthropic openai

API Usage

from videorag.index.milvus_client import MilvusClient
from videorag.rag.pipeline import VideoRAGPipeline
from videorag.text.embedder import TextEmbedder
from videorag.vision.clip_embedder import CLIPEmbedder

# Initialize components
milvus_client = MilvusClient()
milvus_client.collection = milvus_client.create_collection()
milvus_client.load_collection()

text_embedder = TextEmbedder()
clip_embedder = CLIPEmbedder()

# Create pipeline
pipeline = VideoRAGPipeline(
    milvus_client=milvus_client,
    text_embedder=text_embedder,
    clip_embedder=clip_embedder,
)

# Query
result = pipeline.query(
    query="Find lectures about neural networks",
    top_k=5,
    generate_answer=True,
)

print(result["answer"])
for chunk in result["chunks"]:
    print(f"{chunk['video_id']} @ {chunk['start_time']:.1f}s: {chunk['transcript'][:100]}")

Evaluation

The system includes retrieval metrics:

  • Hit@K: Recall of any relevant item in top-k
  • Precision@K: Fraction of top-k that are relevant
  • Recall@K: Fraction of all relevant items in top-k
  • MRR: Mean Reciprocal Rank

See src/videorag/eval/metrics.py for implementations.

Docker Deployment

# Start full stack
docker compose up -d

# View logs
docker compose logs -f

# Stop services
docker compose down

Project Structure

VideoRAG/
β”œβ”€β”€ src/videorag/          # Main package
β”‚   β”œβ”€β”€ config/            # Settings & configuration
β”‚   β”œβ”€β”€ io/                # Video I/O and chunking
β”‚   β”œβ”€β”€ asr/               # Whisper transcription
β”‚   β”œβ”€β”€ vision/            # CLIP embeddings & keyframes
β”‚   β”œβ”€β”€ text/              # Text embeddings
β”‚   β”œβ”€β”€ index/             # Milvus client & schema
β”‚   β”œβ”€β”€ rag/               # Retrieval & generation pipeline
β”‚   β”œβ”€β”€ ui/                # Streamlit app
β”‚   β”œβ”€β”€ eval/              # Evaluation metrics
β”‚   └── utils/             # Logging, paths, etc.
β”œβ”€β”€ scripts/               # CLI processing scripts
β”œβ”€β”€ tests/                 # Pytest unit tests
β”œβ”€β”€ docker/                # Dockerfile
β”œβ”€β”€ compose.yaml           # Docker Compose for Milvus
└── data/                  # Data directories (gitignored)

Troubleshooting

Milvus connection failed

  • Ensure Milvus is running: docker compose ps
  • Check logs: docker compose logs milvus

Out of memory during embedding

  • Reduce batch sizes in scripts
  • Use smaller models (e.g., WHISPER_MODEL=tiny)
  • Set EMBEDDING_DEVICE=cpu if GPU OOM

Poor retrieval quality

  • Increase TOP_K for more candidates
  • Adjust CLIP_WEIGHT / TEXT_WEIGHT balance
  • Use longer context with larger CHUNK_LENGTH_SEC

Contributing

See CONTRIBUTING.md for development guidelines.

License

MIT License - see LICENSE

Citation

If you use VideoRAG in research, please cite:

@software{videorag2025,
  author = {Ayham Jo},
  title = {VideoRAG: Multimodal Retrieval-Augmented Generation over Video Corpora},
  year = {2025},
  url = {https://github.com/AyhamJo7/VideoRAG}
}

Acknowledgments

Built with:


For questions or issues, please open a GitHub issue.

About

This project aims to build a VideoRAG system capable of answering natural language queries over a large collection of videos. This requires a multimodal approach to indexing, where video content is processed and made searchable through a combination of visual and auditory information.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published