VibeStream

VibeStream is a content-based movie recommendation system that uses vector-space similarity to recommend movies based on metadata rather than user behavior. The system is designed to solve the cold-start problem without relying on user history or collaborative filtering.

TL;DR

VibeStream converts movie metadata into high-dimensional feature vectors and computes cosine similarity to recommend semantically similar movies. The system prioritizes determinism, explainability, and fast inference over black-box personalization.

Links

Video Demo: https://youtu.be/0zz2Y1T1cqg
Live Application: https://vibestream.streamlit.app/
Portfolio: https://harshlad.vercel.app/

Problem Statement

Traditional recommendation systems rely heavily on user interaction data (collaborative filtering), which introduces two major issues:

Cold-start problem for new users with no interaction history
Echo chambers driven by popularity bias and feedback loops

VibeStream approaches recommendation as a pure content similarity problem, enabling meaningful recommendations without any prior user data. This makes it ideal for:

New platforms with limited user data
Privacy-conscious applications that avoid tracking
Scenarios requiring explainable recommendations

ML Pipeline

1. Dataset

TMDB 5000 Movies Dataset (~4,800 movies after cleaning)
Metadata fields used:
- Title: Movie name
- Overview: Plot summary (primary semantic signal)
- Genres: Action, Drama, Sci-Fi, etc.
- Keywords: Thematic tags (e.g., "revenge", "space travel")
- Cast: Top 3 actors (name recognition matters)
- Crew: Director (auteur theory signal)

Design Decision: Limited cast to top 3 to prevent vector dilution from minor roles.

2. Preprocessing

Text Normalization

# Remove spaces from multi-word names to treat them as single tokens
# "Christopher Nolan" → "ChristopherNolan"
# Prevents vectorizer from splitting director names

Feature Fusion

Concatenate all metadata fields into unified "tags" string
Split overview into tokens while keeping names intact
Combine: overview + genres + keywords + cast + crew

Why this matters: Treating director names as atomic units preserves semantic meaning (e.g., "ChristopherNolan" carries more signal than "Christopher" + "Nolan").

Cleaning

Drop null values (reduces dataset from 5,000 → 4,803 movies)
Parse JSON-formatted fields using ast.literal_eval()
Remove English stopwords ("the", "is", "and") to reduce noise

3. Feature Engineering

Vectorization Strategy: CountVectorizer

CountVectorizer(max_features=5000, stop_words='english')

Why CountVectorizer over TF-IDF?

Simplicity: Raw term frequency captures presence of key features
Interpretability: Easy to explain which words drove similarity
Performance: Faster computation for precomputed similarity matrix
Domain fit: Movie metadata is short and dense (not long documents)

Trade-off acknowledged: TF-IDF would downweight common genre terms, but in movie metadata, genre repetition is actually a strong signal.

Vocabulary Constraints

Max features: 5,000 (balances expressiveness vs. sparsity)
Stop words: English stopwords removed
Output: Sparse matrix of shape (4803, 5000)

Sparsity consideration: Most movies share <5% vocabulary overlap, making cosine similarity effective at distinguishing content.

4. Similarity Computation

Cosine Similarity

similarity = cosine_similarity(vectors)
# Output: (4803, 4803) dense matrix

Why Cosine Similarity?

Scale-invariant: Measures angle, not magnitude (long vs. short descriptions don't bias results)
Interpretable: Range [0, 1] maps to "completely different" → "identical"
Efficient: Dot product of normalized vectors

Alternative considered: Euclidean distance is sensitive to vector magnitude, which would unfairly penalize movies with shorter metadata.

Precomputation Strategy

Full similarity matrix computed offline during model training
Stored as similarity.pkl (176 MB)
Enables O(1) lookup at inference time (just index into matrix)

Trade-off: High memory cost (4803² floats) for instant recommendations. Acceptable for dataset of this size.

5. Recommendation Logic

Algorithm

def recommend(movie_title):
    # 1. Find movie index in dataset
    idx = movies[movies['title'] == movie_title].index[0]
    
    # 2. Retrieve precomputed similarity scores
    scores = similarity[idx]
    
    # 3. Sort by similarity (descending)
    ranked = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
    
    # 4. Return top 10 (excluding self at position 0)
    return ranked[1:11]

Complexity: O(n log n) for sorting, but only 4,803 items → negligible latency.

Why Top-10?

Balances diversity and relevance
Prevents overwhelming users with too many options
Aligns with typical UI constraints (grid layout)

Evaluation & Design Considerations

VibeStream does not use traditional accuracy metrics (RMSE, precision@k), as recommendation quality is subjective and context-dependent.

Instead, the system is evaluated on:

1. Semantic Coherence

Do recommended movies share meaningful themes/genres?
Manual inspection of top-10 results for popular titles

2. Determinism

Same input always produces same output (no randomness)
Critical for debugging and user trust

3. Inference Latency

Target: <100ms for recommendation retrieval
Achieved: ~10ms (precomputed matrix lookup)

4. Cold-Start Handling

Works for any movie in dataset without user history
No "new user" problem

5. Explainability

Can trace recommendations back to shared keywords/cast/genres
Users can understand why a movie was recommended

Limitation acknowledged: No A/B testing or user feedback loop to validate real-world effectiveness.

Tech Stack

Core Language

Python 3.9+

ML & NLP

scikit-learn: CountVectorizer, cosine_similarity
NumPy: Numerical operations on sparse matrices
Pandas: Data manipulation and cleaning

Data

TMDB 5000 Movies Dataset (Kaggle)
TMDB API: Poster images (runtime only)

UI

Streamlit: Stateless web interface for rapid prototyping
Requests: API calls for movie posters

Serialization

Pickle: Model persistence (movie_list.pkl, similarity.pkl)

Project Structure

VibeStream/
├── app.py                      # Streamlit application
├── movie_recommendation.ipynb  # EDA + model development
├── requirements.txt            # Python dependencies
├── README.md                   # This file
├── .gitignore                  # Git ignore rules
│
├── data/                       # Raw datasets
│   ├── tmdb_5000_movies.csv
│   └── tmdb_5000_credits.csv
│
└── artifacts/                  # Trained models & preprocessed data
    ├── movie_list.pkl          # Preprocessed movie features
    └── similarity.pkl          # Precomputed similarity matrix

Installation & Usage

Setup

# Clone repository
git clone https://github.com/ladHarsh/VIBESTREAM.git
cd VIBESTREAM

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Run Application

streamlit run app.py

Rebuild Model (Optional)

If you modify the data pipeline:

Open movie_recommendation.ipynb in Jupyter
Run all cells to regenerate .pkl files
Restart Streamlit app

Engineering Learnings

1. Recommendation Without User History

Content-based filtering eliminates cold-start problem
Trade-off: No personalization or collaborative signals

2. Sparse Vector Spaces

5,000-dimensional vectors with ~95% zeros
Cosine similarity handles sparsity naturally (dot product ignores zero dimensions)

3. CountVectorizer vs. TF-IDF

CountVectorizer: Better for short, feature-rich metadata
TF-IDF: Better for long documents with varying term importance
Chose CountVectorizer for interpretability and domain fit

4. Precomputation Strategy

Offline computation of similarity matrix trades memory for speed
Critical for real-time user experience (<100ms latency)

5. Avoiding Popularity Bias

Content-based approach treats all movies equally
No "rich get richer" effect from collaborative filtering

6. Feature Engineering Matters

Removing spaces from names: "ChristopherNolan" vs. "Christopher" + "Nolan"
Limiting cast to top 3 prevents vector dilution
Genre as atomic token (not split into words)

Limitations

1. No Personalization

Recommendations are identical for all users
Cannot adapt to individual preferences over time

2. No Learning from Feedback

System is static after training
Cannot improve from user interactions (clicks, ratings)

3. Metadata Quality Dependency

Poor/missing metadata → poor recommendations
Relies on TMDB data completeness

4. Semantic Understanding

Bag-of-words model ignores word order and context
"not good" treated same as "good"

5. Scalability

Full similarity matrix (O(n²) space) doesn't scale to millions of movies
Would need approximate nearest neighbors (Annoy, FAISS) for larger datasets

6. Diversity

May recommend very similar movies (e.g., all sequels)
No diversity penalty or genre balancing

Future Improvements

1. Embedding-Based Models

Replace CountVectorizer with sentence transformers (BERT, SBERT)
Capture semantic meaning beyond keyword matching
Example: "revenge thriller" and "vendetta drama" would be similar

2. Hybrid Recommender

Combine content similarity with collaborative filtering
Use implicit feedback (views, clicks) to personalize
Formula: score = α·content_sim + (1-α)·collab_sim

3. User Feedback Loop

Collect thumbs up/down on recommendations
Fine-tune similarity weights based on feedback
A/B test different vectorization strategies

4. Diversity-Aware Ranking

Penalize recommendations that are too similar to each other
Ensure genre diversity in top-10 results
Maximal Marginal Relevance (MMR) algorithm

5. Approximate Nearest Neighbors

Replace full similarity matrix with ANN index (FAISS, Annoy)
Enables scaling to millions of movies
Trade slight accuracy for massive speedup

6. Temporal Awareness

Boost recent movies or classics based on user preference
Decay factor for older movies

Why This Project Matters

VibeStream demonstrates:

ML fundamentals: Vectorization, similarity metrics, feature engineering
Engineering trade-offs: Memory vs. speed, accuracy vs. interpretability
Problem-solving: Solving cold-start without user data
Production thinking: Precomputation, caching, latency optimization

This is not a black-box LLM wrapper. Every design decision is intentional and explainable.

Author

Harsh Lad
Applied AI Engineer
📧 harshlad.dev@gmail.com
🔗 https://github.com/ladHarsh
🌐 https://harshlad.vercel.app/

License

MIT License - See LICENSE file for details

VibeStream — Content-based recommendations without the cold-start problem.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
artifacts		artifacts
data		data
.gitignore		.gitignore
README.md		README.md
app.py		app.py
movie_recommendation.ipynb		movie_recommendation.ipynb
requirements.txt		requirements.txt

ladHarsh/VIBESTREAM

Folders and files

Latest commit

History

Repository files navigation