VibeStream is a content-based movie recommendation system that uses vector-space similarity to recommend movies based on metadata rather than user behavior. The system is designed to solve the cold-start problem without relying on user history or collaborative filtering.
VibeStream converts movie metadata into high-dimensional feature vectors and computes cosine similarity to recommend semantically similar movies. The system prioritizes determinism, explainability, and fast inference over black-box personalization.
- Video Demo: https://youtu.be/0zz2Y1T1cqg
- Live Application: https://vibestream.streamlit.app/
- Portfolio: https://harshlad.vercel.app/
Traditional recommendation systems rely heavily on user interaction data (collaborative filtering), which introduces two major issues:
- Cold-start problem for new users with no interaction history
- Echo chambers driven by popularity bias and feedback loops
VibeStream approaches recommendation as a pure content similarity problem, enabling meaningful recommendations without any prior user data. This makes it ideal for:
- New platforms with limited user data
- Privacy-conscious applications that avoid tracking
- Scenarios requiring explainable recommendations
- TMDB 5000 Movies Dataset (~4,800 movies after cleaning)
- Metadata fields used:
- Title: Movie name
- Overview: Plot summary (primary semantic signal)
- Genres: Action, Drama, Sci-Fi, etc.
- Keywords: Thematic tags (e.g., "revenge", "space travel")
- Cast: Top 3 actors (name recognition matters)
- Crew: Director (auteur theory signal)
Design Decision: Limited cast to top 3 to prevent vector dilution from minor roles.
# Remove spaces from multi-word names to treat them as single tokens
# "Christopher Nolan" → "ChristopherNolan"
# Prevents vectorizer from splitting director names- Concatenate all metadata fields into unified "tags" string
- Split overview into tokens while keeping names intact
- Combine:
overview + genres + keywords + cast + crew
Why this matters: Treating director names as atomic units preserves semantic meaning (e.g., "ChristopherNolan" carries more signal than "Christopher" + "Nolan").
- Drop null values (reduces dataset from 5,000 → 4,803 movies)
- Parse JSON-formatted fields using
ast.literal_eval() - Remove English stopwords ("the", "is", "and") to reduce noise
CountVectorizer(max_features=5000, stop_words='english')Why CountVectorizer over TF-IDF?
- Simplicity: Raw term frequency captures presence of key features
- Interpretability: Easy to explain which words drove similarity
- Performance: Faster computation for precomputed similarity matrix
- Domain fit: Movie metadata is short and dense (not long documents)
Trade-off acknowledged: TF-IDF would downweight common genre terms, but in movie metadata, genre repetition is actually a strong signal.
- Max features: 5,000 (balances expressiveness vs. sparsity)
- Stop words: English stopwords removed
- Output: Sparse matrix of shape
(4803, 5000)
Sparsity consideration: Most movies share <5% vocabulary overlap, making cosine similarity effective at distinguishing content.
similarity = cosine_similarity(vectors)
# Output: (4803, 4803) dense matrixWhy Cosine Similarity?
- Scale-invariant: Measures angle, not magnitude (long vs. short descriptions don't bias results)
- Interpretable: Range [0, 1] maps to "completely different" → "identical"
- Efficient: Dot product of normalized vectors
Alternative considered: Euclidean distance is sensitive to vector magnitude, which would unfairly penalize movies with shorter metadata.
- Full similarity matrix computed offline during model training
- Stored as
similarity.pkl(176 MB) - Enables O(1) lookup at inference time (just index into matrix)
Trade-off: High memory cost (4803² floats) for instant recommendations. Acceptable for dataset of this size.
def recommend(movie_title):
# 1. Find movie index in dataset
idx = movies[movies['title'] == movie_title].index[0]
# 2. Retrieve precomputed similarity scores
scores = similarity[idx]
# 3. Sort by similarity (descending)
ranked = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
# 4. Return top 10 (excluding self at position 0)
return ranked[1:11]Complexity: O(n log n) for sorting, but only 4,803 items → negligible latency.
- Balances diversity and relevance
- Prevents overwhelming users with too many options
- Aligns with typical UI constraints (grid layout)
VibeStream does not use traditional accuracy metrics (RMSE, precision@k), as recommendation quality is subjective and context-dependent.
Instead, the system is evaluated on:
- Do recommended movies share meaningful themes/genres?
- Manual inspection of top-10 results for popular titles
- Same input always produces same output (no randomness)
- Critical for debugging and user trust
- Target: <100ms for recommendation retrieval
- Achieved: ~10ms (precomputed matrix lookup)
- Works for any movie in dataset without user history
- No "new user" problem
- Can trace recommendations back to shared keywords/cast/genres
- Users can understand why a movie was recommended
Limitation acknowledged: No A/B testing or user feedback loop to validate real-world effectiveness.
- Python 3.9+
- scikit-learn: CountVectorizer, cosine_similarity
- NumPy: Numerical operations on sparse matrices
- Pandas: Data manipulation and cleaning
- TMDB 5000 Movies Dataset (Kaggle)
- TMDB API: Poster images (runtime only)
- Streamlit: Stateless web interface for rapid prototyping
- Requests: API calls for movie posters
- Pickle: Model persistence (
movie_list.pkl,similarity.pkl)
VibeStream/
├── app.py # Streamlit application
├── movie_recommendation.ipynb # EDA + model development
├── requirements.txt # Python dependencies
├── README.md # This file
├── .gitignore # Git ignore rules
│
├── data/ # Raw datasets
│ ├── tmdb_5000_movies.csv
│ └── tmdb_5000_credits.csv
│
└── artifacts/ # Trained models & preprocessed data
├── movie_list.pkl # Preprocessed movie features
└── similarity.pkl # Precomputed similarity matrix
# Clone repository
git clone https://github.com/ladHarsh/VIBESTREAM.git
cd VIBESTREAM
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtstreamlit run app.pyIf you modify the data pipeline:
- Open
movie_recommendation.ipynbin Jupyter - Run all cells to regenerate
.pklfiles - Restart Streamlit app
- Content-based filtering eliminates cold-start problem
- Trade-off: No personalization or collaborative signals
- 5,000-dimensional vectors with ~95% zeros
- Cosine similarity handles sparsity naturally (dot product ignores zero dimensions)
- CountVectorizer: Better for short, feature-rich metadata
- TF-IDF: Better for long documents with varying term importance
- Chose CountVectorizer for interpretability and domain fit
- Offline computation of similarity matrix trades memory for speed
- Critical for real-time user experience (<100ms latency)
- Content-based approach treats all movies equally
- No "rich get richer" effect from collaborative filtering
- Removing spaces from names: "ChristopherNolan" vs. "Christopher" + "Nolan"
- Limiting cast to top 3 prevents vector dilution
- Genre as atomic token (not split into words)
- Recommendations are identical for all users
- Cannot adapt to individual preferences over time
- System is static after training
- Cannot improve from user interactions (clicks, ratings)
- Poor/missing metadata → poor recommendations
- Relies on TMDB data completeness
- Bag-of-words model ignores word order and context
- "not good" treated same as "good"
- Full similarity matrix (O(n²) space) doesn't scale to millions of movies
- Would need approximate nearest neighbors (Annoy, FAISS) for larger datasets
- May recommend very similar movies (e.g., all sequels)
- No diversity penalty or genre balancing
- Replace CountVectorizer with sentence transformers (BERT, SBERT)
- Capture semantic meaning beyond keyword matching
- Example: "revenge thriller" and "vendetta drama" would be similar
- Combine content similarity with collaborative filtering
- Use implicit feedback (views, clicks) to personalize
- Formula:
score = α·content_sim + (1-α)·collab_sim
- Collect thumbs up/down on recommendations
- Fine-tune similarity weights based on feedback
- A/B test different vectorization strategies
- Penalize recommendations that are too similar to each other
- Ensure genre diversity in top-10 results
- Maximal Marginal Relevance (MMR) algorithm
- Replace full similarity matrix with ANN index (FAISS, Annoy)
- Enables scaling to millions of movies
- Trade slight accuracy for massive speedup
- Boost recent movies or classics based on user preference
- Decay factor for older movies
VibeStream demonstrates:
- ML fundamentals: Vectorization, similarity metrics, feature engineering
- Engineering trade-offs: Memory vs. speed, accuracy vs. interpretability
- Problem-solving: Solving cold-start without user data
- Production thinking: Precomputation, caching, latency optimization
This is not a black-box LLM wrapper. Every design decision is intentional and explainable.
Harsh Lad
Applied AI Engineer
📧 harshlad.dev@gmail.com
🔗 https://github.com/ladHarsh
🌐 https://harshlad.vercel.app/
MIT License - See LICENSE file for details
VibeStream — Content-based recommendations without the cold-start problem.