Skip to content

Content-based movie recommendation system using vectorization and cosine similarity.

Notifications You must be signed in to change notification settings

ladHarsh/VIBESTREAM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VibeStream

Python Machine Learning ML Streamlit License

VibeStream is a content-based movie recommendation system that uses vector-space similarity to recommend movies based on metadata rather than user behavior. The system is designed to solve the cold-start problem without relying on user history or collaborative filtering.


TL;DR

VibeStream converts movie metadata into high-dimensional feature vectors and computes cosine similarity to recommend semantically similar movies. The system prioritizes determinism, explainability, and fast inference over black-box personalization.


Links


Problem Statement

Traditional recommendation systems rely heavily on user interaction data (collaborative filtering), which introduces two major issues:

  • Cold-start problem for new users with no interaction history
  • Echo chambers driven by popularity bias and feedback loops

VibeStream approaches recommendation as a pure content similarity problem, enabling meaningful recommendations without any prior user data. This makes it ideal for:

  • New platforms with limited user data
  • Privacy-conscious applications that avoid tracking
  • Scenarios requiring explainable recommendations

ML Pipeline

1. Dataset

  • TMDB 5000 Movies Dataset (~4,800 movies after cleaning)
  • Metadata fields used:
    • Title: Movie name
    • Overview: Plot summary (primary semantic signal)
    • Genres: Action, Drama, Sci-Fi, etc.
    • Keywords: Thematic tags (e.g., "revenge", "space travel")
    • Cast: Top 3 actors (name recognition matters)
    • Crew: Director (auteur theory signal)

Design Decision: Limited cast to top 3 to prevent vector dilution from minor roles.


2. Preprocessing

Text Normalization

# Remove spaces from multi-word names to treat them as single tokens
# "Christopher Nolan" → "ChristopherNolan"
# Prevents vectorizer from splitting director names

Feature Fusion

  • Concatenate all metadata fields into unified "tags" string
  • Split overview into tokens while keeping names intact
  • Combine: overview + genres + keywords + cast + crew

Why this matters: Treating director names as atomic units preserves semantic meaning (e.g., "ChristopherNolan" carries more signal than "Christopher" + "Nolan").

Cleaning

  • Drop null values (reduces dataset from 5,000 → 4,803 movies)
  • Parse JSON-formatted fields using ast.literal_eval()
  • Remove English stopwords ("the", "is", "and") to reduce noise

3. Feature Engineering

Vectorization Strategy: CountVectorizer

CountVectorizer(max_features=5000, stop_words='english')

Why CountVectorizer over TF-IDF?

  • Simplicity: Raw term frequency captures presence of key features
  • Interpretability: Easy to explain which words drove similarity
  • Performance: Faster computation for precomputed similarity matrix
  • Domain fit: Movie metadata is short and dense (not long documents)

Trade-off acknowledged: TF-IDF would downweight common genre terms, but in movie metadata, genre repetition is actually a strong signal.

Vocabulary Constraints

  • Max features: 5,000 (balances expressiveness vs. sparsity)
  • Stop words: English stopwords removed
  • Output: Sparse matrix of shape (4803, 5000)

Sparsity consideration: Most movies share <5% vocabulary overlap, making cosine similarity effective at distinguishing content.


4. Similarity Computation

Cosine Similarity

similarity = cosine_similarity(vectors)
# Output: (4803, 4803) dense matrix

Why Cosine Similarity?

  • Scale-invariant: Measures angle, not magnitude (long vs. short descriptions don't bias results)
  • Interpretable: Range [0, 1] maps to "completely different" → "identical"
  • Efficient: Dot product of normalized vectors

Alternative considered: Euclidean distance is sensitive to vector magnitude, which would unfairly penalize movies with shorter metadata.

Precomputation Strategy

  • Full similarity matrix computed offline during model training
  • Stored as similarity.pkl (176 MB)
  • Enables O(1) lookup at inference time (just index into matrix)

Trade-off: High memory cost (4803² floats) for instant recommendations. Acceptable for dataset of this size.


5. Recommendation Logic

Algorithm

def recommend(movie_title):
    # 1. Find movie index in dataset
    idx = movies[movies['title'] == movie_title].index[0]
    
    # 2. Retrieve precomputed similarity scores
    scores = similarity[idx]
    
    # 3. Sort by similarity (descending)
    ranked = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
    
    # 4. Return top 10 (excluding self at position 0)
    return ranked[1:11]

Complexity: O(n log n) for sorting, but only 4,803 items → negligible latency.

Why Top-10?

  • Balances diversity and relevance
  • Prevents overwhelming users with too many options
  • Aligns with typical UI constraints (grid layout)

Evaluation & Design Considerations

VibeStream does not use traditional accuracy metrics (RMSE, precision@k), as recommendation quality is subjective and context-dependent.

Instead, the system is evaluated on:

1. Semantic Coherence

  • Do recommended movies share meaningful themes/genres?
  • Manual inspection of top-10 results for popular titles

2. Determinism

  • Same input always produces same output (no randomness)
  • Critical for debugging and user trust

3. Inference Latency

  • Target: <100ms for recommendation retrieval
  • Achieved: ~10ms (precomputed matrix lookup)

4. Cold-Start Handling

  • Works for any movie in dataset without user history
  • No "new user" problem

5. Explainability

  • Can trace recommendations back to shared keywords/cast/genres
  • Users can understand why a movie was recommended

Limitation acknowledged: No A/B testing or user feedback loop to validate real-world effectiveness.


Tech Stack

Core Language

  • Python 3.9+

ML & NLP

  • scikit-learn: CountVectorizer, cosine_similarity
  • NumPy: Numerical operations on sparse matrices
  • Pandas: Data manipulation and cleaning

Data

  • TMDB 5000 Movies Dataset (Kaggle)
  • TMDB API: Poster images (runtime only)

UI

  • Streamlit: Stateless web interface for rapid prototyping
  • Requests: API calls for movie posters

Serialization

  • Pickle: Model persistence (movie_list.pkl, similarity.pkl)

Project Structure

VibeStream/
├── app.py                      # Streamlit application
├── movie_recommendation.ipynb  # EDA + model development
├── requirements.txt            # Python dependencies
├── README.md                   # This file
├── .gitignore                  # Git ignore rules
│
├── data/                       # Raw datasets
│   ├── tmdb_5000_movies.csv
│   └── tmdb_5000_credits.csv
│
└── artifacts/                  # Trained models & preprocessed data
    ├── movie_list.pkl          # Preprocessed movie features
    └── similarity.pkl          # Precomputed similarity matrix

Installation & Usage

Setup

# Clone repository
git clone https://github.com/ladHarsh/VIBESTREAM.git
cd VIBESTREAM

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Run Application

streamlit run app.py

Rebuild Model (Optional)

If you modify the data pipeline:

  1. Open movie_recommendation.ipynb in Jupyter
  2. Run all cells to regenerate .pkl files
  3. Restart Streamlit app

Engineering Learnings

1. Recommendation Without User History

  • Content-based filtering eliminates cold-start problem
  • Trade-off: No personalization or collaborative signals

2. Sparse Vector Spaces

  • 5,000-dimensional vectors with ~95% zeros
  • Cosine similarity handles sparsity naturally (dot product ignores zero dimensions)

3. CountVectorizer vs. TF-IDF

  • CountVectorizer: Better for short, feature-rich metadata
  • TF-IDF: Better for long documents with varying term importance
  • Chose CountVectorizer for interpretability and domain fit

4. Precomputation Strategy

  • Offline computation of similarity matrix trades memory for speed
  • Critical for real-time user experience (<100ms latency)

5. Avoiding Popularity Bias

  • Content-based approach treats all movies equally
  • No "rich get richer" effect from collaborative filtering

6. Feature Engineering Matters

  • Removing spaces from names: "ChristopherNolan" vs. "Christopher" + "Nolan"
  • Limiting cast to top 3 prevents vector dilution
  • Genre as atomic token (not split into words)

Limitations

1. No Personalization

  • Recommendations are identical for all users
  • Cannot adapt to individual preferences over time

2. No Learning from Feedback

  • System is static after training
  • Cannot improve from user interactions (clicks, ratings)

3. Metadata Quality Dependency

  • Poor/missing metadata → poor recommendations
  • Relies on TMDB data completeness

4. Semantic Understanding

  • Bag-of-words model ignores word order and context
  • "not good" treated same as "good"

5. Scalability

  • Full similarity matrix (O(n²) space) doesn't scale to millions of movies
  • Would need approximate nearest neighbors (Annoy, FAISS) for larger datasets

6. Diversity

  • May recommend very similar movies (e.g., all sequels)
  • No diversity penalty or genre balancing

Future Improvements

1. Embedding-Based Models

  • Replace CountVectorizer with sentence transformers (BERT, SBERT)
  • Capture semantic meaning beyond keyword matching
  • Example: "revenge thriller" and "vendetta drama" would be similar

2. Hybrid Recommender

  • Combine content similarity with collaborative filtering
  • Use implicit feedback (views, clicks) to personalize
  • Formula: score = α·content_sim + (1-α)·collab_sim

3. User Feedback Loop

  • Collect thumbs up/down on recommendations
  • Fine-tune similarity weights based on feedback
  • A/B test different vectorization strategies

4. Diversity-Aware Ranking

  • Penalize recommendations that are too similar to each other
  • Ensure genre diversity in top-10 results
  • Maximal Marginal Relevance (MMR) algorithm

5. Approximate Nearest Neighbors

  • Replace full similarity matrix with ANN index (FAISS, Annoy)
  • Enables scaling to millions of movies
  • Trade slight accuracy for massive speedup

6. Temporal Awareness

  • Boost recent movies or classics based on user preference
  • Decay factor for older movies

Why This Project Matters

VibeStream demonstrates:

  • ML fundamentals: Vectorization, similarity metrics, feature engineering
  • Engineering trade-offs: Memory vs. speed, accuracy vs. interpretability
  • Problem-solving: Solving cold-start without user data
  • Production thinking: Precomputation, caching, latency optimization

This is not a black-box LLM wrapper. Every design decision is intentional and explainable.


Author

Harsh Lad
Applied AI Engineer
📧 harshlad.dev@gmail.com
🔗 https://github.com/ladHarsh
🌐 https://harshlad.vercel.app/


License

MIT License - See LICENSE file for details


VibeStream — Content-based recommendations without the cold-start problem.

About

Content-based movie recommendation system using vectorization and cosine similarity.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published