🎬 Animation Recommendation System

Train dataset : https://www.kaggle.com/datasets/hernan4444/anime-recommendation-database-2020

Test dataset : https://www.kaggle.com/datasets/CooperUnion/anime-recommendations-database/data

📊 Data Inspection

Both anime.csv and rating_complete.csv datasets are clean, consistent, and well-structured.
There are no missing or unrated values, and the data distribution reflects typical user behavior —
a strong preference for popular titles and high average ratings.

These characteristics make the dataset well-suited for developing a hybrid recommendation system
that combines Collaborative Filtering (CF) with Content-Based (CB) embeddings.

🧼 Preprocessing Pipeline

1️⃣ Load & Clean Raw Metadata

What it does

Reads anime.csv and normalizes the identifier MAL_ID to int.
Fills missing values:
- Text/multi-valued fields → "" (empty string, so TF-IDF won’t crash)
- Categorical fields (Type, Source, Rating, Premiered, Duration) → "Unknown"
Ensures all downstream transformers receive valid inputs.

Why it matters

Vectorizers/encoders don’t accept NaN.
Consistent typing on IDs prevents silent mismatches in later joins.

Pitfalls & mitigations

Ambiguous categories: using "Unknown" groups all missing categories together — fine for modeling, but call it out in limitations.
Unicode & punctuation: TF-IDF tokenization later uses a strict pattern; commas/spaces separate tokens.

Quick checks

meta_df.isna().sum()
meta_df['MAL_ID'].dtype

2️⃣ Encode Categorical Features

What it does

Applies LabelEncoder to: Type, Source, Rating, Premiered, Duration.
Stores each fitted encoder in label_encoders for reproducibility/inference.

Why it matters

Tree models and similarity pipelines consume numeric inputs.
Persisted encoders allow consistent mapping when new/held-out items arrive.

Pitfalls & mitigations

Unseen categories at inference: LabelEncoder will error if a new label appears.
- Option 1️⃣: Pre-map unknowns to a reserved index.
- Option 2️⃣: Refit encoders on union of train + incoming batch (document this policy).

Quick checks

meta_df.filter(like='_encoded').nunique()

3️⃣ Build TF-IDF Features (Genres / Producers / Studios)

What it does

Runs TF-IDF with:
- token_pattern=r'[^, ]+' (split by commas/spaces)
- stop_words='english'
- max_features=100 per field.
Produces three dense matrices with interpretable columns like:
- Genre_Action
- Prod_Aniplex
- Studio_KyoAni

Why it matters

Converts multi-label text fields into semantic vectors.
Capped dimensionality reduces memory usage and speeds up cosine similarity later.

Pitfalls & mitigations

Vocabulary drift:
Save the vectorizers for inference (tfidf_and_encoders.pkl) to ensure consistent feature space.
Over-sparsity:
Using max_features=100 balances expressiveness and compute efficiency.

Quick checks

tfidf_genres.shape[1] == 100
vec_genres.get_feature_names_out()[:10]

4️⃣ Normalize Numeric Attributes

What it does

Casts numeric columns:
['Score', 'Episodes', 'Ranked', 'Popularity', 'Members', 'Favorites']
to numeric (errors='coerce').
Fills remaining NaNs with column means.
Applies MinMaxScaler to range [0, 1].

Why it matters

Mixed-scale features (e.g., Members in millions vs. Score ~ [0–10])
can distort distance metrics and degrade model performance.
Normalization stabilizes training and improves cosine similarity reliability.

Pitfalls & mitigations

Outliers:
MinMaxScaler is sensitive — if the distribution is heavy-tailed
(e.g., Members), consider using RobustScaler in future.
Imputation bias:
Mean imputation is simple but can bias toward central values;
mention this in your limitations section.

Quick checks

meta_df[numeric_cols].min().ge(0).all()
meta_df[numeric_cols].max().le(1).all()

5️⃣ Assemble the Content-Based Feature Table

What it does

Concatenates:
1. MAL_ID
2. Scaled numerics
3. Encoded categoricals
4. TF-IDF matrices (Genres, Producers, Studios)
Produces a dense, model-ready Content-Based (CB) vector per anime.

Why it matters

Combines all preprocessed data sources into a single, unified representation.
This unified table can be directly used for:
- Cosine similarity computation between items
- Feature input for the hybrid meta-learner

Pitfalls & mitigations

Column alignment:
Ensure consistent row alignment when concatenating multiple feature sets.
pd.concat(..., axis=1) relies on aligned indices.
Shape mismatch:
Verify that TF-IDF outputs and encoded numerics have identical row counts.

Quick checks

meta_processed.shape[0] == meta_df.shape[0]

6️⃣ Align with Interaction Data (`rating_complete.csv`)

What it does

Reads rating_complete.csv and casts anime_id to int.
Intersects MAL_ID from the metadata with anime_id from the ratings file.
Keeps only the overlapping records — ensures every CB vector has a corresponding CF entry.
Logs the number of matched items for transparency.

Why it matters

Prevents cold-start leakage during hybrid model training and evaluation.
Guarantees 1:1 correspondence between items in the CB feature matrix
and those in the CF (Collaborative Filtering) dataset.

Pitfalls & mitigations

Coverage drop:
Titles without rating data will be dropped from the hybrid dataset.
It’s recommended to log the retained ratio for your report (e.g., matched / total).
ID mismatch:
Ensure both sides (MAL_ID, anime_id) are of the same integer type
to avoid silent filtering errors.

Quick checks

(meta_processed['MAL_ID'].isin(rating_df['anime_id'])).all()

7️⃣ Persist Outputs for Reproducibility

What it does

Saves:
- meta_preprocessed.csv → the final, dense CB feature table
- tfidf_and_encoders.pkl → serialized preprocessing objects containing:
  - label_encoders (for categorical columns)
  - vec_genres, vec_producers, vec_studios
  - scaler

Why it matters

Ensures identical preprocessing during both training and inference.
Prevents train–serve feature drift by preserving:
- TF-IDF vocabularies
- LabelEncoder mappings
- Scaler parameters

Pitfalls & mitigations

Always reload the pickle and test transform() on a sample before deployment
to confirm version compatibility and consistent output dimensions.

Quick checks

import pickle
obj = pickle.load(open("tfidf_and_encoders.pkl", "rb"))
type(obj)

📦 Modeling & Evaluation

1. Data and Overall Pipeline

1.1 Input Data

rating_complete.csv
- Columns: user_id, anime_id, rating
- Only ratings > 0 are used.
rating_test.csv
- Used for baseline CF models (e.g., SVD, NeuMF).
meta_preprocessed.csv
- Item-level content features (genres, tags, staff, etc.) + MAL_ID.
- Output of the preprocessing pipeline.
anime.csv
- Mapping table for MAL_ID ↔ anime title (for human-readable outputs).

1.2 Pipeline Overview

Collaborative Filtering (CF) – SVD
- Learns latent user/item factors.
- Outputs: cf_score(user_id, anime_id).
Content-Based (CB) – Cosine Similarity
- Uses standardized content embeddings from meta_preprocessed.csv.
- Builds per-user content profiles.
- Outputs: cb_score(user_id, anime_id).
Meta-Training Dataset
- For sampled user–item pairs:
  - Collect (cf_score, cb_score, true_rating).
- Used as input for meta-learners.
Meta-Learners (XGBoost / LightGBM / CatBoost)
- Learn a mapping:
  - $\hat{y} = f_{\text{meta}}(y_{\text{cf}}, y_{\text{cb}})$
- Final hybrid rating predictor.
Evaluation Metrics
- RMSE (raw): absolute prediction error.
- Precision@K / Recall@K: ranking-based accuracy.
- Model-based item recommendation:
  - “Users who liked X also tend to like Y” based on meta-model predictions.

2. Collaborative Filtering (SVD)

2.1 Model Description

Library: surprise (SVD)
Predictive model:

$\hat{r}_{ui} = \mu + b_u + b_i + p_u^\top q_i$
- $\mu$: global mean
- $b_u$, $b_i$: user/item bias
- $p_u$, $q_i$: latent factor vectors.

2.2 Key Hyperparameters

Parameter	Value	Meaning	Effect
`n_factors`	100	Latent dimension	Higher → expressive, risk of overfitting
`n_epochs`	20	Training epochs	Ensures convergence
`lr_all`	0.005	Learning rate	Stable, small updates
`reg_all`	0.02	Regularization	Prevents overfitting

Output:
- cf_score(u, i) = svd.predict(u, i).est

(Additional) Collaborative Filtering Using NeuMF

Concept

NeuMF embeds users/items via:
1. GMF (Generalized Matrix Factorization): linear interaction.
2. MLP: non-linear interaction.
Embeddings:
- User: $p_u$
- Item: $q_i$

Key components:

GMF:
- $\phi_{\text{GMF}}(p_u, q_i) = p_u \odot q_i$
MLP:
- Stacked non-linear layers on concatenated $(p_u, q_i)$.
Fusion:
- Final prediction from concatenation of GMF + MLP representations.

Analysis

NeuMF can capture complex non-linear patterns that SVD cannot.
However, in our setting:
- Rating skew: ~70% in 8–9 range.
- Model size: ~33K parameters.
- Resulted in overfitting:
  - Train RMSE: 1.27
  - Val RMSE: 1.80
Final decision:
- SVD selected as primary CF model:
  - Stable performance
  - Validation RMSE: 1.70

3. Content-Based (CB) Module

3.1 Embedding Preprocessing

Uses meta_preprocessed.csv.
Applies StandardScaler to CB features.
Ensures cosine similarity is not dominated by large-scale features.

3.2 Item–Item Cosine Similarity

For items $i, j$ with vectors $x_i, x_j$:

$\text{sim}(i, j) = \dfrac{x_i \cdot x_j}{|x_i| , |x_j|}$
Similarity matrix shape: (n_items × n_items).

3.3 User Content Profile (Top-K Neighbor Masking)

For each user $u$:

Define liked items:
- $L_u = { i \mid r_{ui} \ge \tau }$, with default $\tau = 8.0$.
For each liked item $i \in L_u$:
- Keep only Top-K most similar neighbors $S_i$.
User-specific CB score for candidate item $j$:

$$ \text{CB}u(j) = \frac{\sum{i \in L_u} \mathbf{1}{ j \in \text{TopK}(S_i) } \cdot \text{sim}(i, j)} {\sum_{i \in L_u} \mathbf{1}{ j \in \text{TopK}(S_i) } + \varepsilon} $$

Default:
- top_k = 30
- like_th = 8.0

Rationale

Focuses on strongest semantic neighbors.
Reduces noise from weakly related items.

4. Meta-Training Dataset Construction

4.1 User Sampling

Randomly sample up to 3000 users for efficiency.
For each sampled user:
- Build CB profile.
- For each rated item $(u, i)$:
  - Compute cf_score(u, i) from SVD.
  - Compute cb_score(u, i) from CB module.
  - Take true_rating(u, i) from rating_complete.csv.
  - Store: (user_id, anime_id, cf_score, cb_score, true_rating).

4.2 Output

Save as: meta_train_ready.csv.
Used as supervised training data for meta-learners.

5. Meta-Learners (XGBoost / LightGBM / CatBoost)

5.1 Concept of the Meta-Learner

The meta-learner is a fusion model:

Input:
- $y_{\text{cf}}$: CF-predicted rating
- $y_{\text{cb}}$: CB-predicted rating
Output:
- $\hat{y} = f_{\text{meta}}(y_{\text{cf}}, y_{\text{cb}})$

Objective:

$$ \min_{f_{\text{meta}}} \mathbb{E}\left[ \left( y - f_{\text{meta}}(y_{\text{cf}}, y_{\text{cb}}) \right)^2 \right] $$

Learns how to combine:
- Behavioral signal (CF)
- Semantic signal (CB)

5.2 System Implementation Process

Base Model Training
- CF (SVD): trained on rating_complete.csv.
- CB: similarity + user content profiles from meta_preprocessed.csv.
Meta-Dataset Construction
- For each $(u, i)`:
  - Features: [y_{\text{cf}}(u, i), y_{\text{cb}}(u, i)]
  - Target: true_rating(u, i)
- Stored in meta_train_ready.csv.
Meta-Learner Training
- Train LightGBM / XGBoost / CatBoost:
  - $\hat{y}{ui} = f{\text{meta}}(y_{\text{cf}}, y_{\text{cb}})$
Recommendation Generation
- For each user:
  - Predict $\hat{y}_{ui}$ for candidate items.
  - Sort descending → Top-N recommendations.

5.3 Characteristics of Each Meta-Learner

(1) LightGBM

Leaf-wise, histogram-based Gradient Boosting.
Key:
- Fast convergence
- Efficient memory usage
- GOSS sampling
Chosen as final meta-learner in this project:
- Best trade-off between speed, performance, and interpretability.

(2) XGBoost

Level-wise tree growth.
L1/L2 regularization.
Robust, stable, handles missing values well.
Slightly slower but very reliable.

(3) CatBoost

Ordered boosting + Oblivious Trees.
Great with categorical features & overfitting control.
In our case (continuous inputs), still stable but less beneficial than LightGBM.

5.4 Model Comparison Summary

Model	Strengths	Limitations
LightGBM	Fast, accurate, low memory, interpretable	Leaf-wise can overfit
XGBoost	Stable, strong regularization, robust	Slower than LightGBM
CatBoost	Great with categoricals, robust to overfit	Slower on large-scale tasks

5.5 Functional Interpretation

The hybrid prediction can be viewed as:

$$ \hat{y}_{ui} \approx w_1 \cdot y_{\text{cf}} + w_2 \cdot y_{\text{cb}} + b $$

But:
- In Gradient Boosting Trees, $w_1, w_2$ are implicit, non-linear, context-dependent.
- The meta-learner:
  - Reinforces agreement between CF & CB.
  - Downweights the less reliable model per user/genre pattern.

Conclusion

The meta-learner acts as an adaptive fusion layer.
Among candidates, LightGBM showed the best overall performance and was selected as the final meta-model.

6. Evaluation

6.1 RMSE (Raw Predictions)

$$ \text{RMSE} = \sqrt{\frac{1}{N} \sum_{(u,i)} (\hat{r}_{ui} - r_{ui})^2 } $$

Computed on raw predictions (no scaling/clipping).
Lower RMSE → better numeric accuracy.

6.2 Precision@K / Recall@K

Relevant items:
- $r_{ui} \ge 8.0$
Metrics:

$$ \text{Precision@K} = \frac{|T_u \cap R_u|}{K}, \quad \text{Recall@K} = \frac{|T_u \cap R_u|}{|R_u|} $$
- $T_u$: Top-K recommended items.
- $R_u$: Relevant (truly liked) items.

Why use raw predictions

Rank-based metrics are invariant to monotonic transforms.
No clipping/scaling before ranking → preserves relative order.

7. Sample User Top-10 Comparison

Select a random user.
Generate Top-10 lists using:
- XGBoost-based meta-learner
- LightGBM-based meta-learner
- CatBoost-based meta-learner
Display:
- Predicted Rating / True Rating per item.
Provides qualitative insight into each model’s personalization behavior.

8. Model-Based “Similar Item” Recommendation

8.1 Concept

Recommend items that users who liked a given anime also tend to like.
Uses meta-model predictions instead of pure co-occurrence:
- Higher semantic + behavioral resolution.

8.2 Steps

Find target anime by name → get MAL_ID.
Find users who rated that anime ≥ 8.0.
Collect other items those users rated.
For each (user, item):
- Gather cf_score, cb_score (from meta_train_ready or recomputation).
- Predict via meta-model.
Average predicted scores per item.
Sort in descending order → Top-N similar items.

Clipping:
- Predictions may be clipped to [1, 10] for display only.
- Training & evaluation use raw values.

Example interpretation:

“Items that users who liked Koe no Katachi are also highly likely to enjoy.”

9. Design Choices & Rationale

Top-K neighbor masking
- Reduces noise; focuses on strongest neighbors.
Raw predictions (no global scaling)
- Each model learns the rating scale directly.
- Avoids distortions in ranking and RMSE.
User sampling (3000 users)
- Practical runtime with sufficient diversity.
- For production:
  - Use distributed compute or ANN libraries (e.g., FAISS).

10. Computational Complexity and Execution Notes

Component	Complexity	Comment
Item similarity	$O(N^2 d)$	Expensive; ANN/FAISS/Annoy recommended at scale
Meta dataset build	$O(U \cdot \bar{I})$	$U$: users, $\bar{I}$: avg rated items
Model training	Linear in samples	Only 2 features (cf, cb) → very fast
Reproducibility	—	`random_state=42`, `np.random.seed(42)` used

11. Result Interpretation Guidelines

RMSE
- Lower = better absolute predictive accuracy.
Precision@10 / Recall@10
- Reflect ranking quality of recommendations.
User Top-10 lists
- Reveal each model’s tendency:
  - Safe vs. exploratory
  - Niche vs. popular bias.

Result of NeuMF

Result of Hybrid model

Only SVD

Visualization of anime embeddings

We can observe that anime belonging to the same genre tend to form clusters in the embedding space.

Visualizing Hybrid Meta-Learner Recommendations

We can see that the model recommended an animation with "action" and "fantasy," which are the genre that the user likes. Here, the model used the light GBM, which has best performance. We can also see that the animation recommended by the model shows that the same genre is clustered.

Recommendation System GUI

🧾 Conclusion

The NeuMF model achieved a lower RMSE than the SVD model.

Among the three meta-learner models (XGBoost, LightGBM, and CatBoost),
LightGBM achieved the best overall performance in both RMSE and ranking metrics (Precision@K, Recall@K).

This can be attributed to several key factors:

Histogram-based Gradient Boosting Efficiency
LightGBM groups continuous feature values into discrete bins (histogram-based splitting).
This approach enables faster training, smoother optimization, and noise reduction,
particularly with small meta-feature dimensions such as (cf_score, cb_score).
Better Handling of Continuous Features
Since the meta-input consists of two continuous scores (from CF and CB),
LightGBM’s leaf-wise growth captures subtle nonlinear interactions more effectively
than XGBoost’s level-wise tree expansion.
Regularization and Feature Usage
LightGBM automatically prunes weak leaves during training, providing natural regularization.
This results in better generalization across users with diverse rating behaviors.
Optimized for Small, Dense Features
While CatBoost excels with high-cardinality categorical data,
LightGBM performs exceptionally well on low-dimensional numeric features,
avoiding unnecessary complexity.

Summary:
Among all ensemble meta-learners, LightGBM demonstrated the highest predictive accuracy.
Its histogram-based optimization and efficient handling of low-dimensional continuous features
provided smoother gradient updates, reduced overfitting, and yielded superior generalization performance.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
GUI.py		GUI.py
Preprocssing_anime.py		Preprocssing_anime.py
README.md		README.md
anime.csv		anime.csv
metaLearner.py		metaLearner.py
meta_preprocessed.csv		meta_preprocessed.csv

Ai-pre/MachineLearning

Folders and files

Latest commit

History

Repository files navigation

🎬 Animation Recommendation System

Train dataset : https://www.kaggle.com/datasets/hernan4444/anime-recommendation-database-2020

Test dataset : https://www.kaggle.com/datasets/CooperUnion/anime-recommendations-database/data

📊 Data Inspection

🧼 Preprocessing Pipeline

1️⃣ Load & Clean Raw Metadata

2️⃣ Encode Categorical Features

3️⃣ Build TF-IDF Features (Genres / Producers / Studios)

4️⃣ Normalize Numeric Attributes

5️⃣ Assemble the Content-Based Feature Table

6️⃣ Align with Interaction Data (rating_complete.csv)

7️⃣ Persist Outputs for Reproducibility

📦 Modeling & Evaluation

1. Data and Overall Pipeline

1.1 Input Data

1.2 Pipeline Overview

2. Collaborative Filtering (SVD)

2.1 Model Description

2.2 Key Hyperparameters

(Additional) Collaborative Filtering Using NeuMF

3. Content-Based (CB) Module

3.1 Embedding Preprocessing

3.2 Item–Item Cosine Similarity

3.3 User Content Profile (Top-K Neighbor Masking)

4. Meta-Training Dataset Construction

4.1 User Sampling

4.2 Output

5. Meta-Learners (XGBoost / LightGBM / CatBoost)

5.1 Concept of the Meta-Learner

5.2 System Implementation Process

5.3 Characteristics of Each Meta-Learner

5.4 Model Comparison Summary

5.5 Functional Interpretation

6. Evaluation

6.1 RMSE (Raw Predictions)

6.2 Precision@K / Recall@K

7. Sample User Top-10 Comparison

8. Model-Based “Similar Item” Recommendation

8.1 Concept

8.2 Steps

9. Design Choices & Rationale

10. Computational Complexity and Execution Notes

11. Result Interpretation Guidelines

Result of NeuMF

Result of Hybrid model

Only SVD

Visualization of anime embeddings

We can observe that anime belonging to the same genre tend to form clusters in the embedding space.

Visualizing Hybrid Meta-Learner Recommendations

Recommendation System GUI

🧾 Conclusion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

6️⃣ Align with Interaction Data (`rating_complete.csv`)

Packages