Both anime.csv and rating_complete.csv datasets are clean, consistent, and well-structured.
There are no missing or unrated values, and the data distribution reflects typical user behavior —
a strong preference for popular titles and high average ratings.
These characteristics make the dataset well-suited for developing a hybrid recommendation system
that combines Collaborative Filtering (CF) with Content-Based (CB) embeddings.

What it does
- Reads
anime.csvand normalizes the identifierMAL_IDtoint. - Fills missing values:
- Text/multi-valued fields →
""(empty string, so TF-IDF won’t crash) - Categorical fields (
Type,Source,Rating,Premiered,Duration) →"Unknown"
- Text/multi-valued fields →
- Ensures all downstream transformers receive valid inputs.
Why it matters
- Vectorizers/encoders don’t accept
NaN. - Consistent typing on IDs prevents silent mismatches in later joins.
Pitfalls & mitigations
- Ambiguous categories: using
"Unknown"groups all missing categories together — fine for modeling, but call it out in limitations. - Unicode & punctuation: TF-IDF tokenization later uses a strict pattern; commas/spaces separate tokens.
Quick checks
meta_df.isna().sum()
meta_df['MAL_ID'].dtypeWhat it does
- Applies
LabelEncoderto:Type,Source,Rating,Premiered,Duration. - Stores each fitted encoder in
label_encodersfor reproducibility/inference.
Why it matters
- Tree models and similarity pipelines consume numeric inputs.
- Persisted encoders allow consistent mapping when new/held-out items arrive.
Pitfalls & mitigations
- Unseen categories at inference:
LabelEncoderwill error if a new label appears.- Option 1️⃣: Pre-map unknowns to a reserved index.
- Option 2️⃣: Refit encoders on union of train + incoming batch (document this policy).
Quick checks
meta_df.filter(like='_encoded').nunique()What it does
- Runs TF-IDF with:
token_pattern=r'[^, ]+'(split by commas/spaces)stop_words='english'max_features=100per field.
- Produces three dense matrices with interpretable columns like:
Genre_ActionProd_AniplexStudio_KyoAni
Why it matters
- Converts multi-label text fields into semantic vectors.
- Capped dimensionality reduces memory usage and speeds up cosine similarity later.
Pitfalls & mitigations
- Vocabulary drift:
Save the vectorizers for inference (tfidf_and_encoders.pkl) to ensure consistent feature space. - Over-sparsity:
Usingmax_features=100balances expressiveness and compute efficiency.
Quick checks
tfidf_genres.shape[1] == 100
vec_genres.get_feature_names_out()[:10]What it does
- Casts numeric columns:
['Score', 'Episodes', 'Ranked', 'Popularity', 'Members', 'Favorites']
to numeric (errors='coerce'). - Fills remaining
NaNs with column means. - Applies
MinMaxScalerto range[0, 1].
Why it matters
- Mixed-scale features (e.g.,
Membersin millions vs.Score~ [0–10])
can distort distance metrics and degrade model performance. - Normalization stabilizes training and improves cosine similarity reliability.
Pitfalls & mitigations
- Outliers:
MinMaxScaleris sensitive — if the distribution is heavy-tailed
(e.g.,Members), consider usingRobustScalerin future. - Imputation bias:
Mean imputation is simple but can bias toward central values;
mention this in your limitations section.
Quick checks
meta_df[numeric_cols].min().ge(0).all()
meta_df[numeric_cols].max().le(1).all()What it does
- Concatenates:
MAL_ID- Scaled numerics
- Encoded categoricals
- TF-IDF matrices (
Genres,Producers,Studios)
- Produces a dense, model-ready Content-Based (CB) vector per anime.
Why it matters
- Combines all preprocessed data sources into a single, unified representation.
- This unified table can be directly used for:
- Cosine similarity computation between items
- Feature input for the hybrid meta-learner
Pitfalls & mitigations
- Column alignment:
Ensure consistent row alignment when concatenating multiple feature sets.
pd.concat(..., axis=1)relies on aligned indices. - Shape mismatch:
Verify that TF-IDF outputs and encoded numerics have identical row counts.
Quick checks
meta_processed.shape[0] == meta_df.shape[0]What it does
- Reads
rating_complete.csvand castsanime_idtoint. - Intersects
MAL_IDfrom the metadata withanime_idfrom the ratings file. - Keeps only the overlapping records — ensures every CB vector has a corresponding CF entry.
- Logs the number of matched items for transparency.
Why it matters
- Prevents cold-start leakage during hybrid model training and evaluation.
- Guarantees 1:1 correspondence between items in the CB feature matrix
and those in the CF (Collaborative Filtering) dataset.
Pitfalls & mitigations
- Coverage drop:
Titles without rating data will be dropped from the hybrid dataset.
It’s recommended to log the retained ratio for your report (e.g.,matched / total). - ID mismatch:
Ensure both sides (MAL_ID,anime_id) are of the same integer type
to avoid silent filtering errors.
Quick checks
(meta_processed['MAL_ID'].isin(rating_df['anime_id'])).all()What it does
- Saves:
meta_preprocessed.csv→ the final, dense CB feature tabletfidf_and_encoders.pkl→ serialized preprocessing objects containing:label_encoders(for categorical columns)vec_genres,vec_producers,vec_studiosscaler
Why it matters
- Ensures identical preprocessing during both training and inference.
- Prevents train–serve feature drift by preserving:
- TF-IDF vocabularies
- LabelEncoder mappings
- Scaler parameters
Pitfalls & mitigations
- Always reload the pickle and test
transform()on a sample before deployment
to confirm version compatibility and consistent output dimensions.
Quick checks
import pickle
obj = pickle.load(open("tfidf_and_encoders.pkl", "rb"))
type(obj)-
rating_complete.csv- Columns:
user_id,anime_id,rating - Only ratings > 0 are used.
- Columns:
-
rating_test.csv- Used for baseline CF models (e.g., SVD, NeuMF).
-
meta_preprocessed.csv- Item-level content features (genres, tags, staff, etc.) +
MAL_ID. - Output of the preprocessing pipeline.
- Item-level content features (genres, tags, staff, etc.) +
-
anime.csv- Mapping table for
MAL_ID ↔ anime title(for human-readable outputs).
- Mapping table for
-
Collaborative Filtering (CF) – SVD
- Learns latent user/item factors.
- Outputs:
cf_score(user_id, anime_id).
-
Content-Based (CB) – Cosine Similarity
- Uses standardized content embeddings from
meta_preprocessed.csv. - Builds per-user content profiles.
- Outputs:
cb_score(user_id, anime_id).
- Uses standardized content embeddings from
-
Meta-Training Dataset
- For sampled user–item pairs:
- Collect
(cf_score, cb_score, true_rating).
- Collect
- Used as input for meta-learners.
- For sampled user–item pairs:
-
Meta-Learners (XGBoost / LightGBM / CatBoost)
- Learn a mapping:
$\hat{y} = f_{\text{meta}}(y_{\text{cf}}, y_{\text{cb}})$
- Final hybrid rating predictor.
- Learn a mapping:
-
Evaluation Metrics
- RMSE (raw): absolute prediction error.
- Precision@K / Recall@K: ranking-based accuracy.
-
Model-based item recommendation:
- “Users who liked X also tend to like Y” based on meta-model predictions.
-
Library:
surprise(SVD) -
Predictive model:
$\hat{r}_{ui} = \mu + b_u + b_i + p_u^\top q_i$ -
$\mu$ : global mean -
$b_u$ ,$b_i$ : user/item bias -
$p_u$ ,$q_i$ : latent factor vectors.
-
| Parameter | Value | Meaning | Effect |
|---|---|---|---|
n_factors |
100 | Latent dimension | Higher → expressive, risk of overfitting |
n_epochs |
20 | Training epochs | Ensures convergence |
lr_all |
0.005 | Learning rate | Stable, small updates |
reg_all |
0.02 | Regularization | Prevents overfitting |
- Output:
cf_score(u, i) = svd.predict(u, i).est
Concept
- NeuMF embeds users/items via:
- GMF (Generalized Matrix Factorization): linear interaction.
- MLP: non-linear interaction.
- Embeddings:
- User:
$p_u$ - Item:
$q_i$
- User:
Key components:
-
GMF:
$\phi_{\text{GMF}}(p_u, q_i) = p_u \odot q_i$
-
MLP:
- Stacked non-linear layers on concatenated
$(p_u, q_i)$ .
- Stacked non-linear layers on concatenated
-
Fusion:
- Final prediction from concatenation of GMF + MLP representations.
Analysis
- NeuMF can capture complex non-linear patterns that SVD cannot.
- However, in our setting:
- Rating skew: ~70% in 8–9 range.
- Model size: ~33K parameters.
- Resulted in overfitting:
- Train RMSE: 1.27
- Val RMSE: 1.80
- Final decision:
- SVD selected as primary CF model:
- Stable performance
- Validation RMSE: 1.70
- SVD selected as primary CF model:
- Uses
meta_preprocessed.csv. - Applies
StandardScalerto CB features. - Ensures cosine similarity is not dominated by large-scale features.
-
For items
$i, j$ with vectors$x_i, x_j$ :$\text{sim}(i, j) = \dfrac{x_i \cdot x_j}{|x_i| , |x_j|}$ -
Similarity matrix shape:
(n_items × n_items).
For each user
-
Define liked items:
-
$L_u = { i \mid r_{ui} \ge \tau }$ , with default$\tau = 8.0$ .
-
-
For each liked item
$i \in L_u$ :- Keep only Top-K most similar neighbors
$S_i$ .
- Keep only Top-K most similar neighbors
-
User-specific CB score for candidate item
$j$ :$$ \text{CB}u(j) = \frac{\sum{i \in L_u} \mathbf{1}{ j \in \text{TopK}(S_i) } \cdot \text{sim}(i, j)} {\sum_{i \in L_u} \mathbf{1}{ j \in \text{TopK}(S_i) } + \varepsilon} $$
- Default:
top_k = 30like_th = 8.0
Rationale
- Focuses on strongest semantic neighbors.
- Reduces noise from weakly related items.
- Randomly sample up to 3000 users for efficiency.
- For each sampled user:
- Build CB profile.
- For each rated item
$(u, i)$ :- Compute
cf_score(u, i)from SVD. - Compute
cb_score(u, i)from CB module. - Take
true_rating(u, i)fromrating_complete.csv. - Store:
(user_id, anime_id, cf_score, cb_score, true_rating).
- Compute
- Save as:
meta_train_ready.csv. - Used as supervised training data for meta-learners.
The meta-learner is a fusion model:
- Input:
-
$y_{\text{cf}}$ : CF-predicted rating -
$y_{\text{cb}}$ : CB-predicted rating
-
- Output:
$\hat{y} = f_{\text{meta}}(y_{\text{cf}}, y_{\text{cb}})$
Objective:
- Learns how to combine:
- Behavioral signal (CF)
- Semantic signal (CB)
-
Base Model Training
- CF (SVD): trained on
rating_complete.csv. - CB: similarity + user content profiles from
meta_preprocessed.csv.
- CF (SVD): trained on
-
Meta-Dataset Construction
- For each $(u, i)`:
- Features:
[y_{\text{cf}}(u, i), y_{\text{cb}}(u, i)] - Target:
true_rating(u, i)
- Features:
- Stored in
meta_train_ready.csv.
- For each $(u, i)`:
-
Meta-Learner Training
- Train LightGBM / XGBoost / CatBoost:
- $\hat{y}{ui} = f{\text{meta}}(y_{\text{cf}}, y_{\text{cb}})$
- Train LightGBM / XGBoost / CatBoost:
-
Recommendation Generation
- For each user:
- Predict
$\hat{y}_{ui}$ for candidate items. - Sort descending → Top-N recommendations.
- Predict
- For each user:
(1) LightGBM
- Leaf-wise, histogram-based Gradient Boosting.
- Key:
- Fast convergence
- Efficient memory usage
- GOSS sampling
- Chosen as final meta-learner in this project:
- Best trade-off between speed, performance, and interpretability.
(2) XGBoost
- Level-wise tree growth.
- L1/L2 regularization.
- Robust, stable, handles missing values well.
- Slightly slower but very reliable.
(3) CatBoost
- Ordered boosting + Oblivious Trees.
- Great with categorical features & overfitting control.
- In our case (continuous inputs), still stable but less beneficial than LightGBM.
| Model | Strengths | Limitations |
|---|---|---|
| LightGBM | Fast, accurate, low memory, interpretable | Leaf-wise can overfit |
| XGBoost | Stable, strong regularization, robust | Slower than LightGBM |
| CatBoost | Great with categoricals, robust to overfit | Slower on large-scale tasks |
The hybrid prediction can be viewed as:
- But:
- In Gradient Boosting Trees,
$w_1, w_2$ are implicit, non-linear, context-dependent. - The meta-learner:
- Reinforces agreement between CF & CB.
- Downweights the less reliable model per user/genre pattern.
- In Gradient Boosting Trees,
Conclusion
- The meta-learner acts as an adaptive fusion layer.
- Among candidates, LightGBM showed the best overall performance and was selected as the final meta-model.
- Computed on raw predictions (no scaling/clipping).
- Lower RMSE → better numeric accuracy.
-
Relevant items:
$r_{ui} \ge 8.0$
-
Metrics:
$$ \text{Precision@K} = \frac{|T_u \cap R_u|}{K}, \quad \text{Recall@K} = \frac{|T_u \cap R_u|}{|R_u|} $$
-
$T_u$ : Top-K recommended items. -
$R_u$ : Relevant (truly liked) items.
-
Why use raw predictions
- Rank-based metrics are invariant to monotonic transforms.
- No clipping/scaling before ranking → preserves relative order.
- Select a random user.
- Generate Top-10 lists using:
- XGBoost-based meta-learner
- LightGBM-based meta-learner
- CatBoost-based meta-learner
- Display:
Predicted Rating / True Ratingper item.
- Provides qualitative insight into each model’s personalization behavior.
- Recommend items that users who liked a given anime also tend to like.
- Uses meta-model predictions instead of pure co-occurrence:
- Higher semantic + behavioral resolution.
- Find target anime by name → get
MAL_ID. - Find users who rated that anime ≥ 8.0.
- Collect other items those users rated.
- For each
(user, item):- Gather
cf_score,cb_score(from meta_train_ready or recomputation). - Predict via meta-model.
- Gather
- Average predicted scores per item.
- Sort in descending order → Top-N similar items.
- Clipping:
- Predictions may be clipped to
[1, 10]for display only. - Training & evaluation use raw values.
- Predictions may be clipped to
Example interpretation:
- “Items that users who liked Koe no Katachi are also highly likely to enjoy.”
-
Top-K neighbor masking
- Reduces noise; focuses on strongest neighbors.
-
Raw predictions (no global scaling)
- Each model learns the rating scale directly.
- Avoids distortions in ranking and RMSE.
-
User sampling (3000 users)
- Practical runtime with sufficient diversity.
- For production:
- Use distributed compute or ANN libraries (e.g., FAISS).
| Component | Complexity | Comment |
|---|---|---|
| Item similarity | Expensive; ANN/FAISS/Annoy recommended at scale | |
| Meta dataset build |
|
|
| Model training | Linear in samples | Only 2 features (cf, cb) → very fast |
| Reproducibility | — |
random_state=42, np.random.seed(42) used |
-
RMSE
- Lower = better absolute predictive accuracy.
-
Precision@10 / Recall@10
- Reflect ranking quality of recommendations.
-
User Top-10 lists
- Reveal each model’s tendency:
- Safe vs. exploratory
- Niche vs. popular bias.
- Reveal each model’s tendency:
We can see that the model recommended an animation with "action" and "fantasy," which are the genre that the user likes. Here, the model used the light GBM, which has best performance. We can also see that the animation recommended by the model shows that the same genre is clustered.
The NeuMF model achieved a lower RMSE than the SVD model.
Among the three meta-learner models (XGBoost, LightGBM, and CatBoost),
LightGBM achieved the best overall performance in both RMSE and ranking metrics (Precision@K, Recall@K).
This can be attributed to several key factors:
-
Histogram-based Gradient Boosting Efficiency
LightGBM groups continuous feature values into discrete bins (histogram-based splitting).
This approach enables faster training, smoother optimization, and noise reduction,
particularly with small meta-feature dimensions such as(cf_score, cb_score). -
Better Handling of Continuous Features
Since the meta-input consists of two continuous scores (from CF and CB),
LightGBM’s leaf-wise growth captures subtle nonlinear interactions more effectively
than XGBoost’s level-wise tree expansion. -
Regularization and Feature Usage
LightGBM automatically prunes weak leaves during training, providing natural regularization.
This results in better generalization across users with diverse rating behaviors. -
Optimized for Small, Dense Features
While CatBoost excels with high-cardinality categorical data,
LightGBM performs exceptionally well on low-dimensional numeric features,
avoiding unnecessary complexity.
Summary:
Among all ensemble meta-learners, LightGBM demonstrated the highest predictive accuracy.
Its histogram-based optimization and efficient handling of low-dimensional continuous features
provided smoother gradient updates, reduced overfitting, and yielded superior generalization performance.