Skip to content

Machine learning project predicting Click-Through Rates using XGBoost with hyperparameter tuning (Test RMSE: 0.059)

Notifications You must be signed in to change notification settings

traceyho59/ctr-prediction-xgboost

Repository files navigation

Click-Through Rate (CTR) Prediction using XGBoost

A machine learning project implementing multiple modeling approaches to predict click-through rates, with XGBoost achieving a testing RMSE of 0.059.

πŸ“Š Project Overview

This project explores various machine learning techniques to predict Click-Through Rates (CTR) for digital advertising. Through iterative experimentation with seasonality-based models, Random Forests, and gradient boosting methods, the final XGBoost implementation with hyperparameter tuning achieved superior performance.

Key Results:

  • Final Test RMSE: 0.059
  • Training RMSE: 0.086
  • Model: XGBoost with hyperparameter tuning
  • Improvement: Significant reduction from baseline models (seasonality RMSE: 0.999)

🎯 Business Impact

Accurate CTR prediction enables:

  • Improved ad targeting through better understanding of user engagement patterns
  • Optimized ad placement by predicting which ads will perform best
  • Enhanced user experience by serving more relevant advertisements
  • Increased ROI for advertising campaigns

πŸ› οΈ Technical Stack

  • Language: R
  • Key Libraries:
    • xgboost - Gradient boosting implementation
    • vtreat - Feature engineering and encoding
    • randomForest - Tree-based modeling
    • dplyr - Data manipulation
    • caret - Model training utilities
    • ggplot2 - Visualization (recommended for extensions)

πŸ“ Project Structure

ctr-prediction-xgboost/
β”œβ”€β”€ code/
β”‚   β”œβ”€β”€ 01_seasonality_model.R      # Initial seasonal analysis
β”‚   β”œβ”€β”€ 02_random_forest_model.R    # Random Forest implementation
β”‚   β”œβ”€β”€ 03_simple_boosting.R        # LightGBM model
β”‚   └── 04_xgboost_final.R          # Final XGBoost with tuning
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ project_report.pdf          # Detailed analysis report
β”‚   └── presentation_slides.pdf     # Project presentation
β”œβ”€β”€ visualizations/
β”‚   └── (Generated plots and charts)
β”œβ”€β”€ data/
β”‚   └── README.md                   # Data documentation
└── README.md

πŸ” Methodology

1. Data Exploration & Preprocessing

Data Characteristics:

  • Target variable (CTR) with right-skewed distribution
  • Mix of numerical and categorical features
  • Missing values across multiple columns
  • Temporal features: time_of_day, day_of_week
  • Demographic features: age_group

Preprocessing Pipeline:

  • Missing Value Imputation:
    • Numerical: Median imputation (robust to outliers)
    • Categorical: Mode imputation with "Missing" level
  • Feature Encoding:
    • One-hot encoding for categorical variables
    • vtreat package for consistent preprocessing
  • Train-Test Split: 80-20 split with stratification

2. Model Development Journey

Approach 1: Seasonality-Based Linear Model

Hypothesis: User behavior varies by time of day and day of week

  • Morning: Higher engagement during commute/routine
  • Afternoon: Reduced engagement during work hours
  • Evening: Increased browsing during leisure time

Results:

  • Training RMSE: 0.999
  • Limitation: Seasonal trends alone insufficient for accurate prediction

Approach 2: Random Forest

Features: age_group, time_of_week, and all available predictors

  • Ntree: 1000
  • Mtry: sqrt(number of features)

Challenges:

  • Mismatched factor levels between train and test
  • High cardinality in categorical variables
  • Lesson: Importance of careful categorical variable alignment

Approach 3: Simple Boosting (LightGBM)

Configuration:

  • Objective: Regression
  • Learning rate: 0.1
  • Number of leaves: 31
  • Rounds: 100

Results: Moderate performance, foundation for XGBoost

Approach 4: XGBoost with Hyperparameter Tuning βœ…

Feature Engineering:

  • vtreat preprocessing for robust encoding
  • Automatic handling of categorical levels
  • Feature importance analysis

Hyperparameter Grid Search:

param_grid:
  - eta: [0.01, 0.05, 0.1]
  - max_depth: [3, 6, 9]
  - subsample: [0.8, 1.0]
  - colsample_bytree: [0.8, 1.0]

Optimization Strategy:

  • 5-fold cross-validation
  • Early stopping (100 rounds)
  • Maximum 10,000 boosting rounds
  • RMSE as evaluation metric

Final Model Configuration:

  • Best parameters selected via grid search
  • Regularization to prevent overfitting
  • Watchlist monitoring for train-test convergence

πŸ“ˆ Results & Performance

Model Comparison

Model Train RMSE Test RMSE Key Insight
Seasonality 0.999 N/A Limited predictive power
Random Forest N/A N/A Categorical encoding challenges
Simple Boosting N/A N/A Good baseline performance
XGBoost (Final) 0.086 0.059 Best performance with tuning

Key Findings

  1. Feature Importance: XGBoost automatically identified most predictive features
  2. Regularization: Critical for preventing overfitting (train RMSE 0.086 vs test 0.059)
  3. Ensemble Methods: Significantly outperformed linear and single-tree approaches
  4. Preprocessing: vtreat package essential for consistent categorical encoding

πŸš€ Usage

Prerequisites

install.packages(c("xgboost", "vtreat", "randomForest", "dplyr", "caret"))

Running the Final Model

# Load the final XGBoost script
source("code/04_xgboost_final.R")

# The script will:
# 1. Load and preprocess data
# 2. Perform hyperparameter tuning
# 3. Train final model
# 4. Generate predictions
# 5. Output submission file

Model Training Steps

  1. Data Loading: Read training and scoring datasets
  2. Preprocessing: Apply vtreat transformation
  3. Hyperparameter Tuning: Grid search with cross-validation
  4. Model Training: Train XGBoost with optimal parameters
  5. Evaluation: Calculate RMSE on train and test sets
  6. Prediction: Generate CTR predictions for scoring data

πŸŽ“ Key Learnings

Technical Insights

  1. Seasonal patterns require strong periodicity to be effective predictors
  2. Random Forests need careful categorical variable handling
  3. XGBoost excels with proper tuning and regularization
  4. Feature engineering is critical but must be domain-appropriate

Best Practices Applied

  • βœ… Consistent preprocessing pipeline
  • βœ… Cross-validation for model selection
  • βœ… Early stopping to prevent overfitting
  • βœ… Systematic hyperparameter tuning
  • βœ… Regular validation checks

πŸ”„ Future Improvements

Model Enhancements

  • Feature Engineering: Create interaction terms and polynomial features
  • Ensemble Methods: Stack multiple models (XGBoost + Random Forest + LightGBM)
  • Automated Tuning: Implement Bayesian optimization for hyperparameters
  • Deep Learning: Explore neural networks for complex patterns

Visualization & Analysis

  • Correlation Heatmaps: Feature relationship analysis
  • Distribution Plots: CTR and feature distributions
  • Hyperparameter Sensitivity: Visualize parameter impact on RMSE
  • Feature Importance Plots: SHAP values for interpretability
  • Learning Curves: Track model performance over iterations

Production Considerations

  • Model Deployment: Containerize model for production serving
  • Monitoring: Implement drift detection for model performance
  • A/B Testing: Framework for comparing model versions
  • Real-time Inference: Optimize for low-latency predictions

πŸ“š References & Resources

πŸ‘€ Author

Tracey Thanh Ho
Master of Science in Applied Analytics | Columbia University
Expected Graduation: December 2025

πŸ“„ License

This project is available for educational and portfolio purposes.


This project demonstrates proficiency in machine learning, R programming, hyperparameter optimization, and iterative model development.

About

Machine learning project predicting Click-Through Rates using XGBoost with hyperparameter tuning (Test RMSE: 0.059)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages