Click-Through Rate (CTR) Prediction using XGBoost

A machine learning project implementing multiple modeling approaches to predict click-through rates, with XGBoost achieving a testing RMSE of 0.059.

📊 Project Overview

This project explores various machine learning techniques to predict Click-Through Rates (CTR) for digital advertising. Through iterative experimentation with seasonality-based models, Random Forests, and gradient boosting methods, the final XGBoost implementation with hyperparameter tuning achieved superior performance.

Key Results:

Final Test RMSE: 0.059
Training RMSE: 0.086
Model: XGBoost with hyperparameter tuning
Improvement: Significant reduction from baseline models (seasonality RMSE: 0.999)

🎯 Business Impact

Accurate CTR prediction enables:

Improved ad targeting through better understanding of user engagement patterns
Optimized ad placement by predicting which ads will perform best
Enhanced user experience by serving more relevant advertisements
Increased ROI for advertising campaigns

🛠️ Technical Stack

Language: R
Key Libraries:
- xgboost - Gradient boosting implementation
- vtreat - Feature engineering and encoding
- randomForest - Tree-based modeling
- dplyr - Data manipulation
- caret - Model training utilities
- ggplot2 - Visualization (recommended for extensions)

📁 Project Structure

ctr-prediction-xgboost/
├── code/
│   ├── 01_seasonality_model.R      # Initial seasonal analysis
│   ├── 02_random_forest_model.R    # Random Forest implementation
│   ├── 03_simple_boosting.R        # LightGBM model
│   └── 04_xgboost_final.R          # Final XGBoost with tuning
├── docs/
│   ├── project_report.pdf          # Detailed analysis report
│   └── presentation_slides.pdf     # Project presentation
├── visualizations/
│   └── (Generated plots and charts)
├── data/
│   └── README.md                   # Data documentation
└── README.md

🔍 Methodology

1. Data Exploration & Preprocessing

Data Characteristics:

Target variable (CTR) with right-skewed distribution
Mix of numerical and categorical features
Missing values across multiple columns
Temporal features: time_of_day, day_of_week
Demographic features: age_group

Preprocessing Pipeline:

Missing Value Imputation:
- Numerical: Median imputation (robust to outliers)
- Categorical: Mode imputation with "Missing" level
Feature Encoding:
- One-hot encoding for categorical variables
- vtreat package for consistent preprocessing
Train-Test Split: 80-20 split with stratification

2. Model Development Journey

Approach 1: Seasonality-Based Linear Model

Hypothesis: User behavior varies by time of day and day of week

Morning: Higher engagement during commute/routine
Afternoon: Reduced engagement during work hours
Evening: Increased browsing during leisure time

Results:

Training RMSE: 0.999
Limitation: Seasonal trends alone insufficient for accurate prediction

Approach 2: Random Forest

Features: age_group, time_of_week, and all available predictors

Ntree: 1000
Mtry: sqrt(number of features)

Challenges:

Mismatched factor levels between train and test
High cardinality in categorical variables
Lesson: Importance of careful categorical variable alignment

Approach 3: Simple Boosting (LightGBM)

Configuration:

Objective: Regression
Learning rate: 0.1
Number of leaves: 31
Rounds: 100

Results: Moderate performance, foundation for XGBoost

Approach 4: XGBoost with Hyperparameter Tuning ✅

Feature Engineering:

vtreat preprocessing for robust encoding
Automatic handling of categorical levels
Feature importance analysis

Hyperparameter Grid Search:

param_grid:
  - eta: [0.01, 0.05, 0.1]
  - max_depth: [3, 6, 9]
  - subsample: [0.8, 1.0]
  - colsample_bytree: [0.8, 1.0]

Optimization Strategy:

5-fold cross-validation
Early stopping (100 rounds)
Maximum 10,000 boosting rounds
RMSE as evaluation metric

Final Model Configuration:

Best parameters selected via grid search
Regularization to prevent overfitting
Watchlist monitoring for train-test convergence

📈 Results & Performance

Model Comparison

Model	Train RMSE	Test RMSE	Key Insight
Seasonality	0.999	N/A	Limited predictive power
Random Forest	N/A	N/A	Categorical encoding challenges
Simple Boosting	N/A	N/A	Good baseline performance
XGBoost (Final)	0.086	0.059	Best performance with tuning

Key Findings

Feature Importance: XGBoost automatically identified most predictive features
Regularization: Critical for preventing overfitting (train RMSE 0.086 vs test 0.059)
Ensemble Methods: Significantly outperformed linear and single-tree approaches
Preprocessing: vtreat package essential for consistent categorical encoding

🚀 Usage

Prerequisites

install.packages(c("xgboost", "vtreat", "randomForest", "dplyr", "caret"))

Running the Final Model

# Load the final XGBoost script
source("code/04_xgboost_final.R")

# The script will:
# 1. Load and preprocess data
# 2. Perform hyperparameter tuning
# 3. Train final model
# 4. Generate predictions
# 5. Output submission file

Model Training Steps

Data Loading: Read training and scoring datasets
Preprocessing: Apply vtreat transformation
Hyperparameter Tuning: Grid search with cross-validation
Model Training: Train XGBoost with optimal parameters
Evaluation: Calculate RMSE on train and test sets
Prediction: Generate CTR predictions for scoring data

🎓 Key Learnings

Technical Insights

Seasonal patterns require strong periodicity to be effective predictors
Random Forests need careful categorical variable handling
XGBoost excels with proper tuning and regularization
Feature engineering is critical but must be domain-appropriate

Best Practices Applied

✅ Consistent preprocessing pipeline
✅ Cross-validation for model selection
✅ Early stopping to prevent overfitting
✅ Systematic hyperparameter tuning
✅ Regular validation checks

🔄 Future Improvements

Model Enhancements

Feature Engineering: Create interaction terms and polynomial features
Ensemble Methods: Stack multiple models (XGBoost + Random Forest + LightGBM)
Automated Tuning: Implement Bayesian optimization for hyperparameters
Deep Learning: Explore neural networks for complex patterns

Visualization & Analysis

Correlation Heatmaps: Feature relationship analysis
Distribution Plots: CTR and feature distributions
Hyperparameter Sensitivity: Visualize parameter impact on RMSE
Feature Importance Plots: SHAP values for interpretability
Learning Curves: Track model performance over iterations

Production Considerations

Model Deployment: Containerize model for production serving
Monitoring: Implement drift detection for model performance
A/B Testing: Framework for comparing model versions
Real-time Inference: Optimize for low-latency predictions

📚 References & Resources

XGBoost Documentation
vtreat Package Guide
Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System

👤 Author

Tracey Thanh Ho
Master of Science in Applied Analytics | Columbia University
Expected Graduation: December 2025

📄 License

This project is available for educational and portfolio purposes.

This project demonstrates proficiency in machine learning, R programming, hyperparameter optimization, and iterative model development.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
IMPROVEMENTS.md		IMPROVEMENTS.md
INDEX.md		INDEX.md
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
ctr-prediction-xgboost		ctr-prediction-xgboost
presentation_slides.pdf		presentation_slides.pdf
visualizations.R		visualizations.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Click-Through Rate (CTR) Prediction using XGBoost

📊 Project Overview

🎯 Business Impact

🛠️ Technical Stack

📁 Project Structure

🔍 Methodology

1. Data Exploration & Preprocessing

2. Model Development Journey

Approach 1: Seasonality-Based Linear Model

Approach 2: Random Forest

Approach 3: Simple Boosting (LightGBM)

Approach 4: XGBoost with Hyperparameter Tuning ✅

📈 Results & Performance

Model Comparison

Key Findings

🚀 Usage

Prerequisites

Running the Final Model

Model Training Steps

🎓 Key Learnings

Technical Insights

Best Practices Applied

🔄 Future Improvements

Model Enhancements

Visualization & Analysis

Production Considerations

📚 References & Resources

👤 Author

📄 License

About

Uh oh!

Releases

Packages

Languages

traceyho59/ctr-prediction-xgboost

Folders and files

Latest commit

History

Repository files navigation

Click-Through Rate (CTR) Prediction using XGBoost

📊 Project Overview

🎯 Business Impact

🛠️ Technical Stack

📁 Project Structure

🔍 Methodology

1. Data Exploration & Preprocessing

2. Model Development Journey

Approach 1: Seasonality-Based Linear Model

Approach 2: Random Forest

Approach 3: Simple Boosting (LightGBM)

Approach 4: XGBoost with Hyperparameter Tuning ✅

📈 Results & Performance

Model Comparison

Key Findings

🚀 Usage

Prerequisites

Running the Final Model

Model Training Steps

🎓 Key Learnings

Technical Insights

Best Practices Applied

🔄 Future Improvements

Model Enhancements

Visualization & Analysis

Production Considerations

📚 References & Resources

👤 Author

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages