A machine learning project implementing multiple modeling approaches to predict click-through rates, with XGBoost achieving a testing RMSE of 0.059.
This project explores various machine learning techniques to predict Click-Through Rates (CTR) for digital advertising. Through iterative experimentation with seasonality-based models, Random Forests, and gradient boosting methods, the final XGBoost implementation with hyperparameter tuning achieved superior performance.
Key Results:
- Final Test RMSE: 0.059
- Training RMSE: 0.086
- Model: XGBoost with hyperparameter tuning
- Improvement: Significant reduction from baseline models (seasonality RMSE: 0.999)
Accurate CTR prediction enables:
- Improved ad targeting through better understanding of user engagement patterns
- Optimized ad placement by predicting which ads will perform best
- Enhanced user experience by serving more relevant advertisements
- Increased ROI for advertising campaigns
- Language: R
- Key Libraries:
xgboost- Gradient boosting implementationvtreat- Feature engineering and encodingrandomForest- Tree-based modelingdplyr- Data manipulationcaret- Model training utilitiesggplot2- Visualization (recommended for extensions)
ctr-prediction-xgboost/
βββ code/
β βββ 01_seasonality_model.R # Initial seasonal analysis
β βββ 02_random_forest_model.R # Random Forest implementation
β βββ 03_simple_boosting.R # LightGBM model
β βββ 04_xgboost_final.R # Final XGBoost with tuning
βββ docs/
β βββ project_report.pdf # Detailed analysis report
β βββ presentation_slides.pdf # Project presentation
βββ visualizations/
β βββ (Generated plots and charts)
βββ data/
β βββ README.md # Data documentation
βββ README.md
Data Characteristics:
- Target variable (CTR) with right-skewed distribution
- Mix of numerical and categorical features
- Missing values across multiple columns
- Temporal features:
time_of_day,day_of_week - Demographic features:
age_group
Preprocessing Pipeline:
- Missing Value Imputation:
- Numerical: Median imputation (robust to outliers)
- Categorical: Mode imputation with "Missing" level
- Feature Encoding:
- One-hot encoding for categorical variables
- vtreat package for consistent preprocessing
- Train-Test Split: 80-20 split with stratification
Hypothesis: User behavior varies by time of day and day of week
- Morning: Higher engagement during commute/routine
- Afternoon: Reduced engagement during work hours
- Evening: Increased browsing during leisure time
Results:
- Training RMSE: 0.999
- Limitation: Seasonal trends alone insufficient for accurate prediction
Features: age_group, time_of_week, and all available predictors
- Ntree: 1000
- Mtry: sqrt(number of features)
Challenges:
- Mismatched factor levels between train and test
- High cardinality in categorical variables
- Lesson: Importance of careful categorical variable alignment
Configuration:
- Objective: Regression
- Learning rate: 0.1
- Number of leaves: 31
- Rounds: 100
Results: Moderate performance, foundation for XGBoost
Feature Engineering:
- vtreat preprocessing for robust encoding
- Automatic handling of categorical levels
- Feature importance analysis
Hyperparameter Grid Search:
param_grid:
- eta: [0.01, 0.05, 0.1]
- max_depth: [3, 6, 9]
- subsample: [0.8, 1.0]
- colsample_bytree: [0.8, 1.0]Optimization Strategy:
- 5-fold cross-validation
- Early stopping (100 rounds)
- Maximum 10,000 boosting rounds
- RMSE as evaluation metric
Final Model Configuration:
- Best parameters selected via grid search
- Regularization to prevent overfitting
- Watchlist monitoring for train-test convergence
| Model | Train RMSE | Test RMSE | Key Insight |
|---|---|---|---|
| Seasonality | 0.999 | N/A | Limited predictive power |
| Random Forest | N/A | N/A | Categorical encoding challenges |
| Simple Boosting | N/A | N/A | Good baseline performance |
| XGBoost (Final) | 0.086 | 0.059 | Best performance with tuning |
- Feature Importance: XGBoost automatically identified most predictive features
- Regularization: Critical for preventing overfitting (train RMSE 0.086 vs test 0.059)
- Ensemble Methods: Significantly outperformed linear and single-tree approaches
- Preprocessing: vtreat package essential for consistent categorical encoding
install.packages(c("xgboost", "vtreat", "randomForest", "dplyr", "caret"))# Load the final XGBoost script
source("code/04_xgboost_final.R")
# The script will:
# 1. Load and preprocess data
# 2. Perform hyperparameter tuning
# 3. Train final model
# 4. Generate predictions
# 5. Output submission file- Data Loading: Read training and scoring datasets
- Preprocessing: Apply vtreat transformation
- Hyperparameter Tuning: Grid search with cross-validation
- Model Training: Train XGBoost with optimal parameters
- Evaluation: Calculate RMSE on train and test sets
- Prediction: Generate CTR predictions for scoring data
- Seasonal patterns require strong periodicity to be effective predictors
- Random Forests need careful categorical variable handling
- XGBoost excels with proper tuning and regularization
- Feature engineering is critical but must be domain-appropriate
- β Consistent preprocessing pipeline
- β Cross-validation for model selection
- β Early stopping to prevent overfitting
- β Systematic hyperparameter tuning
- β Regular validation checks
- Feature Engineering: Create interaction terms and polynomial features
- Ensemble Methods: Stack multiple models (XGBoost + Random Forest + LightGBM)
- Automated Tuning: Implement Bayesian optimization for hyperparameters
- Deep Learning: Explore neural networks for complex patterns
- Correlation Heatmaps: Feature relationship analysis
- Distribution Plots: CTR and feature distributions
- Hyperparameter Sensitivity: Visualize parameter impact on RMSE
- Feature Importance Plots: SHAP values for interpretability
- Learning Curves: Track model performance over iterations
- Model Deployment: Containerize model for production serving
- Monitoring: Implement drift detection for model performance
- A/B Testing: Framework for comparing model versions
- Real-time Inference: Optimize for low-latency predictions
- XGBoost Documentation
- vtreat Package Guide
- Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System
Tracey Thanh Ho
Master of Science in Applied Analytics | Columbia University
Expected Graduation: December 2025
This project is available for educational and portfolio purposes.
This project demonstrates proficiency in machine learning, R programming, hyperparameter optimization, and iterative model development.