🤖 ML Project: Predicting Workers' Compensation Claim Severity 🗽

✨ Project Overview

This project, completed for the Machine Learning course within the Master's in Data Science and Advanced Analytics program at NOVA IMS, focuses on applying supervised learning techniques to predict the severity of workers' compensation claims handled by the New York Work Compensation Board (NWCB). The primary objective is to develop a robust multiclass classification model to predict one of eight possible Claim Injury Type categories, aiming to automate and streamline the claim adjudication process, addressing the increasing volume and manual review time.

📚 Context

Analyzing and adjudicating workers' compensation claims is a critical but resource-intensive task for the NWCB. Facing an upward trend in claim submissions (as noted in the NWCB 2023 Annual Report), this project leverages machine learning to predict claim severity, potentially freeing up resources and improving processing efficiency.

👥 Team

TP2 | TBL Group 33

André Silvestre, 20240502
João Henriques, 20240499
Simone Genovese, 20241459
Steven Carlson, 20240554
Vinícius Pinto, 20211682
Zofia Wojcik, 20240654

🏗️ Project Methodology (CRISP-DM)

The project meticulously followed the CRISP-DM (Cross Industry Standard Process for Data Mining) framework, ensuring a structured approach from problem definition to solution deployment.

Figure 1: Overall Project Flow (CRISP-DM Cycle)

Here's a breakdown of the activities undertaken in each phase:

Business Understanding: 💡
- Problem: High volume and manual processing time for NWCB workers' compensation claims.
- Goal: Develop an ML model to automatically predict claim severity (Claim Injury Type) based on initial claim data.
- Objective: Build a multiclass classification model distinguishing between 8 injury types, aiming for high F1-Macro performance.
Data Understanding: 🔍
- Data Collection: Utilized NWCB public data (Train: ~593K rows, 33 cols; Test: ~388K rows, 30 cols).
- Initial Exploration: Analyzed feature types, distributions (Appendix C), identified the target (Claim Injury Type) and unique ID (Claim Identifier). Key challenge identified: significant class imbalance (Figure C1).
- Quality Assessment: Detected missing values, potential outliers (Age, Dates), and inconsistencies (non-numeric Zips, date ranges).
- Link to Notebook: 1_BU&EDA_MLProject_Group33.ipynb
Data Preparation: 🛠️
- Initial Cleaning: Dropped irrelevant/non-test columns, handled rows with many missing values, addressed specific anomalies (Table 2.2 / C1).
- Missing Values: Imputed numerical features using KNN Imputer; created 'Unknown' category for Industry Code Description.
- Outlier Handling: Analyzed using IQR, retained most outliers to preserve data variability, but created binary features (e.g., IME-4 Reported, Average Weekly Wage Reported) to mitigate the impact of extreme values/missingness patterns.
- Feature Engineering: Extracted temporal components (Year, Month, Day, Weekday) from dates; cleaned Age at Injury/Birth Year and created Age at Injury Group; bucketed high-cardinality categoricals (WCIO codes, Carrier Type); created binary flags for missing dates/specific reports.
- Encoding: Applied OrdinalEncoder (Age at Injury Group) and OneHotEncoder (other nominal/binary categoricals).
- Data Splitting: Used a 75% Training / 25% Validation Hold-Out split.
- Feature Selection: Implemented a multi-faceted strategy (visualized below) combining Filter (Spearman, Cramér's V, Chi-Squared, VIF), Wrapper (RFE), and Embedded (Lasso, Ridge) methods. A 2/3 majority vote selected the final 27 features (Appendix D).
- Scaling: Tested MinMaxScaler, StandardScaler, RobustScaler to prepare data for scale-sensitive algorithms and feature selection steps (Annex D).
- Link to Notebook: 2_FeatureEngineering_MLProject_Group33.ipynb
Figure 2: Feature Selection Process Flowchart
Modeling: 🤖
- Algorithms: Trained and evaluated Logistic Regression, Naive Bayes (Gaussian/Categorical), KNN, MLP Neural Network, Decision Tree, Random Forest, CatBoost, ExtraTrees, Bagging (LR base), and Stacking (RF + LR).
- Strategy: Models were tested on original and scaled (MinMax, Standard, Robust) data. K-Means SMOTE resampling was tested (Annex F) but ultimately discarded as it increased overfitting without improving validation performance compared to models trained on original imbalanced data (Table G1). The Hold-Out strategy was maintained (Figure 3).
- Hyperparameter Tuning: Optimized Random Forest using GridSearchCV (Table H1), improving validation F1-Macro from 0.40 to 0.42. Base parameters were used for CatBoost due to computational cost.
- Link to Notebook: 3_Modeling&Evaluation_MLProject_Group33.ipynb
Figure 3: Model Training and Evaluation Strategy Flowchart
Evaluation: ✅
- Metrics: Performance assessed using Accuracy, Precision, Recall, F1-Score (Macro) (primary due to imbalance), and AUROC.
- Selection Criteria: Focused on high validation F1-Macro (>0.4), low overfitting (Train-Validation F1 difference < 0.1), and strong secondary metrics (Accuracy, AUROC), detailed in Table F1.
- Results: CatBoost (on original data) and Random Forest were the top models. CatBoost achieved the best performance on the final Kaggle test set evaluation (Appendix G).
Deployment: 🚀
- Web Application: Developed an interactive Streamlit dashboard (Live App) featuring:
  - A prediction interface using the final CatBoost model.
  - A data exploration section for interactive analysis (Figure I1).
- Interpretability: Integrated LIME to explain individual predictions, showing feature contributions for transparency and actionable insights (Figure J1).

📈 Deliverables & Outputs

Jupyter Notebooks: Organized by CRISP-DM phase within the main project repository.
Excel Report: Comprehensive results, analysis, and model comparisons in ML_Excel_ReportResults_Group33.xlsx.
Streamlit Web App:
- Deployed Application: https://mlproject-wcb-group33.streamlit.app/
- Source Code Repository: https://github.com/Silvestre17/ML_WebApp_Group33

🔑 Keywords

Workers' Compensation Claims; Machine Learning; Multiclass Classification; Classification Models; Ensemble Learning; Random Forest; CatBoost; Feature Engineering; Feature Selection; Data Exploration; Imbalanced Data; KMeansSMOTE; CRISP-DM; Streamlit; LIME; Predictive Modeling; Data Science.

📔 Conclusion

This project successfully navigated the CRISP-DM process to develop and evaluate machine learning models for predicting workers' compensation claim severity. Ensemble methods, particularly CatBoost and Random Forest, demonstrated the strongest performance despite the challenge of significant class imbalance. The implemented Streamlit application provides a practical interface for prediction, while LIME enhances model transparency. While resampling techniques like K-Means SMOTE did not improve final results in this case, the thorough feature engineering and selection process proved crucial. This work establishes a strong foundation for data-driven decision support within the NWCB, with potential for further refinement through techniques like textual analysis or more advanced resampling/validation strategies.

Explore the notebooks and the live application for a deeper dive into the methodology and results!

Name		Name	Last commit message	Last commit date
Latest commit History 158 Commits
[ML]_Project_EDAOutputs_Group33		[ML]_Project_EDAOutputs_Group33
data		data
img		img
submissions		submissions
.gitattributes		.gitattributes
.gitignore		.gitignore
1_BU&EDA_MLProject_Group33.ipynb		1_BU&EDA_MLProject_Group33.ipynb
2_FeatureEngineering_MLProject_Group33.ipynb		2_FeatureEngineering_MLProject_Group33.ipynb
3_Modeling&Evaluation_MLProject_Group33.ipynb		3_Modeling&Evaluation_MLProject_Group33.ipynb
LICENSE		LICENSE
ML_Excel_ReportResults_Group33.xlsx		ML_Excel_ReportResults_Group33.xlsx
ML_Group33_Report.pdf		ML_Group33_Report.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🤖 ML Project: Predicting Workers' Compensation Claim Severity 🗽

✨ Project Overview

📚 Context

👥 Team

🏗️ Project Methodology (CRISP-DM)

📈 Deliverables & Outputs

🔑 Keywords

📔 Conclusion

About

Uh oh!

Contributors 3

Uh oh!

Languages

License

Silvestre17/ML_PredictingWorkersCompensationClaimSeverity_MasterProject

Folders and files

Latest commit

History

Repository files navigation

🤖 ML Project: Predicting Workers' Compensation Claim Severity 🗽

✨ Project Overview

📚 Context

👥 Team

🏗️ Project Methodology (CRISP-DM)

📈 Deliverables & Outputs

🔑 Keywords

📔 Conclusion

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 3

Uh oh!

Languages