Skip to content

πŸ€– Machine Learning project to predict the severity of workers' compensation claims using CatBoost & Random Forest. Deployed as an interactive Streamlit app with LIME for interpretability. Developed for the ML course at NOVA IMS.

License

Notifications You must be signed in to change notification settings

Silvestre17/ML_PredictingWorkersCompensationClaimSeverity_MasterProject

Repository files navigation

WCB Group33 Banner

πŸ€– ML Project: Predicting Workers' Compensation Claim Severity πŸ—½

✨ Project Overview

This project, completed for the Machine Learning course within the Master's in Data Science and Advanced Analytics program at NOVA IMS, focuses on applying supervised learning techniques to predict the severity of workers' compensation claims handled by the New York Work Compensation Board (NWCB). The primary objective is to develop a robust multiclass classification model to predict one of eight possible Claim Injury Type categories, aiming to automate and streamline the claim adjudication process, addressing the increasing volume and manual review time.

GitHub Repo Live Streamlit App WebApp Repo

πŸ“š Context

Analyzing and adjudicating workers' compensation claims is a critical but resource-intensive task for the NWCB. Facing an upward trend in claim submissions (as noted in the NWCB 2023 Annual Report), this project leverages machine learning to predict claim severity, potentially freeing up resources and improving processing efficiency.

πŸ‘₯ Team

TP2 | TBL Group 33

  • AndrΓ© Silvestre, 20240502
  • JoΓ£o Henriques, 20240499
  • Simone Genovese, 20241459
  • Steven Carlson, 20240554
  • VinΓ­cius Pinto, 20211682
  • Zofia Wojcik, 20240654

πŸ—οΈ Project Methodology (CRISP-DM)

The project meticulously followed the CRISP-DM (Cross Industry Standard Process for Data Mining) framework, ensuring a structured approach from problem definition to solution deployment.

Project Flowchart

Figure 1: Overall Project Flow (CRISP-DM Cycle)

Here's a breakdown of the activities undertaken in each phase:

  1. Business Understanding: πŸ’‘

    • Problem: High volume and manual processing time for NWCB workers' compensation claims.
    • Goal: Develop an ML model to automatically predict claim severity (Claim Injury Type) based on initial claim data.
    • Objective: Build a multiclass classification model distinguishing between 8 injury types, aiming for high F1-Macro performance.

    Python

  2. Data Understanding: πŸ”

    • Data Collection: Utilized NWCB public data (Train: ~593K rows, 33 cols; Test: ~388K rows, 30 cols).
    • Initial Exploration: Analyzed feature types, distributions (Appendix C), identified the target (Claim Injury Type) and unique ID (Claim Identifier). Key challenge identified: significant class imbalance (Figure C1).
    • Quality Assessment: Detected missing values, potential outliers (Age, Dates), and inconsistencies (non-numeric Zips, date ranges).
    • Link to Notebook: 1_BU&EDA_MLProject_Group33.ipynb

    Pandas NumPy Matplotlib Seaborn SciPy Stats

  3. Data Preparation: πŸ› οΈ

    • Initial Cleaning: Dropped irrelevant/non-test columns, handled rows with many missing values, addressed specific anomalies (Table 2.2 / C1).
    • Missing Values: Imputed numerical features using KNN Imputer; created 'Unknown' category for Industry Code Description.
    • Outlier Handling: Analyzed using IQR, retained most outliers to preserve data variability, but created binary features (e.g., IME-4 Reported, Average Weekly Wage Reported) to mitigate the impact of extreme values/missingness patterns.
    • Feature Engineering: Extracted temporal components (Year, Month, Day, Weekday) from dates; cleaned Age at Injury/Birth Year and created Age at Injury Group; bucketed high-cardinality categoricals (WCIO codes, Carrier Type); created binary flags for missing dates/specific reports.
    • Encoding: Applied OrdinalEncoder (Age at Injury Group) and OneHotEncoder (other nominal/binary categoricals).
    • Data Splitting: Used a 75% Training / 25% Validation Hold-Out split.
    • Feature Selection: Implemented a multi-faceted strategy (visualized below) combining Filter (Spearman, CramΓ©r's V, Chi-Squared, VIF), Wrapper (RFE), and Embedded (Lasso, Ridge) methods. A 2/3 majority vote selected the final 27 features (Appendix D).
    • Scaling: Tested MinMaxScaler, StandardScaler, RobustScaler to prepare data for scale-sensitive algorithms and feature selection steps (Annex D).
    • Link to Notebook: 2_FeatureEngineering_MLProject_Group33.ipynb

    Feature Selection Flowchart

    Figure 2: Feature Selection Process Flowchart

    Scikit-Learn Imbalanced-Learn Statsmodels SciPy Stats

  4. Modeling: πŸ€–

    • Algorithms: Trained and evaluated Logistic Regression, Naive Bayes (Gaussian/Categorical), KNN, MLP Neural Network, Decision Tree, Random Forest, CatBoost, ExtraTrees, Bagging (LR base), and Stacking (RF + LR).
    • Strategy: Models were tested on original and scaled (MinMax, Standard, Robust) data. K-Means SMOTE resampling was tested (Annex F) but ultimately discarded as it increased overfitting without improving validation performance compared to models trained on original imbalanced data (Table G1). The Hold-Out strategy was maintained (Figure 3).
    • Hyperparameter Tuning: Optimized Random Forest using GridSearchCV (Table H1), improving validation F1-Macro from 0.40 to 0.42. Base parameters were used for CatBoost due to computational cost.
    • Link to Notebook: 3_Modeling&Evaluation_MLProject_Group33.ipynb

    Modeling Flowchart

    Figure 3: Model Training and Evaluation Strategy Flowchart

    Scikit-Learn CatBoost Imbalanced-Learn

  5. Evaluation: βœ…

    • Metrics: Performance assessed using Accuracy, Precision, Recall, F1-Score (Macro) (primary due to imbalance), and AUROC.
    • Selection Criteria: Focused on high validation F1-Macro (>0.4), low overfitting (Train-Validation F1 difference < 0.1), and strong secondary metrics (Accuracy, AUROC), detailed in Table F1.
    • Results: CatBoost (on original data) and Random Forest were the top models. CatBoost achieved the best performance on the final Kaggle test set evaluation (Appendix G).
  6. Deployment: πŸš€

    • Web Application: Developed an interactive Streamlit dashboard (Live App) featuring:
      • A prediction interface using the final CatBoost model.
      • A data exploration section for interactive analysis (Figure I1).
    • Interpretability: Integrated LIME to explain individual predictions, showing feature contributions for transparency and actionable insights (Figure J1).

    Streamlit App LIME Plotly

πŸ“ˆ Deliverables & Outputs

πŸ”‘ Keywords

Workers' Compensation Claims; Machine Learning; Multiclass Classification; Classification Models; Ensemble Learning; Random Forest; CatBoost; Feature Engineering; Feature Selection; Data Exploration; Imbalanced Data; KMeansSMOTE; CRISP-DM; Streamlit; LIME; Predictive Modeling; Data Science.

πŸ“” Conclusion

This project successfully navigated the CRISP-DM process to develop and evaluate machine learning models for predicting workers' compensation claim severity. Ensemble methods, particularly CatBoost and Random Forest, demonstrated the strongest performance despite the challenge of significant class imbalance. The implemented Streamlit application provides a practical interface for prediction, while LIME enhances model transparency. While resampling techniques like K-Means SMOTE did not improve final results in this case, the thorough feature engineering and selection process proved crucial. This work establishes a strong foundation for data-driven decision support within the NWCB, with potential for further refinement through techniques like textual analysis or more advanced resampling/validation strategies.


Explore the notebooks and the live application for a deeper dive into the methodology and results!

About

πŸ€– Machine Learning project to predict the severity of workers' compensation claims using CatBoost & Random Forest. Deployed as an interactive Streamlit app with LIME for interpretability. Developed for the ML course at NOVA IMS.

Topics

Resources

License

Stars

Watchers

Forks

Contributors 3

  •  
  •  
  •