This project builds a predictive model to estimate medical insurance charges based on an individual’s characteristics such as age, sex, BMI, number of children, smoking status, and region. The model uses a Random Forest Regressor and is built using Python and common machine learning libraries.
The dataset used is [insurance.csv], which contains:
age: Age of the primary beneficiarysex: Gender (male/female)bmi: Body Mass Indexchildren: Number of dependentssmoker: Smoking status (yes/no)region: Residential area (southeast, southwest, northeast, northwest)charges: Medical charges billed by health insurance
- Loaded the dataset using
pandas. - Checked for null values and data types.
- Performed exploratory data analysis (EDA) to understand distributions.
- Used
matplotlibandseabornto visualize:- Age distribution by gender
- Regional breakdown of gender counts
- Label encoded categorical variables (
sex,smoker,region). - Used Recursive Feature Elimination (
RFE) withLinearRegressionto select the top 5 features. - Applied a log transformation to the
chargescolumn to reduce skewness and improve regression accuracy.
- Selected features:
age,bmi,children,smoker,region. - Split the data into training and testing sets using
train_test_split. - Standardized the features using
StandardScaler.
- Trained a
RandomForestRegressoron the scaled training data.
Evaluated the model using:
- Mean Squared Error (MSE): ~0.128
- R² Score: ~0.857
- Accepted new user input in DataFrame format.
- Encoded and scaled the input before prediction.
- Outputted predicted insurance charges (in log scale).
pandasnumpymatplotlibseabornsklearn