Machine Learning on two different datasets - Predicting Adult Income and Car Insurance Claims
Author: Nian Vrey
Analize and understand the data provided, then create a model to predict the outcome based on the data.
To analize the data and attempt to create as accurate of a model as possible.
The data provided describes some features of an adult. Below are some distributions of the features, more can be found in the notebook. The graphs are set up so that the one on the left simply shows the distribution, while the one on the right shows the distribution with regard to the Income
The Distribution of Age graph above shows that we have a good spread over the working class, with a few elderly datapoints as well. There are noticeable outliers, both in count and in Age.
The Distribution of Occupation above shows that we again have a good spread of data. It is worth noting that there are a noticeable amount of missing records.
The model chosen model, found in the notebook, would be the Base Random Forest Model. It is far from perfect however. As we can see on the accuracy stats below, it does not do well on the prediction of >50K.
<=50K Accuracy: 92%
>50K Accuracy: 61%
Overall Accuracy: 85%
Machine_Learning.ipynb - Main Google Colab Notebook for the project.
adult.csv - Adult Income Dataset
Car_Insurance_Claim.csv - Car Insurance Claim Dataset

