Analysis and Modeling of Adult Income using the US Census Dataset

This notebook contains my approach to building an estimator to predict whether or not an individual in the US Census dataset is earning more then 50.000 $ per year. After doing an exploratory data analysis to understand the different variables in the dataset, I finetuned and compared a penalized Logistic Regression Classifier and a Random Forest Classifier, comparing also different preprocessing-pipelines for the data.

The model I favor is a Logistic Regression Classifier that uses only 8 input variables, and performs with accuracy 94.24% when applied to the test data set.

I ran this notebook in Google Collab where all plots are interactively available. I included screenshots for those who are opening the file with jupyter (and don't want to redo the calculations).

Information about the data

This US Census dataset contains detailed but anonymized information for approximately 300,000 people. (download: http://thomasdata.s3.amazonaws.com/ds/us_census_full.zip)

The archive contains 3 files:

A large training file (csv)
Another test file (csv)
A metadata file (txt) describing the columns of the two csv files (identical for both)

The structure of the files should be as follows to execute the notebook:

.
├── Census_Analysis.ipynb
├── eda_report.html
├── us-census-full
   ├── census_income_learn.csv
   └──census_income_test.csv

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitignore		.gitignore
Census_Analysis.ipynb		Census_Analysis.ipynb
Plot_Education_Income_Age.jpg		Plot_Education_Income_Age.jpg
Plot_capital_cains__age_income.jpg		Plot_capital_cains__age_income.jpg
Plot_detailed_occ_recode_age_income.jpg		Plot_detailed_occ_recode_age_income.jpg
Plot_industry_recode_age_income.jpg		Plot_industry_recode_age_income.jpg
Plot_major_occupation_wage_income.jpg		Plot_major_occupation_wage_income.jpg
README.md		README.md
eda_report.html		eda_report.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analysis and Modeling of Adult Income using the US Census Dataset

Information about the data

Selected Visualisations

About

Uh oh!

Releases

Packages

Languages

rjcnrd/us_census

Folders and files

Latest commit

History

Repository files navigation

Analysis and Modeling of Adult Income using the US Census Dataset

Information about the data

Selected Visualisations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages