For my course of Mathematics for Computer Science, it was needed to implement our theoretical knowledge into code (Python). For this report, we used a dataset from a paper called: "Estimation of Obesity Levels Based On Eating Habits and Physical Condition", which include a dataset (synthetic data) of individuals from countries of Mexico, Peru and Colombia. Contains 17 features and 2111 records labeled with the class variable divided by 4 types:
- Insufficient Weight
- Normal Weight
- Overweight (Level I, Level II)
- Obesity Type I, Type III and Type III
This assignment was really complete in terms of methods used, explanation of code cells and compariosn of results using different functions in needed cases. Developed on Jupyter Notebooks using Google Colab with libraries such as Numpy, SciPy, Pandas and StatsModels. There was a division of every features in 3 types:
- Continuous
- Integer (Discrete)
- Binary
- Categorical
Based on the datatype, it was possible to do a separation of statistical methods for better results. Some methods being:
- Probability Mass Function (PMF)
- Probability Density Function (PDF)
- Cumulative Distribution Function (CDF)
- Analysis of Variance (ANOVA)
- False DIscovery Rate (FDR)
- Bootstrap
Based on the information from the dataset, we can agree for the research question to be: "How do eating habits and physical activity patterns influence obesity levels among individuals?".
From this general research question and based on the main 6 tasks related to statistical analysis, there are certain general objectives to complete:
- Describe the distributional behavior of all features, conditioned on obesity levels (target).
- Statistically assess how eating habits and physical activity differ across obesity levels using hypothesis testing for continuous, binary and categorical variables.
- Evaluate the discriminative ability of each feature by identifying which behaviors and physical activity patterns best predict obesity levels.
All the needed solutions, graphs and comparison can be seen on the file annotated-Statistics.pdf. For the coding part, there's a .ipynb file which contains all the code cells needed 😀.