🌍 Ecological Analysis of Global Mortality

PhD Workshop in Global Health Informatics

A comprehensive, modular Python package for teaching ecological analyses using World Bank data.

📚 Overview

This workshop teaches ecological analysis methods for global health research. Students learn to:

Understand limitations: The ecological fallacy and its implications
Acquire data: Programmatic access to World Bank indicators
Explore patterns: Correlation analysis, clustering, and visualisation
Detect spatial patterns: Moran's I, LISA, hot/cold spot analysis
Build models: Progressive regression with full diagnostics
Apply ML: AutoML comparison with SHAP interpretation
Simulate policies: In-silico experiments with uncertainty quantification

🗂️ Project Structure

ecological_workshop/
│
├── src/                          # Source modules
│   ├── __init__.py              # Package initialisation
│   ├── data_acquisition.py      # World Bank data fetching
│   ├── eda.py                   # Exploratory data analysis
│   ├── clustering.py            # K-means and hierarchical clustering
│   ├── spatial.py               # Moran's I, LISA, hot spots
│   ├── regression.py            # OLS with diagnostics
│   ├── ml_models.py             # PyCaret and SHAP
│   ├── policy_simulation.py     # In-silico experiments
│   ├── visualisation.py         # Publication-ready plots
│   ├── dashboard.py             # Gradio interactive interface
│   └── utils.py                 # Helper functions
│
├── notebooks/                    # Jupyter notebooks
│   └── 01_main_workshop.ipynb   # Main workshop notebook
│
├── data/                        # Data directory (auto-populated)
├── outputs/                     # Generated outputs
├── docs/                        # Additional documentation
│
├── requirements.txt             # Python dependencies
└── README.md                    # This file

🚀 Quick Start

Installation

# Clone or download the workshop
cd ecological_workshop

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or: venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

# For spatial analysis (optional)
# apt-get install graphviz  # System package for DAG visualisation

Running the Workshop

# Start Jupyter
jupyter notebook

# Open: notebooks/01_main_workshop.ipynb

Quick Usage

# Import modules
from src.data_acquisition import fetch_world_bank_data
from src.spatial import create_spatial_weights, calculate_moran_i
from src.visualisation import create_funnel_plot

# Fetch data
df, distal, inter, prox = fetch_world_bank_data()

# Create spatial weights and test for autocorrelation
W, ids, df_spatial = create_spatial_weights(df, k=5)
moran = calculate_moran_i(df_spatial, 'crude_death_rate', W, ids)

# Create visualisations
funnel_fig = create_funnel_plot(df)
funnel_fig.show()

📖 Module Documentation

1. Data Acquisition (`data_acquisition.py`)

Fetches and cleans World Bank data using the wbgapi library.

from src.data_acquisition import fetch_world_bank_data, create_data_dictionary

# Fetch most recent data
df, distal_vars, inter_vars, prox_vars = fetch_world_bank_data(verbose=True)

# Create documentation
data_dict = create_data_dictionary(df)

Key Features:

Hierarchical variable organisation (Distal → Intermediate → Proximate)
Automatic aggregate removal (World, Regions)
Geographic coordinates for spatial analysis
Missing data reporting

2. Exploratory Data Analysis (`eda.py`)

Demonstrates the ecological fallacy and correlation analysis.

from src.eda import demonstrate_ecological_fallacy, analyze_correlations_hierarchical

# CRITICAL: Run this first to understand limitations!
fig = demonstrate_ecological_fallacy()

# Hierarchical correlation analysis
corr_fig, correlations = analyze_correlations_hierarchical(df)

3. Clustering (`clustering.py`)

Groups countries by development and health profiles.

from src.clustering import find_optimal_clusters, cluster_countries

# Find optimal number of clusters
optimal_k, diag_fig = find_optimal_clusters(df)

# Cluster countries
df_clustered = cluster_countries(df, n_clusters=optimal_k)

4. Spatial Analysis (`spatial.py`)

Tests for and visualises spatial autocorrelation.

from src.spatial import (create_spatial_weights, calculate_moran_i,
                         calculate_lisa, plot_hotspot_map)

# Create weights matrix
W, ids, df_spatial = create_spatial_weights(df, method='knn', k=5)

# Global test
moran = calculate_moran_i(df_spatial, 'crude_death_rate', W, ids)

# Local clusters (LISA)
lisa = calculate_lisa(df_spatial, 'crude_death_rate', W, ids)

# Hot spot map
hotspot_fig = plot_hotspot_map(df_spatial, 'crude_death_rate', W, ids, lisa)

Interpretation Guide:

Moran's I > 0: Clustering (similar values near each other)
Moran's I < 0: Dispersion (dissimilar values near each other)
HH (Hot Spot): High value surrounded by high values
LL (Cold Spot): Low value surrounded by low values
HL/LH: Spatial outliers

5. Regression Analysis (`regression.py`)

Progressive modelling with comprehensive diagnostics.

from src.regression import (build_progressive_models, regression_diagnostics,
                            sensitivity_analysis, robust_standard_errors)

# Build hierarchical models
results, models = build_progressive_models(df)

# Full diagnostics
diagnostics = regression_diagnostics(models['Model 3: Full Model'], X)

# Identify influential observations
influential, fig = sensitivity_analysis(model, X, y, df)

# Robust standard errors if needed
robust_comparison = robust_standard_errors(model, cov_type='HC3')

Diagnostic Tests Included:

VIF (multicollinearity)
Breusch-Pagan (heteroscedasticity)
Shapiro-Wilk (normality)
Durbin-Watson (autocorrelation)
Cook's Distance (influential observations)

6. Machine Learning (`ml_models.py`)

AutoML with PyCaret and SHAP interpretation.

from src.ml_models import setup_pycaret, compare_ml_models, interpret_model

# Setup environment
exp = setup_pycaret(df, target='crude_death_rate')

# Compare models
best_model, _ = compare_ml_models(n_select=1)

# Interpret with SHAP
interpretation = interpret_model(best_model, df)

7. Policy Simulation (`policy_simulation.py`)

In-silico experiments with uncertainty quantification.

from src.policy_simulation import (run_policy_simulation_with_uncertainty,
                                   compare_policy_scenarios,
                                   create_country_report_card)

# Compare policies with bootstrap CIs
scenarios = {
    'Education (+20%)': ('adult_literacy', 0.20),
    'Healthcare (+20%)': ('physicians_density', 0.20),
}
results = compare_policy_scenarios(model, df, scenarios, with_uncertainty=True)

# Country-specific report
report = create_country_report_card(model, df, country='Nigeria', scenarios=scenarios)

8. Visualisation (`visualisation.py`)

Publication-ready plots.

from src.visualisation import (create_funnel_plot, create_bubble_plot,
                               create_world_map, create_dag_visualisation)

# Funnel plot (outlier detection)
funnel = create_funnel_plot(df)

# Bubble chart (Hans Rosling style)
bubble = create_bubble_plot(df, x_var='gni_per_capita', log_x=True)

# Choropleth map
world_map = create_world_map(df, variable='crude_death_rate')

# Causal DAG
dag = create_dag_visualisation()

9. Interactive Dashboard (`dashboard.py`)

Gradio-based teaching interface.

from src.dashboard import launch_dashboard

# Launch interactive simulator
launch_dashboard(model, df, share=True)

🎓 Teaching Notes

Workshop Structure (2 hours)

Module	Duration	Key Learning
1. Ecological Fallacy	15 min	Limitations of ecological analysis
2. Data Acquisition	10 min	Reproducible data access
3. EDA	20 min	Correlation and visualisation
4. Clustering	15 min	Country groupings
5. Spatial Analysis	20 min	Moran's I and hot spots
6. Regression	20 min	Progressive modelling
7-9. ML/Policy/Dashboard	20 min	Advanced topics

Key Discussion Points

Why can't we infer individual effects from ecological data?
- Confounding at aggregate level
- Aggregation bias
- Cross-level inference problem
Why does spatial autocorrelation matter?
- Violates independence assumption
- Underestimates standard errors
- Inflates Type I error rates
How should we interpret policy simulations?
- Model predictions, not causal effects
- Uncertainty is crucial
- Real-world implementation differs

⚠️ Limitations and Caveats

Ecological Fallacy: Results describe country-level associations, NOT individual effects
Cross-sectional Design: Cannot establish causation
Data Quality: World Bank data has varying completeness across countries
Spatial Approximation: KNN on centroids is approximate; proper analysis needs shapefiles
Small Sample: ~200 countries limits complex ML models

Data Source

World Bank Open Data: https://data.worldbank.org
wbgapi documentation: https://wbgapi.readthedocs.io

📝 License

This workshop material is provided for educational purposes.

🤝 Contributing

Contributions welcome! Please submit issues or pull requests.

📧 Contact

Olalekan Uthman - olalekan.uthman@warwick.ac.uk

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
notebooks		notebooks
sample_articles		sample_articles
src		src
streamlit_app		streamlit_app
Ecological_Analysis_of_Global_Mortality.ipynb		Ecological_Analysis_of_Global_Mortality.ipynb
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🌍 Ecological Analysis of Global Mortality

PhD Workshop in Global Health Informatics

📚 Overview

🗂️ Project Structure

🚀 Quick Start

Installation

Running the Workshop

Quick Usage

📖 Module Documentation

1. Data Acquisition (`data_acquisition.py`)

2. Exploratory Data Analysis (`eda.py`)

3. Clustering (`clustering.py`)

4. Spatial Analysis (`spatial.py`)

5. Regression Analysis (`regression.py`)

6. Machine Learning (`ml_models.py`)

7. Policy Simulation (`policy_simulation.py`)

8. Visualisation (`visualisation.py`)

9. Interactive Dashboard (`dashboard.py`)

🎓 Teaching Notes

Workshop Structure (2 hours)

Key Discussion Points

⚠️ Limitations and Caveats

Data Source

📝 License

🤝 Contributing

📧 Contact

About

Uh oh!

Releases

Packages

Languages

olaTechie/ecological_workshop

Folders and files

Latest commit

History

Repository files navigation

🌍 Ecological Analysis of Global Mortality

PhD Workshop in Global Health Informatics

📚 Overview

🗂️ Project Structure

🚀 Quick Start

Installation

Running the Workshop

Quick Usage

📖 Module Documentation

1. Data Acquisition (data_acquisition.py)

2. Exploratory Data Analysis (eda.py)

3. Clustering (clustering.py)

4. Spatial Analysis (spatial.py)

5. Regression Analysis (regression.py)

6. Machine Learning (ml_models.py)

7. Policy Simulation (policy_simulation.py)

8. Visualisation (visualisation.py)

9. Interactive Dashboard (dashboard.py)

🎓 Teaching Notes

Workshop Structure (2 hours)

Key Discussion Points

⚠️ Limitations and Caveats

Data Source

📝 License

🤝 Contributing

📧 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. Data Acquisition (`data_acquisition.py`)

2. Exploratory Data Analysis (`eda.py`)

3. Clustering (`clustering.py`)

4. Spatial Analysis (`spatial.py`)

5. Regression Analysis (`regression.py`)

6. Machine Learning (`ml_models.py`)

7. Policy Simulation (`policy_simulation.py`)

8. Visualisation (`visualisation.py`)

9. Interactive Dashboard (`dashboard.py`)

Packages