Skip to content

olaTechie/ecological_workshop

Repository files navigation

🌍 Ecological Analysis of Global Mortality

PhD Workshop in Global Health Informatics

A comprehensive, modular Python package for teaching ecological analyses using World Bank data.


📚 Overview

This workshop teaches ecological analysis methods for global health research. Students learn to:

  1. Understand limitations: The ecological fallacy and its implications
  2. Acquire data: Programmatic access to World Bank indicators
  3. Explore patterns: Correlation analysis, clustering, and visualisation
  4. Detect spatial patterns: Moran's I, LISA, hot/cold spot analysis
  5. Build models: Progressive regression with full diagnostics
  6. Apply ML: AutoML comparison with SHAP interpretation
  7. Simulate policies: In-silico experiments with uncertainty quantification

🗂️ Project Structure

ecological_workshop/
│
├── src/                          # Source modules
│   ├── __init__.py              # Package initialisation
│   ├── data_acquisition.py      # World Bank data fetching
│   ├── eda.py                   # Exploratory data analysis
│   ├── clustering.py            # K-means and hierarchical clustering
│   ├── spatial.py               # Moran's I, LISA, hot spots
│   ├── regression.py            # OLS with diagnostics
│   ├── ml_models.py             # PyCaret and SHAP
│   ├── policy_simulation.py     # In-silico experiments
│   ├── visualisation.py         # Publication-ready plots
│   ├── dashboard.py             # Gradio interactive interface
│   └── utils.py                 # Helper functions
│
├── notebooks/                    # Jupyter notebooks
│   └── 01_main_workshop.ipynb   # Main workshop notebook
│
├── data/                        # Data directory (auto-populated)
├── outputs/                     # Generated outputs
├── docs/                        # Additional documentation
│
├── requirements.txt             # Python dependencies
└── README.md                    # This file

🚀 Quick Start

Installation

# Clone or download the workshop
cd ecological_workshop

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or: venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

# For spatial analysis (optional)
# apt-get install graphviz  # System package for DAG visualisation

Running the Workshop

# Start Jupyter
jupyter notebook

# Open: notebooks/01_main_workshop.ipynb

Quick Usage

# Import modules
from src.data_acquisition import fetch_world_bank_data
from src.spatial import create_spatial_weights, calculate_moran_i
from src.visualisation import create_funnel_plot

# Fetch data
df, distal, inter, prox = fetch_world_bank_data()

# Create spatial weights and test for autocorrelation
W, ids, df_spatial = create_spatial_weights(df, k=5)
moran = calculate_moran_i(df_spatial, 'crude_death_rate', W, ids)

# Create visualisations
funnel_fig = create_funnel_plot(df)
funnel_fig.show()

📖 Module Documentation

1. Data Acquisition (data_acquisition.py)

Fetches and cleans World Bank data using the wbgapi library.

from src.data_acquisition import fetch_world_bank_data, create_data_dictionary

# Fetch most recent data
df, distal_vars, inter_vars, prox_vars = fetch_world_bank_data(verbose=True)

# Create documentation
data_dict = create_data_dictionary(df)

Key Features:

  • Hierarchical variable organisation (Distal → Intermediate → Proximate)
  • Automatic aggregate removal (World, Regions)
  • Geographic coordinates for spatial analysis
  • Missing data reporting

2. Exploratory Data Analysis (eda.py)

Demonstrates the ecological fallacy and correlation analysis.

from src.eda import demonstrate_ecological_fallacy, analyze_correlations_hierarchical

# CRITICAL: Run this first to understand limitations!
fig = demonstrate_ecological_fallacy()

# Hierarchical correlation analysis
corr_fig, correlations = analyze_correlations_hierarchical(df)

3. Clustering (clustering.py)

Groups countries by development and health profiles.

from src.clustering import find_optimal_clusters, cluster_countries

# Find optimal number of clusters
optimal_k, diag_fig = find_optimal_clusters(df)

# Cluster countries
df_clustered = cluster_countries(df, n_clusters=optimal_k)

4. Spatial Analysis (spatial.py)

Tests for and visualises spatial autocorrelation.

from src.spatial import (create_spatial_weights, calculate_moran_i,
                         calculate_lisa, plot_hotspot_map)

# Create weights matrix
W, ids, df_spatial = create_spatial_weights(df, method='knn', k=5)

# Global test
moran = calculate_moran_i(df_spatial, 'crude_death_rate', W, ids)

# Local clusters (LISA)
lisa = calculate_lisa(df_spatial, 'crude_death_rate', W, ids)

# Hot spot map
hotspot_fig = plot_hotspot_map(df_spatial, 'crude_death_rate', W, ids, lisa)

Interpretation Guide:

  • Moran's I > 0: Clustering (similar values near each other)
  • Moran's I < 0: Dispersion (dissimilar values near each other)
  • HH (Hot Spot): High value surrounded by high values
  • LL (Cold Spot): Low value surrounded by low values
  • HL/LH: Spatial outliers

5. Regression Analysis (regression.py)

Progressive modelling with comprehensive diagnostics.

from src.regression import (build_progressive_models, regression_diagnostics,
                            sensitivity_analysis, robust_standard_errors)

# Build hierarchical models
results, models = build_progressive_models(df)

# Full diagnostics
diagnostics = regression_diagnostics(models['Model 3: Full Model'], X)

# Identify influential observations
influential, fig = sensitivity_analysis(model, X, y, df)

# Robust standard errors if needed
robust_comparison = robust_standard_errors(model, cov_type='HC3')

Diagnostic Tests Included:

  • VIF (multicollinearity)
  • Breusch-Pagan (heteroscedasticity)
  • Shapiro-Wilk (normality)
  • Durbin-Watson (autocorrelation)
  • Cook's Distance (influential observations)

6. Machine Learning (ml_models.py)

AutoML with PyCaret and SHAP interpretation.

from src.ml_models import setup_pycaret, compare_ml_models, interpret_model

# Setup environment
exp = setup_pycaret(df, target='crude_death_rate')

# Compare models
best_model, _ = compare_ml_models(n_select=1)

# Interpret with SHAP
interpretation = interpret_model(best_model, df)

7. Policy Simulation (policy_simulation.py)

In-silico experiments with uncertainty quantification.

from src.policy_simulation import (run_policy_simulation_with_uncertainty,
                                   compare_policy_scenarios,
                                   create_country_report_card)

# Compare policies with bootstrap CIs
scenarios = {
    'Education (+20%)': ('adult_literacy', 0.20),
    'Healthcare (+20%)': ('physicians_density', 0.20),
}
results = compare_policy_scenarios(model, df, scenarios, with_uncertainty=True)

# Country-specific report
report = create_country_report_card(model, df, country='Nigeria', scenarios=scenarios)

8. Visualisation (visualisation.py)

Publication-ready plots.

from src.visualisation import (create_funnel_plot, create_bubble_plot,
                               create_world_map, create_dag_visualisation)

# Funnel plot (outlier detection)
funnel = create_funnel_plot(df)

# Bubble chart (Hans Rosling style)
bubble = create_bubble_plot(df, x_var='gni_per_capita', log_x=True)

# Choropleth map
world_map = create_world_map(df, variable='crude_death_rate')

# Causal DAG
dag = create_dag_visualisation()

9. Interactive Dashboard (dashboard.py)

Gradio-based teaching interface.

from src.dashboard import launch_dashboard

# Launch interactive simulator
launch_dashboard(model, df, share=True)

🎓 Teaching Notes

Workshop Structure (2 hours)

Module Duration Key Learning
1. Ecological Fallacy 15 min Limitations of ecological analysis
2. Data Acquisition 10 min Reproducible data access
3. EDA 20 min Correlation and visualisation
4. Clustering 15 min Country groupings
5. Spatial Analysis 20 min Moran's I and hot spots
6. Regression 20 min Progressive modelling
7-9. ML/Policy/Dashboard 20 min Advanced topics

Key Discussion Points

  1. Why can't we infer individual effects from ecological data?

    • Confounding at aggregate level
    • Aggregation bias
    • Cross-level inference problem
  2. Why does spatial autocorrelation matter?

    • Violates independence assumption
    • Underestimates standard errors
    • Inflates Type I error rates
  3. How should we interpret policy simulations?

    • Model predictions, not causal effects
    • Uncertainty is crucial
    • Real-world implementation differs

⚠️ Limitations and Caveats

  1. Ecological Fallacy: Results describe country-level associations, NOT individual effects

  2. Cross-sectional Design: Cannot establish causation

  3. Data Quality: World Bank data has varying completeness across countries

  4. Spatial Approximation: KNN on centroids is approximate; proper analysis needs shapefiles

  5. Small Sample: ~200 countries limits complex ML models


Data Source


📝 License

This workshop material is provided for educational purposes.


🤝 Contributing

Contributions welcome! Please submit issues or pull requests.


📧 Contact

Olalekan Uthman - olalekan.uthman@warwick.ac.uk

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published