A comprehensive, modular Python package for teaching ecological analyses using World Bank data.
This workshop teaches ecological analysis methods for global health research. Students learn to:
- Understand limitations: The ecological fallacy and its implications
- Acquire data: Programmatic access to World Bank indicators
- Explore patterns: Correlation analysis, clustering, and visualisation
- Detect spatial patterns: Moran's I, LISA, hot/cold spot analysis
- Build models: Progressive regression with full diagnostics
- Apply ML: AutoML comparison with SHAP interpretation
- Simulate policies: In-silico experiments with uncertainty quantification
ecological_workshop/
│
├── src/ # Source modules
│ ├── __init__.py # Package initialisation
│ ├── data_acquisition.py # World Bank data fetching
│ ├── eda.py # Exploratory data analysis
│ ├── clustering.py # K-means and hierarchical clustering
│ ├── spatial.py # Moran's I, LISA, hot spots
│ ├── regression.py # OLS with diagnostics
│ ├── ml_models.py # PyCaret and SHAP
│ ├── policy_simulation.py # In-silico experiments
│ ├── visualisation.py # Publication-ready plots
│ ├── dashboard.py # Gradio interactive interface
│ └── utils.py # Helper functions
│
├── notebooks/ # Jupyter notebooks
│ └── 01_main_workshop.ipynb # Main workshop notebook
│
├── data/ # Data directory (auto-populated)
├── outputs/ # Generated outputs
├── docs/ # Additional documentation
│
├── requirements.txt # Python dependencies
└── README.md # This file
# Clone or download the workshop
cd ecological_workshop
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # Linux/Mac
# or: venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# For spatial analysis (optional)
# apt-get install graphviz # System package for DAG visualisation# Start Jupyter
jupyter notebook
# Open: notebooks/01_main_workshop.ipynb# Import modules
from src.data_acquisition import fetch_world_bank_data
from src.spatial import create_spatial_weights, calculate_moran_i
from src.visualisation import create_funnel_plot
# Fetch data
df, distal, inter, prox = fetch_world_bank_data()
# Create spatial weights and test for autocorrelation
W, ids, df_spatial = create_spatial_weights(df, k=5)
moran = calculate_moran_i(df_spatial, 'crude_death_rate', W, ids)
# Create visualisations
funnel_fig = create_funnel_plot(df)
funnel_fig.show()Fetches and cleans World Bank data using the wbgapi library.
from src.data_acquisition import fetch_world_bank_data, create_data_dictionary
# Fetch most recent data
df, distal_vars, inter_vars, prox_vars = fetch_world_bank_data(verbose=True)
# Create documentation
data_dict = create_data_dictionary(df)Key Features:
- Hierarchical variable organisation (Distal → Intermediate → Proximate)
- Automatic aggregate removal (World, Regions)
- Geographic coordinates for spatial analysis
- Missing data reporting
Demonstrates the ecological fallacy and correlation analysis.
from src.eda import demonstrate_ecological_fallacy, analyze_correlations_hierarchical
# CRITICAL: Run this first to understand limitations!
fig = demonstrate_ecological_fallacy()
# Hierarchical correlation analysis
corr_fig, correlations = analyze_correlations_hierarchical(df)Groups countries by development and health profiles.
from src.clustering import find_optimal_clusters, cluster_countries
# Find optimal number of clusters
optimal_k, diag_fig = find_optimal_clusters(df)
# Cluster countries
df_clustered = cluster_countries(df, n_clusters=optimal_k)Tests for and visualises spatial autocorrelation.
from src.spatial import (create_spatial_weights, calculate_moran_i,
calculate_lisa, plot_hotspot_map)
# Create weights matrix
W, ids, df_spatial = create_spatial_weights(df, method='knn', k=5)
# Global test
moran = calculate_moran_i(df_spatial, 'crude_death_rate', W, ids)
# Local clusters (LISA)
lisa = calculate_lisa(df_spatial, 'crude_death_rate', W, ids)
# Hot spot map
hotspot_fig = plot_hotspot_map(df_spatial, 'crude_death_rate', W, ids, lisa)Interpretation Guide:
- Moran's I > 0: Clustering (similar values near each other)
- Moran's I < 0: Dispersion (dissimilar values near each other)
- HH (Hot Spot): High value surrounded by high values
- LL (Cold Spot): Low value surrounded by low values
- HL/LH: Spatial outliers
Progressive modelling with comprehensive diagnostics.
from src.regression import (build_progressive_models, regression_diagnostics,
sensitivity_analysis, robust_standard_errors)
# Build hierarchical models
results, models = build_progressive_models(df)
# Full diagnostics
diagnostics = regression_diagnostics(models['Model 3: Full Model'], X)
# Identify influential observations
influential, fig = sensitivity_analysis(model, X, y, df)
# Robust standard errors if needed
robust_comparison = robust_standard_errors(model, cov_type='HC3')Diagnostic Tests Included:
- VIF (multicollinearity)
- Breusch-Pagan (heteroscedasticity)
- Shapiro-Wilk (normality)
- Durbin-Watson (autocorrelation)
- Cook's Distance (influential observations)
AutoML with PyCaret and SHAP interpretation.
from src.ml_models import setup_pycaret, compare_ml_models, interpret_model
# Setup environment
exp = setup_pycaret(df, target='crude_death_rate')
# Compare models
best_model, _ = compare_ml_models(n_select=1)
# Interpret with SHAP
interpretation = interpret_model(best_model, df)In-silico experiments with uncertainty quantification.
from src.policy_simulation import (run_policy_simulation_with_uncertainty,
compare_policy_scenarios,
create_country_report_card)
# Compare policies with bootstrap CIs
scenarios = {
'Education (+20%)': ('adult_literacy', 0.20),
'Healthcare (+20%)': ('physicians_density', 0.20),
}
results = compare_policy_scenarios(model, df, scenarios, with_uncertainty=True)
# Country-specific report
report = create_country_report_card(model, df, country='Nigeria', scenarios=scenarios)Publication-ready plots.
from src.visualisation import (create_funnel_plot, create_bubble_plot,
create_world_map, create_dag_visualisation)
# Funnel plot (outlier detection)
funnel = create_funnel_plot(df)
# Bubble chart (Hans Rosling style)
bubble = create_bubble_plot(df, x_var='gni_per_capita', log_x=True)
# Choropleth map
world_map = create_world_map(df, variable='crude_death_rate')
# Causal DAG
dag = create_dag_visualisation()Gradio-based teaching interface.
from src.dashboard import launch_dashboard
# Launch interactive simulator
launch_dashboard(model, df, share=True)| Module | Duration | Key Learning |
|---|---|---|
| 1. Ecological Fallacy | 15 min | Limitations of ecological analysis |
| 2. Data Acquisition | 10 min | Reproducible data access |
| 3. EDA | 20 min | Correlation and visualisation |
| 4. Clustering | 15 min | Country groupings |
| 5. Spatial Analysis | 20 min | Moran's I and hot spots |
| 6. Regression | 20 min | Progressive modelling |
| 7-9. ML/Policy/Dashboard | 20 min | Advanced topics |
-
Why can't we infer individual effects from ecological data?
- Confounding at aggregate level
- Aggregation bias
- Cross-level inference problem
-
Why does spatial autocorrelation matter?
- Violates independence assumption
- Underestimates standard errors
- Inflates Type I error rates
-
How should we interpret policy simulations?
- Model predictions, not causal effects
- Uncertainty is crucial
- Real-world implementation differs
-
Ecological Fallacy: Results describe country-level associations, NOT individual effects
-
Cross-sectional Design: Cannot establish causation
-
Data Quality: World Bank data has varying completeness across countries
-
Spatial Approximation: KNN on centroids is approximate; proper analysis needs shapefiles
-
Small Sample: ~200 countries limits complex ML models
- World Bank Open Data: https://data.worldbank.org
- wbgapi documentation: https://wbgapi.readthedocs.io
This workshop material is provided for educational purposes.
Contributions welcome! Please submit issues or pull requests.
Olalekan Uthman - olalekan.uthman@warwick.ac.uk