Skip to content

[Data & Code] Predicting poverty and wealth from mobile phone metadata #11

@chengjun

Description

@chengjun

Predicting poverty and wealth from mobile phone metadata

 
Joshua Blumenstock, Gabriel Cadamuro, Robert On
 
image

Mobile phone data were supplied by an anonymous service provider in Rwanda and are not available for distribution.

All other data and code, including all intermediate data needed to replicate these results and apply these methods in other contexts, are available through the Inter-university Consortium for Political and Social Research (http://doi.org/10.3886/E50592V2). https://www.openicpsr.org/openicpsr/project/100144/version/V5/view

INTRODUCTION

The code and data required to replicate the findings in the paper are organized into three top-level directories:
/modeling: Used to construct features from call records, and to perform supervised learning on the phone survey sample
/validation: Used to aggregate and spatially join phone and DHS data
/analysis: Used to construct most figures and tables

SYSTEM REQUIREMENTS
* Python running sklearn 0.16.1 and graphlab 1.6.1
* JVM
* Spark
* SBT
* Edit paths in run_local.sh
* Additional details on individual files, including dependencies, is provided in this readme for convenience:

CODE OVERVIEW
Note: ICPSR does not allow certain file extensions, these are automatically replaced by .txt and should be changed back

/analysis
/analysis/fig1.R

  • used to construct figure 1 scatter plot and ROC curves
  • requires: /modeling/data/train/PhoneSurveyFeats.csv
    /modeling/data/train/ROC.csv

/analysis/fig3.R

  • produces Figure 3 and SM Figure 6
  • requires: /validation/data/out/districts2010.csv
    /validation/data/out/clusters2010.csv

/analysis/figS2.R

  • produces SM Figure 2
  • requires: /modeling/data/train/sampleSize.csv

/analysis/figS3.R

  • produces SM Figure 3
  • requires: /modeling/data/train/feature_r2.csv
  • requires: /analysis/vioPlot.R

/analysis/figS8.R

  • produces SM Figure 8
  • requires: /validation/data/out/districts2010.csv
    /validation/data/out/districts2007.csv
    /analysis/rwmap/District_Boundary_2012/

/analysis/figS4/code/figS4.R

  • produces SM Figure 4
  • requires: /modeling/data/train/feature_r2.csv
    /analysis/figS4/data/in/trainFeats.csv
    /analysis/figS4/data/in/trainObj.csv

/modeling
Note: all python files in /modeling/code/scripts require a link to a config file as the first command line argument

/modeling/code/scripts/PhoneSurveyPreProc.py

  • Produces the composite wealth index using PCA
  • requires /modeling/data/misc/districtWeights.csv
    survey population weights
  • output
    /modeling/data/preprocessing/survey_weights.csv
    /modeling/data/preprocessing/survey_means.csv
    /modeling/data/preprocessing/survey_stds.csv
    PCA vectors, used to construct projections later

/modeling/code/scripts/FeatureGenerator.py
Performs feature engineering on original CDR files using graphlab. DFA structure specified in config file

  • requires /modeling/data/cdr/domestic_cdr.csv
    /modeling/data/cdr/sms_cdr.csv
    /modeling/data/cdr/intl_cdr.csv
    Original call detail records (redacted)
    /modeling/data/cdr/allIds.csv
    List of all ids appearing in any CDR (redacted)
    /modeling/code/libs/AggregateFeatureLib.py
    /modeling/code/libs/FeatureLib.py
    /modeling/code/libs/LocalNetworkLib.py
  • output
    /modeling/data/train/PhoneSurveyFeats.csv
    contains survey responses and CDR features for all survey respondents, including actual and predicted wealth composite (redacted).

/modeling/data/train/PhoneSurveyFeatsHash.csv

same as above, but includes original data and masks the call metrics for each subscriber.

Can be used for replication of e.g. Fig 1.

			/modeling/data/features/finalFeatures.csv
				CDR Features for all subscribers in Rwanda (redacted)

/modeling/code/scripts/TrainModel.py
Supervised learning models fit using cross-validation on training set, and out-of-sample predictions generated. Utilizes CLI’s to generate additional analysis files used to construct figures (see next to each output file for the command line required)

  • requires /modeling/data/train/PhoneSurveyFeats.csv
    /modeling/data/train/PhoneSurveyFeatsHash.csv
    /modeling/data/features/finalFeatures.csv
    /modeling/data/aux/districtWeights.csv
    /modeling/code/libs/TrainLib.py
    /modeling/code/libs/FeatureNamerLib.py
  • output
    /modeling/data/train/allPredictions.csv
    Predicted values of wealth and assets for all mobile subscribers
    /modeling/data/train/_roc.csv [with arg = “roc”]
    Data used to generate ROC curves, based on predictions
    /modeling/data/train/feature_r2.csv [with arg = “singler2”]
    Bivariate R2 values from a regression of wealth index on each individual CDR feature. Reports AUC values for binary assets
    /modeling/data/train/sampleSize.csv [with arg = “samplesize”]
    Model performance as a function of sample size

/validation/
/validation/code/run_joint.sh
Joins CDR predictions with DHS data to validate predictions. Script first reads in asset and wealth predictions from CDR data and DHS data. Then reads location data from CDR towers and DHS cluster locations. Performs spatial join of wealth imputed from CDR locations to wealth recorded in DHS locations. Aggregates to district and cluster level

  • requires: data/in/DhsRwandaClusters0708.csv - 2007 DHS Cluster Data
    data/in/DhsRwandaClusters10.csv - 2010 DHS Cluster Data
    data/in/DhsRwandaHH0708.tsv - 2007 DHS Household Survey Data
    data/in/DhsRwandaHH10.tsv - 2010 DHS Household Survey Data
    Note that Dhs* data is available at dhsprogram.com, these files only contain the header
    data/in/Towers.csv - Cell tower lat/long coordinates (redacted)
    data/in/*_HourlyCOG.csv - Approximate location of each subscriber (redacted)
    modeling/data/train/allPredictions.csv - see above
  • output
    /validation/data/out/districts2010.csv
    /validation/data/out/clusters2010.csv
    /validation/data/out/districts2007.csv
    /validation/data/out/clusters2007.csv

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions