-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Predicting poverty and wealth from mobile phone metadata
Joshua Blumenstock, Gabriel Cadamuro, Robert On

Mobile phone data were supplied by an anonymous service provider in Rwanda and are not available for distribution.
All other data and code, including all intermediate data needed to replicate these results and apply these methods in other contexts, are available through the Inter-university Consortium for Political and Social Research (http://doi.org/10.3886/E50592V2). https://www.openicpsr.org/openicpsr/project/100144/version/V5/view
INTRODUCTION
The code and data required to replicate the findings in the paper are organized into three top-level directories:
/modeling: Used to construct features from call records, and to perform supervised learning on the phone survey sample
/validation: Used to aggregate and spatially join phone and DHS data
/analysis: Used to construct most figures and tables
SYSTEM REQUIREMENTS
* Python running sklearn 0.16.1 and graphlab 1.6.1
* JVM
* Spark
* SBT
* Edit paths in run_local.sh
* Additional details on individual files, including dependencies, is provided in this readme for convenience:
CODE OVERVIEW
Note: ICPSR does not allow certain file extensions, these are automatically replaced by .txt and should be changed back
/analysis
/analysis/fig1.R
- used to construct figure 1 scatter plot and ROC curves
- requires: /modeling/data/train/PhoneSurveyFeats.csv
/modeling/data/train/ROC.csv
/analysis/fig3.R
- produces Figure 3 and SM Figure 6
- requires: /validation/data/out/districts2010.csv
/validation/data/out/clusters2010.csv
/analysis/figS2.R
- produces SM Figure 2
- requires: /modeling/data/train/sampleSize.csv
/analysis/figS3.R
- produces SM Figure 3
- requires: /modeling/data/train/feature_r2.csv
- requires: /analysis/vioPlot.R
/analysis/figS8.R
- produces SM Figure 8
- requires: /validation/data/out/districts2010.csv
/validation/data/out/districts2007.csv
/analysis/rwmap/District_Boundary_2012/
/analysis/figS4/code/figS4.R
- produces SM Figure 4
- requires: /modeling/data/train/feature_r2.csv
/analysis/figS4/data/in/trainFeats.csv
/analysis/figS4/data/in/trainObj.csv
/modeling
Note: all python files in /modeling/code/scripts require a link to a config file as the first command line argument
/modeling/code/scripts/PhoneSurveyPreProc.py
- Produces the composite wealth index using PCA
- requires /modeling/data/misc/districtWeights.csv
survey population weights - output
/modeling/data/preprocessing/survey_weights.csv
/modeling/data/preprocessing/survey_means.csv
/modeling/data/preprocessing/survey_stds.csv
PCA vectors, used to construct projections later
/modeling/code/scripts/FeatureGenerator.py
Performs feature engineering on original CDR files using graphlab. DFA structure specified in config file
- requires /modeling/data/cdr/domestic_cdr.csv
/modeling/data/cdr/sms_cdr.csv
/modeling/data/cdr/intl_cdr.csv
Original call detail records (redacted)
/modeling/data/cdr/allIds.csv
List of all ids appearing in any CDR (redacted)
/modeling/code/libs/AggregateFeatureLib.py
/modeling/code/libs/FeatureLib.py
/modeling/code/libs/LocalNetworkLib.py - output
/modeling/data/train/PhoneSurveyFeats.csv
contains survey responses and CDR features for all survey respondents, including actual and predicted wealth composite (redacted).
/modeling/data/train/PhoneSurveyFeatsHash.csv
same as above, but includes original data and masks the call metrics for each subscriber.
Can be used for replication of e.g. Fig 1.
/modeling/data/features/finalFeatures.csv
CDR Features for all subscribers in Rwanda (redacted)
/modeling/code/scripts/TrainModel.py
Supervised learning models fit using cross-validation on training set, and out-of-sample predictions generated. Utilizes CLI’s to generate additional analysis files used to construct figures (see next to each output file for the command line required)
- requires /modeling/data/train/PhoneSurveyFeats.csv
/modeling/data/train/PhoneSurveyFeatsHash.csv
/modeling/data/features/finalFeatures.csv
/modeling/data/aux/districtWeights.csv
/modeling/code/libs/TrainLib.py
/modeling/code/libs/FeatureNamerLib.py - output
/modeling/data/train/allPredictions.csv
Predicted values of wealth and assets for all mobile subscribers
/modeling/data/train/_roc.csv [with arg = “roc”]
Data used to generate ROC curves, based on predictions
/modeling/data/train/feature_r2.csv [with arg = “singler2”]
Bivariate R2 values from a regression of wealth index on each individual CDR feature. Reports AUC values for binary assets
/modeling/data/train/sampleSize.csv [with arg = “samplesize”]
Model performance as a function of sample size
/validation/
/validation/code/run_joint.sh
Joins CDR predictions with DHS data to validate predictions. Script first reads in asset and wealth predictions from CDR data and DHS data. Then reads location data from CDR towers and DHS cluster locations. Performs spatial join of wealth imputed from CDR locations to wealth recorded in DHS locations. Aggregates to district and cluster level
- requires: data/in/DhsRwandaClusters0708.csv - 2007 DHS Cluster Data
data/in/DhsRwandaClusters10.csv - 2010 DHS Cluster Data
data/in/DhsRwandaHH0708.tsv - 2007 DHS Household Survey Data
data/in/DhsRwandaHH10.tsv - 2010 DHS Household Survey Data
Note that Dhs* data is available at dhsprogram.com, these files only contain the header
data/in/Towers.csv - Cell tower lat/long coordinates (redacted)
data/in/*_HourlyCOG.csv - Approximate location of each subscriber (redacted)
modeling/data/train/allPredictions.csv - see above - output
/validation/data/out/districts2010.csv
/validation/data/out/clusters2010.csv
/validation/data/out/districts2007.csv
/validation/data/out/clusters2007.csv