[Data & Code] Predicting poverty and wealth from mobile phone metadata

## Predicting poverty and wealth from mobile phone metadata
 
Joshua Blumenstock, Gabriel Cadamuro, Robert On
 
![image](https://user-images.githubusercontent.com/543384/89919942-73f0f900-dc2e-11ea-80aa-a26c6e3f84d9.png)

Mobile phone data were supplied by an anonymous service provider in Rwanda and are not available for distribution. 

**All other data and code**, including all intermediate data needed to replicate these results and apply these methods in other contexts, are available through the Inter-university Consortium for Political and Social Research (http://doi.org/10.3886/E50592V2).  https://www.openicpsr.org/openicpsr/project/100144/version/V5/view


INTRODUCTION

The code and data required to replicate the findings in the paper are organized into three top-level directories:
	/modeling: Used to construct features from call records, and to perform supervised learning on the phone survey sample
	/validation: Used to aggregate and spatially join phone and DHS data
	/analysis: Used to construct most figures and tables

SYSTEM REQUIREMENTS
	* Python running sklearn 0.16.1 and graphlab 1.6.1
	* JVM
	* Spark
	* SBT
	* Edit paths in run_local.sh
	* Additional details on individual files, including dependencies, is provided in this readme for convenience: 

CODE OVERVIEW
	Note: ICPSR does not allow certain file extensions, these are automatically replaced by .txt and should be changed back

/analysis
/analysis/fig1.R 
 - used to construct figure 1 scatter plot and ROC curves
 - requires: 	/modeling/data/train/PhoneSurveyFeats.csv
				/modeling/data/train/<asset>ROC.csv 

/analysis/fig3.R
  - produces Figure 3 and SM Figure 6
  - requires: 	/validation/data/out/districts2010.csv
				/validation/data/out/clusters2010.csv

/analysis/figS2.R
  - produces SM Figure 2
  - requires: 	/modeling/data/train/sampleSize.csv

/analysis/figS3.R
  - produces SM Figure 3
  - requires: 	/modeling/data/train/feature_r2.csv
  - requires: 	/analysis/vioPlot.R

/analysis/figS8.R
  - produces SM Figure 8
  - requires: 	/validation/data/out/districts2010.csv
                /validation/data/out/districts2007.csv
				/analysis/rwmap/District_Boundary_2012/


/analysis/figS4/code/figS4.R
  - produces SM Figure 4
  - requires: 	/modeling/data/train/feature_r2.csv
                /analysis/figS4/data/in/trainFeats.csv
				/analysis/figS4/data/in/trainObj.csv

/modeling
Note: all python files in /modeling/code/scripts require a link to a config file as the first command line argument

/modeling/code/scripts/PhoneSurveyPreProc.py
 - Produces the composite wealth index using PCA
 - requires		/modeling/data/misc/districtWeights.csv
					survey population weights 
 - output
				/modeling/data/preprocessing/survey_weights.csv
				/modeling/data/preprocessing/survey_means.csv
				/modeling/data/preprocessing/survey_stds.csv
					PCA vectors, used to construct projections later

/modeling/code/scripts/FeatureGenerator.py
	Performs feature engineering on original CDR files using graphlab. DFA structure specified in config file
 - requires		/modeling/data/cdr/domestic_cdr.csv
				/modeling/data/cdr/sms_cdr.csv
				/modeling/data/cdr/intl_cdr.csv
					Original call detail records (redacted)
				/modeling/data/cdr/allIds.csv
					List of all ids appearing in any CDR (redacted)
				/modeling/code/libs/AggregateFeatureLib.py
				/modeling/code/libs/FeatureLib.py
				/modeling/code/libs/LocalNetworkLib.py
 - output
				/modeling/data/train/PhoneSurveyFeats.csv 
					contains survey responses and CDR features for all survey respondents, including actual and predicted wealth composite (redacted).

### /modeling/data/train/PhoneSurveyFeatsHash.csv
			
same as above, but includes original data and masks the call metrics for each subscriber. 

**Can be used for replication of e.g. Fig 1.**

				/modeling/data/features/finalFeatures.csv
					CDR Features for all subscribers in Rwanda (redacted)

/modeling/code/scripts/TrainModel.py
	Supervised learning models fit using cross-validation on training set, and out-of-sample predictions generated. Utilizes CLI’s to generate additional analysis files used to construct figures (see next to each output file for the command line required)
 - requires 	/modeling/data/train/PhoneSurveyFeats.csv 
				/modeling/data/train/PhoneSurveyFeatsHash.csv
				/modeling/data/features/finalFeatures.csv
				/modeling/data/aux/districtWeights.csv
				/modeling/code/libs/TrainLib.py
				/modeling/code/libs/FeatureNamerLib.py
 - output
				/modeling/data/train/allPredictions.csv
					Predicted values of wealth and assets for all mobile subscribers 
				/modeling/data/train/<asset>_roc.csv [with arg = “roc”]
					Data used to generate ROC curves, based on predictions
				/modeling/data/train/feature_r2.csv [with arg = “singler2”]
					Bivariate R2 values from a regression of wealth index on each individual CDR feature. Reports AUC values for binary assets
				/modeling/data/train/sampleSize.csv [with arg = “samplesize”]
					Model performance as a function of sample size

/validation/
/validation/code/run_joint.sh
	Joins CDR predictions with DHS data to validate predictions. Script first reads in asset and wealth predictions from CDR data and DHS data. Then reads location data from CDR towers and DHS cluster locations. Performs spatial join of wealth imputed from CDR locations to wealth recorded in DHS locations. Aggregates to district and cluster level
 - requires:	data/in/DhsRwandaClusters0708.csv - 2007 DHS Cluster Data
				data/in/DhsRwandaClusters10.csv - 2010 DHS Cluster Data
				data/in/DhsRwandaHH0708.tsv - 2007 DHS Household Survey Data
				data/in/DhsRwandaHH10.tsv - 2010 DHS Household Survey Data
					Note that Dhs* data is available at dhsprogram.com, these files only contain the header
				data/in/Towers.csv - Cell tower lat/long coordinates (redacted)
				data/in/*_HourlyCOG.csv - Approximate location of each subscriber (redacted)
				modeling/data/train/allPredictions.csv - see above
 - output
				/validation/data/out/districts2010.csv
				/validation/data/out/clusters2010.csv
				/validation/data/out/districts2007.csv
				/validation/data/out/clusters2007.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data & Code] Predicting poverty and wealth from mobile phone metadata #11

Predicting poverty and wealth from mobile phone metadata

/modeling/data/train/PhoneSurveyFeatsHash.csv

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Data & Code] Predicting poverty and wealth from mobile phone metadata #11

Description

Predicting poverty and wealth from mobile phone metadata

/modeling/data/train/PhoneSurveyFeatsHash.csv

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions