nealjean / predicting-poverty Goto Github PK

View Code? Open in Web Editor NEW

449.0 449.0 234.0 754 KB

Combining satellite imagery and machine learning to predict poverty

Home Page: http://sustain.stanford.edu/predicting-poverty

License: MIT License

Jupyter Notebook 95.55% Python 2.80% R 1.65%

predicting-poverty's People

Contributors

Stargazers

Watchers

Forkers

mutual-ai himmelstein david-g-l geetua adamsamson zigengliu j-indarto pgr-me plishu hobark babooppa6 jdglick econandrew bentanust jmmnn esurface jaimyoung jinchieh conorbmurphy gwnudt crikeli brunosan lesaffrea juanlp robtenorio ishmaelbelghazi vyraun envhyf ajsw ckrishanu aniucd djcomlab vaishaal bolliger32 is2548 geekiac wlau0721 koyo-jakanees aklilu jdc08161063 hehuanshu96 benjamesbabala complexly yinshtony csjunxu caizy1709 gninnur giserh pustar agilajah geoyi boo1245 mugaruka jetsalinas bkcline eptrszkwcz dairejpwalsh escap-data-hub superalexander raghparihar solertis pepechu alstat joycezw kahvinsky laylasss anhnguyendepocen rvaughan clustersdata olatechie avenash97 lininglouis lhmzhou liangma0201 candlelightms faezs hidhineshraja amitkayal lizl90 kjeanclaude merico34 bluejayblues elenita1221 mhamine waceunmn singhvijay shashi1992 pondib yaojt pratheerth zhangrongen nikunj2512 porvakanti nickrance feliuserra countrysideid cuulee ottoman91 davidcomfort johmedr

predicting-poverty's Issues

How to create the trained CNN model?

Hi Neal,
I tried to replicate this work for predicting poverty in another country. However, in this work, you have already provided the trained CNN (predicting_poverty_trained.caffemodel) to extract 4096 image features corresponding to each cluster (extract_features.py). Since I want to build a model for another country using the training images from this country, I would like to know how you built the trained CNN model.

Do we train the model to perform a classification task since I noticed you used SOFTMAX in the last layer? what is the label for each image (I don't think we have the labels or the classes for images)?
Is there any reason why you use the features in the layer conv7?

Thank you so much for your help. Looking forward to hearing from you.

Why not getting good results?

Hi Neal,
why I am not getting very good results. In following figure I have choose first 800 images from candicate_download_locs.txt and use them and then extract features and generate figure 3.

In following figure I have used cluster Lat Lon directly to download images and then extract features and generate figure 3.

could you please help me out? It would be highly appreciated

cross_validation.KFold depricated

Figure 3 cannot be reproduced as cross_validation.KFold is depricated. Will the script be updated?

Is the satellite imagery georeferenced?

Hey I am looking into using satellite imagery to predict economic activity. I saw previous questions about how the images are downloaded. I just wanted to ask, if your images are georeferenced?

400*400 pixel daytime image

Hey...i saw in one of the issues about the watermark at the bottom and you have downloaded a slightly larger image. May I know what pixel size you have used to download the images. And did it effect the square km area of the images downloaded.

Issues in ProcessSurveyData.R and the README

Hi! I am trying to reproduce your research to learn more about applied machine learning with satellite imagery. I ran into a few issues I thought you might want to hear:

First, in ProcessSurveyData.R, line 131 (for Malawi) two arguments are given for the nl function. I see that it takes two, but I get an unused argument error:

Error in nl(., mwi13.vars, 2013) : unused argument (2013)
In addition: There were 15 warnings (use warnings() to see them)

My guess from the other code is that there used to be multiple parameters to this function and the vars is no lnger needed and should be removed:

nl(mwi13.vars, 2013) -> nl(2013)

Second, in the README.md, you mention that the Tanzania data from LSMS should be relabeled to DATA:

  3. Unzip these files so that **data/input/LSMS** contains the following folders of data:
       1. UGA_2011_UNPS_v01_M_STATA
       2. DATA (formerly TZA_2012_LSMS_v01_M_STATA_English_labels before a re-upload in January 2016)
       3. NGA_2012_LSMS_v03_M_STATA
       4. MWI_2013_IHPS_v01_M_STATA

But in the code in ProcessSurveyData.R you have "DATA" as the directory for the Nigeria data:

## Nigeria ##
nga13.cons <- read.dta('data/input/LSMS/DATA/cons_agg_w2.dta') %$%
  data.frame(hhid = hhid, cons = pcexp_dr_w2/365)
nga13.cons$cons <- nga13.cons$cons*110.84/(79.53*100)
nga13.geo <- read.dta('data/input/LSMS/DATA/Geodata Wave 2/NGA_HouseholdGeovars_Y2.dta')
nga13.coords <- data.frame(hhid = nga13.geo$hhid, lat = nga13.geo$LAT_DD_MOD, lon = nga13.geo$LON_DD_MOD)
nga13.rururb <- data.frame(hhid = nga13.geo$hhid, rururb = nga13.geo$sector, stringsAsFactors = F)
nga13.weight <- read.dta('data/input/LSMS/DATA/HHTrack.dta')[,c('hhid', 'wt_wave2')]
names(nga13.weight)[2] <- 'weight'
nga13.phhh8 <- read.dta('data/input/LSMS/DATA/Post Harvest Wave 2/Household/sect8_harvestw2.dta')
nga13.room <- data.frame(hhid = nga13.phhh8$hhid, room = nga13.phhh8$s8q9)
nga13.metal <- data.frame(hhid = nga13.phhh8$hhid, metal = nga13.phhh8$s8q7=='IRON SHEETS')
nga13.elev <- raster('data/input/DIVA-GIS/NGA_alt.gri') %>%
  extract(., nga13.coords[,c('lon', 'lat')]) %>%
  data.frame(hhid = nga13.coords$hhid, elev = .) %>% na.omit()

Which should be fixed: the code or the README?

The host website for the World Bank's LSMS-ISA data is not accessible.

When attempting to access the World Bank's LSMS-ISA data through your link, an error message comes up as follows:

Would you like to check if the site is accessible or has been updated to a new location?

Thanks and Kind Regards,
Haocheng

Different output file names in different scripts

Hi!

The script 'extract_features.py' stores the CNN features and other aspects of the model as 'conv_features.npy' and 'image_counts.npy'. But, the modules 'load_country_lsms' and 'load_country_dhs' in 'fig_utils.py' seem to be looking for the files 'cluster_conv_features.npy' and 'cluster_image_counts.npy' and they are not being generated at any other point in the workflow. The same goes for 'nightlights.npy', 'consumptions.npy' and 'households.npy'. Am I missing something here, or are both the scripts supposed to be referring to the same file?

Thanks

fig D - Tanzania differs

Hi Neal,

maybe not an issue, just a note.
After replication of Fig1 I noticed that in original paper - fig D Tanzania has much different shape than other countries.
Could be caused by not enough data points for higher consumption segment. Actually it looks like for Tanzania there is much more data for that period.

My figure for this country looks similar to Uganda and Nigeria. Can supply the fig if needed. Used the same data downloaded these days.

Btw, nice work, regards
Tom

Out-of-sample/cluster Prediction

I've succeeded in replicating your results (great work by the way), but I'm now trying to make predictions for consumption/assets in places outside of the original DHS/LSMS clusters, i.e. out-of-sample predictions derived from additional satellite imagery from non-DHS/LSMS locations. I can see from extract_features.py that the features are estimated for every image provided before being aggregated to the cluster level, so this should be feasible. But I'm then unsure how to use these image-specific features in the regression model produced in fig_utils.py, partly because everything is coded at the cluster level, reflecting the available level of DHS/LSMS data, and partly because I'm not familiar with the cross-validation approach. How would you advise applying the regression model to make predictions for individual images? Thanks,

pixel2coord in get_image_download.py

Hi Neal,

My name is Kishen. I was looking at your code.

In pixel2coord in /scripts/get_image_download.py function.
Every pixel spans some latitude and longitude.
Shouldn't you use mean of the pixel's latitude and longitude.
That will correspond to 0.45km shift on the ground?

Training data and testing data for Caffe Model

Greetings Authors,
I have a few questions on the training data set for the Caffe Model. After going over your code, it seems like all modified coordinates from LSMS were used in creating the downloaded_locs.txt, and downloaded_locs.txt is used by extract_features.py and thus becomes the test set before it could be used regression. To simplify, I am concerned if training set and testing set were mutually exclusive? If yes is it possible for you to share the training coordinates for each country?

Thanks,
Vinit

Per pixel area of night time light?

I was wondering that night lights are very large in size and cannot view in normal photo editor. I want to know what is the resolution of nightlight images and how much area it is covering. (per pixel area of nightlight) ?

Thanks in advance

n*4096 features

Hey
So i have extracted the n*4096 features from the satellite images. I was wondering if all the features are meaningful for your poverty measure and from the output you know which feature represents what?

Also in one of your videos in youtube i saw the output you refer to is a linear combination of the features extracted..is that a summation of the features or something else.

how to Download imagery from locations of interest (e.g., cluster locations from Nigeria DHS survey).

Great work. but i am stuck at this point. i have run get_image_download_locations.py and after that i need to download imagery locations of interest. Do i need to download the imagery manually or any script is given?

thanks in advance

Object "pcexp_dr_w2" not found

Hello,

Ive trained running the ProcessSurveyData.R script but I keep getting the following error:

Error in data.frame(hhid = hhid, cons = pcexp_dr_w2/365) : object 'pcexp_dr_w2' not found

The error comes from the following part of the code:

nga13.cons <- read.dta('./data/input/LSMS/DATA/cons_agg_wave2_visit2.dta') %$% data.frame(hhid = hhid, cons = pcexp_dr_w2/365) nga13.cons$cons <- nga13.cons$cons*110.84/(79.53*100) nga13.geo <- read.dta('./data/input/LSMS/DATA/Geodata Wave 2/NGA_HouseholdGeovars_Y2.dta')

How can I fix the code?

the 2nd step

Hey so for the 2nd step training step, night lights are used. Is that in image form or values ranging from 0 to 62?
Also the second training step predicts the nightlight intensities from the daytime images. So does that mean you also derive a new data set of predicted light intensity values from your training along with the third row of images from figure 2?

Missing Model Weights - FOUND

The saved weights for the trained model are missing. In extract_features.py (see here) we expect the weights file to be located at ../model/predicting_poverty_trained.caffemodel, but such a file does not exist in the repo.

Downloading images from Google Map API at correct coordinates

Greetings Authors,
Thanks for sharing your code. As the README in the repository mentions the output of the candidate_download_locs.txt is in the form of [image_lat] [image_long] [cluster_lat] [cluster_long] and these coordinates generate locations meant to download 1x1 km RGB satellite images of size 400x400 pixels. In context to using Google Map API to download images, I am assuming that [image_lat], [image_long], [cluster_lat] and [cluster_long] were the rectangular coordinates of the Geometry object that were used to download the 400 x 400 image, i.e, top-left corner = {[image_lat], [image_long]} and bottom-left corner = {[cluster_lat], [cluster_long]}. To verify this assumption I used the haversine distance formula but I obtained areas greater that 25km in some cases. Now, I am assuming that you might have taken 1 km x 1km patches around the {[image_lat] , [image_long]} , i.e, considering {[image_lat] , [image_long]} as the center point. Is this what you had done or was some other method used?
Thank you,
Vinit

Deriving specific filtered images

Hi,
I have run the extract feature script on some images to derive the convolutional features and also the filtered images. I was able to derive the n*4096 array conv_features.npy and the 64 filtered images. But I see from Figure 2 of your paper, you have identified different convolutional filters. I was wondering if you ran a separate script to identify a specific convolutional feature such as roads, buildings, concrete structures etc. In particular, I was wondering if it is possible to extract values (in a tabular format) that measure the total number of particular features in the image. For example, out of the total number of pixels, there are X pixels that have the features of concrete structures, Y pixels that are roads etc.

How to actually download satellite images?

Hi there!

Thanks for the detailed description of how to get and process the data. It seems to me that the only missing piece is how to actually download the satellite images (and where to download them from). Is it possible to do that automatically using GDAL? It would be wonderful if you could to share the script you used to retrieve the images.

Thanks,
Maruan

Problem replicating results after using extract_features.py

Hello,

I'm having some trouble replicating the figures after trying to extract_features.py myself, and am getting results looking like attached below for Figure 3 Cluster Level Consumptions:

I'm pretty sure have followed all the steps correctly unless I missed something - do you have any idea what I may have done wrong?

Thanks!

trainable?

is this model trainable? i want to train the model further on some other data ?

Training Data

Hi,
In the third step to predict poverty, we require the survey data along with the corresponding extracted features. If so ,then we need some sort of corresponding training data for the images to predict poverty? Then how can we calculate poverty or economic measure just using the daytime images? I dont know if i am missing something? My question is how can we predict some values of poverty or economic activity based on the daytime images without any survey or training data?

Cannot download the "Nigeria 2012-13" LSMS data

I'm following your process for "Instructions for processing survey data" and I hid a dead end when trying to download the "Nigeria 2012-13" LSMS data, both links to download the data that are provided on the "host website for the World Bank's LSMS-ISA data" are not working for the country Nigeria (link example), is there another place where I can download this data? or could you share the examples you used?

Thanks,
Alex

Out of sample training

Hey I was able to train for some countries replicating your work. I want to do some out of sample predictions. I see you have used countries for the out of sample prediction which are similar in characteristics. Do you suggest we can use a model trained in a country which is very different economic development wise? For instance, using a model trained with say Netherlands to do out of sample training lets say for Nigeria?

Trained CNN Model link is not working

Trained CNN model here and save in the model directory.