Giter Club home page Giter Club logo

thinkingmachines / unicef-ai4d-poverty-mapping Goto Github PK

View Code? Open in Web Editor NEW
16.0 3.0 6.0 152.66 MB

UNICEF AI4D Relative Wealth Mapping Project - datasets, models, and scripts for building relative wealth estimation models across Southeast Asia (SEA)

Home Page: https://thinkingmachines.github.io/unicef-ai4d-poverty-mapping

License: MIT License

Makefile 0.01% CSS 0.01% Python 0.25% Jupyter Notebook 99.74% Dockerfile 0.01% Shell 0.01%
geospatial-analysis machine-learning social-impact

unicef-ai4d-poverty-mapping's Introduction

UNICEF AI4D Relative Wealth Project

Python Code style: black



📜 Description

The UNICEF AI4D Relative Wealth Project aims to develop open datasets and machine learning (ML) models for poverty mapping estimation across nine countries in Southeast Asia (SEA).

We also aim to open source all the scripts, experiments and other artifacts used for developing these datasets and models in order to allow others to replicate our work as well as to collaborate and extend our work for their own use cases.

This project is part of Thinking Machines's overall push for open science through the AI4D (AI for Development) Research Bank which aims to accelerate the development and adoption of effective machine learning (ML) models for development across Southeast Asia.

Documentation geared towards our methodology and experiments can be found here.



💻 Replicating model training and rollout for a country

Our final trained models and their use to produce nationwide estimates can replicated through our notebooks, assuming you've followed the Data and Local Development setup below.

All the output files (models, datasets, intermediate files) can all be downloaded from here.



📚 Data Setup

DHS Data

Due to the sensitive nature of the data and the DHS program terms of use, we cannot provide the raw DHS data used in our experiments.

You will have to request for access to raw data yourself on the DHS website.

Generally, for all the experiment notebooks in this repo, they assume that the DHS Stata and Shape zip files contents are unzipped to its own folder under data/dhs/<iso-country-code>/ where the <iso-country-code> is the two-letter ISO country code.

For example, from the data for the Philippines will have this directory structure:

data/
    dhs/
        ph/
            PHGE71FL/
                DHS_README.txt
                GPS_Displacement_README.txt
                PHGE71FL.cpg
                PHGE71FL.dbf
                PHGE71FL.prj
                PHGE71FL.sbn
                PHGE71FL.sbx
                PHGE71FL.shp
                PHGE71FL.shp.xml
                PHGE71FL.shx
            PHHR71DT/
                PHHR71FL.DCT
                PHHR71FL.DO
                PHHR71FL.DTA
                PHHR71FL.FRQ
                PHHR71FL.FRW
                PHHR71FL.MAP

If you create your own notebook, of course you are free to modify these conventions for filepaths yourself. But out-of-the-box, this is what our notebooks assume.

Night Lights from EOG

The only other data access requirement is for the EOG Nightlights Data which requires registering for an account. The notebooks require the use of these credentials (user name and password) to download the nightlights data automatically.

General Dataset Notes

All the other datasets used in this project are publically available and the notebooks provide the code necessary to automatically download and cache the data.

Due to the size of the datasets, please make sure you have enough disk space (minimum 40GB-50GB) to accommodate all the data used in building the models.



⚙️ Local Setup for Development

This repo assumes the use of miniconda for simplicity in installing GDAL.

Requirements

  1. Python 3.9
  2. make
  3. miniconda

🐍 One-time Set-up

Run this the very first time you are setting-up the project on a machine to set-up a local Python environment for this project.

  1. Install miniconda for your environment if you don't have it yet.
wget "https://repo.anaconda.com/miniconda/Miniconda3-latest-$(uname)-$(uname -m).sh"
bash Miniconda3-latest-$(uname)-$(uname -m).sh
  1. Create a local conda env and activate it. This will create a conda env folder in your project directory.
make conda-env
conda activate ./env
  1. Run the one-time set-up make command.
make setup
  1. To test if the setup was successful, run the tests. You should get a message that all the tests passed.
make test

At this point, you should be ready to run all the existing notebooks on your local.

📦 Dependencies

Over the course of development, you will likely introduce new library dependencies. This repo uses pip-tools to manage the python dependencies.

There are two main files involved:

  • requirements.in - contains high level requirements; this is what we should edit when adding/removing libraries
  • requirements.txt - contains exact list of python libraries (including depdenencies of the main libraries) your environment needs to follow to run the repo code; compiled from requirements.in

When you add new python libs, please do the ff:

  1. Add the library to the requirements.in file. You may optionally pin the version if you need a particular version of the library.

  2. Run make requirements to compile a new version of the requirements.txt file and update your python env.

  3. Commit both the requirements.in and requirements.txt files so other devs can get the updated list of project requirements.

Note: When you are the one updating your python env to follow library changes from other devs (reflected through an updated requirements.txt file), simply run pip-sync requirements.txt

📜Documentation

We are using Quarto to maintain the Unicef AI4D Relative Wealth documentation site.

Here are some quick tips to running quarto/updating the doc site, assuming you're on Linux.

For other platforms, please refer to Quarto's website.

  • Download:
wget https://github.com/quarto-dev/quarto-cli/releases/download/v1.2.247/quarto-1.2.247-linux-amd64.deb
  • Install:
sudo dpkg -i quarto-1.2.247-linux-amd64.deb
quarto preview --port 4444 --no-browser
  • Update the site (must have maintainer role):
quarto publish gh-pages --no-browser
  • Pro-tip : If you are using VS Code as your code editor, install the Quarto extension to make editing/previewing the doc site a lot smoother.

☸️Running in Docker

We have created a docker image (ghcr.io/butchtm/povmap-jupyter) of the poverty mapping repo for those who want to view the notebooks or rollout the models for new countries and new data (e.g. new nightlights and ookla years)

To run these docker images please copy and paste the following scripts to run on your linux, mac or windows (wsl) terminals:

curl -s https://raw.githubusercontent.com/thinkingmachines/unicef-ai4d-poverty-mapping/main/localscripts/run-povmap-jupyter-notebook.sh > run-povmap-jupyter-notebook.sh && \
chmod +x run-povmap-jupyter-notebook.sh && \
./run-povmap-jupyter-notebook.sh
  • Country-wide rollout This will run an interactive dialog that will rollout the poverty mapping models for different countries and different time periods
curl -s https://raw.githubusercontent.com/thinkingmachines/unicef-ai4d-poverty-mapping/main/localscripts/run-povmap-rollout.sh > run-povmap-rollout.sh && \
chmod +x run-povmap-rollout.sh && \
./run-povmap-rollout.sh
  • Copy rollout to local directory This will copy the contents of the rollout notebooks and rollout data into your current directory (after running a new rollout) to rollout-data and rollout-output-notebooks
curl -s https://raw.githubusercontent.com/thinkingmachines/unicef-ai4d-poverty-mapping/main/localscripts/copy-rollout-to-local.sh > copy-rollout-to-local.sh && \
chmod +x copy-rollout-to-local.sh && \
./copy-rollout-to-local.sh

Note: These commands assume that curl is installed and will download the scripts, change their permissions to executable as well as run them. After the initial download, you can just rerun the scripts which would would have been downloaded to your current directory.

Note: The scripts create and use a docker volume named povmap-data which contains the outputs as well as caches the data used for generating the features from public datasets

Note: Rolling out the notebooks requires downloading EOG nightlights data so a user id and password are required as detailed in the previous section above.

unicef-ai4d-poverty-mapping's People

Contributors

alronlam avatar ardieorden avatar butchtm avatar dependabot[bot] avatar levymedina avatar tm-jace-peralta avatar tm-jc-nacpil avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

unicef-ai4d-poverty-mapping's Issues

Add DHS data file path instructions

Encountered a file path error and had to manually navigate and edit path to get to the right folder level. I think its best if we provide instructions, or a folder structure guide for the data/ folder image

data preparation

DHS Data

  • Philippines
  • Cambodia
  • Myanmar
  • Timor Leste

OSM

  • Philippines
  • Cambodia
  • Myanmar
  • Timor Leste
  • Laos
  • Malaysia
  • Thailand
  • Vietnam
  • Indonesia

Ookla

  • Philippines
  • Cambodia
  • Myanmar
  • Timor Leste
  • Thailand
  • Laos
  • Malaysia
  • Vietnam
  • Indonesia

NTL

  • Philippines
  • Cambodia
  • Myanmar
  • Timor Leste
  • Laos
  • Malaysia
  • Thailand
  • Vietnam
  • Indonesia

data collection

DHS Data

  • Philippines
  • Cambodia
  • Myanmar
  • Timor Leste

OSM

  • Philippines
  • Cambodia
  • Myanmar
  • Laos
  • Malaysia
  • Thailand
  • Vietnam
  • Indonesia

Ookla

  • Philippines
  • Cambodia
  • Myanmar
  • Timor Leste
  • Laos
  • Malaysia
  • Vietnam
  • Indonesia

NTL

  • Philippines
  • Cambodia
  • Myanmar
  • Timor Leste
  • Laos
  • Malaysia
  • Thailand
  • Vietnam
  • Indonesia

Outlier cluster in PH data

Hi @butchtm just wanted to flag an outlier in the dataset, When reading the data straight from the cloud storage bucket, there seems to be an outlier cluster to the far west from the rest of the PH, see below:

image

Parametrize / automate `process_string` parameter in nightlights module

Hi @butchtm , just wanted to flag for future improvements. I have a feeling the process_suffix will change depending if they have changes in the preprocessing stages. It's possible that they will upload new versions, with a different process_suffix. In that case, since it's hardcoded, we might not find the correct url.

Possible Solution

  • could we parametrize make_url to accept a custom process_string ?

Originally posted by @tm-jc-nacpil in #75 (comment)

pygeos error in installation step

Hello everyone, I'm getting an error in the make setup step. It points to a JSONDecodeError in the pygeos module. Did anyone else encounter this?

Building wheel for pygeos (pyproject.toml) ... done
Created wheel for pygeos: filename=pygeos-0.12.0-cp39-cp39-linux_x86_64.whl size=466415 sha256=b5825f6c13d1bcc3f760555744d4aea66940b5b3562ce9f722ecd49b973d7c09
Stored in directory: /home/jace/.cache/pip/wheels/67/0f/37/46f75689d7cacbc71c82facc1fa05b4e9b88a557ff89923644
ERROR: Exception:
Traceback (most recent call last):
File "/mnt/c/Users/JCPeralta/Documents/Github/unicef-ai4d-poverty-mapping/env/lib/python3.9/site-packages/pip/_internal/cli/base_command.py", line 160, in exc_logging_wrapper
status = run_func(*args)
File "/mnt/c/Users/JCPeralta/Documents/Github/unicef-ai4d-poverty-mapping/env/lib/python3.9/site-packages/pip/_internal/cli/req_command.py", line 247, in wrapper
return func(self, options, args)
File "/mnt/c/Users/JCPeralta/Documents/Github/unicef-ai4d-poverty-mapping/env/lib/python3.9/site-packages/pip/_internal/commands/install.py", line 450, in run
_, build_failures = build(
File "/mnt/c/Users/JCPeralta/Documents/Github/unicef-ai4d-poverty-mapping/env/lib/python3.9/site-packages/pip/_internal/wheel_builder.py", line 361, in build
wheel_cache.record_download_origin(cache_dir, req.download_info)
File "/mnt/c/Users/JCPeralta/Documents/Github/unicef-ai4d-poverty-mapping/env/lib/python3.9/site-packages/pip/_internal/cache.py", line 282, in record_download_origin
origin = DirectUrl.from_json(origin_path.read_text())
File "/mnt/c/Users/JCPeralta/Documents/Github/unicef-ai4d-poverty-mapping/env/lib/python3.9/site-packages/pip/_internal/models/direct_url.py", line 206, in from_json
return cls.from_dict(json.loads(s))
File "/mnt/c/Users/JCPeralta/Documents/Github/unicef-ai4d-poverty-mapping/env/lib/python3.9/json/init.py", line 346, in loads
return _default_decoder.decode(s)
File "/mnt/c/Users/JCPeralta/Documents/Github/unicef-ai4d-poverty-mapping/env/lib/python3.9/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/mnt/c/Users/JCPeralta/Documents/Github/unicef-ai4d-poverty-mapping/env/lib/python3.9/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
WARNING: Ignoring invalid distribution -umpy (/mnt/c/Users/JCPeralta/Documents/Github/unicef-ai4d-poverty-mapping/env/lib/python3.9/site-packages)
WARNING: Ignoring invalid distribution -umpy (/mnt/c/Users/JCPeralta/Documents/Github/unicef-ai4d-poverty-mapping/env/lib/python3.9/site-packages)
WARNING: Ignoring invalid distribution -umpy (/mnt/c/Users/JCPeralta/Documents/Github/unicef-ai4d-poverty-mapping/env/lib/python3.9/site-packages)
Traceback (most recent call last):
File "/mnt/c/Users/JCPeralta/Documents/Github/unicef-ai4d-poverty-mapping/env/bin/pip-sync", line 8, in
sys.exit(cli())
File "/mnt/c/Users/JCPeralta/Documents/Github/unicef-ai4d-poverty-mapping/env/lib/python3.9/site-packages/click/core.py", line 1130, in call
return self.main(*args, **kwargs)
File "/mnt/c/Users/JCPeralta/Documents/Github/unicef-ai4d-poverty-mapping/env/lib/python3.9/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/mnt/c/Users/JCPeralta/Documents/Github/unicef-ai4d-poverty-mapping/env/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/mnt/c/Users/JCPeralta/Documents/Github/unicef-ai4d-poverty-mapping/env/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/mnt/c/Users/JCPeralta/Documents/Github/unicef-ai4d-poverty-mapping/env/lib/python3.9/site-packages/piptools/scripts/sync.py", line 174, in cli
sync.sync(
File "/mnt/c/Users/JCPeralta/Documents/Github/unicef-ai4d-poverty-mapping/env/lib/python3.9/site-packages/piptools/sync.py", line 240, in sync
run( # nosec
File "/mnt/c/Users/JCPeralta/Documents/Github/unicef-ai4d-poverty-mapping/env/lib/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/mnt/c/Users/JCPeralta/Documents/Github/unicef-ai4d-poverty-mapping/env/bin/python3.9', '-m', 'pip', 'install', '-r', '/tmp/tmp6y59070e', '--no-binary', 'pygeos,shapely']' returned non-zero exit status 2.
make: *** [Makefile:14: setup] Error 1

osm.py downloads error

There seems to be a path error in creating the cache for osm downloads-- might be due to a repeated folder in the target directory string:
OSM POIs for philippines being loaded from ~/.geowrangler/osm/osm/philippines/gis_osm_pois_free_1.shp

ss_error_osm_dl

Implementing DHS cross country data manager

Hi @alronlam ! I'd like to sanity check an approach for returning the cross-country datasets. :D

  • main idea is to create a class DHSDataManager to help manage the dhs data for multiple countries
  • when we run generate_cluster/household_level_data, it stores the output in a class, index by country name
  • If we can store it in this way, we can run the usual dhs_data = DHSDataManager.generate_data() pattern, while also keeping track of all the countries we've performed it on
  • Afterwards, since it's stored in the class, we can then have functions for returning the concatenated cross-country datasets based on a list of specified countries. We can also apply the necessary preprocessing of the cross-country output in the same functions

Pseudocode

class DHSDataManager:
    def_init(self):
        # Initialize storage for processed country datasets
        # separated into household and cluster level
        self.household_data = {
            "ph": <dataframe>,
            "tl": <dataframe>,
            ...    
        }
        self.cluster_data = {
            "ph": <dataframe>,
            "tl": <dataframe>,
            ...    
        }

    def generate_dhs_cluster_level_data(<same parameters as before>, return=True):
        <same logic here>

        # Store the output gdf into cluster_data
        self.cluster_data[country_name] = gdf
        
        if return:
             return gdf 

    def generate_dhs_household_level_data(<same parameters as before>, return=True):
        <same logic here>

        # Store the output df into household_data
        self.household_data[country_name] = df
        
        if return:
            return df 

     def get_cluster_level_data_by_country(country_list):
         "Concatenate all cluster level dataframes for each specified country"
         for country in country_list:
             ...
         return concatenated_gdf

     def get_household_level_data_by_country(country_list):
         "Concatenate all household level dataframes for each specified country"
         for country in country_list:
             ...
         return concatenated_df

     def recalculate_index(country_list):
         "Recalculates wealth index based on specified country"
         return df

Ideal usage

# Get cluster level data for four countries  using the data manager
# Using the same parameters as the original generate_dhs_cluster_level_data() function
# This stores the output data manager and optionally returns the corresponding dataframe
dhs_ph_gdf = DHSDataManager.generate_dhs_cluster_level_data(<ph parameters>, return=True)
dhs_tl_gdf = DHSDataManager.generate_dhs_cluster_level_data(<tl parameters>, return=True)
dhs_mm_gdf = DHSDataManager.generate_dhs_cluster_level_data(<mm parameters>, return=True)
dhs_kh_gdf = DHSDataManager.generate_dhs_cluster_level_data(<kh parameters, return=True)

# After loading it in the data manager, it can take care of combining the datasets into country level
country_cluster_data = DHSDataManager.get_cluster_level_data_by_country(<list of countries>)

# Similarly for household data
country_household_data = DHSDataManager.get_household_level_data_by_country(<list of countries>)

# Run PCA and recalculate index accordingly
index = PCA(country_household_data)


Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.