Giter Club home page Giter Club logo

intersect-community-data's Introduction

DOI

IN-CORE HRRC Banner

Repository Title: Intersect Community Data

Point of Contact: Nathanael Rosenheim: [email protected]

Required Citation:

Rosenheim, Nathanael, Roberto Guidotti, Paolo Gardoni & Walter Gillis Peacock. (2021). Integration of detailed household and housing unit characteristic data with critical infrastructure for post-hazard resilience modeling. Sustainable and Resilient Infrastructure. 6(6), 385-401. https://doi.org/10.1080/23789689.2019.1681821

Rosenheim, Nathanael. (2022). "Detailed Household and Housing Unit Characteristics: Data and Replication Code." [Version 2] DesignSafe-CI. https://doi.org/10.17603/ds2-jwf6-s535

Project Description

People are the most important part of community resilience planning. However, models for community resilience planning tend to focus on buildings and infrastructure. This repository provides a solution that connects people to buildings for community resilience models. The housing unit inventory method transforms aggregated population data into disaggregated housing unit data that includes occupied and vacant housing unit characteristics. Detailed household characteristics include size, race, ethnicity, income, group quarters type, vacancy type and census block. Applications use the housing unit allocation method to assign the housing unit inventory to structures within each census block through a reproducible and randomized process. The benefits of the housing unit inventory include community resilience statistics that intersect detailed population characteristics with hazard impacts on infrastructure; uncertainty propagation; and a means to identify gaps in infrastructure data such as limited building data. This repository includes all of the python code files. Python is an open source programming language and the code files provide future users with the tools to generate a 2010 housing unit inventory for any county in the United States. Applications of the method are reproducible in IN-CORE (Interdependent Networked Community Resilience Modeling Environment).

Funding Source: Award name and number

Center for Risk-Based Community Resilience Planning : NIST-70NANB20H008, NIST-70NANB15H044 (http://resilience.colostate.edu/)

Related Works

Interdependent Networked Community Resilience Modeling Environment (IN-CORE) : https://incore.ncsa.illinois.edu/

Hazard Reduction and Recovery Center (HRRC): https://www.arch.tamu.edu/impact/centers-institutes-outreach/hazard-reduction-recovery-center/

Keywords: Social and Economic Population Data, Housing Unit, US Census, Synthetic Population

Output Data License: Open Data Commons Attribution License (https://opendatacommons.org/licenses/by/summary/)

Program Code: Mozilla Public License Version 2.0 (https://www.mozilla.org/en-US/MPL/2.0/)

DesignSafe-CI project identification number

Project-2961 - Available at: https://www.designsafe-ci.org/data/browser/public/designsafe.storage.published/PRJ-2961

Important Acronyms

  • acg = api.census.gov - primary data source
  • HRRC = Hazard Reduction and Recovery Center
  • hui = housing unit inventory
  • ncoda = Intersect Community Data
  • IN-CORE = Interdependent Networked Community Resilience Modeling Environment

Census Data Terms of Use

This product uses the Census Bureau Data API but is not endorsed or certified by the Census Bureau. https://www.census.gov/data/developers/about/terms-of-service.html

"Users will not use these data, alone or in combination with any other Census or non-Census data, to:

  1. identify any individual person, household, business or other entity
  2. not link or combine these data with information in any other Census or non-Census dataset in a manner that identifies an individual person, household, business or other entity
  3. not publish information from these data files, particularly in combination with any other Census or non-Census data, in a manner that identifies any individual person, household, business or other entity; and
  4. not use the identity of any person or establishment discovered inadvertently and advise the Director of the Census Bureau of any such discovery by calling the U.S. Census Bureau’s Policy Coordination Office at 301-763-6440."

Setup Environment

  1. Install Anaconda (https://www.anaconda.com/products/individual)
  2. Install VS Code
  3. Download Repository
  4. See environment.yml file for dependencies
  5. Run the primary Jupyter Notebook in the main folder

Data File Naming Convention

The housing unit inventory file names include the following convention:

  • Acronym = hui - for housing unit inventory
  • version = v2-0-0 - version of the file
  • community = community name - name of the community ex. Lumberton, NC
  • year = year of the data - ex. 2010
  • random seed = random seed used to generate the data - ex. rs1000

Example description of the file name hui_v2-0-0_Lumberton_NC_2010_rs1000.csv:

  • Version 2 Housing Unit Inventory for Lumberton, NC, using 2010 Census data. Set the random seed to 1000 to replicate the data.

Each data file is a comma separated value (CSV) file and each data file has a PDF file associated with it. The PDF file is the Codebook for the data file. The Codebook includes an overview of the project, details and notes on each variable, links to verify the data, and figures for exploring the data. NOTE - To understand the data please refer to the Codebook.

Program File Naming Convention

Program file names are designed to be unique and descriptive. The file names provide a way to identify the source of the data (based on the acronym), the order the programs should be run, and the version of the program. All program files (python files) are named with the following convention:

  • Acronym (e.g ICD - for Intersect Community Data)
  • Data science workflow step:
    • 00 - Administrative, Overview of Data, General Utility Functions
    • 01 - Obtain Data (read in ata from api, wget, data parsing, etc.)
    • 02 - Data Cleaning (merging files, adding new columns, checking keys, etc.)
    • 03 - Data Analysis
    • 04 - Data Visualization
    • 05 - Data Output (combined workflow steps for producing csv files)
    • 06 - Data Production (codebook, data sharing, archiving, etc.)
    • 07 - Data workflows (produce files from other programs)
    • 08 - Validation workflows
    • 09 - Uncertainty propagation workflows
  • Program run order step (e.g run a before b or b depends on code in a)
  • version (e.g v2 - version 2 of the file)
    • Github used for incremental version control, version number updated when program file has a major change.

Directory Design

The directory structure is as follows:

├───OutputData [ignored by github folder]
├───OutputData_LogFiles [ignored by github folder]
├───SampleOutputData [manually added to github repository to show example results]
└───pyncoda
    ├───CommunitySourceData
    │   ├───api_census_gov

The pyncoda directory contains the source code to replicate the data. In the pyncoda directory, the notebook file ncoda_06dv1_run_HUI_v2_workflow.ipynb is the notebook file that contains the workflow for the data replication.

The replication code creates the folder OutputData to store the data files and log files. This folder is not included in the GitHub repository or the DesignSafe-CI archive. The folder OutputData is ignored by github and was removed to limit the size of the DesignSafe-CI archive. The file directory_file_tree_2021-04-01.txt contains an archive of all files generated by the replication code.

The Archive directory contains the program files that were used to develop code but are no longer needed. These files are kept in case the code needs to be reused.

The folder pyncoda contains the source code to replicate the data - this source code archives the version of the code at the time the data was generated. A current version of the source code is available on GitHub at https://github.com/npr99/intersect-community-data/.

Within the pyncoda folder the folder CommunitySourceData contains the code to obtain, clean, and process the source data for a community. For the Housing Unit Inventory the source data is obtained from the Census Bureau API (apo.census.gov). The folder name api_census_gov reflects the url where the source data can be found - note that underscores (_) are used in place of . to allow for the folder to be used as part of the python package.

The folder OutputData_LogFiles is included in the DesignSafe archive but ignored by github. This directory contains the log files generated by the replication code. This folder was created manually to archive the details stored in the log files. The log files include all of the weblinks, and steps in the workflow.

Acknowledgements

[Work on acknowledgments] Mastura Safayet - PhD Student Texas A&M University [has fork of repository running in VS Code on local machine]

intersect-community-data's People

Contributors

albeck2 avatar aminenderami avatar mastura10 avatar npr99 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

intersect-community-data's Issues

Research Idea

We could use ipums to see what the general likelihood is that households have 0 to x number of vulnerabilities
Also could perform PCF to see how vulnerabilities relate to each other and which ones are highlighly correlated, the way the different categories weight could help guide the synthetic population variables to include

Get "Used By" section on repo

requirements.txt

https://github.com/Kanaries/pygwalker/tree/main

https://docs.github.com/en/code-security/supply-chain-security/understanding-your-software-supply-chain/about-the-dependency-graph#supported-package-ecosystems

I noticed in pygwalker github page that they have a used by section. Also the package has a nice readme page.

It looks like a package that uses another package needs to have a requirments.txt file - then github can identify the package that depends on another package and will make a used by section on the repository.

workflow for prechui

getting ready to publish example files for the person record files, with student and staff examples.
prechui_v0-2-0_37155_2010_rs9876.csv
prec_v0-2-0_Lumberton_NC_2010_rs9876_schoolstaff.csv
prec_v0-2-0_Lumberton_NC_2010_rs9876_students.csv

I need to work on getting this workflow up-to-date on github. Right now it is not clear who these files could be replicated.
The code was from 2022-03-02

Run HUA with non incore building inventory

working on workflow to run the Housing Unit Inventory on a non incore dataset
For example - in southeast texas we have National Address Data Inventory and Parcel data. What do I need to do to make the hua work with this type of file

Get "Used By

requirements.txt

https://github.com/Kanaries/pygwalker/tree/main

https://docs.github.com/en/code-security/supply-chain-security/understanding-your-software-supply-chain/about-the-dependency-graph#supported-package-ecosystems

I noticed in pygwalker github page that they have a used by section. Also the package has a nice readme page.

It looks like a package that uses another package needs to have a requirments.txt file - then github can identify the package that depends on another package and will make a used by section on the repository.

Run code in codespace

GitHub has a new option to run code in a virtual environment called codespace. This sounds like a good idea but - when I initially launch the codespace the environment is not setup. I had expected the environment to be based on the .yml file. But it looks like I need to make a .dev file.

Following these instructions:
https://docs.github.com/en/codespaces/setting-up-your-project-for-codespaces/adding-a-dev-container-configuration/setting-up-your-python-project-for-codespaces

found bug - expected total seems to decrease

Issue moved over from: https://github.com/npr99/Labor_Market_Allocation/issues/1

The expected total should be the same across loops of MCMC SA - noticed with block 371559610003026 (
MCMC SA settings:
random_accept_threshold = 0.1
start_reduction_threshold = 0.1
max_reduction_threshold = 0.4
seed = 133234)
That the expected total decreases from 1205 to 1184 to 1159 to 1147....
This should not happen. It is possible that the random select function has a bug.

Labor_Market_Allocation/pyincoredata_addons/SourceData/lehd_ces_census_gov/lodes_mcmcsa_loops.py

Lines 84 to 99 in 731d71a

Should probably make length_expected using the wac job list not the random selection. But the random selection should always selected the same number - so the error is probably in the random selection process.

Labor_Market_Allocation/pyincoredata_addons/SourceData/lehd_ces_census_gov/lodes_mcmcsa_util.py

Lines 350 to 358 in 731d71a

if k > 2:
seed_k = seedk + k
df = rand_select_jobs(df, seed_k)
# Check if number of jobs selected for each od pair
# matches the expected number of jobs

 # New function here 

 total_mcmcsa[k,1] = calculate_total_fitness(\

Note line 356 in lodes_mcmcsa_util.py references same issue

Try using unique job id
It might be that jobidac is not merging correctly and therefore dropping obs

change main folder pyhui

pyhui stands for python housing unit inventory... need to change the name but have not decided on the future package name

options
pyicd - python intersect community data
pyincoda
pyintcomdat
pyintersectcd
pyntrsctcmntydt

pyixcommunitydata

Consider adding Homeless

https://data.census.gov/table?q=foster+child&g=0100000US_0500000US37155_1600000US3739700&tid=ACSST5Y2019.S0901

https://data.census.gov/cedsci/table?q=foster%20child&g=0100000US_0500000US37155_1600000US3739700&tid=ACSST5Y2019.S0901

Homeless students
share housing with other persons due to loss of housing or economic hardship, live in hotels or motels, trailer parks, or campgrounds due to lack of alternative arrangements, those awaiting foster care placement, living in substandard housing, and children of migrant workers (CRS, 2018)

good list of characteristics - some of these are in the housing unit and person record inventory. Shared housing, trailer parks - could be identified in the data. Foster care might be something to summarize - the census bureau has counts of people in foster care.

I will see if I can flag shared housing. The person record inventory has household structure - could be based on number of adults to children. I would assume that shared housing would have multiple adults (3-4+) and multiple children (3+)

According to the National School Boards Association, during the 2018-2019 school year, 77% of homeless students lived in shared housing, 12% lived in shelters, transitional housing, or were awaiting foster care, 7% lived in hotels or motels, and 4% were unsheltered (Cai, 2021).

Cai, Jinghong. (2021). Homeless Students in Public Schools Across America: Down but Not Out. National School Boards Association.
Congressional Research Service (CRS). (2018). Homelessness: Targeted Federal Programs. CRS Report.

Issue with family variable in Oceana County

running code for oceana county
found that there is an issue with the family variable having missing values.
The code did not run the first time - error at the point of making the figure for family income.
The code ran fine the second time through.

Also an error with building available table. Might be an issue with the data not having anyone in missing buildings.

rework joblist to prec merge

once @aminenderami has his joblist that combines the wac-od-rac files. Here is a possible workflow for how to combine the joblist with the person record files.

@npr99 did this for Lumberton School Staff.

Example data published on DesignSafe
https://www.designsafe-ci.org/data/browser/public/designsafe.storage.published//PRJ-2961v3/PersonRecordFiles_Lumberton_2022-03-02?doi=10.17603%2Fds2-jwf6-s535

prec_v0-2-0_Lumberton_NC_2010_rs9876_schoolstaff.pdf

adding code to NPR sandbox

found code in
G:\Shared drives\HRRC_IN-CORE\Tasks\M5.2-01 Pop inventory\github_com\npr99\Population_Inventory\pyincore_data_addons

Archived code in this folder:
pyncoda\99_SandboxCode\SandboxNPR\prechui_lodes_code

The file
ncoda_07fv1_HUA_PREC_NSI.ipynb

generates the HUI and PREC files (note need to add code that combines the HUI and PREC files. This code is NPR Sandbox but has not been updated.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.