yu-group / covid19-severity-prediction Goto Github PK

Extensive and accessible COVID-19 data + forecasting for counties and hospitals. 📈

Home Page: https://arxiv.org/abs/2005.07882

License: MIT License

Python 0.11% Jupyter Notebook 65.16% CSS 0.01% Shell 0.01% HTML 34.71% JavaScript 0.01%

covid-19 covid-19-data covid-19-data-analysis python3 visualization county-health-data ventilator outbreak risk-assessment risk-modelling

covid19-severity-prediction's Introduction

Covid Severity Forecasting

Data and models (updated daily) for forecasting COVID-19 severity for individual counties and hospitals in the US. The data includes confirmed cases/deaths, demographics, risk factors, social distancing data, and much more.

Table of contents
Overview • Quickstart • Acknowledgements
Resources
Data csv • Paper • Website • Modeling docs • Dashboard code

Overview

Note: This repo is actively maintained - for any questions, please file an issue.

Data (updated daily): We have cleaned, merged, and documented a large corpus of hospital- and county-level data from a variety of public sources to aid data science efforts to combat COVID-19.
- At the hospital level, the data include the location of the hospital, the number of ICU beds, the total number of employees, the hospital type, and contact information
- At the county level, our data include socioeconomic factors, social distancing scores, and COVID-19 cases/deaths from USA Facts and NYT
- Easily downloadable as processed csv or full pipeline
- Extensive documentation available here
Paper link: "Curating a COVID-19 data repository and forecasting county-level death counts in the United States"
Project website: http://covidseverity.com/
- see interactive county-level map + interactive hospital-level map
Modeling: Using this data, we have developed a short-term (3-5 days) forecasting model for mortality at the county level. This model combines a county-specific exponential growth model and a shared exponential growth model through a weighted average, where the weights depend on past prediction accuracy.
Severity index: The Covid pandemic severity index (CPSI) is designed to help aid the distribution of medical resources to hospitals. It takes on three values (3: High, 2: Medium, 1: Low), indicating the severity of the covid-19 outbreak for a hospital on a certain day. It is calculated in three steps.
1. county-level predictions for number of deaths are modeled
2. county-level predictions are allocated to hospitals within counties proportional the their total number of employees
3. final value is decided by thresholding the number of cumulative predicted deaths for a hospital (=current recorded deaths + predicted future deaths)

Quickstart with the data + models

Can download, load, and merge the data via:

import load_data
# first time it runs, downloads and caches the data
df = load_data.load_county_level(data_dir='/path/to/data')

for more data details, see ./data/readme.md
see also the quickstart notebook
we are constantly monitoring and adding new data sources (+ relevant data news here)
output from running the daily updates is stored here

To get deaths predictions for our current best-performing model, the simplest way is to call the add_preds function (for more details, see ./modeling/readme.md)

from modeling.fit_and_predict import add_preds
df = add_preds(df, NUM_DAYS_LIST=[1, 3, 5]) # adds keys like "Predicted Deaths 1-day", "Predicted Deaths 3-day"
# NUM_DAYS_LIST is list of number of days in the future to predict

Related county-level projects

Acknowledgements

The UC Berkeley Departments of Statistics, EECS led by Professor Bin Yu (group members are all alphabetical by last name)

Yu group team (Data/modeling): Nick Altieri, Rebecca Barter, James Duncan, Raaz Dwivedi, Karl Kumbier, Xiao Li, Robbie Netzorg, Briton Park, Chandan Singh (student lead), Yan Shuo Tan, Tiffany Tang, Yu Wang
- Summer team: Abhineet Agarwal, Maya Shen, Danqing Wang, Chao Zhang
Response4Life (Organization/distribution) team and volunteers, particularly Don Landwirth and Rick Brennan
Medical team (Advice from a medical perspective): Roger Chaufournier, Aaron Kornblith, David Jaffe
Hospital-info collection: Matthew Shen, Anthony Rio, Miles Bishop, Josh Davis, and Dylan Goetting
Kolak group team (Geospatial visualization): Qinyun Lin
Support from Google: Cat Allman and Peter Norvig
Shen Group team (IEOR): Junyu Cao, Shunan Jiang, Pelagie Elimbi Moudio
Helpful input from many including: SriSatish Ambati, Rob Crockett, Tina Elassia-Rad, Marty Elisco, Nick Jewell, Valerie Isham, Valerie Karplus, Andreas Lange, Ying Lu, Samuel Scarpino, Jas Sekhon, Phillip Stark, Jacob Steinhardt, Suzanne Tamang, Brian Yandell, Tarek Zohdi
Thanks to support from AWS and Google
Additionally, we would like to thank our sources, which can be found in the data readme

To reference, please cite the paper:

@article{altieri2020Curating,
  journal = {Harvard Data Science Review},
  doi = {10.1162/99608f92.1d4e0dae},
  note = {https://hdsr.mitpress.mit.edu/pub/p6isyf0g},
  title = {Curating a COVID-19 Data Repository and Forecasting County-Level DeathCounts in the United States},
  url = {https://hdsr.mitpress.mit.edu/pub/p6isyf0g},
  author = {Altieri, Nick and Barter, Rebecca L and Duncan, James and Dwivedi, Raaz and Kumbier, Karl and Li, Xiao and Netzorg, Robert and Park, Briton and Singh, Chandan and Tan, Yan Shuo and Tang, Tiffany and Wang, Yu and Zhang, Chao and Yu, Bin},
  date = {2020-11-03},
  year = {2020},
  month = {11},
  day = {3},
}

covid19-severity-prediction's People

Contributors

Stargazers

Watchers

Forkers

san-git rlbarter sarahz916 qixuanjin rahul263-stack zheng-da cccwei1999 kartechbabu stevegoldstein alicegif bmajeed92 mrnekkar gabrielguan1108 jingyichen0701 aliciayc c4554ndr4 weenyjy weidaidavid daniel-covelli rudrakshtuwani ackaplan11 ninonb valleyj0e ayushsax jlabowitz sudarshangopal98 andrewilyas zhaojhao vishaln94 ryantwilkinson 666chao666 inklingdq lizy331 namjoonsuh abdulhameed younesamin ryanwyg erickemde kingsleyred zccheng77 xuetengtsui peacegui ryanlovett restartus mcoram kathyolsen kuleafenu geor7 sourabhhkt huning2009 shuyufan hsinglukliu isabelweng response4life harsha3509-zz fuyunjian kars10k xhxuciedu zha204 fagan2888 jygis surfcao yonas-g dongyann ruohanzhan badruddoza lunarmouse anonym0305 lchen1733 sainikhil7 leenamasih barrosm pedramsafaei nemochina2008 shailik trung4454 flookkrup yu45020 gustav0silva fusion-research bmstuftececs leleswaldir olayinkaadeleye avijits01 midnightripper python-repository-hub longshen931 ericawwl yerkeboss shubhampachori12110095 scelinmio

covid19-severity-prediction's Issues

FIPS code 00001 in usafacts_infections

The line in usafacts_infections with FIPS code 00001 corresponds to this line in the raw file:
1,New York City Unallocated/Probable,NY
Do you really want it in the processed file?

20069,Gray County,KS has 1 case in processed/usafacts_infections but none in raw

tips code 20069 has one case in the processed usafacts_infection column #Cases_04-19-2020 but none in the raw file. The raw file I'm looking at has entries up to and including 4/20/20.

Load dataset code: "Error: No such file"

Per the instructions, I cloned the repo and started a program to run in the root directory:

import data
# unabridged
df_unabridged = data.load_county_data(data_dir = "data", cached = False, abridged = False)

Running this code produces an error: FileNotFoundError: [Errno 2] No such file or directory: 'File ../../raw/ahrf_health/ahrf_health.csv does not exist'. I have been able to reproduce this environment in a totally separate environment. The problem is likely in clean.py.

Not sure if I'm missing something obvious here.

Other info: MacOS

consistency between usafacts and nytimes

If a goal is to put county_level data from various source in a common format then consider:

Sorting the columns so that the columns in usafacts_infection and nytimes_infections are in the same order. Currently, the #Cases_ columns come before the #Deaths_ in usafacts and the reverse is true for nytimes_infections.
All the numbers in nytimes_infections end in ".0" e.g. 0.0,0.0,1.0,1.0,... They are integers in usafacts_infections. I suggest removing the .0 in the nytimes_infections.

add_preds

following the quickstart.ipynb. In the add_preds function, it has "date out of range" issue. The delta in the function keep increasing without stopping criteria other than the existence of the cached frame and causing the overflow issue?

FileNotFoundError: [Errno 2] No such file or directory: 'data/hrsa/data_AHRF_2018-2019/processed/df_renamed.pkl'

Looks like there's a problem with the hrsa data, as below.

FileNotFoundError Traceback (most recent call last)
in
15 import load_data
16
---> 17 df = load_data.load_county_level()
18 df = df.sort_values('tot_deaths', ascending=False)
19 important_vars = load_data.important_keys(df)

~/load_data.py in load_county_level(data_dir, cached_file, cached_file_abridged, ahrf_data, diabetes, voting, icu, heart_disease_data, stroke_data, dir_mod)
50 heart_disease_data=heart_disease_data,
51 stroke_data=stroke_data,
---> 52 diabetes=diabetes) # also cleans usafacts data
53
54 # basic preprocessing

~/functions/merge_data.py in merge_data(ahrf_data, diabetes, voting, icu, heart_disease_data, stroke_data, medicare_group, resp_group)
18
19 # read in data
---> 20 facts = pd.read_pickle(ahrf_data)
21 facts = facts.rename(columns={'Blank': 'id'})
22

/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/io/pickle.py in read_pickle(filepath_or_buffer, compression)
168 if not isinstance(fp_or_buf, str) and compression == "infer":
169 compression = None
--> 170 f, fh = get_handle(fp_or_buf, "rb", compression=compression, is_text=False)
171
172 # 1) try standard library Pickle

/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text)
432 else:
433 # Binary mode
--> 434 f = open(path_or_buf, mode)
435 handles.append(f)
436

FileNotFoundError: [Errno 2] No such file or directory: 'data/hrsa/data_AHRF_2018-2019/processed/df_renamed.pkl'

Maximum of array under 'deaths' does not match the value in 'tot_deaths'

The maximum value mentioned in the array under deaths column does not match the total number of deaths for some counties. In total, I found 16 such instances.

The FIPS codes of the problematic counties are as follows:
['01031', '01077', '02110', '05031', '05061', '08069', '08097', '13085', '13269', '28005', '39027', '39113', '45023', '49005', '53037', '54055']

I am using the 'abridged' version of the dataset.

Where is all_deaths_preds_6_21.pkl?

When I try to run the predict_all_death.py, it shows No such file or directory: 'all_deaths_preds_6_21.pkl'
I am not sure where is this file

NYTimes has "City1" and "City2" as countyFIPS codes

The last 2 lines of the processed nytimes_infections file begin with "City1" and "City2" in the first field. I believe City1 corresponds to the "New York City" line in the raw file (with no fips code) and City2 to Kansas City,Missouri (also no fips code).

$ tail -2 nytimes_infections.csv |less -SX
City1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, ...
City2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,  ...

https://github.com/Yu-Group/covid19-severity-prediction/blob/master/data/county_level/processed/nytimes_infections/nytimes_infections.csv

Metadata update

Your dataset was added to CoronaWhy (https://www.coronawhy.org/) Data Lake on Dataverse as a piece of common COVID-19 data frame http://datasets.coronawhy.org/dataset.xhtml?persistentId=doi:10.5072/FK2/WB3UE8

Would you be willing to help with the maintenance of your dataset in Dataverse, e.g. adding the relevant metadata and keeping the dataset up-to-date? That will help to make the dataset findable and accessible for the medical science community.

hospital-level dataset

Excuse me, I had trouble while running the script, and got an similar error with #16 . So that I couldn't generate the hospital-level dataset.
Could you please upload the processed clean hospital-level dataset? Thank you!!

Fail to download the raw dataset

Hi, thanks for sharing this dataset. I'm trying to load the safegraph_socialdistancing data from this git repository. However, it shows that (as shown below) the dataset are stored in a seperated "covid-19-private-data". Is there any way I could get access to this safegraph_socialdistancing data?

def load_safegraph_socialdistancing(data_dir='../../../../../covid-19-private-data'):
''' Load in SafeGraph Social Distancing data (automatically updated)

Parameters
----------
data_dir : str; path to the data directory to find safegraph_socialdistancing.gz (private data)

Returns
-------
data frame
'''

orig_dir = os.getcwd()
os.chdir(data_dir)

# refresh and load in data
os.system("git pull")
raw = pd.read_pickle("safegraph_socialdistancing.gz", compression="gzip")

KeyError: 'all_deaths_pred_6_17_advanced_shared_model_21'

When I tried to run the predict_all_deaths.py it raise this KeyError, what should I do in this case?
KeyError: 'all_deaths_pred_6_17_advanced_shared_model_21'

FIPS code 06000 in usafacts_infections

The line in usafacts_infections with FIPS code 06000 corresponds to this line in the raw file:
6000,Grand Princess Cruise Ship,CA
Do you really want it in the processed county-level file?