act-now-coalition / covid-data-model Goto Github PK

Data backend providing computed data for the graphs displayed at https://covidactnow.org

License: MIT License

Python 87.97% Jupyter Notebook 8.61% Makefile 0.12% Shell 1.17% JavaScript 0.56% CSS 0.39% HTML 0.13% TypeScript 0.58% MDX 0.48%

covid-data-model's Introduction

COVID-19 Data Pipeline

COVID data pipeline / API supporting https://covidactnow.org/.

It ingests data scraped via https://github.com/covid-projections/can-scrapers, combines it, calculates metrics, and generates data files for the Covid Act Now API and website

Development

Setup

Detailed setup instructions can be found here.

Local development.

Normally the pipeline is run via github actions on a beefy cloud VM and still takes 2+ hours. When developing locally it is often useful to run the pipeline on a subset of locations ond/or to skip pipeline steps.

To run the pipeline end-to-end but only generate data for Connecticut state / counties, you can run:

# Fetches latest scraped data from can-scrapers and combines all data sources
# into a combined dataset, runs all filters, etc.  Adding --no-refresh-datasets
# will make this much faster but skips fetching / combining latest datasets.
./run.py data update --states CT

# Runs the pyseir code to generate the infection rate, i.e. r(t) metric data for locations.
python ./pyseir/cli.py build-all --states=CT

# Runs the API generation
./run.py api generate-api-v2 --state CT output -o output/api

# API files are generated to output/api.

Downloading Model Run Data

If you just want to run the API generation you can skip the first two steps above by downloading the pyseir model results from a previous snapshot. You can download the pyseir model output from a recent github action run with:

export GITHUB_TOKEN=<YOUR PERSONAL GITHUB TOKEN>
./run.py utils download-model-artifact

By default it downloads the last run, but you can choose a specific run with --run-number

Running PySEIR

PySEIR provides a command line interface in the activated environment. You can access the model with pyseir --help and pyseir <subcommand> --help providing more information.

Example: pyseir build-all --states="NY" will run state and county models for New York. States can also be specified by their state code: --states="New York" and --states=NY are equivalent.

pyseir build-all --states=NY --fips=36061 will run the New York state model and the model for the specified FIPS code (in this case New York City).

Check the output/ folder for results.

Model Output

There are a variety of output artifacts to paths described in pyseir/utils.py. The main artifact is the ensemble_result which contains the output information for each suppression policy -> model compartment as well as capacity information.

API Documentation

We host an API documentation site available in api/docs. It is a static site built using Docusaurus 2.

Additionally, we define the API output using pydantic schemas and generate Open API specs (default output api/docs/open_api_schema.json and json-schema outputs (default output api/schemas_v2/.

When modifying the API schema, run ./run.py api update-schemas to update the schemas.

Simple setup

Build environment:

$ cd api/docs
$ yarn

Start server locally:

$ cd api/docs
$ yarn start

Deploy update to apidocs.covidactnow.org:

$ tools/deploy-docs.sh

covid-data-model's People

Contributors

Stargazers

Watchers

covid-data-model's Issues

Current method of copying covid-public-data locally doesn't support Git LFS

Downloading a zip of the repository downloads the pointers to large files tracked with git lfs, but not the files themselves. Since the zip is also not itself a repository, it is not immediately possible to fetch the large files. We should consider either what needs to be done to fetch the large files after the zip is downloaded, or the possibility of cloning enough of the repo to also download the large files as we need them.

This is currently breaking support for the COVID_DATA_PUBLIC environment variable on the dod branch, since it depends on shapefiles stored via git lfs in covid-data-public.

Port Legacy Model to Updated Data Sources

Re-implement backfill on new JHU/CDS/NYT sources
Retarget run_old_model.py to the new source library
Retarget notebook_runtime.ipynb to the new source library

How backfill works:

Identify the first day with >0 cases that is either 2020-03-03 or a multiple of four days after (2020-03-07, 2020-03-11, ...).
Ensure prior dates with zero cases get synthesized to have half of the next interval's amount.
Backfill until reaching 2020-03-03.

Empirical data for cases:

2020-03-03: 0
2020-03-07: 0
2020-03-11: 2

Backfilled:

2020-03-03: 0.5
2020-03-07: 1
2020-03-11: 2

Document API schema.

Per https://docs.google.com/document/d/1VhQqMxtVYHQMIj9zFeb1DUQZ1aq1k4xIc_bYoTIqUb4 we want (at least lightweight) documentation on all of the data artifacts we currently produce, including:

State / county model files. (https://github.com/covid-projections/covid-projections/blob/develop/public/data/AK.0.json, https://github.com/covid-projections/covid-projections/blob/develop/public/data/county/AK.02020.0.json)
County summary files. (https://github.com/covid-projections/covid-projections/blob/develop/public/data/county_summaries/AK.summary.json)
Case summary files. (https://github.com/covid-projections/covid-projections/blob/develop/public/data/case_summary/AK.summary.json)
DoD artifacts (states.[csv, shp, shx, dbf], counties.[csv, shp, shx, dbf]) (https://github.com/covid-projections/covid-data-model/blob/f1b790585fadf19d34c5176440ca895cb59dfaee/deploy_dod_dataset.py#L57)

This documentation should live in api/README.md in this repo and should at minimum describe the API URL endpoints (e.g. /county/WA.53033.json) and the JSON fields or CSV columns expected in the file / endpoint.

This is important to document the API interface between our model code and the website / other consumers of our data.

Create notebooks for additional analysis

Write scheduled GitHub Action to run run.sh and deploy to S3.

The goal is to create a scheduled GitHub Action that re-runs our models to generate new API artifacts, and publishes them on data.covidactnow.org ~once daily.

It will need to:

clone the covid-data-public repo (will git lfs be an issue?)
run run.sh (see #123).
Publish to S3 according to URL format / version schema specified in https://docs.google.com/document/d/1VhQqMxtVYHQMIj9zFeb1DUQZ1aq1k4xIc_bYoTIqUb4/edit

Notes:

There's some s3 python code here that might be useful. Or maybe there's a command-line tool that would be easier / better.
We need to generate a snapshot-id to stick in the ultimate S3 destination. This could be a precise timestamp or maybe github actions have a unique job ID we could use, or something.

JHUDataset load fails in demo Jupyter Notebook (Binder): seird_notebook_runtime.ipynb

On the main page of the repo (README.md), there is a link to a Binder Jupyter Notebook instance.
In particular, seird_notebook_runtime.ipynb fails on the fourth cell. It's an issue with the JHUDataset.local() call.

County Beds Data using icu beds rather than licensed beds

Causing the beds number to way under report.

Combine mobility data?

Hi all,

Fantastic work!! This model is awesome.

I was wondering if you have considered using mobility data to inform whether a region is strictly or poorly complying with shelther-at-home orders?

The data used in this NYT article: https://www.nytimes.com/interactive/2020/04/02/us/coronavirus-social-distancing.html is free and can be found here: https://www.cuebiq.com/visitation-insights-covid19/

It looks like they provide a county-by-county estimate of the change in mobility over time. I presume a small change in mobility since the order was enacted means there is poor compliance.

Cheers,
Miles

Create run.sh script that generates all published data artifacts.

The goal is to write a single script that runs "everything" we need to run in order to produce all artifacts that we want to publish on data.covidactnow.org. This includes our website models and our DoD artifacts. Per @fridiculous we prefer this to be a bash script to keep it simple and discourage any kind of data processing, etc. in it. The resulting files in the output directory should match our API schema (which will be more formally defined as part of #124).

Syntax should be something like:

run.sh <location of covid-data-public repo> <output location>

As we clean up our internal plumbing, what the script does may change but for now we need to probably:

run run.py
run deploy_dod_dataset.py pointed at output directory.
Generate a version.json file that contains the current date/time, the git hash of covid-data-public, and the git hash of covid-data-model.

Context: https://docs.google.com/document/d/1VhQqMxtVYHQMIj9zFeb1DUQZ1aq1k4xIc_bYoTIqUb4

check/fix all geographic exceptions for fips codes in jhu data

The nytimes dataset has a list of geographic fips exceptions. Check all of these in the jhu data to make sure they're all aggregated properly

https://github.com/nytimes/covid-19-data#geographic-exceptions

JHUDataset county aggregation bug.

There's an issue where dates that have county data but not state data don't properly aggregate up correctly.

As a simple side-effect of this, if you run:

model_state('USA', 'CA', INTERVENTIONS[0])

then the results start with 2020-03-10 instead of 2020-03-03 because the prior dates only have county data.

Jupyter notebooks contain outputs

The jupyter notebooks on master right now have some saved output, but we're trying to keep all notebooks free of output to help prevent merge conflicts and git bloat.

@vincentwoo reported this because the githook to clean model output was always modifying those files.

Turn parameters into widgets

Turn parameters in the notebook to widgets so they can be tuned/modified without updating code.

Expose Public Policy Implementations Data

Example below. This is critical this week in order to run proper inferences.

def cache_public_implementations_data():
    """
    Pulled from https://github.com/JieYingWu/COVID-19_US_County-level_Summaries
    """
    logging.info('Downloading public implementations data')
    url = 'https://raw.githubusercontent.com/JieYingWu/COVID-19_US_County-level_Summaries/master/raw_data/national/public_implementations_fips.csv'

    data = requests.get(url, verify=False).content.decode('utf-8')
    data = re.sub(r',(\d+)-(\w+)', r',\1-\2-2020', data)  # NOTE: This assumes the year 2020

    date_cols = [
        'stay at home',
        '>50 gatherings',
        '>500 gatherings',
        'public schools',
        'restaurant dine-in',
        'entertainment/gym',
        'Federal guidelines',
        'foreign travel ban']
    df = pd.read_csv(io.StringIO(data), parse_dates=date_cols, dtype='str').drop(['Unnamed: 1', 'Unnamed: 2'], axis=1)
    df.columns = [col.replace('>', '').replace(' ', '_').replace('/', '_').lower() for col in df.columns]
    df.fips = df.fips.apply(lambda x: x.zfill(5))
    df.to_pickle(os.path.join(DATA_DIR, 'public_implementations_data.pkl'))

Expose County Level Age Distributions

Cannot recall what state this method was in, but it contains code to pull age distribution by county and format a dataframe containing an array of this dist + the bins. This is needed for demographic mapping + our upcoming age structured modeling.

#### Deprecated
# def cache_county_metadata():
#     """
#     Cache 2019 census data including age distribution by state/county FIPS.
#
#     # TODO Add pop density
#     """
#     print('Downloading county level population data')
#     county_summary = pd.read_csv(
#         'https://www2.census.gov/programs-surveys/popest/datasets/2010-2018/counties/asrh/cc-est2018-alldata.csv',
#         sep=',', encoding="ISO-8859-1", dtype='str', low_memory=False)
#
#     df = county_summary[county_summary.YEAR == '11'][['STATE', 'COUNTY', 'CTYNAME', 'AGEGRP', 'TOT_POP']]
#     df[['AGEGRP', 'TOT_POP']] = df[['AGEGRP', 'TOT_POP']].astype(int)
#     list_agg = df.sort_values(['STATE', 'COUNTY', 'CTYNAME', 'AGEGRP']) \
#         .groupby(['STATE', 'COUNTY', 'CTYNAME'])['TOT_POP'] \
#         .apply(np.array) \
#         .reset_index()
#     list_agg['TOTAL'] = list_agg['TOT_POP'].apply(lambda x: x[0])
#     list_agg['AGE_DISTRIBUTION'] = list_agg['TOT_POP'].apply(lambda x: x[1:])
#     list_agg.drop('TOT_POP', axis=1)
#
#     age_bins = list(range(0, 86, 5))
#     age_bins += [120]
#     list_agg['AGE_BIN_EDGES'] = [np.array(age_bins) for _ in
#                                  range(len(list_agg))]
#
#     list_agg.insert(0, 'fips', list_agg['STATE'] + list_agg['COUNTY'])
#     list_agg = list_agg.drop(['COUNTY', 'TOT_POP'], axis=1)
#     list_agg.columns = [col.lower() for col in list_agg.columns]
#     list_agg = list_agg.rename(
#         mapper={'ctyname': 'county_name', 'total': 'total_population'}, axis=1)
#     list_agg.to_pickle(os.path.join(DATA_DIR, 'covid_county_metadata.pkl'))

Expose ICU Beds and Utilization in DHBeds

Name says it all! We should just pull from the ESRI endpoint directly. Method here:

def cache_hospital_beds():
    """
    Pulled from "Definitive"
    See: https://services7.arcgis.com/LXCny1HyhQCUSueu/arcgis/rest/services/Definitive_Healthcare_Hospitals_Beds_Hospitals_Only/FeatureServer/0
    """
    logging.info('Downloading ICU capacity data.')
    url = 'http://opendata.arcgis.com/datasets/f3f76281647f4fbb8a0d20ef13b650ca_0.geojson'
    tmp_file = urllib.request.urlretrieve(url)[0]

    with open(tmp_file) as f:
        vals = json.load(f)
    df = pd.DataFrame([val['properties'] for val in vals['features']])
    df.columns = [col.lower() for col in df.columns]
    df = df.drop(['objectid', 'state_fips', 'cnty_fips'], axis=1)
    df.to_pickle(os.path.join(DATA_DIR, 'icu_capacity.pkl'))

Simplify/Standardize County -> Fips parsing

Right now the match counties to fips parsing is pretty ugly and doesn't handle all of the cases.

When loading some datasets will show if it was unable to match a county with fips code.

Could not match ('VI', 'Saint Croix')
Could not match ('GU', 'Guam')
Could not match ('VI', 'Saint Thomas')
Could not match ('AS', 'American Samoa')
Could not match ('MP', 'Saipan')

one of the ugly parts is the diacritic parsing (thanks @davidstrauss for the vocab).
It looks like there is a library that deals with this pretty well: https://stackoverflow.com/a/2633310.

Should make sure that all counties are properly being converted to fips or otherwise dealt with - potentially by assigning an unknown fips code to missing counties.

Expose an API

GET https://data.covidactnow.org/states/CA
{
population: "",
infected: "",
dead: "",
recovered: ""
}

GET https://data.covidactnow.org/states/CA/counties/kern
{
population: "",
infected: "",
dead: "",
recovered: ""
}

We should just pre-compute all of these and publish them into an s3 bucket. To start it can be the same bucket as we're using for the DOD. (data.covidactnow.org)

Handling of Missing Country Data

Related to this feedback. Basically, is it the data source library's responsibility to panic on missing data, or should the consumer inspect what it receives to ensure it has what it needs?

turn run.py into a notebook

let's run the notebook via Google Colab

Notebook should use the static files and pull in latest data from coronodatascraper for time series data.

defs should be in cell by itself to abstract/hide them

then model in another cell that's easy to manipulate

Goal is: Easy manipulation of models and a notebook that can run the model to output .json files for ingestion into website.

Add county-level support to timeseries data library

We currently support querying a timeseries by state, but it would be good to get the same in place for counties. The JHU data probably lacks this from March 10-20-ish, but the CDS data has counties for all dates. We should still be able to make a best-effort with the JHU data by getting what it has.

Make built_dod_dataset.py not read CSVs (poorly) directly. Use JHUDataset -> Timeseries dataset

As is, it's fragile to what date it is vs what date the latest dataset is from and likely other issues

Enable CORS on data.covidactnow.org

Test page to demonstrate it fails: https://mike-stuff.web.app/test-data.html

Result (in JS Console):

Access to fetch at 'https://data.covidactnow.org/states.csv' from origin 'https://mike-stuff.web.app' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource. If an opaque response serves your needs, set the request's mode to 'no-cors' to fetch the resource with CORS disabled.

We need to get CORS working, which probably means adding a Access-Control-Allow-Origin: * header.

Note that in a past lifetime I recall * had limitations and it ended up being better to instead set the header to whatever the incoming requests Origin was set to. If we do that, we also probably need to set Vary: Origin to avoid unwanted caching by proxies, etc. We should probably start with just setting it to *.

Use Beds from Definitive Health rather than beds from legacy dataset

Covid datasets is the legacy datasets, the reason we were keeping it around is to support the legacy model (which i'm pretty sure we're not using any more) and to not change the beds number for state level outputs.

Right now the beds data comes from the estimate that multiplies "bed density" * population. When aggregating the DH beds county level data to a state, we get a slightly higher number per state:

VT - old: 1310, new: 1646
WA - old: 12945, new: 14107
WI - old: 12227, new: 18657
UT - old: 5771, new: 6839
TX - old: 66691, new: 85035
TN - old: 19816, new: 23673
KS - old: 9614, new: 12117
KY - old: 14297, new: 18818
MA - old: 15984, new: 21480
MD - old: 11487, new: 12231
MN - old: 14099, new: 16837
MO - old: 19026, new: 23482
NC - old: 22025, new: 30886
NE - old: 6964, new: 6804
NJ - old: 21317, new: 28814
NV - old: 6468, new: 8684
NY - old: 52525, new: 56175
OK - old: 11080, new: 14417
OR - old: 6748, new: 8904
PA - old: 37126, new: 44250
RI - old: 2225, new: 4129
SC - old: 12357, new: 14823
NH - old: 2855, new: 3419
IA - old: 9465, new: 11795
LA - old: 15341, new: 21129
OH - old: 32729, new: 40183
WY - old: 2026, new: 1661
WV - old: 6810, new: 8788
SD - old: 4246, new: 3365
ND - old: 3277, new: 3739
NM - old: 3774, new: 5728
MT - old: 3527, new: 3431
MS - old: 11905, new: 14030
MI - old: 24967, new: 27499
ME - old: 3361, new: 3884
ID - old: 3396, new: 3859
DE - old: 2142, new: 2877
AR - old: 9657, new: 12399
AK - old: 1609, new: 1695
DC - old: 3105, new: 4143
AL - old: 15200, new: 19028
PR - old: 8623, new: 8620

I'd guess that the DH numbers are probably more accurate, since the "beds per milli" figure has a lot of potential to change with what population statistics you are using

Make run.py not hang/crash when running in parallel mode with remote data

Happens every time

Generate county JSON forecasts

Generate county JSON forecast files, but only when we have data to forecast in a county. Files should have the same format as state-level JSON and be named using FIPS codes.

Create Summary CSV of Inputs used on model generation

It's sometimes unclear:
a) why a specific county doesn't have data
b) where the numbers shown come from
c) what initial values are

I think it would be very helpful to have a csv generated (and presumably shared somewhere, could be a google sheet that is updated?) when model generation runs.

Can have individual state and county summary csvs:

State level:

Latest Date	Country	State	population	beds	cases	deaths	model results generated	case data source
2020-04-01	USA	MA	10003	1200	1000	122	True	JHU

County level:

Latest Date	Country	State	fips	name	population	beds	cases	deaths	model results generated	cases data source
2020-04-01	USA	MA	12345	Suffolk	10003	1200	1000	122	True	JHU

I think this can be done by modifying run.py and keeping track of the results as it run

"3 months of Stay at home (strict compliance)" model doesn't seem to match description

From the model reference, the shelter in place model makes these R assumptions: "1.3 for 4 weeks, 1.1 for 4 weeks, 0.8 for 4 weeks."

However the results for the "3 months of Stay at home (strict compliance)" model appear to temporarily accelerate after each assumed decrease in R, rather than start declining. This appears to be true in each county and state level model (while all examples screenshotted are in CA, this appears true across every state I've checked).

Update run.py to scale beds when running with the Harvard model

Currently, the bed count is fixed in the branch with the Harvard model. We should be scaling them according to modeling inputs.

Add county-level hospital capacity support

Our current hospital bed capacity interface only supports states (and queries a CSV that only has states). We have better data now, but it isn't exposed in any APIs yet.

So, this issue covers adding an API to get county-level hospital data from the linked source.

What epidemiological peer review was performed on this model?

Add population count checksum to model.

Total people should be constant and people move between susceptible, infected, recovered, etc. from cycle to cycle. We should validate this in the model code and throw an exception if the invariant doesn't hold.

Ingest Ventilator Capacity by State

Table 4 here contains ventilator estimates by state. I believe this is what the IMHE group is using.

https://www.cambridge.org/core/services/aop-cambridge-core/content/view/F1FDBACA53531F2A150D6AD8E96F144D/S1935789300002731a.pdf/mechanical_ventilators_in_us_acute_care_hospitals.pdf

Confirmed that you can copy paste that table into a text editor and should be able to parse.

Add model for DC

@davidstrauss - would be great to have at some point

CDS data update broke our pipeline yesterday.

It looks like this was a transient issue. If I download the latest CDS data, then our pipeline works. But we may want to look into what went wrong and if we should have better error handling in our own code or something.

git checkout 18ff6a9 in your covid-data-public repo.
Run python validate.py -d ~/src/covid-data-public in covid-data-model repo.

Result:

run.py failed with code 1
INFO:libs.CovidDatasets:NO COUNTY DATA: 2020-03-12 00:00:00
INFO:libs.CovidDatasets:NO COUNTY DATA: 2020-03-04 00:00:00
...
INFO:libs.CovidDatasets:NO COUNTY DATA: 2020-03-04 00:00:00
INFO:libs.CovidDatasets:NO COUNTY DATA: 2020-03-03 00:00:00
Traceback (most recent call last):
  File "/Users/mikelehen/.pyenv/versions/3.7.7/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2646, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'date'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run.py", line 84, in <module>
    model_state('USA', state, intervention),
  File "run.py", line 56, in model_state
    'timeseries': Dataset.get_timeseries_by_country_state(country, state, MODEL_INTERVAL),
  File "/Users/mikelehen/tmp/covid-data-model/libs/CovidDatasets.py", line 205, in get_timeseries_by_country_state
    return self.prep_data(self.combine_state_county_data(country, state), model_interval)
  File "/Users/mikelehen/tmp/covid-data-model/libs/CovidDatasets.py", line 148, in prep_data
    return self.backfill(self.cutoff(series), model_interval)
  File "/Users/mikelehen/tmp/covid-data-model/libs/CovidDatasets.py", line 137, in backfill
    self.backfill_to_init_date(series, model_interval),
  File "/Users/mikelehen/tmp/covid-data-model/libs/CovidDatasets.py", line 93, in backfill_to_init_date
    min_interval_row = interval_rows[interval_rows[self.DATE_FIELD] == interval_rows[self.DATE_FIELD].min()].iloc[0]
  File "/Users/mikelehen/.pyenv/versions/3.7.7/lib/python3.7/site-packages/pandas/core/frame.py", line 2800, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/Users/mikelehen/.pyenv/versions/3.7.7/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2648, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'date'

Generate Summary Reports for countries to use in embed

Should be a json output of the latest data by state with county overview

Website data deploy outputting to wrong subdirectory

running:
./run_model.py county --deploy output data in public/data/ rather than public/data/county

Age demographics for states or counties

https://covidactnow.org/faq says that a current limitation of the model is "Demographics for the U.S. as a whole are used, rather than specific to each state."

State- and county-level population counts by age are available from the US Census Bureau. For example, at https://www.census.gov/data/datasets/time-series/demo/popest/2010s-counties-detail.html, one could grab the 2018 population estimates by county for the following 18 age brackets: Under 5, 5 to 9, 10 to 14, 15 to 19, 20 to 24, 25 to 29, 30 to 34, 35 to 39, 40 to 44, 45 to 49, 50 to 54, 55 to 59, 60 to 64, 65 to 69, 70 to 74, 75 to 79, 80 to 84, 85 and over

To prepare the data to be used by your model, in what format would you want the population count for each county-by-agebracket observation?

turn notebook into voila app

https://blog.jupyter.org/and-voil%C3%A0-f6a2c08a4a93

Add ICU bed handling to run.py for the Harvard model

Testing -automated and smoke tests

We need to think about basic testing within our scripts to ensure we are creating everything correctly. A few items we can test

Integrity of input files in the dataset
State count of output .json file
size of the .json files (i.e. not null)
date span (not dropping dates)

let's start with some basic smoke tests (ex. ensuring 1+1 per model .json for each state)

JHU Data Groups all NYC counties data together, not spread amongst cities

One thing we could do is map all of the counties that this covers to point to this aggregated data point.

Expose Descarte Labs Mobility Data

Probably needed next week.

def cache_mobility_data():
    """
    Pulled from https://github.com/descarteslabs/DL-COVID-19
    """
    logging.info('Downloading mobility data.')
    url = 'https://raw.githubusercontent.com/descarteslabs/DL-COVID-19/master/DL-us-mobility-daterow.csv'

    dtypes_mapping = {
        'country_code': str,
        'admin_level': int,
        'admin1': str,
        'admin2': str,
        'fips': str,
        'samples': int,
        'm50': float,
        'm50_index': float}

    df = pd.read_csv(filepath_or_buffer=url, parse_dates=['date'], dtype=dtypes_mapping)
    df__m50 = df.query('admin_level == 2')[['fips', 'date', 'm50']]
    df__m50_index = df.query('admin_level == 2')[['fips', 'date', 'm50_index']]
    df__m50__final = df__m50.groupby('fips').agg(list).reset_index()
    df__m50_index__final = df__m50_index.groupby('fips').agg(list).reset_index()
    df__m50__final['m50'] = df__m50__final['m50'].apply(lambda x: np.array(x))
    df__m50_index__final['m50_index'] = df__m50_index__final['m50_index'].apply(lambda x: np.array(x))

    df__m50__final.to_pickle(os.path.join(DATA_DIR, 'mobility_data__m50.pkl'))
    df__m50_index__final.to_pickle(os.path.join(DATA_DIR, 'mobility_data__m50_index.pkl'))

validate.py fails if results/test doesn't already exist.

$ python validate.py -d ~/src/covid-data-public
Traceback (most recent call last):
File "validate.py", line 64, in
clear_result_dir(build_params.OUTPUT_DIR)
File "validate.py", line 25, in clear_result_dir
for f in os.listdir(result_dir):
FileNotFoundError: [Errno 2] No such file or directory: 'results/test'

Counties named st. are missing

I believe this is in the code to generate the fips population dataset.

CI test (MyBinder / mybinder-build) is broken

E.g. https://github.com/covid-projections/covid-data-model/pull/17/checks?check_run_id=534556246

We should disable or fix.

Add ability to source data from a specific snapshot of the data repository

Currently, it can only source from HEAD on master, but the current system is designed to support "pinning" a forecast run on a specific combination (single commit) of source data.

Also, we should probably drop the ability to filter data after a certain date after adding this. Pinning is better than trimming because even old numbers can get fixed, which is great for data quality but awful for reproducability.

Create PR template to remind folks to document schema changes.

To make sure API schema documentation is kept up-to-date and we audit API changes we want to have a PR template that reminds folks to document any changes to schema (new endpoints, json fields, etc.).

Extract run_dod_dataset.py script from deploy_dod_dataset.py

It looks like right now deploy_dod_dataset.py both generates the DoD dataset (leveraging libs.build_dod_dataset) and deploys it directly to S3.

We want to separate this out so that we can generate the dataset into local files (so that another script can then push it to the right location in s3). Context: https://docs.google.com/document/d/1VhQqMxtVYHQMIj9zFeb1DUQZ1aq1k4xIc_bYoTIqUb4

It should somehow be possible to configure it to use a specified local location for covid-data-public repo as well as a specified output directory.

Decide on Staffed vs. Licensed Beds

Most counties seem to have more licensed beds than staffed ones, but quite a few have the inverse. We should decide on which we rely on, or some combination thereof.

We're currently using licensed beds. Theoretically, that should include all beds that are allowed to operate (should the staffing materialize), but the data don't work that way in practice.