Giter Club home page Giter Club logo

covid-data-model's Introduction

COVID-19 Data Pipeline

COVID data pipeline / API supporting https://covidactnow.org/.

It ingests data scraped via https://github.com/covid-projections/can-scrapers, combines it, calculates metrics, and generates data files for the Covid Act Now API and website

Development

Setup

Detailed setup instructions can be found here.

Local development.

Normally the pipeline is run via github actions on a beefy cloud VM and still takes 2+ hours. When developing locally it is often useful to run the pipeline on a subset of locations ond/or to skip pipeline steps.

To run the pipeline end-to-end but only generate data for Connecticut state / counties, you can run:

# Fetches latest scraped data from can-scrapers and combines all data sources
# into a combined dataset, runs all filters, etc.  Adding --no-refresh-datasets
# will make this much faster but skips fetching / combining latest datasets.
./run.py data update --states CT

# Runs the pyseir code to generate the infection rate, i.e. r(t) metric data for locations.
python ./pyseir/cli.py build-all --states=CT

# Runs the API generation
./run.py api generate-api-v2 --state CT output -o output/api

# API files are generated to output/api.

Downloading Model Run Data

If you just want to run the API generation you can skip the first two steps above by downloading the pyseir model results from a previous snapshot. You can download the pyseir model output from a recent github action run with:

export GITHUB_TOKEN=<YOUR PERSONAL GITHUB TOKEN>
./run.py utils download-model-artifact

By default it downloads the last run, but you can choose a specific run with --run-number

Running PySEIR

PySEIR provides a command line interface in the activated environment. You can access the model with pyseir --help and pyseir <subcommand> --help providing more information.

Example: pyseir build-all --states="NY" will run state and county models for New York. States can also be specified by their state code: --states="New York" and --states=NY are equivalent.

pyseir build-all --states=NY --fips=36061 will run the New York state model and the model for the specified FIPS code (in this case New York City).

Check the output/ folder for results.

Model Output

There are a variety of output artifacts to paths described in pyseir/utils.py. The main artifact is the ensemble_result which contains the output information for each suppression policy -> model compartment as well as capacity information.

API Documentation

We host an API documentation site available in api/docs. It is a static site built using Docusaurus 2.

Additionally, we define the API output using pydantic schemas and generate Open API specs (default output api/docs/open_api_schema.json and json-schema outputs (default output api/schemas_v2/.

When modifying the API schema, run ./run.py api update-schemas to update the schemas.

Simple setup

Build environment:

$ cd api/docs
$ yarn

Start server locally:

$ cd api/docs
$ yarn start

Deploy update to apidocs.covidactnow.org:

$ tools/deploy-docs.sh

covid-data-model's People

Contributors

abeisgoat avatar actions-user avatar asciimike avatar benholtzmansmith avatar brettboval avatar clatko avatar crwilcox avatar davidstrauss avatar dependabot[bot] avatar erccarls avatar fridiculous avatar ghop02 avatar goldblatt avatar gsoltis avatar igorkofman avatar jabalsad avatar jamestamplin avatar jeremiq avatar jredding avatar mikelehen avatar playermanny2 avatar pnavarrc avatar raphomet avatar smcclure17 avatar tashwoods avatar tomgobravo avatar traethethird avatar tsebens avatar vincentwoo avatar zhang-xinyu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

covid-data-model's Issues

Current method of copying covid-public-data locally doesn't support Git LFS

Downloading a zip of the repository downloads the pointers to large files tracked with git lfs, but not the files themselves. Since the zip is also not itself a repository, it is not immediately possible to fetch the large files. We should consider either what needs to be done to fetch the large files after the zip is downloaded, or the possibility of cloning enough of the repo to also download the large files as we need them.

This is currently breaking support for the COVID_DATA_PUBLIC environment variable on the dod branch, since it depends on shapefiles stored via git lfs in covid-data-public.

Port Legacy Model to Updated Data Sources

  • Re-implement backfill on new JHU/CDS/NYT sources
  • Retarget run_old_model.py to the new source library
  • Retarget notebook_runtime.ipynb to the new source library

How backfill works:

  1. Identify the first day with >0 cases that is either 2020-03-03 or a multiple of four days after (2020-03-07, 2020-03-11, ...).
  2. Ensure prior dates with zero cases get synthesized to have half of the next interval's amount.
  3. Backfill until reaching 2020-03-03.

Empirical data for cases:

2020-03-03: 0
2020-03-07: 0
2020-03-11: 2

Backfilled:

2020-03-03: 0.5
2020-03-07: 1
2020-03-11: 2

Document API schema.

Per https://docs.google.com/document/d/1VhQqMxtVYHQMIj9zFeb1DUQZ1aq1k4xIc_bYoTIqUb4 we want (at least lightweight) documentation on all of the data artifacts we currently produce, including:

This documentation should live in api/README.md in this repo and should at minimum describe the API URL endpoints (e.g. /county/WA.53033.json) and the JSON fields or CSV columns expected in the file / endpoint.

This is important to document the API interface between our model code and the website / other consumers of our data.

Write scheduled GitHub Action to run run.sh and deploy to S3.

The goal is to create a scheduled GitHub Action that re-runs our models to generate new API artifacts, and publishes them on data.covidactnow.org ~once daily.

It will need to:

  1. clone the covid-data-public repo (will git lfs be an issue?)
  2. run run.sh (see #123).
  3. Publish to S3 according to URL format / version schema specified in https://docs.google.com/document/d/1VhQqMxtVYHQMIj9zFeb1DUQZ1aq1k4xIc_bYoTIqUb4/edit

Notes:

  • There's some s3 python code here that might be useful. Or maybe there's a command-line tool that would be easier / better.
  • We need to generate a snapshot-id to stick in the ultimate S3 destination. This could be a precise timestamp or maybe github actions have a unique job ID we could use, or something.

Combine mobility data?

Hi all,

Fantastic work!! This model is awesome.

I was wondering if you have considered using mobility data to inform whether a region is strictly or poorly complying with shelther-at-home orders?

The data used in this NYT article: https://www.nytimes.com/interactive/2020/04/02/us/coronavirus-social-distancing.html is free and can be found here: https://www.cuebiq.com/visitation-insights-covid19/

It looks like they provide a county-by-county estimate of the change in mobility over time. I presume a small change in mobility since the order was enacted means there is poor compliance.

Cheers,
Miles

Create run.sh script that generates all published data artifacts.

The goal is to write a single script that runs "everything" we need to run in order to produce all artifacts that we want to publish on data.covidactnow.org. This includes our website models and our DoD artifacts. Per @fridiculous we prefer this to be a bash script to keep it simple and discourage any kind of data processing, etc. in it. The resulting files in the output directory should match our API schema (which will be more formally defined as part of #124).

Syntax should be something like:

run.sh <location of covid-data-public repo> <output location>

As we clean up our internal plumbing, what the script does may change but for now we need to probably:

  • run run.py
  • run deploy_dod_dataset.py pointed at output directory.
  • Generate a version.json file that contains the current date/time, the git hash of covid-data-public, and the git hash of covid-data-model.

Context: https://docs.google.com/document/d/1VhQqMxtVYHQMIj9zFeb1DUQZ1aq1k4xIc_bYoTIqUb4

JHUDataset county aggregation bug.

There's an issue where dates that have county data but not state data don't properly aggregate up correctly.

As a simple side-effect of this, if you run:

model_state('USA', 'CA', INTERVENTIONS[0])

then the results start with 2020-03-10 instead of 2020-03-03 because the prior dates only have county data.

Jupyter notebooks contain outputs

The jupyter notebooks on master right now have some saved output, but we're trying to keep all notebooks free of output to help prevent merge conflicts and git bloat.

@vincentwoo reported this because the githook to clean model output was always modifying those files.

Expose Public Policy Implementations Data

Example below. This is critical this week in order to run proper inferences.

def cache_public_implementations_data():
    """
    Pulled from https://github.com/JieYingWu/COVID-19_US_County-level_Summaries
    """
    logging.info('Downloading public implementations data')
    url = 'https://raw.githubusercontent.com/JieYingWu/COVID-19_US_County-level_Summaries/master/raw_data/national/public_implementations_fips.csv'

    data = requests.get(url, verify=False).content.decode('utf-8')
    data = re.sub(r',(\d+)-(\w+)', r',\1-\2-2020', data)  # NOTE: This assumes the year 2020

    date_cols = [
        'stay at home',
        '>50 gatherings',
        '>500 gatherings',
        'public schools',
        'restaurant dine-in',
        'entertainment/gym',
        'Federal guidelines',
        'foreign travel ban']
    df = pd.read_csv(io.StringIO(data), parse_dates=date_cols, dtype='str').drop(['Unnamed: 1', 'Unnamed: 2'], axis=1)
    df.columns = [col.replace('>', '').replace(' ', '_').replace('/', '_').lower() for col in df.columns]
    df.fips = df.fips.apply(lambda x: x.zfill(5))
    df.to_pickle(os.path.join(DATA_DIR, 'public_implementations_data.pkl'))

Expose County Level Age Distributions

Cannot recall what state this method was in, but it contains code to pull age distribution by county and format a dataframe containing an array of this dist + the bins. This is needed for demographic mapping + our upcoming age structured modeling.

#### Deprecated
# def cache_county_metadata():
#     """
#     Cache 2019 census data including age distribution by state/county FIPS.
#
#     # TODO Add pop density
#     """
#     print('Downloading county level population data')
#     county_summary = pd.read_csv(
#         'https://www2.census.gov/programs-surveys/popest/datasets/2010-2018/counties/asrh/cc-est2018-alldata.csv',
#         sep=',', encoding="ISO-8859-1", dtype='str', low_memory=False)
#
#     df = county_summary[county_summary.YEAR == '11'][['STATE', 'COUNTY', 'CTYNAME', 'AGEGRP', 'TOT_POP']]
#     df[['AGEGRP', 'TOT_POP']] = df[['AGEGRP', 'TOT_POP']].astype(int)
#     list_agg = df.sort_values(['STATE', 'COUNTY', 'CTYNAME', 'AGEGRP']) \
#         .groupby(['STATE', 'COUNTY', 'CTYNAME'])['TOT_POP'] \
#         .apply(np.array) \
#         .reset_index()
#     list_agg['TOTAL'] = list_agg['TOT_POP'].apply(lambda x: x[0])
#     list_agg['AGE_DISTRIBUTION'] = list_agg['TOT_POP'].apply(lambda x: x[1:])
#     list_agg.drop('TOT_POP', axis=1)
#
#     age_bins = list(range(0, 86, 5))
#     age_bins += [120]
#     list_agg['AGE_BIN_EDGES'] = [np.array(age_bins) for _ in
#                                  range(len(list_agg))]
#
#     list_agg.insert(0, 'fips', list_agg['STATE'] + list_agg['COUNTY'])
#     list_agg = list_agg.drop(['COUNTY', 'TOT_POP'], axis=1)
#     list_agg.columns = [col.lower() for col in list_agg.columns]
#     list_agg = list_agg.rename(
#         mapper={'ctyname': 'county_name', 'total': 'total_population'}, axis=1)
#     list_agg.to_pickle(os.path.join(DATA_DIR, 'covid_county_metadata.pkl'))

Expose ICU Beds and Utilization in DHBeds

Name says it all! We should just pull from the ESRI endpoint directly. Method here:

def cache_hospital_beds():
    """
    Pulled from "Definitive"
    See: https://services7.arcgis.com/LXCny1HyhQCUSueu/arcgis/rest/services/Definitive_Healthcare_Hospitals_Beds_Hospitals_Only/FeatureServer/0
    """
    logging.info('Downloading ICU capacity data.')
    url = 'http://opendata.arcgis.com/datasets/f3f76281647f4fbb8a0d20ef13b650ca_0.geojson'
    tmp_file = urllib.request.urlretrieve(url)[0]

    with open(tmp_file) as f:
        vals = json.load(f)
    df = pd.DataFrame([val['properties'] for val in vals['features']])
    df.columns = [col.lower() for col in df.columns]
    df = df.drop(['objectid', 'state_fips', 'cnty_fips'], axis=1)
    df.to_pickle(os.path.join(DATA_DIR, 'icu_capacity.pkl'))

Simplify/Standardize County -> Fips parsing

Right now the match counties to fips parsing is pretty ugly and doesn't handle all of the cases.

When loading some datasets will show if it was unable to match a county with fips code.

Could not match ('VI', 'Saint Croix')
Could not match ('GU', 'Guam')
Could not match ('VI', 'Saint Thomas')
Could not match ('AS', 'American Samoa')
Could not match ('MP', 'Saipan')

one of the ugly parts is the diacritic parsing (thanks @davidstrauss for the vocab).
It looks like there is a library that deals with this pretty well: https://stackoverflow.com/a/2633310.

Should make sure that all counties are properly being converted to fips or otherwise dealt with - potentially by assigning an unknown fips code to missing counties.

turn run.py into a notebook

let's run the notebook via Google Colab

Notebook should use the static files and pull in latest data from coronodatascraper for time series data.

defs should be in cell by itself to abstract/hide them

then model in another cell that's easy to manipulate

Goal is: Easy manipulation of models and a notebook that can run the model to output .json files for ingestion into website.

Add county-level support to timeseries data library

We currently support querying a timeseries by state, but it would be good to get the same in place for counties. The JHU data probably lacks this from March 10-20-ish, but the CDS data has counties for all dates. We should still be able to make a best-effort with the JHU data by getting what it has.

Enable CORS on data.covidactnow.org

Test page to demonstrate it fails: https://mike-stuff.web.app/test-data.html

Result (in JS Console):

Access to fetch at 'https://data.covidactnow.org/states.csv' from origin 'https://mike-stuff.web.app' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource. If an opaque response serves your needs, set the request's mode to 'no-cors' to fetch the resource with CORS disabled.

We need to get CORS working, which probably means adding a Access-Control-Allow-Origin: * header.

Note that in a past lifetime I recall * had limitations and it ended up being better to instead set the header to whatever the incoming requests Origin was set to. If we do that, we also probably need to set Vary: Origin to avoid unwanted caching by proxies, etc. We should probably start with just setting it to *.

Use Beds from Definitive Health rather than beds from legacy dataset

Covid datasets is the legacy datasets, the reason we were keeping it around is to support the legacy model (which i'm pretty sure we're not using any more) and to not change the beds number for state level outputs.

Right now the beds data comes from the estimate that multiplies "bed density" * population. When aggregating the DH beds county level data to a state, we get a slightly higher number per state:

VT - old: 1310, new: 1646
WA - old: 12945, new: 14107
WI - old: 12227, new: 18657
UT - old: 5771, new: 6839
TX - old: 66691, new: 85035
TN - old: 19816, new: 23673
KS - old: 9614, new: 12117
KY - old: 14297, new: 18818
MA - old: 15984, new: 21480
MD - old: 11487, new: 12231
MN - old: 14099, new: 16837
MO - old: 19026, new: 23482
NC - old: 22025, new: 30886
NE - old: 6964, new: 6804
NJ - old: 21317, new: 28814
NV - old: 6468, new: 8684
NY - old: 52525, new: 56175
OK - old: 11080, new: 14417
OR - old: 6748, new: 8904
PA - old: 37126, new: 44250
RI - old: 2225, new: 4129
SC - old: 12357, new: 14823
NH - old: 2855, new: 3419
IA - old: 9465, new: 11795
LA - old: 15341, new: 21129
OH - old: 32729, new: 40183
WY - old: 2026, new: 1661
WV - old: 6810, new: 8788
SD - old: 4246, new: 3365
ND - old: 3277, new: 3739
NM - old: 3774, new: 5728
MT - old: 3527, new: 3431
MS - old: 11905, new: 14030
MI - old: 24967, new: 27499
ME - old: 3361, new: 3884
ID - old: 3396, new: 3859
DE - old: 2142, new: 2877
AR - old: 9657, new: 12399
AK - old: 1609, new: 1695
DC - old: 3105, new: 4143
AL - old: 15200, new: 19028
PR - old: 8623, new: 8620

I'd guess that the DH numbers are probably more accurate, since the "beds per milli" figure has a lot of potential to change with what population statistics you are using

Generate county JSON forecasts

Generate county JSON forecast files, but only when we have data to forecast in a county. Files should have the same format as state-level JSON and be named using FIPS codes.

Create Summary CSV of Inputs used on model generation

It's sometimes unclear:
a) why a specific county doesn't have data
b) where the numbers shown come from
c) what initial values are

I think it would be very helpful to have a csv generated (and presumably shared somewhere, could be a google sheet that is updated?) when model generation runs.

Can have individual state and county summary csvs:

State level:

Latest Date Country State population beds cases deaths model results generated case data source
2020-04-01 USA MA 10003 1200 1000 122 True JHU

County level:

Latest Date Country State fips name population beds cases deaths model results generated cases data source
2020-04-01 USA MA 12345 Suffolk 10003 1200 1000 122 True JHU

I think this can be done by modifying run.py and keeping track of the results as it run

"3 months of Stay at home (strict compliance)" model doesn't seem to match description

From the model reference, the shelter in place model makes these R assumptions: "1.3 for 4 weeks, 1.1 for 4 weeks, 0.8 for 4 weeks."

However the results for the "3 months of Stay at home (strict compliance)" model appear to temporarily accelerate after each assumed decrease in R, rather than start declining. This appears to be true in each county and state level model (while all examples screenshotted are in CA, this appears true across every state I've checked).

Screen Shot 2020-04-05 at 8 55 42 PM

Screen Shot 2020-04-05 at 8 49 26 PM

Screen Shot 2020-04-05 at 8 49 03 PM

Add county-level hospital capacity support

Our current hospital bed capacity interface only supports states (and queries a CSV that only has states). We have better data now, but it isn't exposed in any APIs yet.

So, this issue covers adding an API to get county-level hospital data from the linked source.

Add population count checksum to model.

Total people should be constant and people move between susceptible, infected, recovered, etc. from cycle to cycle. We should validate this in the model code and throw an exception if the invariant doesn't hold.

CDS data update broke our pipeline yesterday.

It looks like this was a transient issue. If I download the latest CDS data, then our pipeline works. But we may want to look into what went wrong and if we should have better error handling in our own code or something.

  1. git checkout 18ff6a9 in your covid-data-public repo.
  2. Run python validate.py -d ~/src/covid-data-public in covid-data-model repo.

Result:

run.py failed with code 1
INFO:libs.CovidDatasets:NO COUNTY DATA: 2020-03-12 00:00:00
INFO:libs.CovidDatasets:NO COUNTY DATA: 2020-03-04 00:00:00
...
INFO:libs.CovidDatasets:NO COUNTY DATA: 2020-03-04 00:00:00
INFO:libs.CovidDatasets:NO COUNTY DATA: 2020-03-03 00:00:00
Traceback (most recent call last):
  File "/Users/mikelehen/.pyenv/versions/3.7.7/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2646, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'date'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run.py", line 84, in <module>
    model_state('USA', state, intervention),
  File "run.py", line 56, in model_state
    'timeseries': Dataset.get_timeseries_by_country_state(country, state, MODEL_INTERVAL),
  File "/Users/mikelehen/tmp/covid-data-model/libs/CovidDatasets.py", line 205, in get_timeseries_by_country_state
    return self.prep_data(self.combine_state_county_data(country, state), model_interval)
  File "/Users/mikelehen/tmp/covid-data-model/libs/CovidDatasets.py", line 148, in prep_data
    return self.backfill(self.cutoff(series), model_interval)
  File "/Users/mikelehen/tmp/covid-data-model/libs/CovidDatasets.py", line 137, in backfill
    self.backfill_to_init_date(series, model_interval),
  File "/Users/mikelehen/tmp/covid-data-model/libs/CovidDatasets.py", line 93, in backfill_to_init_date
    min_interval_row = interval_rows[interval_rows[self.DATE_FIELD] == interval_rows[self.DATE_FIELD].min()].iloc[0]
  File "/Users/mikelehen/.pyenv/versions/3.7.7/lib/python3.7/site-packages/pandas/core/frame.py", line 2800, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/Users/mikelehen/.pyenv/versions/3.7.7/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2648, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'date'

Age demographics for states or counties

https://covidactnow.org/faq says that a current limitation of the model is "Demographics for the U.S. as a whole are used, rather than specific to each state."

State- and county-level population counts by age are available from the US Census Bureau. For example, at https://www.census.gov/data/datasets/time-series/demo/popest/2010s-counties-detail.html, one could grab the 2018 population estimates by county for the following 18 age brackets: Under 5, 5 to 9, 10 to 14, 15 to 19, 20 to 24, 25 to 29, 30 to 34, 35 to 39, 40 to 44, 45 to 49, 50 to 54, 55 to 59, 60 to 64, 65 to 69, 70 to 74, 75 to 79, 80 to 84, 85 and over

To prepare the data to be used by your model, in what format would you want the population count for each county-by-agebracket observation?

Testing -automated and smoke tests

We need to think about basic testing within our scripts to ensure we are creating everything correctly. A few items we can test

  • Integrity of input files in the dataset
  • State count of output .json file
  • size of the .json files (i.e. not null)
  • date span (not dropping dates)

let's start with some basic smoke tests (ex. ensuring 1+1 per model .json for each state)

Expose Descarte Labs Mobility Data

Probably needed next week.

def cache_mobility_data():
    """
    Pulled from https://github.com/descarteslabs/DL-COVID-19
    """
    logging.info('Downloading mobility data.')
    url = 'https://raw.githubusercontent.com/descarteslabs/DL-COVID-19/master/DL-us-mobility-daterow.csv'

    dtypes_mapping = {
        'country_code': str,
        'admin_level': int,
        'admin1': str,
        'admin2': str,
        'fips': str,
        'samples': int,
        'm50': float,
        'm50_index': float}

    df = pd.read_csv(filepath_or_buffer=url, parse_dates=['date'], dtype=dtypes_mapping)
    df__m50 = df.query('admin_level == 2')[['fips', 'date', 'm50']]
    df__m50_index = df.query('admin_level == 2')[['fips', 'date', 'm50_index']]
    df__m50__final = df__m50.groupby('fips').agg(list).reset_index()
    df__m50_index__final = df__m50_index.groupby('fips').agg(list).reset_index()
    df__m50__final['m50'] = df__m50__final['m50'].apply(lambda x: np.array(x))
    df__m50_index__final['m50_index'] = df__m50_index__final['m50_index'].apply(lambda x: np.array(x))

    df__m50__final.to_pickle(os.path.join(DATA_DIR, 'mobility_data__m50.pkl'))
    df__m50_index__final.to_pickle(os.path.join(DATA_DIR, 'mobility_data__m50_index.pkl'))

validate.py fails if results/test doesn't already exist.

$ python validate.py -d ~/src/covid-data-public
Traceback (most recent call last):
File "validate.py", line 64, in
clear_result_dir(build_params.OUTPUT_DIR)
File "validate.py", line 25, in clear_result_dir
for f in os.listdir(result_dir):
FileNotFoundError: [Errno 2] No such file or directory: 'results/test'

Add ability to source data from a specific snapshot of the data repository

Currently, it can only source from HEAD on master, but the current system is designed to support "pinning" a forecast run on a specific combination (single commit) of source data.

Also, we should probably drop the ability to filter data after a certain date after adding this. Pinning is better than trimming because even old numbers can get fixed, which is great for data quality but awful for reproducability.

Extract run_dod_dataset.py script from deploy_dod_dataset.py

It looks like right now deploy_dod_dataset.py both generates the DoD dataset (leveraging libs.build_dod_dataset) and deploys it directly to S3.

We want to separate this out so that we can generate the dataset into local files (so that another script can then push it to the right location in s3). Context: https://docs.google.com/document/d/1VhQqMxtVYHQMIj9zFeb1DUQZ1aq1k4xIc_bYoTIqUb4

It should somehow be possible to configure it to use a specified local location for covid-data-public repo as well as a specified output directory.

Decide on Staffed vs. Licensed Beds

Most counties seem to have more licensed beds than staffed ones, but quite a few have the inverse. We should decide on which we rely on, or some combination thereof.

We're currently using licensed beds. Theoretically, that should include all beds that are allowed to operate (should the staffing materialize), but the data don't work that way in practice.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.