rtcovidlive / covid-model Goto Github PK

License: Other

Python 2.15% Jupyter Notebook 97.85%

covid-model's Introduction

Model powering rt.live

This repository contains the code of the data processing and modeling behind https://rt.live.

Because this code is running in production, the maintainers of this repository are very conservative about merging any PRs.

Application to Other Countries

We have learned that it takes continuous attention to keep running the model. This is mostly due to data quality issues that are best solved with local domain knowledge.

In other words, the maintainers behind this repo and http://rt.live don't currently have the resources to ensure high-quality analyses for other countries.

However, we encourage you to apply and improve the model for your country!

Contributing

We are open to PRs that address aspects of the code or model that generalize across borders. For example on the topics of:

docstrings (NumPy-style),
testing
robustness against data outliers
computational performance
model insight

Citing

To reference this project in a scientific article:

Kevin Systrom, Thomas Vladek and Mike Krieger. Rt.live (2020). GitHub repository, https://github.com/rtcovidlive/covid-model

or with the respective BibTeX entry:

@misc{rtlive2020,
  author = {Systrom, Kevin and Vladek, Thomas and Krieger, Mike},
  title = {Rt.live},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/rtcovidlive/covid-model}},
  commit = {...}
}

covid-model's People

Contributors

Stargazers

Watchers

Forkers

possiblybhavin dathere juliomariodaza drkmathan scosant kurniayazid ngcd04-fa07 qcaudron canyon289 mshapi2 leetcode-notes pepekalo mitxir brandonwillard fesp21 4oh4 shaunstanislauslau ssitb bastienboutonnet mgmd-96 vhschalk helingai twiecki nbrasher lhelleckes qinyinghua maisonml mvenkadesan ansesu spoelee11 menyoung haroldm coronamex askprash counterfactuals chkoster fagan2888 toholk otisexplodis vethno raghavp96 cwallaceh asafpr sayantan-sarkar hrossman tymick hongqin denatns woldemarg dax027 jpdeleon melnimr namu3418-seagate geilsm aagaskar greta8782 devrand uazcoviddata laurabalasso patrikakrenius krishnarekapalli rongpenl ujs aniruhil gbournigal ylzhang29 michaelosthege gauss256 mitch-at-orika derrek tgehrs venkatalakshmi03 kdpenner veevargas aminnj alexruiz95 guizhen-wang sirchill88 matthieuoliveira ravendico jessvann aakankshachouhan boldfield toshj unimol ambatibhargav

covid-model's Issues

Show United States Rt

It'd be really nice to also see an Rt for the entire US.

Clarify the addition of cases for LA

covid-model/covid/data_us.py

Lines 34 to 36 in 094e5b3

 # From CT: On June 19th, LDH removed 1666 duplicate and non resident cases 

 # after implementing a new de-duplicaton process. 

 data.loc[idx["LA", pd.Timestamp("2020-06-19") :], :] += 1666

Can you clarify why you are adding cases to the LA total? Is this just to standardize with the pre-deduplication data?

Prod-relevant test coverage

After merging in #5 we now need to build out the relevant tests that ensure the pipeline works as expected.

Potential caveats to explanation of test exposure factor

I appreciate that y'all wrote a tutorial with the basic epidemiology needed to estimate R(t). I was with the explanation until this:

Intuitively, if we test twice as much, we expect twice as many positive tests to show up.

This is not intuitive to me. My intuition is that most tests have been done on a biased sample of the symptomatic and the exposed, for example those who know they've been exposed because they were near a symptomatic person. The bias implies that testing twice the sample does not necessarily lead to twice as many positives. Throw in the possibility of asymptomatic cases and different transmissibility of the asymptomatic, and I'm no longer sure what an unbiased sample is. Is it a random subset of the population? Probably not.

If these are significant caveats in mapping model positives to data positives they should be mentioned.

process_covidtracking_data can't handle run_date < outlier corrections

When run date is set to a date that is before any of the dates that are outlier-corrected, process_covidtracking_data crashes with a pandas slicing error.

This happened on the master CI pipeline, because the PR #5 branch did not contain the most recent outlier-corrections, hence did not raise the error. https://github.com/rtcovidlive/covid-model/runs/833006303#step:5:89

I have a fix for this already. Will open a PR to fix asap.

Use sane defaults for paths

A few issues.

I installed the package in a conda environment. For every line in patients.py with

os.path.join(os.path.dirname(__file__), "../data/"+string)

the path has the prefix /Users/kpenner/miniconda3/envs/rtlive/lib/python3.9/site-packages/covid/../data/

Some place like ~/.local/share/covid/data/ is logical.

The tutorial should mention that you need to use download_patient_data() before you do anything.
I have to untar latestdata.tar.gz. If I don't and pass the .tar.gz here:

covid-model/covid/patients.py

Lines 27 to 32 in cf892d5

patients = pd.read_csv(

file_path,

parse_dates=False,

usecols=["country", "date_onset_symptoms", "date_confirmation"],

low_memory=False,

)

pandas throws ValueError: Passed header names mismatches usecols. If I drop usecols, the output of patients.columns is Index(['././@PaxHeader'], dtype='object').

rt.live SSL Cert is expired

Looks like it expired on Saturday, February 19, 2022 at 11:04:28 AM:

Repo needs NB that more clearly explains how the model works

Currently the NB is pretty simple, but not very helpful in understanding the model. A good NB would:

Explain how we pull data and any modifications we do
Explain the logical steps of the model (eg seed->rt->infections->...>observed)
Explain how to run the model yourself
Explain how to interpret the results

Broken Pipe Error and Fixes?

Upon running the tutorial code I was consistently hitting a road block with the multiprocessing package this code requires (Running in Windows with Anaconda Install of Python). I had to make the following changes to successfully run the code.

Error:
__ == "main": #added to code to fix pipe closed error https://discourse.pymc.io/t/multiprocessing-windows-10-brokenpipeerror-errno-32-broken-pipe/2259
gm = GenerativeModel(region, model_data)
File "", line 1
__ == "main": #added to code to fix pipe closed error https://discourse.pymc.io/t/multiprocessing-windows-10-brokenpipeerror-errno-32-broken-pipe/2259
^
SyntaxError: invalid syntax

I changed the following in the tutorial.py:
if name == "main": #added to code to fix pipe closed error https://discourse.pymc.io/t/multiprocessing-windows-10-brokenpipeerror-errno-32-broken-pipe/2259
gm = GenerativeModel(region, model_data)
gm.sample()

With this i was able to run the data using Covid Tracker unaltered source.

Upon still receiving errors after adjusting this to fit local state data instead of Covid Trackers CSV, I came aross a similar error, after playing around with it for a few days I changed in the generative.py file:

def sample(
    self,
    cores=1,    #changed was 4
    chains=2,  #changed was 4

But I am concerned because I don't know what these values affect. I assume cores is processing cores used. I am less sure on chains. Can I change this code without changing the model? Or do you have other suggestions to fix the Broken Pipe (Error 32) Errors

README/contributing guide with clear requirements

Consistent code formatting & docstrings

Missing data points for Maine

The Maine chart is missing several days of data, and this may be causing the Rt to calculate lower than it really is.

Our Governor is addressing the state today due to a sudden rise in cases.

Data-hotfixing without master branch commits

TypeError: cannot do slice indexing on <class 'pandas.core.indexes.datetimes.DatetimeIndex'>

Using

numpy                      1.18.1
pymc3                      3.9.2
pandas                     0.25.3

I get this error trying to run cell 3 in rtlive-model.ipynb

gm = GenerativeModel(region, model_data)
gm.sample()

TypeError: cannot do slice indexing on <class 'pandas.core.indexes.datetimes.DatetimeIndex'> with these indexers [2020-03-06 00:00:00] of <class 'pandas._libs.tslibs.timestamps.Timestamp'>

Proposal for better variable names

Proposal:
"infections" -> "transmissions", i.e. how many people transmitted the disease to someone else
"test_adjusted_positive" -> "infection_onset", i.e. how many people got infected

Computation took so long

Hello, I am Ega from Indonesia and I would like to compute the Rt in my country. However, it took me around 13 hours to run data for a region, while I see that you only took around 7 mins to run data for a region in the python notebook. Do you use a virtual machine or cloud computing to speed up the computation? I am sorry for a dumb question, I am a newbie. Thanks in advance.

Best,
Ega

Add Rt estimates for foreign countries

Would it be a good idea to see how we stack up against other countries?

Of course, for the few countries that have eliminated the virus, make sure the website portion doesn't divide by zero.

Prior in Gaussian random walk

Can you explain the choice of prior for the Gaussian random walk? How did you choose this value for sigma?

           log_r_t = pm.GaussianRandomWalk(
                "log_r_t",
                sigma=0.035,
                dims=["date"] )

Specify numpy version >=1.19.0

Installation with numpy <= 1.18.5 (latest version on anaconda), pymc3==3.9.2 and theano==1.0.4 caused compilation errors. Numpy 1.19.0 solved this. Specify numpy version in requirements.txt to avoid installation errors.

Create production branch

We should have a separate "prod" branch from which the model runs so that master can be a bit more stable.

Alternatively, we could have releases and production only ever uses the most recent version which is known to be stable.

I'm curious where the shelter policy data points are coming from

I thought that was a nice feature of the visualizations, but I couldn't see where the data was being sourced from as the COVID tracking project doesn't seem to have that available in their API.

Thanks and thanks for making these tools open and available! Great work.

Tag if states aren't reporting their data daily

I've noticed Mississippi has large gaps in their data, but if you drill into the state you can see they just aren't reporting ANY data for most days. Is there a way these states can be tagged in some way (different color, a simple * with a NOTE)?

Model needs to correct for anomalies in testing reporting

Increasingly states are reporting 100% positive tests on a given day (eg 215 of 215 tests came back positive). This throws the model off because it assumes positive rate of tests are roughly proportional to the actual number of tests. If the state reports 100% positive tests, Rt increases too quickly because of the faulty data point.

For instance, Ohio has a handful of days when clearly total tests have not been reported correctly and positive % shoots up to 100%:

And in some cases, tests are withheld one day, only to be reported together with the next day's results:

In this case, drops in data are often followed by 2x the number of tests the following day.

In either case, having an unstable positive % confuses the model significantly so we need to figure out a solution to either:

Remove these anomalies and let the model infer the true hidden value
Correct these anomalies using some kind of algorithm

Currently @tvladeck and I have looked at Gaussian Processes and Kalman Filters as ways of detecting and perhaps correcting these issues. Other ideas are welcome too.

Illinois testing backlog de-noising

I'm concerned that the data from Illinois' September 4th backlog "catch-up" might be de-noised out, as if the cases never happened. If it's a simple average, it should work out fine and be counted. But if the algorithm is tossing values it determines are outside the norm and thus "noise," the information will be lost, which I feel would be in error.

I noticed there was a recent code change to mitigate a South Dakota reporting error. Has someone analyzed how the algorithm works when data reporting is backlogged and then later caught up? i.e., a number of days' worth of positive tests were bundled up and then reported (incorrectly) as if they occurred on a single later date, rather than the various dates when the samples were taken?

[enhancement] Choropleth map with most recent Rₜ values.

Really great project guys! I created this simple choropleth map from the Rₜ values you report with the same time points you have on the site (Latest. 2 Weeks Ago, 1 Month Ago. 2 Months Ago, 3 Months Ago).

Thank you again for providing the data!

Map

Can not install mkl-service

I have cloned the repository and tried to run "python setup.py install"

It failed while trying to install mkl-service module. I am using python 3.6 running on docker. What am I missing?

	# From CT: On June 19th, LDH removed 1666 duplicate and non resident cases
	# after implementing a new de-duplicaton process.
	data.loc[idx["LA", pd.Timestamp("2020-06-19") :], :] += 1666

	patients = pd.read_csv(
	file_path,
	parse_dates=False,
	usecols=["country", "date_onset_symptoms", "date_confirmation"],
	low_memory=False,
	)