Giter Club home page Giter Club logo

covid-model's Introduction

Model powering rt.live

This repository contains the code of the data processing and modeling behind https://rt.live.

Because this code is running in production, the maintainers of this repository are very conservative about merging any PRs.

Application to Other Countries

We have learned that it takes continuous attention to keep running the model. This is mostly due to data quality issues that are best solved with local domain knowledge.

In other words, the maintainers behind this repo and http://rt.live don't currently have the resources to ensure high-quality analyses for other countries.

However, we encourage you to apply and improve the model for your country!

Contributing

We are open to PRs that address aspects of the code or model that generalize across borders. For example on the topics of:

  • docstrings (NumPy-style),
  • testing
  • robustness against data outliers
  • computational performance
  • model insight

Citing

To reference this project in a scientific article:

Kevin Systrom, Thomas Vladek and Mike Krieger. Rt.live (2020). GitHub repository, https://github.com/rtcovidlive/covid-model

or with the respective BibTeX entry:

@misc{rtlive2020,
  author = {Systrom, Kevin and Vladek, Thomas and Krieger, Mike},
  title = {Rt.live},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/rtcovidlive/covid-model}},
  commit = {...}
}

covid-model's People

Contributors

aminnj avatar michaelosthege avatar mikeyk avatar twiecki avatar tymick avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

covid-model's Issues

Clarify the addition of cases for LA

# From CT: On June 19th, LDH removed 1666 duplicate and non resident cases
# after implementing a new de-duplicaton process.
data.loc[idx["LA", pd.Timestamp("2020-06-19") :], :] += 1666

Can you clarify why you are adding cases to the LA total? Is this just to standardize with the pre-deduplication data?

Potential caveats to explanation of test exposure factor

I appreciate that y'all wrote a tutorial with the basic epidemiology needed to estimate R(t). I was with the explanation until this:

Intuitively, if we test twice as much, we expect twice as many positive tests to show up.

This is not intuitive to me. My intuition is that most tests have been done on a biased sample of the symptomatic and the exposed, for example those who know they've been exposed because they were near a symptomatic person. The bias implies that testing twice the sample does not necessarily lead to twice as many positives. Throw in the possibility of asymptomatic cases and different transmissibility of the asymptomatic, and I'm no longer sure what an unbiased sample is. Is it a random subset of the population? Probably not.

If these are significant caveats in mapping model positives to data positives they should be mentioned.

process_covidtracking_data can't handle run_date < outlier corrections

When run date is set to a date that is before any of the dates that are outlier-corrected, process_covidtracking_data crashes with a pandas slicing error.

This happened on the master CI pipeline, because the PR #5 branch did not contain the most recent outlier-corrections, hence did not raise the error. https://github.com/rtcovidlive/covid-model/runs/833006303#step:5:89

I have a fix for this already. Will open a PR to fix asap.

Use sane defaults for paths

A few issues.

  1. I installed the package in a conda environment. For every line in patients.py with
os.path.join(os.path.dirname(__file__), "../data/"+string)

the path has the prefix /Users/kpenner/miniconda3/envs/rtlive/lib/python3.9/site-packages/covid/../data/

Some place like ~/.local/share/covid/data/ is logical.

  1. The tutorial should mention that you need to use download_patient_data() before you do anything.

  2. I have to untar latestdata.tar.gz. If I don't and pass the .tar.gz here:

    patients = pd.read_csv(
    file_path,
    parse_dates=False,
    usecols=["country", "date_onset_symptoms", "date_confirmation"],
    low_memory=False,
    )

    pandas throws ValueError: Passed header names mismatches usecols. If I drop usecols, the output of patients.columns is Index(['././@PaxHeader'], dtype='object').

Repo needs NB that more clearly explains how the model works

Currently the NB is pretty simple, but not very helpful in understanding the model. A good NB would:

  • Explain how we pull data and any modifications we do
  • Explain the logical steps of the model (eg seed->rt->infections->...>observed)
  • Explain how to run the model yourself
  • Explain how to interpret the results

Broken Pipe Error and Fixes?

Upon running the tutorial code I was consistently hitting a road block with the multiprocessing package this code requires (Running in Windows with Anaconda Install of Python). I had to make the following changes to successfully run the code.

Error:
__ == "main": #added to code to fix pipe closed error https://discourse.pymc.io/t/multiprocessing-windows-10-brokenpipeerror-errno-32-broken-pipe/2259
gm = GenerativeModel(region, model_data)
File "", line 1
__ == "main": #added to code to fix pipe closed error https://discourse.pymc.io/t/multiprocessing-windows-10-brokenpipeerror-errno-32-broken-pipe/2259
^
SyntaxError: invalid syntax

I changed the following in the tutorial.py:
if name == "main": #added to code to fix pipe closed error https://discourse.pymc.io/t/multiprocessing-windows-10-brokenpipeerror-errno-32-broken-pipe/2259
gm = GenerativeModel(region, model_data)
gm.sample()

With this i was able to run the data using Covid Tracker unaltered source.

Upon still receiving errors after adjusting this to fit local state data instead of Covid Trackers CSV, I came aross a similar error, after playing around with it for a few days I changed in the generative.py file:

def sample(
    self,
    cores=1,    #changed was 4
    chains=2,  #changed was 4

But I am concerned because I don't know what these values affect. I assume cores is processing cores used. I am less sure on chains. Can I change this code without changing the model? Or do you have other suggestions to fix the Broken Pipe (Error 32) Errors

Missing data points for Maine

The Maine chart is missing several days of data, and this may be causing the Rt to calculate lower than it really is.

Our Governor is addressing the state today due to a sudden rise in cases.

Proposal for better variable names

Proposal:
"infections" -> "transmissions", i.e. how many people transmitted the disease to someone else
"test_adjusted_positive" -> "infection_onset", i.e. how many people got infected

Computation took so long

Hello, I am Ega from Indonesia and I would like to compute the Rt in my country. However, it took me around 13 hours to run data for a region, while I see that you only took around 7 mins to run data for a region in the python notebook. Do you use a virtual machine or cloud computing to speed up the computation? I am sorry for a dumb question, I am a newbie. Thanks in advance.

Best,
Ega

Add Rt estimates for foreign countries

Would it be a good idea to see how we stack up against other countries?

Of course, for the few countries that have eliminated the virus, make sure the website portion doesn't divide by zero.

Prior in Gaussian random walk

Can you explain the choice of prior for the Gaussian random walk? How did you choose this value for sigma?

           log_r_t = pm.GaussianRandomWalk(
                "log_r_t",
                sigma=0.035,
                dims=["date"] )

Specify numpy version >=1.19.0

Installation with numpy <= 1.18.5 (latest version on anaconda), pymc3==3.9.2 and theano==1.0.4 caused compilation errors. Numpy 1.19.0 solved this. Specify numpy version in requirements.txt to avoid installation errors.

Create production branch

We should have a separate "prod" branch from which the model runs so that master can be a bit more stable.

Alternatively, we could have releases and production only ever uses the most recent version which is known to be stable.

I'm curious where the shelter policy data points are coming from

I thought that was a nice feature of the visualizations, but I couldn't see where the data was being sourced from as the COVID tracking project doesn't seem to have that available in their API.

Thanks and thanks for making these tools open and available! Great work.

Tag if states aren't reporting their data daily

I've noticed Mississippi has large gaps in their data, but if you drill into the state you can see they just aren't reporting ANY data for most days. Is there a way these states can be tagged in some way (different color, a simple * with a NOTE)?

Model needs to correct for anomalies in testing reporting

Increasingly states are reporting 100% positive tests on a given day (eg 215 of 215 tests came back positive). This throws the model off because it assumes positive rate of tests are roughly proportional to the actual number of tests. If the state reports 100% positive tests, Rt increases too quickly because of the faulty data point.

For instance, Ohio has a handful of days when clearly total tests have not been reported correctly and positive % shoots up to 100%:

image

And in some cases, tests are withheld one day, only to be reported together with the next day's results:

image

In this case, drops in data are often followed by 2x the number of tests the following day.

In either case, having an unstable positive % confuses the model significantly so we need to figure out a solution to either:

  • Remove these anomalies and let the model infer the true hidden value
  • Correct these anomalies using some kind of algorithm

Currently @tvladeck and I have looked at Gaussian Processes and Kalman Filters as ways of detecting and perhaps correcting these issues. Other ideas are welcome too.

Illinois testing backlog de-noising

I'm concerned that the data from Illinois' September 4th backlog "catch-up" might be de-noised out, as if the cases never happened. If it's a simple average, it should work out fine and be counted. But if the algorithm is tossing values it determines are outside the norm and thus "noise," the information will be lost, which I feel would be in error.

I noticed there was a recent code change to mitigate a South Dakota reporting error. Has someone analyzed how the algorithm works when data reporting is backlogged and then later caught up? i.e., a number of days' worth of positive tests were bundled up and then reported (incorrectly) as if they occurred on a single later date, rather than the various dates when the samples were taken?

Can not install mkl-service

I have cloned the repository and tried to run "python setup.py install"

It failed while trying to install mkl-service module. I am using python 3.6 running on docker. What am I missing?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.