outbreak-info / r-outbreak-info Goto Github PK

R package to access the genomics and epidemiology data and Research Library metadata compiled and standardized on outbreak.info.

Home Page: https://outbreak-info.github.io/R-outbreak-info/

License: MIT License

R 99.90% CSS 0.10%

rstats rstats-package epidemiology genomics genomics-visualization genomics-data

r-outbreak-info's Introduction

output
github_document

outbreakinfo

R package for outbreak.info

outbreak.info is a platform to discover and explore COVID-19 data and variants. Our Variant Reports allow researchers to track any emerging or known variant using customizable visualizations, enabling near real-time genomic surveillance. Our Epidemiology tools allow users to explore how COVID-19 cases and deaths are changing across locations.

The outbreakinfo R package provides access to the underlying genomic and epidemiology data on outbreak.info. This includes SARS-CoV-2 variant prevalence data calculated using the Bjorn package using data provided by GISAID. We standardize COVID-19 case and death data from Johns Hopkins University and the New York Times and calculate derived statistics.

Installation

# Install development version from GitHub
devtools::install_github("outbreak-info/R-outbreak-info")

Getting Started

If you're getting started using outbreakinfo, we recommend starting with our tutorial vignettes.

To access the genomic data (SARS-CoV-2 variant prevalences), you will need to create an account on GISAID before being able to access the data. It may take a few days for the registration to become active. Before calling the genomics functions, you'll need to register your GISAID credentials:

outbreakinfo::authenticateUser()

By using this R package you reaffirm your understanding of the terms of use and the DAA you agreed to while registering with GISAID. Please see this section for more details.

Please view our vignettes for examples of how to use the R package.

Related Projects

API access for outbreak.info's Research Library, which provides metadata on COVID-19 publications, pre-prints, clinical trials, datasets, protocols, and more is available on our API. API access for the cases and deaths data is also available on our API. In addition to this R package, the Research Library and Cases & Deaths API endpoints can be accessed through the httr R package or the the Python requests package.

Examples

Genomic data

Lineage | Mutation Tracker

Provides access to the prevalence of a lineage, mutation(s), or lineage with additional mutations, to access the data underlying the outbreak.info Variant Tracker - in this example, mutation [S:P681R] (https://outbreak.info/situation-reports?muts=S%3AP681R). View the Variant Tracker Vignette to explore more options.

library(outbreakinfo)
#  Provide GISAID credentials using authenticateUser()
# Get the prevalence of mutation P681R in the Spike protein in Kansas over time.
P681R = getPrevalence(mutations = c("S:P681R"), location = "Kansas", logInfo = FALSE)
plotPrevalenceOverTime(P681R, title = "Prevalence of S:P681R in Kansas")

Location Tracker

Provides access to the prevalence of all lineages and variants in a country, state/province, or U.S. county, to access the data underlying the outbreak.info Location Tracker. View the Location Tracker Vignette to explore more options.

library(outbreakinfo)
#  Provide GISAID credentials using authenticateUser()
# Get the prevalence of all circulating lineages in California over the past 90 days
ca_lineages = getAllLineagesByLocation(location = "California", ndays = 90)
#> Retrieving data...

# Plot the prevalence of the dominant lineages in California
plotAllLineagesByLocation(location = "California", ndays = 90)
#> Retrieving data... 
#> Plotting data...

Lineage Comparison Tool

Provides access to the mutations per lineage, to access the data underlying the outbreak.info Lineage Comparison Tool.

library(outbreakinfo)
#  Provide GISAID credentials using authenticateUser()

lineages_of_interest <- c("BA.2", "BA.2.12.1", "BA.4", "BA.5")

# Get all mutations in the lineages of interest with at least 75% prevalent in one of the lineages.
mutations = getMutationsByLineage(pangolin_lineage=lineages_of_interest, frequency=0.75, logInfo = FALSE)

# Plot the mutations as a heatmap
plotMutationHeatmap(mutations, title = "S-gene mutations in lineages")

Research Library

Provides access to the metadata on COVID-19 research, including publications, clinical trials, datasets, protocols, and more.

library(outbreakinfo)
library(dplyr)
library(ggplot2)
library(lubridate)

resources_by_date = getResourcesData(query = "date:[2020-01-01 TO *]", types=c("Publication", "ClinicalTrial", "Protocol", "Dataset"), fields = c("date", "@type"), fetchAll = TRUE)

# roll up the number of resources by week
resources_by_date = resources_by_date %>%
  mutate(year = lubridate::year(date),
         iso_week = lubridate::isoweek(date))

# count the number of new resources per week.
resources_per_week = resources_by_date %>%
  count(`@type`, iso_week, year) %>%
  # convert from iso week back to a date
  mutate(iso_date = lubridate::parse_date_time(paste(year,iso_week, "Mon", sep="-"), "Y-W-a"))

# Make it a bit prettier, by sorting by the relative proportion of resource types
type_frequency = resources_by_date %>%
count(`@type`) %>%
  arrange(desc(n)) %>%
  pull(`@type`)

resources_per_week$`@type` = factor(resources_per_week$`@type`, type_frequency)

ggplot(resources_per_week, aes(x = iso_date, y = n, fill = `@type`)) +
  geom_bar(stat="identity") +
  ggtitle("COVID-19 resources have rapidly proliferated", subtitle="Number of publications, datasets, clinical trials, and more added each week to outbreak.info's Research Library") +
  theme_minimal() +
  theme(
    text = element_text(family="DM Sans"),
    axis.title = element_blank(),
    axis.text = element_text(size = 16),
    plot.title = element_text(size = 20),
    plot.subtitle = element_text(colour="#777777", size=9)
  ) +
  scale_x_datetime(limits = c(min(resources_per_week$iso_date, na.rm = T), max(resources_per_week$iso_date, na.rm = T)), date_labels = "%b %Y") +
  scale_y_continuous(label=scales::comma) +
  scale_fill_manual(values = c(Publication = "#e15759", ClinicalTrial = "#b475a3", Dataset = "#126b93", Protocol = "#59a14f")) +
  facet_wrap(~`@type`, scales = "free_y", ncol = 1) +
  theme(legend.position = "none")
#> Error in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : polygon edge not found

Cases & Deaths

Replicates the daily confirmed cases visualization on outbreak.info - in this example, the United States and Mexico.

# Plots the daily confirmed cases per capita for the United States and Mexico.
library(outbreakinfo)
plotEpiData(locations = c("United States of America", "Mexico"), variable = "confirmed_rolling_per_100k")
#> Mexico (metropolitan area)
#> 
  downloading [==============================] 100% eta:  0s

Data Sources

SARS-CoV-2 virus sequences

We would like to thank the GISAID Initiative and are grateful to all of the data contributors, i.e.the Authors, the Originating laboratories responsible for obtaining the specimens, and the Submitting laboratories for generating the genetic sequence and metadata and sharing via the GISAID Initiative, on which this research is based.

TERMS OF USE for R Package and

Reminder of GISAID's Database Access Agreement

Your ability to access and use Data in GISAID, including your access and
use of same via R Package, is subject to the terms and conditions of
GISAID's Database Access Agreement (“DAA”) (which you agreed to
when you requested access credentials to GISAID), as well as the
following terms:

1. You will treat all data contained in the R Package consistent with
other Data in GISAID and in accordance with GISAID's Database Access
Agreement;

2. You will not distribute, or re-distribute Data made available through
GISAID to any third party other than Authorized Users as contemplated by
the DAA;

3. USE OF R PACKAGE: Any visualizations, charts, graphs,
graphics, pictographs, plots, or other displays you create via the R
Package may be exclusively used for academic and research purposes.
No other types of uses are allowed;

4. Any use of visualizations, charts, graphs, graphics, pictographs,
plots, or other displays created via the R Package in an academic or
research publication, including in a paper, manuscript, preprint, website,
web service, or any other media material must be in conformity with the
GISAID Publishing Guidelines, available at https://www.gisaid.org/publish,
and the DAA, available at https://www.gisaid.org/daa; and

5. By using the R Package you reaffirm your understanding of these
terms and the DAA.

When using this data, please state, "This data was obtained from GISAID via the outbreak.info API". WE DO NOT SUPPORT THIRD PARTY APPLICATIONS. THIS PACKAGE IS MEANT FOR RESEARCH AND VISUALIZATION PURPOSES ONLY. If you want to build third party applications, please contact GISAID via [https://www.gisaid.org/help/contact/](https://www.gisaid.org/help/contact/).

Cases & deaths

Confirmed cases, recovered cases, and deaths over time for countries outside the United States, and provinces in Australia, Canada, and China are provided by Johns Hopkins University Center for Systems Science and Engineering. See data FAQ.

Confirmed cases and deaths over time for the United States, U.S. States, U.S. Metropolitan Areas, U.S. cities and U.S. counties are provided by the New York Times. Note that "New York City" refers to the combined totals for New York, Kings, Queens, Bronx and Richmond Counties; "Kansas City" refers to cases within the Missouri portion of the Kansas City Metropolitan area and values for Jackson, Cass, Clay, and Platte counties are the totals excluding the KCMO data; cities like St. Louis that are administered separately from their containing county are reported separately. See other geographic exceptions.

r-outbreak-info's People

Contributors

Stargazers

Watchers

Forkers

sabahzero emilyhaag gtsueng ningning-c finlaycampbell qianqli srandall02 kristiyslee tkmarkcheng skandel2

r-outbreak-info's Issues

Add a nicer error message for slow internet connections

I think my internet flaked out, so I got a connection time out message. Might be nice to catch timeout errors, esp. if they happen when the API is down for maintenance / data update.

Investigate rpy2 package error

@watronfire noticed a potential connection issue when trying to run the package through rpy2

Function call: countries <- getAdmn0()

Allow user to use browseVignettes(package = "outbreakinfo")

Vignette .Rmd would have to be moved into the vignettes/ directory.
In addition, would be nice to edit the DESCRIPTION file to include a link to the vignette.

Add ggtitle to plotPrevalenceByLocation

Add a title to plotPrevalenceByLocation(pangolin_lineage = "P.1", location = "Brazil"):

Personally, I also prefer percentage over proportion, but it's not a big deal either way.

plotPrevalenceByLocation(pangolin_lineage = "P.1", location = "Brazil") +
  ggtitle("Prevalence of P.1 in Brazil") +
  scale_y_continuous(labels = scales::percent, name="percentage")

Add introduction to README and vignette.

from @flaneuse : Add introduction to the Github Readme and the vignette to explain what the data is, where it comes from, etc. More context about what the package can do. Should also link to the data sources (outbreak.info/sources) and the api documentation (api.outbreak.info)

If `N` selected for all options in getAdmn1ByCounty(), show better error message

Check length of hits to avoid this error: Error in rbind_pages(results) :        
all(vapply(pages, is.data.frame, logical(1))) is not TRUE

Show progress bar for queries

Functions like getAdmn2ByCounty take a while to load the data. Showing a progress bar with the number of items retrieved would improve user experience.

Fix / suppress the `muffleWarning` output when user isn't authenticated

Warning: Please authenticate by calling authenticateUser() to access the API.
#> Error in invokeRestart("muffleWarning"): no 'restart' 'muffleWarning' found

Improve helper text displayed to user

Call: county_df=getAdmn2ByCountry()

The counter says things like 144 records retrieved; do you actually mean 14,400 records? Would suggest changing language to either something like "query" or multiplying by 1000.
It'd be helpful to grab the number of results so you have an idea of how far you are into the function call (e.g. 14,400 out of 68,000)
Would be nice to add a newline after "X records are retrieved". When there's an error, they get pasted together.
Might also be useful to display the estimated time remaining in the function. see progress package: https://dplyr.tidyverse.org/reference/progress_estimated.html

Geographic synonym lookup in searchLocations()

From @flaneuse : searchLocations would be MUCH more useful if there was a geographic synonym lookup function. If I try to lookup “USA” right now, I get no results. We could do this either by a gazetteer approach (store a list of synonyms for each location), or we could use a package that calls Google’s Geocode API to find the lat/lon of the location and let the user select what they mean.

We would need an api_key for google geocoding api.

getSeqCounts broken (?)

Writing and expose a generic getEpiData function and structure other functions as wrappers

Feedback from @flaneuse : It would be something like getEpiData(location_id, location_name, date, most_recent, admin_level, fields, sort, size), where all those fields are optional. getData() would return ALL the data. The advantage of this approach is that you now have one function which all the others can rely on, so if anything ever changes, like the API URL, you only have to change it in one place. You can also then have one place to do all the transformations you need. This function would also expose a lot more functionality to the user like sorting, specifying size, fields to return, etc. It’d be really cool too if you could specify date ranges (let us know if you need help constructing that ES query).

Vignette `getAdmn2ByState` function is broken

getAdmn2ByState(“California”)
-->

1] "Retrieving data..."
 1 records retrieved

Error in rbind_pages(results) : 
  all(vapply(pages, is.data.frame, logical(1))) is not TRUE

Change the color scheme on the lineage prevalence plots

p <- plotAllLineagesByLocation(location = "United States", other_threshold = 0.03, nday_threshold = 5, ndays = 60)
p <- p + scale_fill_brewer(palette="Paired")

The ColorBrewer "Paired" scheme wouldn't be my first choice for this plot. The paired-ness of the colors mean that users subconsciously associate the two similar colors together, when B.1.1.519 and B.1.1.7 are completely unrelated.

Add other vignette links to README

Only the Epi vignette is listed.

Add authentication to all examples in documentation

e.g. https://outbreak-info.github.io/R-outbreak-info/reference/getMutationsByLineage.html

README

Be explicit that installation occurs in R script and not command line (this seems intuitive, but doesn't hurt to state)
Be more descriptive with purpose and function of R package directly, rather than relying on the user to click on the vignette (which does a good job of explaining functions themselves, but isn't entirely descriptive itself). For example, provide a few of the package uses directly.
License? Credits? I know we did an overview on how to approach a README (Mar 3, 2020) that Ginger facilitating, and there was a GoogleDoc breakdown

getMutationDetails - would be nice to return total number of observations

@gkarthik : maybe add to API the total worldwide instances of S:N501Y and percentage of all sequences for a given mutation? could be useful in cases where there are multiple nucleotide substitutions for a given amino acid change to figure out which is the more frequent change.

searchLocations needs a better example

searchLocations(c("California", "Florida", "Texas"), admin_level=1) which returns "California" "Florida" "Texas" seems not particularly useful.

Simplify ggplot code in Lineage Report Rmd

ditch_the_axes <- theme(
  axis.text = element_blank(),
  axis.line = element_blank(),
  axis.ticks = element_blank(),
  panel.border = element_blank(),
  panel.grid = element_blank(),
  axis.title = element_blank()
)

p <- ggplot() + geom_polygon(data = us_df, aes(x, y, group=group, fill=proportion), color="gray25", size=0.3) + scale_fill_gradient(low = "lemonchiffon", high = "lightseagreen", name = "Cumulative prevalence \nof B.1.1.7", labels = scales::percent) + ditch_the_axes + coord_fixed(1.05)
show(p)

Suggest replacing ditch_the_axes with the in-built ggplot::theme_void() function

Add intro to Lineage Report vignette

Add context, link to the outbreak.info report, etc.: https://outbreak-info.github.io/R-outbreak-info/lineagevignette.html

getCountryByRegion is broken

getCountryByRegion("North America")

Error in getEpiData(wb_region = location, admin_level = 0) : 
  object 'location' not found

Add ggtitle to plotPrevalenceByLocation

Mods to characteristic mutations heatmap

Suggestions for "Characteristic S-gene mutations in common lineages" in https://outbreak-info.github.io/R-outbreak-info/locationvignette.html:

The heatmap is wrong; you need to grab all the mutations which have at least 75% prevalence in any of the lineages. I do this by getting all mutations (getMutationsByLineage(.x, frequency=0)) within the lineage_list, then finding all unique mutations where they're > 0.75 in any lineage, and then filtering the mutations in df_ll to only be those mutations.
If you want a dplyr version (I'm a big advocate because I find the code easier to read): lineage_list <- lineage_df %>% filter(lineage != "other") %>% pull(lineage) %>% unique()
df_ll <- subset(df_ll, gene == "S") can also be dplyr-ized with filter
add coord_equal to the ggplot so the chiclets are square rather than rectangular.
add a border around the tiles
add a ggtitle
To mimic the outbreak palette, I'm using RdPu from ColorBrewer
I'd probably change the theme to remove the distracting gridlines, etc.

In last table, use more meaningful variable name(s)

In "Tracked variants of concern in the United States" (https://outbreak-info.github.io/R-outbreak-info/locationvignette.html), would suggest using a more descriptive variable name than out. I use df all the time, so i'm not one to talk, but out seems just a little too divorced from what it's trying to represent.

Add intro to Location Report vignette

Add context, link to the outbreak.info report, etc.: https://outbreak-info.github.io/R-outbreak-info/locationvignette.html

`getByAdmnLevel(-1)` is broken

getByAdmnLevel(-1)
Error in open.connection(con, "rb") : HTTP error 400.

Probably need to enquote admin_level:"-1"?

Add function to get all countries/regions

Create a generic function to get all region with the specified admin_level. This will be useful to plot world map.

getByAdminLevel(admin_level = 0/1/2/1.5)

`plotCovid` is broken

plotCovid("Florida", "confirmed_per_100k")
[1] "Retrieving data..."
 1 records retrieved
Error in key %in% colnames(df) : object 'key' not found

Add percent complete feedback for API calls that require lots of time

For instance: plotAllLineagesByLocation(location = "United States", other_threshold = 0.03, nday_threshold = 5, ndays = 60) takes a few seconds to load. Add feedback on how much longer the function will take to return data.

Create function to print all available fields in api

printAPIFields()

Add this function to Rmarkdown as well.

Better error message if admin_level not specified for getLocations()

Error in paste0(api.url, "query?q=", location.ids, "%20AND%20", "admin_level:%22",  : 
  argument "admin_level" is missing, with no default

Merge genomics branch into main branch

Update 'master' branch name to 'main'

Reasoning and execution:
https://stevenmortimer.com/5-steps-to-change-github-default-branch-from-master-to-main/

getEpiData function is broken

getEpiData(name="United States of America", date=2020-07-01, mostRecent=T) example from the help documentiation is broken.

Need to translate mostRecent argument TRUE / FALSE into ES-compatible representations of booleans (true, false)
Fixing that string leads to an rbind err:

getEpiData(name="United States of America", date=2020-07-01, mostRecent='true')

[1] "Retrieving data..."
 1 records retrieved

Error in rbind_pages(results) : 
  all(vapply(pages, is.data.frame, logical(1))) is not TRUE

Document use of list of locations names in vignette

For getAdmn2ByState, getCountryByRegion, getAdmn1ByCountry, etc. provide another example where you supply the function with a list of location names rather than a string.

Improve function call perf and/or offer users more options to limit their results

When you call functions like getAdmn0() or getAdmn2ByCountry() the function is slooooooooow since it has to make a bunch of calls and returns a lot of data.

Would be nice to offer the users the ability to limit the query to a smaller subset of the data to help make it easier to return data faster. Potential options could include narrowing by date range, mostRecent, returning only a subset of fields...

Might also be worth double-checking that the function is as fast as possible. You're probably API-limited, but if it could be faster, that'd be nice. (low priority)

Create a homepage

https://outbreak-info.github.io/R-outbreak-info/

and/or

https://outbreak-info.github.io/

should point to some sort of basic homepage, ideally listing all the vignettes and maybe the functions. the tidyverse format is nice: installation instructions, usage, features, authors, etc: https://dplyr.tidyverse.org/

Should also link to outbreak.info and the API.

Full testing of R client with authentication

Pre-release testing

Vignette `getLocationData` function is broken

From https://outbreak-info.github.io/R-outbreak-info/outbreakvignette.html, using version 0.2.0 built on 3 November 2020:

df=getLocationData(c("Texas", "Brazil", "San Diego County"))

-->

Error in getLocationData(c("Texas", "Brazil", "San Diego County")) : 
  could not find function "getLocationData"

Better error message if api field not found

Error in get(key) : object ‘comfirmed_per_100k’ not found

Lineage Report vignette mapping

I'm not in love with the call to coord_fixed(1.05) in the Lineage Report vignette, since that's a hardcoded value of the aspect ratio of the lat/lon in a proper projection of the U.S. map.

I would think about converting the mapping data into an sf object, where you will need to:

convert the lat/lon of the polygons into a geometry object
specify the projection using st_transform and a proj4 string -- can grab from projection wizard
change the ggplot geom_polygon call into geom_sf

There's a lot more cool things the sf package can do for manipulating geographic data...

eu = getCountryByRegion("Europe & Central Asia")
Error in open.connection(con, "rb") : HTTP error 400.