hubverse-org / hubvis Goto Github PK

View Code? Open in Web Editor NEW

3.0 4.0 0.0 7.41 MB

Plotting methods for hub model outputs

Home Page: https://hubverse-org.github.io/hubVis/

License: Other

R 100.00%

hubverse r-package

hubvis's Introduction

hubVis

The goal of hubVis is to provide plotting methods for hub model outputs, following the hubverse format. The hubverse is a collection of open-source software and data tools, developed by the Consortium of Infectious Disease Modeling Hubs. For more information, please consult the hubDocs website

Installation

You can install the development version of hubVis like so:

remotes::install_github("hubverse-org/hubVis")

Usage

The R package contains currently one function plot_step_ahead_model_output() plotting 50%, 80%, and 95% quantiles intervals, with a specific color per "model_id".

The function can output 2 types of plots:

interactive (Plotly object)
static (ggplot2 object)

library(hubVis)

library(hubExamples)
head(scenario_outputs)
head(scenario_target_ts)
projection_data <- dplyr::mutate(scenario_outputs,
     target_date = as.Date(origin_date) + (horizon * 7) - 1)

target_data_us <- dplyr::filter(scenario_target_ts, location == "US",
                                date < min(projection_data$target_date) + 21,
                                date > "2020-10-01")

projection_data_us <- dplyr::filter(projection_data,
                                    scenario_id == "A-2021-03-05",
                                    location == "US")
plot_step_ahead_model_output(projection_data_us, target_data_us)

Faceted plots can be created for multiple scenarios, locations, targets, models, etc.

projection_data_us <- dplyr::filter(projection_data,
                                    location == "US")
plot_step_ahead_model_output(projection_data_us, target_data_us, 
                             use_median_as_point = TRUE,
                             facet = "scenario_id", facet_scales = "free_x", 
                             facet_nrow = 2, facet_title = "bottom left")

Code of Conduct

Please note that the hubVis package is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Contributing

Interested in contributing back to the open-source Hubverse project? Learn more about how to get involved in the Hubverse Community or how to contribute to the hubVis package.

hubvis's People

Contributors

Stargazers

Watchers

hubvis's Issues

[ORG NAME CHANGE]: Update repo to hubverse-org organisation name

❗ Please do not merge anything into main until the 19th June

Create new branch called to-hubverse from latest dev branch (most likely related to the v3 schema PRs). (Make sure to pull first). If none exist, branch off from main.
Find and replace Infectious-Disease-Modeling-Hubs with hubverse-org throughout repo.

Replace origin remote to point to the new organisation

git remote set-url origin https://github.com/hubverse-org/<REPO_NAME>.git
# Check remotes
git remote -v

url <- paste0('https://github.com/hubverse-org/', basename(getwd()), '.git')
usethis::use_git_remote(name = "origin", url, overwrite = TRUE)
# Check remotes
usethis::git_remotes()

Push and open PR to original branch.

Consistent argument name for a model output tbl

It was decided in dev meeting to standardized argument name for model output tbls to model_out_tbl. The hubVis::plot_step_ahead_model_output() need to be updated accordingly

Update authors

Use definition below
Definitions of roles (Author, maintainer, contributor, etc…)

Going forward, ensure that PRs from new contributors include an update to authorship roles (for R packages, in description file)

facet_ncol parameter not used

facet_nrow seems to be passed on to output_plot here, but not facet_ncol

`plot_step_ahead_model_output` returns error for model outputs that include PMF output type with `chr` `output_type_id`

The example hub used in the hubEnsembles vignette/manuscript includes mean, median, quantile, and PMF output types. The PMF output types have output_type_id = c("large decrease", "decrease", "stable", "increase", "large increase"), which forces the entire output_type_id column to be chr.
Example hub data: https://github.com/Infectious-Disease-Modeling-Hubs/hubEnsembles/tree/software-manuscript/inst/example-data/example-simple-forecast-hub

I tried to use this model output (filtered to only quantile output type) in plot_step_ahead_model_output(), and I received the following error: Error in x - y : non-numeric argument to binary operator. I traced this back to the step in the function where the interval ribbons are set up (I think the issue was because the function is trying to do something with "0.25" instead of 0.25).

Reproducible example

hub_path <- system.file("example-data/example-simple-forecast-hub",
                        package = "hubEnsembles")
model_outputs <- hubUtils::connect_hub(hub_path) |>
  dplyr::collect()
target_data_path <- file.path(hub_path, "target-data", "covid-hospitalizations.csv")
target_data <- read.csv(target_data_path) |>
    dplyr::mutate(time_idx = `as.Date(time_idx))

This code gives the error.

hubVis::plot_step_ahead_model_output(model_output_data = model_outputs |>
                                                  dplyr::filter(location == "US",
                                                                output_type %in% c("quantile"),
                                                                origin_date == "2022-12-12") |>
                                                  dplyr::mutate(target_date =  origin_date + horizon),
                                              truth_data = target_data |>
                                                  dplyr::filter(location == "US",
                                                                time_idx >= "2022-11-01",
                                                                time_idx <= "2023-03-01"),
                                              facet = "model_id", 
                                              facet_nrow = 1, 
                                              interactive = FALSE,
                                              intervals = 0.5,
                                              one_color = "black",
                                              pal_color = NULL, 
                                              show_legend = FALSE, 
                                              use_median_as_point = TRUE,)

Adding dplyr::mutate(output_type_id = as.double(output_type_id)) solves the problem.

hubVis::plot_step_ahead_model_output(model_output_data = model_outputs |>
                                                  dplyr::filter(location == "US",
                                                                output_type %in% c("quantile"),
                                                                origin_date == "2022-12-12") |>
                                                  dplyr::mutate(target_date =  origin_date + horizon, 
                                                       output_type_id = as.double(output_type_id)),
                                              truth_data = target_data |>
                                                  dplyr::filter(location == "US",
                                                                time_idx >= "2022-11-01",
                                                                time_idx <= "2023-03-01"),
                                              facet = "model_id", 
                                              facet_nrow = 1, 
                                              interactive = FALSE,
                                              intervals = 0.5,
                                              one_color = "black",
                                              pal_color = NULL, 
                                              show_legend = FALSE, 
                                              use_median_as_point = TRUE,)

Input data - Date information

From Issue #1

Arguments specifying input data
forecast_data required data.frame with forecasts in the hub_mdl_out_df format. Noting that maybe we want to start by supporting only certain output types (e.g. quantiles, means, medians, maybe samples once we implement
All forecasts in forecast_data will be plotted, i.e. all filtering needs to happen outside this function. We note per validation specified below that this data.frame must have either a target_date column of both of a origin_date and horizon column.

Currently, the input data is required to have a target_date column, it would be interesting the add the capacity to have input file with origin_date and horizon , with the associated formula and have the capacity to calculate the target_date column if missing.

create plot_step_ahead_forecasts()

We would like to have a plot_step_ahead_forecasts() function as part of the hubUtils package. Ideally, this would have similar functionality to the "original" covidHubUtils::plot_forecasts() function which can be found here. The scope of this function is limited to plotting what we are calling step-ahead forecasts, that is forecasts for a single "target variable" at different horizons in the future.

Based on the original function, and integrating knowledge of the new hubverse toolkit, here is a proposal for what the new function would do and look like. This depends in some ways on hubverse-org/hubUtils#33 which relates to standardized structures for collections of predictions

General functionality

This function will plot forecasts and optional truth data for only one selected step-ahead target variable (which might be represented by one or more task_id variables). Optionally, faceted plots could be created for multiple models, locations and forecast dates are supported with specified facet formula.

Input parameters

(I've copied some text from the original plot_forecasts() function)

Arguments specifying input data

forecast_data required data.frame with forecasts in the hub_mdl_out_df format. Noting that maybe we want to start by supporting only certain output types (e.g. quantiles, means, medians, maybe samples once we implement hubverse-org/hubData#8). All forecasts in forecast_data will be plotted, i.e. all filtering needs to happen outside this function. We note per validation specified below that this data.frame must have either a target_date column of both of a origin_date and horizon column.
truth_data optional data.frame from with required columns as follows:
- time_idx
- value
- [collection of columns that are task_id variables, but not the ones that define the target date]

Arguments about plotting details

facet interpretable facet option for ggplot. Function will error if multiple values of some task_id variables are passed in without the corresponding column in the facet formula.
facet_scales argument for scales in [ggplot2::facet_wrap]. Default to "fixed".
facet_nrow number of rows for facetting; optional.
facet_ncol number of columns for facetting; optional.
intervals values indicating which central prediction interval levels to plot. NULL means only plotting point forecasts. If not provided, it will default to c(.5, .8, .95). When plotting 6 models or more, the plot will be reduced to show .95 interval only.
use_median_as_point logical for using median quantiles as point forecasts in plot. Default to FALSE.
plot_truth logical for showing truth data in plot. Default to TRUE. Data used in the plot is either truth_data or data loaded from truth_source.
plot logical for showing the plot. Default to TRUE.
fill_by_model logical for specifying colors in plot. If TRUE, separate colors will be used for each model. If FALSE, only blues will be used for all models. Default to FALSE.
fill_transparency numeric value used to set transparency of intervals. 0 means fully transparent, 1 means opaque.
top_layer character vector, where the first element indicates the top layer of the resulting plot. Possible options are "forecast" and "truth".
title optional text for the title of the plot. If left as "default", the title will be automatically generated. If "none", no title will be plotted.
subtitle optional text for the subtitle of the plot. If left as "default", the subtitle will be automatically generated. If "none", no subtitle will be plotted.

Input validations

at least one of target_metadata entry for the hub must have is_step_ahead set to TRUE.
hub target ids include "temporal ID variables" which are either (a) target_date or (b) origin_date and horizon
truth data has
- time_idx
- value
- if specified by hub, columns for all task_id variables that are not target_date, origin_date, horizon
forecast data is in the hub_mdl_out_df format

change plot_step_ahead_model_output references from "truth" to "target"

I think we've discussed before trying to standardize on referring to "target data" rather than "truth data" in the hub. I noticed today that we have "truth" references in several arguments in the plot_step_ahead_model_output() function.

interactive vis does not respect the `group` argument

Working with data from hubExamples, the following code produces a static plot that looks as expected:

library(dplyr)
library(hubExamples)
library(hubVis)

plot_step_ahead_model_output(
    model_output_data = forecast_outputs |> filter(output_type %in% c("quantile", "median")),
    target_data = forecast_target_ts |>
        filter(location %in% c("25", "48"),
               date >= "2022-10-01", date <= "2023-04-01"),
    use_median_as_point = TRUE,
    x_col_name = "target_end_date",
    intervals = c(0.5, 0.8, 0.9),
    facet = "location",
    group = "reference_date",
   interactive = FALSE
)

However, if we set interactive = TRUE, forecasts from different reference dates are connected:

plot_step_ahead_model_output(
    model_output_data = forecast_outputs |> filter(output_type %in% c("quantile", "median")),
    target_data = forecast_target_ts |>
        filter(location %in% c("25", "48"),
               date >= "2022-10-01", date <= "2023-04-01"),
    use_median_as_point = TRUE,
    x_col_name = "target_end_date",
    intervals = c(0.5, 0.8, 0.9),
    facet = "location",
    group = "reference_date",
   interactive = TRUE
)

Legend for `model_id` facet plot

          Found this generally very clear and easy to follow.

I find the attached plot (faceted by model) a bit confusing, with the legend only showing the one model (the first panel). In the interactive plotly, if you then click to not display the model, only the first panel projection disappears (pictured). I would propose to have the default of this plot to not have a legend, but not sure exactly how to set this up practically (tricky with the option default being true for other plots). Another option might be to allow each of the panels to still be a different colour so the legend corresponds to something meaningful?

Originally posted by @saraloo in #2 (comment)

In this case the legend should either be:

if fill_by = "model_id": each model-id with a different color and with each one a legend item. For plotly, with the possibility to show or not show each one of them
if fill_by is not set to "model_id": all model-id should have the same color and the legend will refer to the fill_by variable. For plotly, with the possibility to show or not show all of them at once.

Documentation

It would be nice to have more documentation on the package:

examples with the function
more complete README with small example

Unit testing

It would be nice to add tests on the package and if so, the test coverage information

add example data objects to package directly

See, e.g., https://github.com/Infectious-Disease-Modeling-Hubs/hubEvals/tree/main/data-raw

Update DESCRIPTION file - add contributors

Update DESCRIPTION file - add contributors in the description file

Upgrade Docs to hubStyle

Update package config

Create enhancement/hubstyle branch
Run hubDevs::use_hubdev_pkgdown(add_logo = TRUE) to update pkgdown
Add logo to your README
Append standard footer to README with hubDevs::append_hubdev_readme_footer(). Render
Run hubDevs::use_hubdev_community() to update community docs
Run hubDevs:::hubdev_ignore() to ignore std files
Check authors
Add hubverse & r-package topics to repos

Set up Netlify PR Previews

Log into Hubverse Netlify account and drag in pkg docs direcory to creat new site.
Edit site name with [pkg_name]-pr-previews and copy NETLIFY_SITE_ID
Create a NETLIFY_AUTH_TOKEN if required.
Add NETLIFY_AUTH_TOKEN & NETLIFY_SITE_ID to repository secrets

Push to Github, open a PR and check Preview 🎉

Error message unclear when number of facets to high.

The current error message is not very informative if the number of facet requested with the facet_nrow and facet_ncol is too high and causes issues:

Error in (function (obj, domains)  : 
  'list' object cannot be coerced to type 'double'
In addition: Warning messages:
1: In mapply(FUN = f, ..., SIMPLIFY = FALSE) :
  longer argument not a multiple of length of shorter
2: In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) :
  data length is not a multiple of split variable

packagedown site

Would be nice to make a packagedown site for this package, with examples and documentation

create ability to add a plot layer where prior "seasons" of data are overlaid

See the http://flusightnetwork.io/ visualization as an example where prior seasons of "target data" are plotted in the background of a "current season" of truth data and a current forecast. These kinds of plots can be really useful for comparing the current trends of data and forecast to what has been observed at similar times in prior years.

y-axis label

On the facet plots, the y-axis label is not consistent across all subplots

Change action message in numeric coercion warning

I was running through an example and I got a confusing warning message:

! `output_type_id` column must be a numeric. Class applied by default.

I wasn't sure what "Class applied by default" meant in this context. Should it be "Converting to numeric?"

hubVis/R/validate_format.R

Lines 142 to 146 in 7b2bd6c

 model_output_data$output_type_id <- 

 as.numeric(model_output_data$output_type_id) 

 cli::cli_warn(c("!" = "{.arg output_type_id} column must be a numeric. 

  Class applied by default.")) 

 }

Update tests

          Just to note that snapshots of graphical output can be tested using [`vdiffr`](https://vdiffr.r-lib.org/) package

Originally posted by @annakrystalli in #2 (comment)

GitHub Actions - R CMD check

It would be nice to add github action to run R CMD check on PRs

add `stat_`/`geom_` functions to create a layer plotting step ahead forecasts

The idea is that this would make it easier to build up a plot layer by layer, allowing for easier customization by users who are familiar with ggplot, rather than calling a function that "does it all".

It might be possible to base these on functionality in the ggdist package here.

Point and interval forecasts are not clearly separated by forecast date when plotted

When forecasts made on multiple forecasts dates are plotted at the same time, all of the forecasts are connected even though they should not be and this is not desirable behavior. See the attached images at the bottom for the desired behavior (achievable using covidHubUtils::plot_forecasts()) vs the current behavior demonstrated by hubVis::plot_step_ahead_model_output.

This behavior is particularly noticeable when there is a large gap between forecast dates, like during the off season for flu forecasting from around June to the end of September, when there should be no plotted forecasts.

A simple, reproducible example is given below using the data in the attached zip file:

library(tidyverse)
library(lubridate)
library(hubUtils)
library(hubVis)

flu_truth_all <- readr::read_rds("flu_truth_all.rds")
flu_forecasts_baseline <- readr::read_rds("flu_baseline_hub.rds")
flu_dates_21_22 <- as.Date("2022-01-24") + weeks(0:21)
flu_dates_22_23 <- as.Date("2022-10-17") + weeks(0:30)
all_flu_dates <- c(flu_dates_21_22, flu_dates_22_23)

select_dates <- all_flu_dates[seq(1, 53, 4)]

baseline_ca <- flu_forecasts_baseline |> 
  dplyr::filter(location == "06", forecast_date %in% select_dates)

truth_ca <- flu_truth_all |> 
  dplyr::filter(location == "06")
  
plot_step_ahead_model_output(
  baseline_ca,
  truth_ca,
  use_median_as_point=TRUE,
  show_plot=TRUE,
  x_col_name="target_end_date",
  x_truth_col_name = "target_end_date",
  show_legend=FALSE,
  facet="model_id",
  facet_nrow=5,
  interactive=FALSE,
  fill_transparency = 0.45,
  intervals=c(0.5, 0.8, 0.95),
  title="Weekly Incident Hospitalizations for Influenza in California"
)

flu_baseline_truth.zip

add ability to plot sample output_type data

Currently, a limitation of plot_step_ahead_model_data() is that it only accepts median and quantile output_types.

We should support the plotting of sample output_type data by inferring quantiles from it. Currently, a user would have to manually transform the data to quantiles. Possible approach: add sample output_type to the list of valid types and make the appropriate transformation to quantile output_type data.

Acceptance criteria:

a user passes sample output_type data to plot_step_ahead_model_output() and gets a plot returned, with intervals if specified

Interactive plotly - hovertext

The hovertext of the plot should follow the same standard for the numerics (for example 2 digits) and have a space after ":"

	model_output_data$output_type_id <-
	as.numeric(model_output_data$output_type_id)
	cli::cli_warn(c("!" = "{.arg output_type_id} column must be a numeric.
	Class applied by default."))
	}