hubverse-org / hubutils Goto Github PK

View Code? Open in Web Editor NEW

5.0 4.0 2.0 5.37 MB

Utility functions for Infectious Disease Modeling Hubs

Home Page: https://hubverse-org.github.io/hubUtils/

License: Other

R 100.00%

hubverse r-package

hubutils's Introduction

hubUtils

The goal of hubUtils is to provide a set of utility functions for downloading, plotting, and scoring forecast and truth data from Infectious Disease Modeling Hubs

Installation

You can install the development version of hubUtils like so:

remotes::install_github("hubverse-org/hubUtils")

Code of Conduct

Please note that the hubUtils package is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Contributing

Interested in contributing back to the open-source Hubverse project? Learn more about how to get involved in the Hubverse Community or how to contribute to the hubUtils package.

hubutils's People

Contributors

Stargazers

Watchers

Forkers

kjsato testorg-rename

hubutils's Issues

Create `as_model_out_df()` function and `model_out_df` S3 class

as_model_out_df() function
validate_model_out_df() function

For more details, refer to dicussion in #33

`get_task_id_params` likely needs to know about a round id

In scenario modeling hubs, the task id parameters may be different in different rounds. In order to retrieve information about a particular task id variable, this function therefore potentially needs to know a round id. See this example for roughly the kind of set up we might have in mind.

Create hub connection S3 class

Objects of hub-connection S3 class should contain all metadata required to successfully load data from a given modelling hub.

Create article on connecting to a hub

Use magrittr `%>%` or base R `|>` pipe?

Was wondering if you had a preference for importing magrittr pipe or base R |> pipe.

I've been using base R |> pipe so far but that immediately sets R 4.1.0 or greater as a dependency so on second thought, asking folks to upgrade R to use the package might be a big ask.

Compile example `hubmeta.json` files and associated mini-hubs

To validate that metadata in hubmeta.json files can be used effectively to support loading data from a variety of hubs I will need:

representative hubmeta.json files
Associated example mini hubs to validate functions for loading data against. These will be useful for testing and demos in vignettes. Options:
- Internal: If small enough they would Ideally live in inst and be accessible through system.file() calls.
- External: if the total size of files to capture enough hub diversity of characteristics then we could host demo data as separate demo hubs. This would make testing slower then and require an internet connection to run tests.

Change mention of `type` to `output_type` and `type_id` to `output_type_id` in all examples and documentation

Implement discussions in #33 (comment)

[Input validation - Submit] Scenario #2 Invalid task_id value

A team wants to submit data but one of the valid task_id columns in the dataset contains a value that is not valid for that task id for a given round.

Options for action

Throw error identifying the invalid task_id value (and perhaps the location of affected cells in the file)?

don't throw error when no errors exist from validation

Once I got the config working, I thought I still had an error since this was throwing an error. Actually, the error was that I had no errors! Maybe just a more graceful message, to make that clear to a novice user?

library(hubUtils)
view_config_val_errors(validate_config())
#> Loading required namespace: jsonvalidate
#> Error in validate_config(): Assertion on 'config_path' failed: File does not exist: './hub-config/tasks.json'.

^{Created on 2023-03-27 by the reprex package (v2.0.1)}

Allow users to override `output_type_id` column data type when opening a connection to a hub

Implement proposed functionality in this comment: #52 (comment)

add standardize_column_names() function

This would be designed to live in the workflow after "collect()" and before something like a plot or ensembling of forecasts. E.g. Currently when data are read in, a model column is inferred based on the folder name where the data originates. We would like to have the recommended setup be that model_abbr and team_abbr columns are inferred from a nested folder structure. however to support legacy code, we'd also like to have a function that could split a non-nested folder structure based on a "[team_name]-[model_name]" kind of folder name.

Be able to download subsets of data from a remote hub

Ability to load forecasts for a particular team, set of targets, etc., without having a local clone of the full repository and without relying on Zoltar integration

GitHub

One option is to query via github API
Limitations:

GitHub API query rate limits can pose a problem
Difficult to query information within files
Might be violation of GitHub terms
Maybe store relevant search information. This could be related to previous discussion about what (meta)-data to cache for models
Relevant code from European Hub: https://github.com/epiforecasts/simplified-forecaster-evaluation/blob/main/R/get-hub-forecasts.R

Zoltar

Somewhat longer term, set up action definitions as part of template hub to send data to Zoltar
How well can Zoltar infrastructure capture the needs of SMH’s?

Handle multiple model_task elements in a single round

When I was working on the JSON schema, I realised round-1 the complex example has two model task elements, which vary according to what the target is!

This needs a little more thought for how to handle so will work on it first thing next week.

Originally posted by @annakrystalli in #15 (comment)

Add `jsonvalidate` to Imports

standard format(s) for a collection of predictions (forecasts or projections)

The goal of this issue is to arrive at a definition for a standard format for representing a collection of predictions when they are loaded into a working environment such as an R or python session. Discussion here will focus on aspects about the general data format that are agnostic to the programming environment (i.e., they are relevant to multiple languages, e.g. R, python, ...). There are also some questions that are specific to the R implementation in the hubUtils package -- I've filed a separate issue (#32) to discuss R-specific points.

As a first pass at a proposal, I suggest that predictions could be stored in a long format data frame that has the same columns as in the model output submission files, augmented with the team and model names as discussed in this issue. In more detail, here is a take at the expected columns that will be in a dataframe-like object that is returned by a hubUtils::load_predictions function call:

team_abbr <string>: The abbreviation of the team name that produced the predictions
model_abbr <string>: The abbreviation of the model that produced the predictions
One column for each task_id variable in all model output files that were loaded. The data type is a <string> (unless the data type is otherwise specified somehow in the hub's tasks.json file? @annakrystalli do you have insights? If we try to get fancy here, we might have to worry about edge cases where the same-named task id variable is specified to have a different data type in different rounds...? But it would clearly be preferable to be able to immediately be able to convert date fields to dates...).
output_type <string>: The output type
output_type_id <string>: The output type id
value <numeric>: The value of the model prediction

If predictions are loaded from multiple rounds that have different task id variables, it is expected that any missing values in task id columns that were missing will be filled in with an NA.

Here are some alternative ideas that have come up in various discussions:

The predx package made use of a roughly similar tidy data frames, but used objects specific to each forecast output type to represent forecasts across multiple type id values. With this representation, there would be one row for each combination of values for the team, model, task id variables, and output type. Saying this another way, as an example, we would not have multiple rows for quantile forecasts by the same team/model for the same target at different quantile probability levels, but if a team/model submitted both quantile and point forecasts, there would be one row for the quantile forecasts and a second row for the point forecasts. For example, see discussion in this vignette.
Perhaps in combination with the above, it has been suggested that we could apply a pivot_wider type action to one or more of these columns. In particular, if we used a separate column for each possible output_type, then there would be only one row for each combination of team_abbr, model_abbr, and the task id variables. Thus, each row would correspond to forecasts (possibly of multiple types) for one forecasting problem and one team.

Although I see value in these alternatives I would not prefer to use them as the foundational/core data format for two (closely related) reasons:

A long data format is familiar and easily transportable across languages, and will be easy to work with in the common tools of any language
Implementing functionality along the lines of predx would take more work; for R, we could build on the existing functionality in that package, but we'd have to start from scratch in other languages

`round-id` label in `round_id_from_variable` hubs

Currently in the hub_connection class object I'm attaching 3 relevant metadata summary attributes:

round_ids:
- in round varying hubs -> a vector of round id names
- in round_id_from_variable hubs -> the round_id_variable, e.g. "origin_date"
task_id_names:
- in round varying hubs -> a named list of task id names per round
- in round_id_from_variable hubs -> a character vector of task id names, indicating their consistency across rounds
task_ids_by_round a logical value, TRUE if round varying hub, FALSE if round_id_from_variable hub making it easier to distinguish the two types of hubs.

See examples:

library(hubUtils)
# round_id_from_variable hub
connect_hub(system.file("hub_1", package = "hubUtils")) |>
    attributes()
#> $names
#> [1] "round_id_from_variable"
#> 
#> $hubmeta_path
#> /Users/Anna/Library/R/arm64/4.2/library/hubUtils/hub_1/hubmeta.json
#> 
#> $hub_path
#> [1] "/Users/Anna/Library/R/arm64/4.2/library/hubUtils/hub_1"
#> 
#> $task_id_names
#> [1] "origin_date" "location"    "horizon"    
#> 
#> $round_ids
#> [1] "origin_date"
#> 
#> $task_ids_by_round
#> [1] FALSE
#> 
#> $class
#> [1] "list"           "hub_connection"
# round varying hub
connect_hub(system.file("scnr_hub_1", package = "hubUtils")) |>
    attributes()
#> $names
#> [1] "round-1" "round-2"
#> 
#> $hubmeta_path
#> /Users/Anna/Library/R/arm64/4.2/library/hubUtils/scnr_hub_1/hubmeta.json
#> 
#> $hub_path
#> [1] "/Users/Anna/Library/R/arm64/4.2/library/hubUtils/scnr_hub_1"
#> 
#> $task_id_names
#> $task_id_names$`round-1`
#> [1] "origin_date" "scenario_id" "location"    "target"      "horizon"    
#> 
#> $task_id_names$`round-2`
#> [1] "origin_date" "scenario_id" "location"    "target"      "age_group"  
#> [6] "horizon"    
#> 
#> 
#> $round_ids
#> [1] "round-1" "round-2"
#> 
#> $task_ids_by_round
#> [1] TRUE
#> 
#> $class
#> [1] "list"           "hub_connection"

^{Created on 2022-10-18 with reprex v2.0.2}

This however leads to inconsistencies in the form of how task id metadata are held. There was a question about simplifying to a single vector where task ids are consistent across rounds in round varying hubs (#14 (comment)) and Evan seemed to like the idea. However when it comes to implementation, and given such simplification cannot always be guaranteed, the inconsistency of output is ending up requiring checks and extra conditional logic according to object type (list or vector) returned in order to make use of the info downstream.

On the other hand, having a named list of task ids in all cases actually simplifies the back-end significantly as we can just use the required round id in round varying hubs and a consistent round id name in non varying hubs so I'm leaning towards that.

If you agree with this approach, this leads to a question of what to assign as the consistent round-id name of the single task_id_names element in the case of a round_id_from_variable metadata structure. Here are some options:

Leave as round_id_from_variable. This is simple but not that informative
Change to the value of the actual variable acting as round_id identifier (e.g. origin_date). This is much more meaningful I feel. In this case I also suggest that, when reading in hubmeta, we also replace round_id_from_variable (which is the name of the single top level element) with the variable name acting as round_id identifier. This would make validating round_id argument inputs against the round_ids attribute of hub_connection the same as in the case of round varying hubs. It would also mean we would have a meaningful value to assign in the case of round_id = NULL inputs (in round aware functions discussed in #14) against into the nested metadata structure and consistent between hubs with round varying task ids and round_id_from_variable hubs.

I'm personally leaning towards the second option but thoughts welcome. Also, while internal validation etc functions can make use of the consistent structure, we could have user facing functions where metadata are simplified if possible.

`new_hub_connection()` low-level constructor to efficiently create new `hub_connection` objects with the correct structure. Metadata to be sourced from `hubmeta.json` file.

Pointer to null/NA read JSON issue

reichlab/hub-infrastructure-experiments#2

support json references for hub config files

See issue #9 for further documentation. Due to technical issues, we do not support json references in a first pass at this functionality.

Update in `hubUtils:::open_hub_dataset()`

To read csv files, it seems to be necessary to have the model_id information in the input "schema" (output of create_hub_schema()

Print method for `<hub_connection>` class object

While the <hub_connection> S3 class object itself is a nested list containing hubmeta metadata, it would be nice for the details to be hidden when the object is printed, exposing a few key hub details instead, in the spirit of printing a dbConnection object.

Include:

what type of connection/data file sources are available?
Local repo, zoltar
connection details (eg path to hub or zoltar project id

`validate_hub_connection()` A validator that will perform checks to ensure that the object has correct values

Validation checks are needed before creating hub_connection S3 class object after reading in hubmeta file.

Create function for checking integrity of `hub-config` directory

Currently we can check individual config files. It would be good to be able to check the integrity of both config files and consistency in the versioning between them with one function, e.g. validate_hub_config().

Such a function could be used in CI to routinely check consistency of hub configuration.

validate that task id variables don't conflict with standard column names

When validating tasks.json, we should ensure that the task ids don't conflict with our standard column names of "model_id", "output_type", "output_type_id", and "value".

Explore whether schema containing partitioning variables can safely be used now with csv datasets

From experimentations by @LucieContamin , it seems that, despite experiments by arrow devs that showed that providing non-hive style partitions as well as a custom schema produced errors when opening a csv dataset, that it in fact seemed to work now!

If it indeed works consistently, it means the create_hub_schema() function, which currently returns a different schema depending on the file format, could be simplified.

It also feels more robust to be supplying a schema for the partitioning variable in csv too.

Explore potential for storing testhubs in `inst/` folder as submodules

A suggestion was made by @elray1 to move individual test hubs stored n inst/testhub/ to separate repositories and include in package as submodules

Pros

standalone and visible as examples
can use continuous integration design to test config and hub data validity to ensure they remain up to date and maintain integrity

Cons

Cons are still unknown and would require research

More complicated
Unsure at this point whether install_github and eventually installing from CRAN would work as expected. Needs more research.

Here's an example in any case to explore: https://github.com/tnagler/wdm-r

Extracting metadata for Input validation

I've opened this issue to be able to discuss at a higher level and aggregate more specific discussions I'm organising in specific issues linked below.

These questions have arisen from trying to work on #13 & #14. I've had difficulty thinking through the most obviously ideal behaviour of functions extracting specific metadata, in particular valid task_ids and indivisula task_id valid filtering values, with downstream consideration that they will ultimately feed in to validation workflows during loading and submitting data. I feel the issue need some more detailed upfront consideration before proceeding.

What I'm having most difficulty with is what to do in cases where metadata for more than one round are required and what the behaviour of validations should be when task_ids or valid task_id values differ between rounds.

To make the issues that I'm having difficulty with more concrete I'm creating separate issues with specific scenarios that we can think through and discuss but we can use this issue for more higher level discussion.

Scenarios

Input validation during loading

Data validation during submission

Note that I'm also thinking ahead towards data submission during validation. This is perhaps premature but as we are aiming for efficiency, atleast having some high level consideration of required behaviour could potentially help to ensure low level functionality can be efficiently reused in both cases.

support yaml format for hub config files

At some point later, we may like to support a yaml format for the hub config files. Due to technical issues, we only support json in a first pass at this functionality.

object class to represent a collection of forecasts

Which of the options in R for dataframe-like objects do we want to build on for a collection of forecasts that have been loaded into an R working session: data.frame, tibble, or data.table?

Here are some points that have been raised in previous discussion:

data.frames are the only of these options that are native to R.
tibbles and the tidyverse are a de facto standard that is in common use by a large swath of the R community, and they allow for features that are not supported in data frames such as nesting
data.tables are substantially faster than tibbles
We could support multiple options... but ultimately, even if we go this route we are probably going to want to pick one that most of our tooling works with, right?

`read_hubmeta`

It might be helpful to have a convenience function to load hub metadata files. Some requirements/desired behaviors:

support both json and yaml
support this standard for references in json documents: https://datatracker.ietf.org/doc/html/draft-pbryan-zyp-json-ref-03 Briefly, the idea is that a reference has the form { "$ref": "path/to/json/object" }, and when that reference is resolved, it is replaced by the object that was referenced. For example, suppose we have the following input document:

{
  "round-1": {
    "model_tasks": [
      {
        "task_ids": {
          "origin_date": {
            "required": ["2022-09-03"],
            "optional": null
          },
          "location": {
            "required": {
              "$ref": "#/$defs/task_ids/location/us_states"
            },
            "optional": ["US"]
          },
        },
        "output_types": {
          "mean": {
            "type_id": {
              "required": null,
              "optional": ["NA"]
            },
            "value": {
              "type": "integer",
              "minimum": 0
            }
          }
        }
      },
    ],
    "submissions_due": {
        "start": "2022-09-01",
        "end": "2022-09-05"
    }
  },
  "$defs": {
    "task_ids": {
      "location": {
        "us_states": ["01", "02", "04", "05", "06", "08", "09", "10", "11", "12", "13", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "40", "41", "42", "44", "45", "46", "47", "48", "49", "50", "51", "53", "54", "55", "56"]
      }
    },
  ]
}

After resolving the reference, we would obtain the following document:


{
  "round-1": {
    "model_tasks": [
      {
        "task_ids": {
          "origin_date": {
            "required": ["2022-09-03"],
            "optional": null
          },
          "location": {
            "required": ["01", "02", "04", "05", "06", "08", "09", "10", "11", "12", "13", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "40", "41", "42", "44", "45", "46", "47", "48", "49", "50", "51", "53", "54", "55", "56"],
            "optional": ["US"]
          },
        },
        "output_types": {
          "mean": {
            "type_id": {
              "required": null,
              "optional": ["NA"]
            },
            "value": {
              "type": "integer",
              "minimum": 0
            }
          }
        }
      },
    ],
    "submissions_due": {
        "start": "2022-09-01",
        "end": "2022-09-05"
    }
  },
  "$defs": {
    "task_ids": {
      "location": {
        "us_states": ["01", "02", "04", "05", "06", "08", "09", "10", "11", "12", "13", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "40", "41", "42", "44", "45", "46", "47", "48", "49", "50", "51", "53", "54", "55", "56"]
      }
    },
  ]
}

Question: should the function also delete a "$defs" block, if there is one? Maybe as an option that defaults to TRUE?

support operations on informal hubs

Often, a part of my workflow involves producing forecasts that are not a part of a formal hub set up. Here is one example: https://github.com/elray1/rclp-example Note that often, these are forecasts that could in principle be submitted to an existing hub in that they match the targets and output types of those hubs -- but they may not necessarily. One example of this is that I recently developed daily influenza hospitalization forecasts; this is not a target that any hub is collecting, but it was useful to be able to work with those forecasts.

It would be ideal if we could support these kinds of use cases to the extent possible without requiring users to create the hub config files. Specifically, would it be possible to provide some minimal support for common operations like reading in and plotting forecasts? Some notes:

it seems safe to assume the files are stored using the kind of directory layout we have described for the model-outputs folder, with team/model-specific folder within that and then submission files
it would be OK not to do validation in these instances, or to only do some minimal validation

[Input validation - LOAD] Scenario #1 Invalid task_id

In this scenario, the user requests data for a task_id that does not exist. There are two scenarios where this could happen

Incorrectly input task_id

Part of these checks is to ensure task_id are requested correctly. A user might request specific values of a task_id but there might be a typo. e.g. requesting a loction instead of location

Options for action

Throw an error and direct user to the invalid task id
Throw a warning. ignore invalid task id and return data matching all other valid filters (could make the returned data significantly larger?)

task_id not available in one or more rounds but available in others

Let's assume that task id age_group was not included in round-1 of a hub but was from round-2 and thereafter. A user requests data from round-1 and round-2 and includes an age group filter, let's say 6-18. What do we want the behaviour of input validation checks to be?

Options for action

Throw an error (I think no but including for completeness)
Throw a warning about age_group not being a valid task_id and return ALL data from round-1 (as cannot be filtered by age group)
Throw a warning about age_group not being a valid task_id and return NO data from round-1 (as cannot be filtered by age group)
Throw a warning and control whether ALL or NO data for rounds where the task_id is invalid through an argument.

Distinguishing between the two situations

We may want different behaviour between the two aforementioned situations. For example if we want an error in the first situation and a warning in the second, we need to be able to distinguish between the two. One option could be that if a given task_id is not found as a valid task_id in any rounds, assume we are dealing with the first situation otherwise assume we are dealing with situation 2.

Additional considerations

Related to my question here #14 (comment), I'm assuming task_ids might only change in hubs with round specific task ids. Is this however strictly the case? What happens to a round_id_from_variable hub? Could they ever for example add task_ids at some point and how would that be handled in the hubmeta file? Would they just add the additional task_id to their metadata or shift to a round varying metadata format with different task ids for the pre and post task id periods?

Explore using arrow datasets for accessing hub model outputs

Apache Arrow datasets provide a convenient way to open connections to datasets contained within a structured directory and run queries against it similar to accessing a database making interacting with the data simple and elegant.

It also support data in a variety of relevant formats including csv and parquet.

I've opened this issue as a way to collect findings of experimentation with this approach, list pro's and con's and determine suitability.

Create dummy data of epiweek cdf forecasts

To expand testing, I need to generate some slightly more elaborate test hub cases and want to include examples of an epiweek cdf output type.

Firstly, does this seem like reasonable model output data?

output_type	type_id	value
cdf	EW202301	0.007295056
cdf	EW202302	0.029636164
cdf	EW202303	0.081765416
cdf	EW202304	0.172991608
cdf	EW202305	0.300708276
cdf	EW202306	0.449711056
cdf	EW202307	0.598713836
cdf	EW202308	0.729091268
cdf	EW202309	0.830495937
cdf	EW202310	0.901479206
cdf	EW202311	0.946650377
cdf	EW202312	0.973000227
cdf	EW202313	0.987188607
cdf	EW202314	0.994282798
cdf	EW202315	0.997593420
cdf	EW202316	0.999041817
cdf	EW202317	0.999638216
cdf	EW202318	0.999870149
cdf	EW202319	0.999955598
cdf	EW202320	0.999985505

epi_tab <- tibble::tibble(
    output_type = "cdf",
    type_id = glue::glue("EW2023{sprintf('%02d', 1:20)}"),
    value = ppois(1:20, lambda = 7)
)

library(ggplot2)
ggplot(epi_tab,
       aes(x = type_id, y = value, group = output_type)) +
    geom_line() +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

^{Created on 2023-04-20 with reprex v2.0.2}

I have a few follow up questions to ensure I understand what's being modelled here correctly:

Am I right in assuming that the target associated with such an output type is "peak week" of some outcome?
How does a cdf of peak week relate to other target_ids like "origin_date" or "horizon"?
- Does "origin_date" limit the start of the time frame over which the cdf of peak week is calculated?
- And what is the meaning of horizon when peak week is being forecasted? Would that just be NA?

Any pointers on how to ensure I get this dummy data right welcome, as well as links to hubs already collecting such data (I did a quick scavenge of a few hubs I know of but couldn't locate any)

more generic default `params` argument in `get_task_id_params`

Currently, the default value of the params argument to get_task_id_params is c("location", "horizon", "origin_date"), but those specific model task id variables are not relevant to all hubs. It may make sense for this function to be a method on a hub_connection object, and for it to do something like default to returning the first task id variable in the hub metadata file?

[Input validation - Submit] Scenario #3 Missing required task_id value

A team wants to submit data but one of the valid task_id columns in the dataset does not contains a required value for that task id for a given round.

Options for action

Throw error identifying the missing required task_id value?

[Input validation - LOAD] Scenario #2 Invalid task_id filtering value

In this scenario, the user requests data using an invlaid filtering value for a task_id. There are again two scenarios where this could happen

Incorrectly input task_id filtering value

Part of these checks is to ensure task_id filters are implemented correctly. A user might request specific values of a task_id but there might be a typo. e.g. requesting location = USA instead of location = US

Options for action

Throw an error and direct user to the invalid task id filtering value
Throw a warning. ignore invalid task id filtering value and return data matching all other valid filters (could make the returned data significantly larger?)

task_id filtering value not available in one or more rounds but available in others

Let's assume that task id age_group valid targets included ["0-5", "6-18", "19-24", "25-64", "65+"] in round-1 of a hub but in round-2 and thereafter it became more fine grained e.g. ["0-5", "6-18", "19-24", "25-34", "35-44", "45-54", "55-64", "65+"]. A user requests data from round-1 and round-2 and includes an age group filter, let's say 35-44. What do we want the behaviour of input validation checks to be?

Options for action

Throw an error (I think no but including for completeness)
Throw a warning about 35-44 not being a valid filtering value for age_group in round-1 and return ALL data from round-1
Throw a warning about 35-44 not being a valid filtering value for age_group in round-1 and return NO data from round-1
Throw a warning and control whether ALL or NO data for rounds where the task_id is invalid through an argument.

Distinguishing between the two situations

Again, we may want different behaviour between the two aforementioned situations. For example if we want an error in the first situation and a warning in the second, we need to be able to distinguish between the two. One option could be that if a given task_id filtering value is not found in any rounds, assume we are dealing with the first situation otherwise assume we are dealing with situation 2.

Unexpected successful validation of malformed tasks.json file

The tasks.json file in the example-simple-forecast-hub currently has a target_metadata block that is in the wrong location, here. That block is nested within the "output_type" block, but I think it should not be there. Note that the output_type specification in the schema file does not include target_metadata as a valid output type.

However, running hubUtils::validate_config yields a passing result:

> library(hubUtils)
> hubUtils::validate_config(".")
Loading required namespace: jsonvalidate
✔ Successfully validated config file ./hub-config/tasks.json against schema <https://raw.githubusercontent.com/Infectious-Disease-Modeling-Hubs/schemas/main/v1.0.0/tasks-schema.json>
[1] TRUE
attr(,"config_path")
./hub-config/tasks.json
attr(,"schema_version")
[1] "v1.0.0"
attr(,"schema_url")
https://raw.githubusercontent.com/Infectious-Disease-Modeling-Hubs/schemas/main/v1.0.0/tasks-schema.json

Accommodate renaming of `categorical` output type to `pmf` in v1.0.0 schema

This primarily affects functions for creating JSON cofig files in R

Perform manual validation checking

As json schema validators cannot validate dynamically, some checks need to be performed manually.

Checks need to be performed on:

each element of the rounds array
each element of the model_tasks array within each round
each element in the target_metadata array

Target validation

1: Check that both task_ids & target_metadata exist. If any is missing or cannot be read properly, this will be caught by standard validation. Abort further tests until they are fixed.
2: validate target_key names of each target_metadata element to ensure they all correspond to properties in corresponding task_ids
3: validate target_key values of all target_metadata elements against corresponding task_id property values to ensure they exist in task ids.
4: validate values of task_id properties specified as target keys to ensure they have been defined as a corresponding target key values.
5: ensure the combination of values of task_id properties specified as target keys correspond to valid combinations across optional and required. Unsure about the implementation of this, see questions and example below:

Questions:

Q1

The introduction of targets being split across 2 or more task ids has confused me a little in terms of how required and optional should be specified in that it's now specific combinations of the values of such tasks ids that becomes

Consider the following striped down example focusing of task ids and associated target metadata. Here we have 44possible targets ("inc hosp", "inc case") split across two task ids ("target_variable" & "target_outcome"). Let's say also that we want only one of the targets to be required ("inc hosp") and one optional ("inc case").

In my initial attempt to represent this, "inc" ends up as both a required and optional value. Is that reasonable?

{
    "task_ids": {
        "target_variable" : {
            "required" : ["hosp"],
            "optional": ["case"]
        },
        "target_outcome" : {
            "required" : ["inc"],
            "optional": ["inc"]
        }
    },
    "target_metadata": [{
        "target_keys": {
            "target_variable": "hosp",
            "target_outcome": "inc"
        }
    }, {
        "target_keys": {
            "target_variable": "case",
            "target_outcome": "inc"
        }
    }
    ]
}

In addition, do we agree that check 5 listed above becomes necessary to check for validity of combinationsof values of task id target key properties across required and optional?

Q2

target metadata is an array. Is it possible that the target key names might vary across different elements of the target metadata array for the same model task element or is it just the values of target key properties will vary?

Create JSON schema for hubmeta

Details in:

reichlab/hub-infrastructure-experiments#3

Add code coverage github action

Using https://github.com/r-lib/actions/blob/v2/examples/test-coverage.yaml

Include number files per file format metadata in hub_connection print method

To provide quick sanity check, include metadata about the number of files of each format in hub_connection print method

Error message not clear with hubUtils::validate_config

I was trying to validate the config files in the megaround-scenariohub repository and I got an error but the message is confusing.

I first got an Error in loadNamespace(x) : there is no package called ‘jsonvalidate’ error that I easily fixed by installing jsonvalidate but it might be good to change the DESCRIPTION file to have it on import instead of suggest.

The second error I have, I am not certain how to fix it:

> hubUtils::validate_config(hub_path = ".")
Error in `purrr::map_chr()` at hubUtils/R/validate-config-file.R:275:2:
ℹ In index: 1.
Caused by error in `rep()`:
! invalid 'times' argument
Run `rlang::last_error()` to see where the error occurred.
Warning message:
Schema errors detected in config file ./hub-config/tasks.json validated against schema <https://raw.githubusercontent.com/Infectious-Disease-Modeling-Hubs/schemas/main/v0.0.0.9/tasks-schema.json> 

> rlang::last_error()
<error/purrr_error_indexed>
Error in `purrr::map_chr()` at hubUtils/R/validate-config-file.R:275:2:
ℹ In index: 1.
Caused by error in `rep()`:
! invalid 'times' argument
---
Backtrace:
  1. hubUtils::validate_config(hub_path = ".")
  3. hubUtils::launch_pretty_errors_report(validation)
       at hubUtils/R/validate-config-file.R:105:6
  4. purrr::map_chr(error_df[["instancePath"]], path_to_tree)
       at hubUtils/R/validate-config-file.R:275:2
  5. purrr:::map_("character", .x, .f, ..., .progress = .progress)
  9. hubUtils (local) .f(.x[[i]], ...)
 11. base::paste(rep("─", times = i - 2), collapse = "")
       at hubUtils/R/validate-config-file.R:381:4
Run `rlang::last_trace()` to see the full context.

It seems to be link to the "schema_version" parameter that has a length of 1, if I understand correctly. But, I am not sure how to fix it.

Evaluate the need for accommodating multiple model output file formats in a single hub

In a previous meeting there was mention by a number of you of mixing model output files of different format (e.g. CSV and parquet) in the same hub. I'm trying to figure out whether this is something we might need to support and if so what the best approach would be so could you let me know:

If this is something you have implemented or think you might need to implement at some point in one of your hubs.
Whether you have different file formats intermixed within individual directories or whether you separate them out into separate directories for each format.

[Input validation - Submit] Scenario #1 Invalid task_id

A team wants to submit data but one of the columns in the dataset does not correspond to a valid task id for the given round.

Options for action

Trow error identifying column that does not correspond to a valid task_id

consider refactoring basic validations of forecast data

It seems like many functions will want to do the kinds of checks that are being done here -- should we extract them into a function in hubUtils or hubValidations?
Additionally/alternatively, should that function just accept an object that has the information about task ids, output types and type ids, and value columns as attributes (as outlined in this comment) and assume they've already been validated? Or call a function that's designed to do these validation checks based on that data structure?

Authors: Should I copy over the full list of authors from covidHubUtils (https://github.com/reichlab/covidHubUtils/blob/ebe1401353075c3c0518aa0f847757dfc3b8cbde/DESCRIPTION#L4-L48)
License: Sick with MIT?

hubverse-org / hubutils Goto Github PK

hubutils's Introduction

hubUtils

Installation

Code of Conduct

Contributing

hubutils's People

Contributors

Stargazers

Watchers

Forkers

hubutils's Issues

Options for action

GitHub

Zoltar

Pros

Cons

Scenarios

Input validation during loading

Data validation during submission

Incorrectly input task_id

Options for action

task_id not available in one or more rounds but available in others

Options for action

Distinguishing between the two situations

Additional considerations

Options for action

Incorrectly input task_id filtering value

Options for action

task_id filtering value not available in one or more rounds but available in others

Options for action

Distinguishing between the two situations

Target validation

Questions:

Q1

Q2

Options for action

Recommend Projects

Recommend Topics

Recommend Org