hubverse-org / schemas Goto Github PK

JSON schemas for modeling hubs

License: Creative Commons Zero v1.0 Universal

schemas's Introduction

Modeling Hub Schemas

This repository houses JSON schemas for the Consortium of Infectious Disease Modeling Hubs. These schemas define the specifications for the configuration files that are required to be present in a modeling hub. Full documentation about modeling hubs can be found at the Modeling Hub documentation site, with some specific documentation about the schema files.

Versioning

Schemas will be directly versioned, with different versions living in different folders in the root directory of the repo with a name as vx.x.x (for example, v0.0.1). Any finalized change to any of the three schema files that is added to the main branch will result in the addition of a new set of all schema files. To determine an appropriate version number for the next version, follow semantic versioning principles.

Wen creating new versions and making changes to the schema file, make sure to record important user facing changes in NEWS.md.

Schema documentation

The HubDocs documentation site is the primary location for documenting schema usage. It is also versioned by using releases and should track releases in this repository.

After making a new release to the schema repository, ensure hubDocs are also appropriately updated and an associated new release in the hubDocs repository also created.

New schema version development process

New schema versions should be developed in a separate branch. Name the branch v{version-number}-branch to avoid creating release tags which share the same name as a branch later on.
New version branches should be merged into main when ready to released.
Merging into main should be accompanied by creating an associated formal release in the repository.
Update HubDocs site with any additional relevant information associated with the new schema release.
Create a new release on hubDocs using the same version number but without the v (e.g. v0.0.1 would be released as 0.0.1 on hubDocs).
Update the hubTemplate config to reflect the most up to date schema. Create a new release using the same version.

Highlighting changes to schema in PRs

To bring attention to the changes in new schema versions, it's useful to include in any PR, a print out of the diffs in the tasks-schema.json and admin-schema.json files compared to the previous version.

Automated Process (via GitHub)

After you create a new Pull Request, if you create a new comment with /diff, GitHub will automatically generate the diffs of the tasks-schema.json and admin-schema.json and comment on the pull request.

If you need to update the schema after review, you can update the diffs by creating another /diff comment.

If this does not work for any reason, you can follow the manual process below.

Manual Process

To print the diffs in each file you can use the following commands in the terminal:

`admin-schema.json`

diff -u --color=always $(ls -d */ | sort | tail -n 2 | head -n 1)admin-schema.json $(ls -d */ | sort | tail -n 1)admin-schema.json

`tasks-schema.json`

diff -u --color=always $(ls -d */ | sort | tail -n 2 | head -n 1)tasks-schema.json $(ls -d */ | sort | tail -n 1)tasks-schema.json

💡 Tips

Show diff colours in PR

To show the colour of the diffs in the PR, wrap the output of the commands in a diff code block, e.g.

```diff
- old line
+ new line
```
is rendered in the PR renders as:
- old line
+ new line
Send output directly to clipboard

Depending on your system (macOS or Linux), you can pipe the output of the above commands directly to the clipboard. See examples below:

macOS:
diff $(ls -d */ | sort | tail -n 2 | head -n 1)tasks-schema.json $(ls -d */ | sort | tail -n 1)tasks-schema.json | pbcopy
Linux:

Make sure xclip is installed. You can install it using your package manager, e.g., sudo apt-get install xclip on Debian-based systems.
diff $(ls -d */ | sort | tail -n 2 | head -n 1)tasks-schema.json $(ls -d */ | sort | tail -n 1)tasks-schema.json | xclip -selection clipboard

schemas's People

Contributors

Stargazers

Watchers

Forkers

kjsato testorg-rename

schemas's Issues

more convenient representations of required/optional task id values

Is there a way we could simplify a situation where there are multiple overlapping sets of values for task ids?

Examples:

Location:

us: “US”
us_states”: [“US”, “01”, … ]
us_states_counties: above plus all the counties

Horizon:

horizon_1: [1]
horizon_2: [1,2]
horizon_3: [1,2,3]
…
horizon_52: [1, 2, 3, …, 52]

Ideas for a better way

Idea 1: Can we simplify by specifying data type and range rather than providing lists?

Ex. 1: horizon

Rather than horizon: [1, 2, 3, …, 52]
Something like horizon: {type:integer, min:1, max:52]

Ex. 2: origin_date
Rather than origin_date: [“2020-04-07”, “2020-04-14”, …, “2022-12-05”]
Something like origin_date: {type:date, min: “2020-04-07”}, perhaps with some way to specify weekday?

Idea 2: Some way to specify concatenation of arrays within the json format?

update schema to reflect changes in sample output_type

Details here. New schema will incorporate "output_type_id_params" with different parameters than other output_types.

License for Hub repository

Should we add a license field for the Hub repository, maybe in the admin.json file?

change target_type "categorical" to "nominal"

For more clear distinction from "binary" and "ordinal" target types.

Add `target_variable` and `target_outcome` to task ids as potential task ids

This reflects the fact we include target as a potential task_id and target_variable and target_outcome are an option that has been discussed.

One question I had is that while using a single target task id, optional and required arrays of multiple unique targets are explicit, what happens when you have multiple targets but spread across target_variable and target_outcome. How do you ensure they specify unique and correct combinations? Does the order they are specified in the array become important?

e.g. if two valid targets where inc hosp & inc case, how would they be specified?

"target_variable": {
  "required": null,
  "optional": ["hosp", "case"]
},
"target_outcome": {
  "required": null,
  "optional": ["inc", "inc"]
},

@elray1 your thoughts would be appreciated!

Clarify thinking about required and optional model tasks and output types

Currently, the required and optional values of output type ids can in effect also specify whether the corresponding output types as a whole are required: namely, a particular output type is required if it has at least one required type_id, and is optional otherwise. This may be confusing. Is there another way? See also the related discussion under issue #9.

Current proposed system

To explain the situation, we consider a series examples of hubs with varying modeling task specifications.

Example 1:

      "model_tasks": [
        {
          "task_ids": {
            "location": {
              "required": ["a", "b"],
              "optional": ["c", "d"]
            },
            "horizon": {
              "required": [1, 2],
              "optional": [3, 4]
            }
          },
          "output_types": {
            "median": {
              "type_id": {
                "required": ["NA"],
                "optional": null
              },
              "value" : {
                "type": "integer",
                "minimum": 0
              }
            },
            "quantile" : {
              "type_id": {
                "required": [0.25, 0.5, 0.75],
                "optional": [0.1, 0.9]
              },
              "value": {
                "type": "integer",
                "minimum": 0
              }
            }
          }
        }
      ]

For a hub with this specification, a valid submission must include at least the following rows, obtained via a kind of expand_grid action across the different combinations of required values for the task id variables and required type_ids within each output type. Note that in this process, you could imagine first concatenating the output_types with the options for type_id values within each output_type, so that they are treated as a "unit" when the expand_grid happens. Then split them back into two columns. This is necessary to track the nesting of type_id values withing the specific output types.

 location horizon output_type type_id value
        a       1     median      NA   ...
        b       1     median      NA   ...
        a       2     median      NA   ...
        b       2     median      NA   ...
        a       1    quantile    0.25   ...
        b       1    quantile    0.25   ...
        a       2    quantile    0.25   ...
        b       2    quantile    0.25   ...
        a       1    quantile     0.5   ...
        b       1    quantile     0.5   ...
        a       2    quantile     0.5   ...
        b       2    quantile     0.5   ...
        a       1    quantile    0.75   ...
        b       1    quantile    0.75   ...
        a       2    quantile    0.75   ...
        b       2    quantile    0.75   ...

Example 2

Example 2 is the same as example 1, but it has only one required quantile level:

      "model_tasks": [
        {
            "location": {
              "required": ["a", "b"],
              "optional": ["c", "d"]
            },
            "horizon": {
              "required": [1, 2],
              "optional": [3, 4]
            }
          },
          "output_types": {
            "median": {
              "type_id": {
                "required": ["NA"],
                "optional": null
              },
              "value" : {
                "type": "integer",
                "minimum": 0
              }
            },
            "quantile" : {
              "type_id": {
                "required": [0.5],
                "optional": [0.1, 0.25, 0.75, 0.9]
              },
              "value": {
                "type": "integer",
                "minimum": 0
              }
            }
          }
        }
      ]

For a hub with this specification, a valid submission must include at least the following rows:

 location horizon output_type type_id value
        a       1        median      NA   ...
        b       1        median      NA   ...
        a       2        median      NA   ...
        b       2        median      NA   ...
        a       1    quantile     0.5   ...
        b       1    quantile     0.5   ...
        a       2    quantile     0.5   ...
        b       2    quantile     0.5   ...

Example 3

Example 3 is similar to examples 1 and 2, but now all of the quantile levels are specified as optional.

      "model_tasks": [
        {
            "location": {
              "required": ["a", "b"],
              "optional": ["c", "d"]
            },
            "horizon": {
              "required": [1, 2],
              "optional": [3, 4]
            }
          },
          "output_types": {
            "median": {
              "type_id": {
                "required": ["NA"],
                "optional": null
              },
              "value" : {
                "type": "integer",
                "minimum": 0
              }
            },
            "quantile" : {
              "type_id": {
                "required": null,
                "optional": [0.1, 0.25, 0.5, 0.75, 0.9]
              },
              "value": {
                "type": "integer",
                "minimum": 0
              }
            }
          }
        }
      ]

For a hub with this specification, a valid submission must include at least the following rows:

 location horizon output_type type_id value
        a       1        median      NA   ...
        b       1        median      NA   ...
        a       2        median      NA   ...
        b       2        median      NA   ...

Example 4

Our final example is similar to example 1, but swaps the specification of ["NA"] and null values in the required and optional fields for the mean output type:

      "model_tasks": [
        {
            "location": {
              "required": ["a", "b"],
              "optional": ["c", "d"]
            },
            "horizon": {
              "required": [1, 2],
              "optional": [3, 4]
            }
          },
          "output_types": {
            "median": {
              "type_id": {
                "required": null,
                "optional": ["NA"]
              },
              "value" : {
                "type": "integer",
                "minimum": 0
              }
            },
            "quantile" : {
              "type_id": {
                "required": [0.25, 0.5, 0.75],
                "optional": [0.1, 0.9]
              },
              "value": {
                "type": "integer",
                "minimum": 0
              }
            }
          }
        }
      ]

For a hub with this specification, a valid submission must include at least the following rows:

 location horizon output_type type_id value
        a       1    quantile    0.25   ...
        b       1    quantile    0.25   ...
        a       2    quantile    0.25   ...
        b       2    quantile    0.25   ...
        a       1    quantile     0.5   ...
        b       1    quantile     0.5   ...
        a       2    quantile     0.5   ...
        b       2    quantile     0.5   ...
        a       1    quantile    0.75   ...
        b       1    quantile    0.75   ...
        a       2    quantile    0.75   ...
        b       2    quantile    0.75   ...

Summary and question for discussion

Summary: Under the current system, the required rows that a submission must minimally obtain are obtained by applying an expand_grid type of action to the task id variables and combinations of output types and type ids. This means that if there are no required values under the type_ids for a particular output type, a minimal submission does not need to include any rows with that output type. Effectively, this means that that output type is optional. Saying this again in different words: in this set up, a particular output type is required only if there is at least one value specified as required in the type_ids under that output type. This is illustrated in examples 3 and 4 above.

Every time this has come up, this use of required/optional values of a type_id to implicitly set the status of an output type has been non-intuitive. How can we resolve this? Three ideas:

Change the representation of the output column so that it has the required and optional properties similar to the other columns. We would then perhaps check that the names of any additional properties currently under "output_types" match the values that were specified as required or optional for the output column. We would need to think through and document how this interacts with the "implicit requirement" for output types that comes out of the current procedure as illustrated above.
Some other higher level field indicating which output types are required and optional. We would need to think through and document how this interacts with the "implicit requirement" for output types that comes out of the current procedure as illustrated above.
Somehow more directly specify the concatenated/nested values of columns output_type and type_id (and any restrictions on value) as being required or optional.
Lots of documentation.

Use`uniqueItems: true` to ensure arrays contain unique values

Does `required` and `optional` apply to categorical output types

Given probabilities across all possible categories sum to 1, it doesn't make much sense to me to have some categories optional and some required?

If anything feels like they should either all be required or all optional (with any missing assumed to be 0 probability) so long as what is provided sums to one?

add model-schema.json

Add schema for model metadata as per https://hubdocs.readthedocs.io/en/latest/format/model-metadata.html

Add Tests!!

Version v0.0.9 had a bug in the admin.json file that effectively made the file unusable :(

Need a basic GitHub Action that at the very least checks that schema are valid json schema files

Handle mean and median type_id specifications more efficiently?

At the moment, for mean and median type_id, we are asking for both required and optional value specification. In type_id which will eventually be NA in R, we are also allowing for either NA or NULL to be supplied.

Given the type_id of mean and median must always be NA and required and optional has no meaning in the context of either type_id or value, should we simplify the schema structure to:

"mean": {
                      "type": "object",
                      "description": "the mean of the predictive distribution",
                      "properties": {
                        "type_id": {
                          "description": "Not used for mean output type. Must be NA or null.",
                              "type": "array",
                              "items": {
                                "enum": ["NA"],
                                "maxItems": 1
                              }
                            },
                        "value": {
                          "type": "object",
                          "properties": {
                            "type": {
                              "type": "string",
                              "enum": ["numeric", "double", "integer"]
                            },
                            "minimum": {
                              "type": "integer"
                            },
                            "maximum": {
                              "type": "integer"
                            }
                          },
                          "required": [
                            "type"
                          ]
                        }
                      },
                      "required": [
                        "type_id",
                        "value"
                      ]
                    }

Alternatively, we can make type_id optional all together for mean and median and ignore it by default in R. That would also get round the awkwardness of having to specify "NA" within a one element array so that R will automatically convert it to NA when reading in.

A final option is to require type_id for mean & median to always be null (again getting rid of the awkward array) in tasks.json and replace it with NA in R once it's read in?

beef up documentation about relative submission dates

https://github.com/Infectious-Disease-Modeling-Hubs/schemas/blob/b95a91cb9953315bc27ddea6278a28eea30a250c/tasks-schema.json#L747

Can we give some example calculations, so it's really specific about what date would be accepted or rejected in a particular situation with particular relative start/end dates?

`type` often omitted when specifying property values with `enum` or `const`

There are some locations in the tasks.json schema where enum or const keywords are used to effectively dictate what property values should be but type is not specified for the property. See examples:

https://github.com/Infectious-Disease-Modeling-Hubs/schemas/blob/19f40344655064dff9fc8e78f7a10eacb5894cea/v0.0.1/tasks-schema.json#L759-L767

https://github.com/Infectious-Disease-Modeling-Hubs/schemas/blob/19f40344655064dff9fc8e78f7a10eacb5894cea/v0.0.1/tasks-schema.json#L769-L772

Although the values provided in enum or const somewhat dictate the data type of the field, the omission of the type property feels somewhat inconsistent.

In other places type is included:

https://github.com/Infectious-Disease-Modeling-Hubs/schemas/blob/19f40344655064dff9fc8e78f7a10eacb5894cea/v0.0.1/tasks-schema.json#L646-L657

Should I make sure all properties require that type is provided?

Thoughts @nickreich @LucieContamin ?

Change `output_types` property to `output_type`

Change output_types property to output_type to match column names in data files.

question about data types for task id values

Our current schema suggests that some task id values might accept multiple data types, e.g. scenario_id lists both "integer" and "string" here.

I had some questions about this:

How are we using these data types?
- It seems possible/likely that this is only being used to validate that the hub has specified its tasks.json file correctly.
- If I set up a hub where the scenario_id has an integer data type, will that column of model outputs be converted to an integer when I read in some model outputs? Would we want to do this?
- If I set up a hub where the scenario_id has integer data type, do we expect validations to throw an error if a model submission has encoded the value as "1" instead of 1, either via a data type specification in a parquet file or (possibly?) via quoting values in a csv file? Would we want to do this? (I think the quoting values in csv files is pretty iffy, that's not really a data type specification so much as a csv formatting thing...)

I am basically wondering if the data type that shows up in a hub's tasks.json file for values of task id variables matters or should matter for downstream processing?

"value" object in quantile "output_type" should always have specific values

https://github.com/Infectious-Disease-Modeling-Hubs/schemas/blob/b95a91cb9953315bc27ddea6278a28eea30a250c/tasks-schema.json#L486

Should always be this:

                            "value": {
                                "type": "numeric",
                                "minimum": 0,
                                "maximum": 1
                            }

maybe this could be documented clearly in the documentation?

Add schema support for strings as sample indices in `output_type_id`

Per discussion related to #48, in a first pass we will only support integer sample indices for output_type_id, but we would eventually like to support strings as well. To do this, we will need to add two properties to the output_type_id field in a hub's tasks.json config file:

“type”: the data type of the sample index. "string" or "integer"
“max_length”: if "type" is "string", the maximum number of characters in a sample index

"max_length" should be required if "type" is "string", but not otherwise.

add capacity to link integers with category numbers for categorical targets

In a categorical target definition, could we use what is currently stored in
model_tasks > output_type > categorical > type_id
to encode a mapping between integers and categories, so that we could have sample-based representations of categories? E.g. a table like

output_type | type id | value 
----------- | ------- | ------
"sample"    | 1       | 4
"sample"    | 2       | 3
"sample"    | 3       | 7
"sample"    | 4       | 4

where type_id corresponds to the index of the sample and value corresponds to the number corresponding to the category. Noting that categorical targets are weird because the categories can show up as the type_id for pmf output_type but also as the value for the sample output_type.

Introduce `"mode"` output_type

Opening this issue to move discussions on this topic to the repo.

From slack:

@nickreich :

[5 days ago]
How would people feel about adding an output_type of "mode" to the other existing types? This came up today in a conversation with @annakrystalli as it seems like a possibly natural form of a point estimate for a categorical target. E.g. a "mean" or "median" wouldn’t make sense. I will note that the mode could be extracted from the representation of a probability mass function for a categorical outcome, but that would require a probabilistic forecast. If you like the idea, please just add a ✅ . If you have questions or comments or objections, please add a note here. Thanks!

One comment on this after discussing briefly with Evan is that the tabular data representation would maybe be kind of ugly, e.g. since we can only have numeric objects in the “value” column, maybe it would look something like this?

output_type	type_id	value
"mode"	["cat1", "cat2", "cat3"]	[0,1,1]

where the type-id is an array of the possible values of the categorical variable and the array in value would be indicating which value(s) are the mode? Or maybe this would need to be spread over two rows, to keep value purely numeric?

Add `forecast_date` / `target_end_date` properties to schema?

In our schema we currently have a origin_date:

https://github.com/Infectious-Disease-Modeling-Hubs/schemas/blob/c1d8fc0e993ed07c2d783becbe91ea10e31c7e85/v1.0.0/tasks-schema.json#L51C42-L60

and a target_date property:

https://github.com/Infectious-Disease-Modeling-Hubs/schemas/blob/c1d8fc0e993ed07c2d783becbe91ea10e31c7e85/v1.0.0/tasks-schema.json#L268-L306

Have we settled on whether forecast_date (or other) / target_end_date represent standard concepts and should therefore have schema properties for validating them when included in tasks.json files?

update default value for model_output_directory

I think it should be "model-output", not "model_output".

https://github.com/Infectious-Disease-Modeling-Hubs/schemas/blob/de580d56b8fc5c24dd36a32994182e37b8b0ac95/v2.0.0/admin-schema.json#L635

add optional "model_output_folder" object in admin.json

The idea here is to allow for compatibility with repos that have a model_output folder that is not called model_output. Desired behavior is if this field is not specified then the validation code assumes there is a directory called "model_output". If it is specified then the expectation is that the model_output folder would be redirected to the specified folder in this field.

timezone information

Per suggestions from Koji, should we add a "timezone" information field in the admin.json schema? As different Hubs are on different timezome, it might be interesting to store that information especially if, for example, the submission date start/end date of the Hub need to be precise.

Is limiting specification of maximum and minimum values for mean and median output types to integers to restrictive?

I noticed that in the schema, specifying output_type mean minimum and maximum values in a config file is restricted to "integer":

https://github.com/Infectious-Disease-Modeling-Hubs/schemas/blob/19f40344655064dff9fc8e78f7a10eacb5894cea/v0.0.1/tasks-schema.json#L474-L480

Same in median:

https://github.com/Infectious-Disease-Modeling-Hubs/schemas/blob/19f40344655064dff9fc8e78f7a10eacb5894cea/v0.0.1/tasks-schema.json#L568-L575

Although these are optional properties, Is this unnecessarily restrictive? I can imagine someone perhaps wanting to use a decimal number to specify ranges. Should I change the schema to ["integer", "number"].

Update authors

Use definition below
Definitions of roles (Author, maintainer, contributor, etc…)

Going forward, ensure that PRs from new contributors include an update to authorship roles (for R packages, in description file)

[MEETING - 2022-11-23] Resolve open discussion issues

Topics for discussion & ideally resolution at 2022-11-23 Meeting

Ensure `$schema` properties start with `http` not `https`

URLs which start with https are not recognised by ajv causing issues when using other validators like ajv-cli. So all URLs in $schema properties specified the json schema version should start with http.

See issues brought up in #42

versioning for schemas, suggest that instances of the schema reference the versioned schema

I think we would like the schemas that are defined in this repository to be versioned -- that way, a configuration file like admin.json in a hub repository can reference the version of the admin-schema.json file that it is based on, and if we ever make changes to the schema definition, downstream tools can still process the hub's admin config settings correctly.

I'm not sure if there is a standard way to do this, but maybe we can use naming conventions like admin-schema-v0.0.1?

Then, we might expect the file like admin.json in a hub to look like this:

{
    "$schema": "https://raw.githubusercontent.com/Infectious-Disease-Modeling-Hubs/schemas/main/admin-schema-v0.0.1.json",
    "name": "Simple Forecast Hub",
    "maintainer": "Consortium of Infectious Disease Modeling Hubs",
    "contact": {
        "name": "Joe Bloggs",
        "email": "[email protected]"
    },
    "repository_url": "https://github.com/Infectious-Disease-Modeling-Hubs/example-simple-forecast-hub",
    "hub_models": [{
        "team_abbr": "simple_hub",
        "model_abbr": "baseline",
        "model_type": "baseline"
    }]
}

And in turn, admin-schema-v0.0.1.json would need to specify that such a $schema property is expected:

{
    "$schema": "http://json-schema.org/draft-07/schema",
    "title": "Hub administrative settings",
    "description": "This JSON file provides a schema for modeling hub administrative information.",
    "type": "object",
    "properties": {
        "$schema": {
            "description": "The version of the schema file used to define the administrative settings config file",
            "type": "string",
            "example": "https://raw.githubusercontent.com/Infectious-Disease-Modeling-Hubs/schemas/main/admin-schema-v0.0.1.json"
        },
        "name": {
            "description": "The name of the hub.",
            "type": "string",
            "example": "US COVID-19 Forecast Hub"
        },
    ...
}

Define data types for sample `value` minimum and maximum properties

Working and v1.0.0 I noticed that the schema for sample value minimum and maximum properties only has a description with no type or other specification.

https://github.com/Infectious-Disease-Modeling-Hubs/schemas/blob/c73b02de6dc44af4251c90a239e47505e7a40f91/v1.0.0/tasks-schema.json#L946-L966

Could I get some clarity on what the value column for sample output types will hold? Is it target dependant?

"submissions_due.relative_to" property should be optional

The relative_to property of submissions_due is listed as required here, but I think it should be optional -- a hub may want to specify an exact range of submission due dates as in the example scenario hub here.

Add support for a distribution output type

We might need to have something like

output_type = distribution
output_type_id = param1, param2, param3, ...

where for each distribution, we have a map between a parameter and an output type id, e.g.

distribution = gaussian
param1 = mean
param2 = sd

etc...

Does `required` and `optional` apply to sample output types

Similar to #9 & #10 , Does it make sense to have required and optional for sample output types? If so would be great to have an example of what a reasonable:

"example": [{
           "required": [],
           "optional": []
           }]

example would be?

admin-schema.json: split repository_url property into org name and repository name

As we onboard hubs to the cloud, splitting the admin-schema.json repository_url property into its atomic components of GitHub organization and repository name would be a helpful change. Or, alternately, combine these items and repository_host into a single repository group.

Reason: cloud-enabled hubs need the GitHub org name and repo name as separate strings for setting up AWS permissions.

Because we know that all hubs are currently hosted on GitHub, it's not worth introducing a breaking schema change right now (we can parse out org and repo name as needed). But logging this suggestion to consider the next time we do a breaking upgrade.

Unclear where horizon units are specified

In the Docs, Horizon is defined as such:

Horizons define the difference between the target_date and the origin_date in time units specified by the hub (e.g., may be days, weeks, or months)

However, I'm unclear where the units for the horizon are being specified in the metadata.

Add schema for `task_id` additional properties

Set a schema for additional task id properties. See example here: https://json-schema.org/understanding-json-schema/reference/object.html#additional-properties

Add round_id_from_variable property to tasks-schema.json

For details of discussion see: hubverse-org/example-complex-scenario-hub#1

allow for more than 20 chars in target_id

The FluSight project wanted to use "wk flu hosp rate change" which is 23 characters long. The limit appears to be 20. Can we increase this to 30, and maybe throw a warning if >20?

rename `type_id` to `output_type_id`

e.g. see this line

Elsewhere, we have decided to name this output_type_id rather than type_id, but it seems like the schemas here have not been updated yet. Is this intentional, or just an oversight?

Add CC-BY license to the schema repo

change valid output_type to "pmf" from "categorical"

Is this worth doing? It feels semantically a bit better. As the output is a category, not a categorical.

consider formalizing dependence structure for sample outputs

There is some related discussion here: https://hubdocs.readthedocs.io/en/latest/format/model-output.html#formats-of-model-output

Reproducing the relevant part:

We emphasize that the mean, median, quantile, cdf, and pmf representations all summarize the marginal predictive distribution for a single combination of model task id variables. On the other hand, the sample representation may capture dependence across combinations of multiple model task id variables by recording samples from a joint predictive distribution. For example, suppose that the model task id variables are “forecast date”, “location” and “horizon”. A predictive mean will summarize the predictive distribution for a single combination of forecast date, location and horizon. On the other hand, there are several options for the distribution from which a sample might be drawn, capturing dependence across different levels of the task id variables, including:

the joint predictive distribution across all locations and horizons within each forecast date

the joint predictive distribution across all horizons within each forecast date and location

the joint predictive distribution across all locations within each forecast date and horizon

the marginal predictive distribution for each combination of forecast date, location, and horizon

Hubs should specify the collection of task id variables for which samples are expected to capture dependence; e.g., the first option listed above might specify that samples should be drawn from distributions that are “joint across” locations and horizons.

Should we provide a way for hubs to specify any desired dependence structure for sample outputs in their metadata?

Create JSON schema for hub tasks.json

Re-opening this issue here as seems the best place to track conversations
Originally opened in reichlab/hub-infrastructure-experiments#3
Original pull request tracking developement of hubmeta schema (which is actually the tasks-schema.json and will be shortly submitted here: reichlab/hub-infrastructure-experiments#4

The purpose of the schema is two-fold:

Act as documentation of expectations of vallid hubmeta.json.
Be used to validate json hubmeta files against.

PROs

simple one step validation which can be performed using many languages/tools. This means validation is not encode in language specific code.
human readability of file makes standards inspectable by users

CONs (within the hubUtils) context

package jsonvalidate which can be used to perform validations depends on package V8 and underlying V8 javascript and webassembly engine. Since 2020, it's much easier to install V8 (work out of the box with install.packages("V8") on maOS & Windows) it does need separate installation of the V8 C++ engine on Linux. This used to be problematic on systems e.g. without sudo rights but an alternative option to automatically download a static build of libv8 during package installation is now available.
While it appears json schema can be used to validate yaml, I can't find a tool for it in R. We can always however convert yaml to json prior to validating.

Initial resources

jsonvalidate R package
JSON Schema: A Media Type for Describing JSON Documents
Understanding JSON Schema ebook.
Online JSON to schema converter. Useful for getting started.
Frictionless data: Validating tabular data with a specialised JSON schema
Stencila use of JSON schema for validating notebook elements. Useful example of generating schema for custom objects.

Correct `example` to valid `examples` keywords in all schema

nomenclature about `type`, `output_type`, and `target_type`

Currently in the hub tasks.json, we have:

model_tasks > output_type, which describes a representation or summary of a probability distribution and lines up with the type column in a model output submission file
model_tasks > target_metadata > target_type, which describes a statistical variable type

Maybe we should have a more consistent naming for the two things in point 1? e.g., might we want to change the column name in submission files to output_type rather than just type?

Validation issues with admin-schema.json

Trying to use ajv-cli to validate against admin-schema.json, there are a few problems.

All of the "examples" should be lists and not single strings
error: unknown format "uri" ignored in schema at path "#/properties/schema_version"

Add round name information

Per suggestion from @shauntruelove , the tasks.json file has a round_id field that can be a string.
In the scenario complex example repo, the tasks.json has multiple date as round_id.
The dates are used to identify each round in the filename, content of the files, etc. However, we also identify the name by round, for example, round 1 is the date "2020-01-01".

Should we either add a round_name field in the tasks.json file or should we have a specific place in the repo to store that information in a different format? or both?

Add a property for S3 bucket information?

Once we are implementing links to S3 buckets, it might be good to document details of buckets in hub configs too?

Broken link

specific documentation about the schema files

reconsider the name `output_type_id`

Dylan noted that this is a potentially confusing name for this column, as someone who is familiar with ideas about relational databases who is coming to the hubverse for the first time would interpret this as being the unique identifier for the output type (e.g. "quantile" = 1, "sample" = 2, etc). He suggests perhaps something like "output_value_metadata".

hubverse-org / schemas Goto Github PK

schemas's Introduction

Modeling Hub Schemas

Versioning

Schema documentation

New schema version development process

Highlighting changes to schema in PRs

Automated Process (via GitHub)

Manual Process

admin-schema.json

tasks-schema.json

💡 Tips

Show diff colours in PR

Send output directly to clipboard

macOS:

Linux:

schemas's People

Contributors

Stargazers

Watchers

Forkers

schemas's Issues

Examples:

Location:

Horizon:

Ideas for a better way

Idea 1: Can we simplify by specifying data type and range rather than providing lists?

Idea 2: Some way to specify concatenation of arrays within the json format?

Current proposed system

Example 1:

Example 2

Example 3

Example 4

Summary and question for discussion

@nickreich :

Initial resources

Recommend Projects

Recommend Topics

Recommend Org

`admin-schema.json`

`tasks-schema.json`