hubverse-org / schemas Goto Github PK
View Code? Open in Web Editor NEWJSON schemas for modeling hubs
License: Creative Commons Zero v1.0 Universal
JSON schemas for modeling hubs
License: Creative Commons Zero v1.0 Universal
In the Docs, Horizon is defined as such:
Horizons define the difference between the target_date and the origin_date in time units specified by the hub (e.g., may be days, weeks, or months)
However, I'm unclear where the units for the horizon are being specified in the metadata.
I think we would like the schemas that are defined in this repository to be versioned -- that way, a configuration file like admin.json
in a hub repository can reference the version of the admin-schema.json
file that it is based on, and if we ever make changes to the schema definition, downstream tools can still process the hub's admin config settings correctly.
I'm not sure if there is a standard way to do this, but maybe we can use naming conventions like admin-schema-v0.0.1
?
Then, we might expect the file like admin.json
in a hub to look like this:
{
"$schema": "https://raw.githubusercontent.com/Infectious-Disease-Modeling-Hubs/schemas/main/admin-schema-v0.0.1.json",
"name": "Simple Forecast Hub",
"maintainer": "Consortium of Infectious Disease Modeling Hubs",
"contact": {
"name": "Joe Bloggs",
"email": "[email protected]"
},
"repository_url": "https://github.com/Infectious-Disease-Modeling-Hubs/example-simple-forecast-hub",
"hub_models": [{
"team_abbr": "simple_hub",
"model_abbr": "baseline",
"model_type": "baseline"
}]
}
And in turn, admin-schema-v0.0.1.json
would need to specify that such a $schema
property is expected:
{
"$schema": "http://json-schema.org/draft-07/schema",
"title": "Hub administrative settings",
"description": "This JSON file provides a schema for modeling hub administrative information.",
"type": "object",
"properties": {
"$schema": {
"description": "The version of the schema file used to define the administrative settings config file",
"type": "string",
"example": "https://raw.githubusercontent.com/Infectious-Disease-Modeling-Hubs/schemas/main/admin-schema-v0.0.1.json"
},
"name": {
"description": "The name of the hub.",
"type": "string",
"example": "US COVID-19 Forecast Hub"
},
...
}
Trying to use ajv-cli to validate against admin-schema.json, there are a few problems.
Change output_types
property to output_type
to match column names in data files.
Once we are implementing links to S3 buckets, it might be good to document details of buckets in hub configs too?
There are some locations in the tasks.json
schema where enum
or const
keywords are used to effectively dictate what property values should be but type
is not specified for the property. See examples:
Although the values provided in enum
or const
somewhat dictate the data type of the field, the omission of the type property feels somewhat inconsistent.
In other places type
is included:
Should I make sure all properties require that type
is provided?
Thoughts @nickreich @LucieContamin ?
Per suggestion from @shauntruelove , the tasks.json
file has a round_id
field that can be a string.
In the scenario complex example repo, the tasks.json
has multiple date as round_id
.
The dates are used to identify each round in the filename, content of the files, etc. However, we also identify the name by round, for example, round 1 is the date "2020-01-01".
Should we either add a round_name field in the tasks.json
file or should we have a specific place in the repo to store that information in a different format? or both?
There is some related discussion here: https://hubdocs.readthedocs.io/en/latest/format/model-output.html#formats-of-model-output
Reproducing the relevant part:
We emphasize that the mean, median, quantile, cdf, and pmf representations all summarize the marginal predictive distribution for a single combination of model task id variables. On the other hand, the sample representation may capture dependence across combinations of multiple model task id variables by recording samples from a joint predictive distribution. For example, suppose that the model task id variables are “forecast date”, “location” and “horizon”. A predictive mean will summarize the predictive distribution for a single combination of forecast date, location and horizon. On the other hand, there are several options for the distribution from which a sample might be drawn, capturing dependence across different levels of the task id variables, including:
the joint predictive distribution across all locations and horizons within each forecast date
the joint predictive distribution across all horizons within each forecast date and location
the joint predictive distribution across all locations within each forecast date and horizon
the marginal predictive distribution for each combination of forecast date, location, and horizon
Hubs should specify the collection of task id variables for which samples are expected to capture dependence; e.g., the first option listed above might specify that samples should be drawn from distributions that are “joint across” locations and horizons.
Should we provide a way for hubs to specify any desired dependence structure for sample outputs in their metadata?
Per suggestions from Koji, should we add a "timezone" information field in the admin.json
schema? As different Hubs are on different timezome, it might be interesting to store that information especially if, for example, the submission date start/end date of the Hub need to be precise.
The idea here is to allow for compatibility with repos that have a model_output folder that is not called model_output. Desired behavior is if this field is not specified then the validation code assumes there is a directory called "model_output". If it is specified then the expectation is that the model_output folder would be redirected to the specified folder in this field.
I think it should be "model-output"
, not "model_output"
.
We might need to have something like
output_type = distribution
output_type_id = param1, param2, param3, ...
where for each distribution, we have a map between a parameter and an output type id, e.g.
distribution = gaussian
param1 = mean
param2 = sd
etc...
Re-opening this issue here as seems the best place to track conversations
Originally opened in reichlab/hub-infrastructure-experiments#3
Original pull request tracking developement ofhubmeta
schema (which is actually thetasks-schema.json
and will be shortly submitted here: reichlab/hub-infrastructure-experiments#4
The purpose of the schema is two-fold:
PROs
CONs (within the hubUtils) context
jsonvalidate
which can be used to perform validations depends on package V8 and underlying V8 javascript and webassembly engine. Since 2020, it's much easier to install V8
(work out of the box with install.packages("V8")
on maOS & Windows) it does need separate installation of the V8 C++ engine on Linux. This used to be problematic on systems e.g. without sudo rights but an alternative option to automatically download a static build of libv8 during package installation is now available.jsonvalidate
R packageFor more clear distinction from "binary" and "ordinal" target types.
Currently, the required and optional values of output type ids can in effect also specify whether the corresponding output types as a whole are required: namely, a particular output type is required if it has at least one required type_id
, and is optional otherwise. This may be confusing. Is there another way? See also the related discussion under issue #9.
To explain the situation, we consider a series examples of hubs with varying modeling task specifications.
"model_tasks": [
{
"task_ids": {
"location": {
"required": ["a", "b"],
"optional": ["c", "d"]
},
"horizon": {
"required": [1, 2],
"optional": [3, 4]
}
},
"output_types": {
"median": {
"type_id": {
"required": ["NA"],
"optional": null
},
"value" : {
"type": "integer",
"minimum": 0
}
},
"quantile" : {
"type_id": {
"required": [0.25, 0.5, 0.75],
"optional": [0.1, 0.9]
},
"value": {
"type": "integer",
"minimum": 0
}
}
}
}
]
For a hub with this specification, a valid submission must include at least the following rows, obtained via a kind of expand_grid
action across the different combinations of required values for the task id variables and required type_id
s within each output type. Note that in this process, you could imagine first concatenating the output_type
s with the options for type_id
values within each output_type
, so that they are treated as a "unit" when the expand_grid
happens. Then split them back into two columns. This is necessary to track the nesting of type_id
values withing the specific output types.
location horizon output_type type_id value
a 1 median NA ...
b 1 median NA ...
a 2 median NA ...
b 2 median NA ...
a 1 quantile 0.25 ...
b 1 quantile 0.25 ...
a 2 quantile 0.25 ...
b 2 quantile 0.25 ...
a 1 quantile 0.5 ...
b 1 quantile 0.5 ...
a 2 quantile 0.5 ...
b 2 quantile 0.5 ...
a 1 quantile 0.75 ...
b 1 quantile 0.75 ...
a 2 quantile 0.75 ...
b 2 quantile 0.75 ...
Example 2 is the same as example 1, but it has only one required quantile level:
"model_tasks": [
{
"location": {
"required": ["a", "b"],
"optional": ["c", "d"]
},
"horizon": {
"required": [1, 2],
"optional": [3, 4]
}
},
"output_types": {
"median": {
"type_id": {
"required": ["NA"],
"optional": null
},
"value" : {
"type": "integer",
"minimum": 0
}
},
"quantile" : {
"type_id": {
"required": [0.5],
"optional": [0.1, 0.25, 0.75, 0.9]
},
"value": {
"type": "integer",
"minimum": 0
}
}
}
}
]
For a hub with this specification, a valid submission must include at least the following rows:
location horizon output_type type_id value
a 1 median NA ...
b 1 median NA ...
a 2 median NA ...
b 2 median NA ...
a 1 quantile 0.5 ...
b 1 quantile 0.5 ...
a 2 quantile 0.5 ...
b 2 quantile 0.5 ...
Example 3 is similar to examples 1 and 2, but now all of the quantile levels are specified as optional.
"model_tasks": [
{
"location": {
"required": ["a", "b"],
"optional": ["c", "d"]
},
"horizon": {
"required": [1, 2],
"optional": [3, 4]
}
},
"output_types": {
"median": {
"type_id": {
"required": ["NA"],
"optional": null
},
"value" : {
"type": "integer",
"minimum": 0
}
},
"quantile" : {
"type_id": {
"required": null,
"optional": [0.1, 0.25, 0.5, 0.75, 0.9]
},
"value": {
"type": "integer",
"minimum": 0
}
}
}
}
]
For a hub with this specification, a valid submission must include at least the following rows:
location horizon output_type type_id value
a 1 median NA ...
b 1 median NA ...
a 2 median NA ...
b 2 median NA ...
Our final example is similar to example 1, but swaps the specification of ["NA"]
and null
values in the required
and optional
fields for the mean
output type:
"model_tasks": [
{
"location": {
"required": ["a", "b"],
"optional": ["c", "d"]
},
"horizon": {
"required": [1, 2],
"optional": [3, 4]
}
},
"output_types": {
"median": {
"type_id": {
"required": null,
"optional": ["NA"]
},
"value" : {
"type": "integer",
"minimum": 0
}
},
"quantile" : {
"type_id": {
"required": [0.25, 0.5, 0.75],
"optional": [0.1, 0.9]
},
"value": {
"type": "integer",
"minimum": 0
}
}
}
}
]
For a hub with this specification, a valid submission must include at least the following rows:
location horizon output_type type_id value
a 1 quantile 0.25 ...
b 1 quantile 0.25 ...
a 2 quantile 0.25 ...
b 2 quantile 0.25 ...
a 1 quantile 0.5 ...
b 1 quantile 0.5 ...
a 2 quantile 0.5 ...
b 2 quantile 0.5 ...
a 1 quantile 0.75 ...
b 1 quantile 0.75 ...
a 2 quantile 0.75 ...
b 2 quantile 0.75 ...
Summary: Under the current system, the required rows that a submission must minimally obtain are obtained by applying an expand_grid
type of action to the task id variables and combinations of output types and type ids. This means that if there are no required values under the type_ids for a particular output type, a minimal submission does not need to include any rows with that output type. Effectively, this means that that output type is optional. Saying this again in different words: in this set up, a particular output type is required only if there is at least one value specified as required
in the type_ids under that output type. This is illustrated in examples 3 and 4 above.
Every time this has come up, this use of required/optional values of a type_id to implicitly set the status of an output type has been non-intuitive. How can we resolve this? Three ideas:
output
column so that it has the required
and optional
properties similar to the other columns. We would then perhaps check that the names of any additional properties currently under "output_types"
match the values that were specified as required
or optional
for the output
column. We would need to think through and document how this interacts with the "implicit requirement" for output types that comes out of the current procedure as illustrated above.output_type
and type_id
(and any restrictions on value
) as being required
or optional
.I noticed that in the schema, specifying output_type
mean
minimum
and maximum
values in a config file is restricted to "integer"
:
Same in median
:
Although these are optional properties, Is this unnecessarily restrictive? I can imagine someone perhaps wanting to use a decimal number to specify ranges. Should I change the schema to ["integer", "number"]
.
As we onboard hubs to the cloud, splitting the admin-schema.json
repository_url
property into its atomic components of GitHub organization and repository name would be a helpful change. Or, alternately, combine these items and repository_host
into a single repository
group.
Reason: cloud-enabled hubs need the GitHub org name and repo name as separate strings for setting up AWS permissions.
Because we know that all hubs are currently hosted on GitHub, it's not worth introducing a breaking schema change right now (we can parse out org and repo name as needed). But logging this suggestion to consider the next time we do a breaking upgrade.
Add schema for model metadata as per https://hubdocs.readthedocs.io/en/latest/format/model-metadata.html
In our schema we currently have a origin_date
:
and a target_date
property:
Have we settled on whether forecast_date
(or other) / target_end_date
represent standard concepts and should therefore have schema properties for validating them when included in tasks.json
files?
The FluSight project wanted to use "wk flu hosp rate change" which is 23 characters long. The limit appears to be 20. Can we increase this to 30, and maybe throw a warning if >20?
In a categorical target definition, could we use what is currently stored in
model_tasks > output_type > categorical > type_id
to encode a mapping between integers and categories, so that we could have sample-based representations of categories? E.g. a table like
output_type | type id | value
----------- | ------- | ------
"sample" | 1 | 4
"sample" | 2 | 3
"sample" | 3 | 7
"sample" | 4 | 4
where type_id corresponds to the index of the sample and value corresponds to the number corresponding to the category. Noting that categorical targets are weird because the categories can show up as the type_id for pmf output_type but also as the value for the sample output_type.
Possibly related to the question about whether output_type should be a property of a target specifically or not.
Topics for discussion & ideally resolution at 2022-11-23 Meeting
Per discussion related to #48, in a first pass we will only support integer sample indices for output_type_id
, but we would eventually like to support strings as well. To do this, we will need to add two properties to the output_type_id
field in a hub's tasks.json
config file:
"string"
or "integer"
"type"
is "string"
, the maximum number of characters in a sample index"max_length"
should be required if "type"
is "string"
, but not otherwise.
Opening this issue to move discussions on this topic to the repo.
From slack:
[5 days ago]
How would people feel about adding an output_type of "mode" to the other existing types? This came up today in a conversation with @annakrystalli as it seems like a possibly natural form of a point estimate for a categorical target. E.g. a "mean" or "median" wouldn’t make sense. I will note that the mode could be extracted from the representation of a probability mass function for a categorical outcome, but that would require a probabilistic forecast. If you like the idea, please just add a ✅ . If you have questions or comments or objections, please add a note here. Thanks!
One comment on this after discussing briefly with Evan is that the tabular data representation would maybe be kind of ugly, e.g. since we can only have numeric objects in the “value” column, maybe it would look something like this?
output_type | type_id | value |
---|---|---|
"mode" | ["cat1", "cat2", "cat3"] | [0,1,1] |
where the type-id is an array of the possible values of the categorical variable and the array in value would be indicating which value(s) are the mode? Or maybe this would need to be spread over two rows, to keep value purely numeric?
Is this worth doing? It feels semantically a bit better. As the output is a category, not a categorical.
Should we add a license field for the Hub repository, maybe in the admin.json
file?
This reflects the fact we include target as a potential task_id
and target_variable
and target_outcome
are an option that has been discussed.
One question I had is that while using a single target
task id, optional and required arrays of multiple unique targets are explicit, what happens when you have multiple targets but spread across target_variable
and target_outcome
. How do you ensure they specify unique and correct combinations? Does the order they are specified in the array become important?
e.g. if two valid targets where inc hosp
& inc case
, how would they be specified?
"target_variable": {
"required": null,
"optional": ["hosp", "case"]
},
"target_outcome": {
"required": null,
"optional": ["inc", "inc"]
},
@elray1 your thoughts would be appreciated!
Can we give some example calculations, so it's really specific about what date would be accepted or rejected in a particular situation with particular relative start/end dates?
For details of discussion see: hubverse-org/example-complex-scenario-hub#1
Currently in the hub tasks.json
, we have:
model_tasks
> output_type
, which describes a representation or summary of a probability distribution and lines up with the type
column in a model output submission filemodel_tasks
> target_metadata
> target_type
, which describes a statistical variable typeMaybe we should have a more consistent naming for the two things in point 1? e.g., might we want to change the column name in submission files to output_type
rather than just type
?
Given probabilities across all possible categories sum to 1, it doesn't make much sense to me to have some categories optional and some required?
If anything feels like they should either all be required or all optional (with any missing assumed to be 0 probability) so long as what is provided sums to one?
Is there a way we could simplify a situation where there are multiple overlapping sets of values for task ids?
us: “US”
us_states”: [“US”, “01”, … ]
us_states_counties: above plus all the counties
horizon_1: [1]
horizon_2: [1,2]
horizon_3: [1,2,3]
…
horizon_52: [1, 2, 3, …, 52]
Ex. 1: horizon
Rather than horizon: [1, 2, 3, …, 52]
Something like horizon: {type:integer, min:1, max:52]
Ex. 2: origin_date
Rather than origin_date: [“2020-04-07”, “2020-04-14”, …, “2022-12-05”]
Something like origin_date: {type:date, min: “2020-04-07”}
, perhaps with some way to specify weekday?
Details here. New schema will incorporate "output_type_id_params" with different parameters than other output_types.
URLs which start with https
are not recognised by ajv
causing issues when using other validators like ajv-cli
. So all URLs in $schema
properties specified the json schema version should start with http
.
See issues brought up in #42
Should always be this:
"value": {
"type": "numeric",
"minimum": 0,
"maximum": 1
}
maybe this could be documented clearly in the documentation?
Working and v1.0.0 I noticed that the schema for sample
value
minimum
and maximum
properties only has a description with no type or other specification.
Could I get some clarity on what the value column for sample output types will hold? Is it target dependant?
Dylan noted that this is a potentially confusing name for this column, as someone who is familiar with ideas about relational databases who is coming to the hubverse for the first time would interpret this as being the unique identifier for the output type (e.g. "quantile" = 1, "sample" = 2, etc). He suggests perhaps something like "output_value_metadata".
Our current schema suggests that some task id values might accept multiple data types, e.g. scenario_id
lists both "integer"
and "string"
here.
I had some questions about this:
scenario_id
has an integer data type, will that column of model outputs be converted to an integer when I read in some model outputs? Would we want to do this?scenario_id
has integer data type, do we expect validations to throw an error if a model submission has encoded the value as "1"
instead of 1
, either via a data type specification in a parquet file or (possibly?) via quoting values in a csv file? Would we want to do this? (I think the quoting values in csv files is pretty iffy, that's not really a data type specification so much as a csv formatting thing...)I am basically wondering if the data type that shows up in a hub's tasks.json file for values of task id variables matters or should matter for downstream processing?
Use definition below
Definitions of roles (Author, maintainer, contributor, etc…)
Going forward, ensure that PRs from new contributors include an update to authorship roles (for R packages, in description file)
Version v0.0.9
had a bug in the admin.json
file that effectively made the file unusable :(
Need a basic GitHub Action that at the very least checks that schema are valid json schema files
At the moment, for mean
and median
type_id, we are asking for both required and optional value specification. In type_id which will eventually be NA
in R, we are also allowing for either NA
or NULL to be supplied.
Given the type_id of mean
and median
must always be NA
and required and optional has no meaning in the context of either type_id or value, should we simplify the schema structure to:
"mean": {
"type": "object",
"description": "the mean of the predictive distribution",
"properties": {
"type_id": {
"description": "Not used for mean output type. Must be NA or null.",
"type": "array",
"items": {
"enum": ["NA"],
"maxItems": 1
}
},
"value": {
"type": "object",
"properties": {
"type": {
"type": "string",
"enum": ["numeric", "double", "integer"]
},
"minimum": {
"type": "integer"
},
"maximum": {
"type": "integer"
}
},
"required": [
"type"
]
}
},
"required": [
"type_id",
"value"
]
}
Alternatively, we can make type_id optional all together for mean
and median
and ignore it by default in R. That would also get round the awkwardness of having to specify "NA" within a one element array so that R will automatically convert it to NA
when reading in.
A final option is to require type_id for mean & median to always be null (again getting rid of the awkward array) in tasks.json
and replace it with NA in R once it's read in?
Set a schema for additional task id properties. See example here: https://json-schema.org/understanding-json-schema/reference/object.html#additional-properties
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.