hubverse-org / schemas Goto Github PK

JSON schemas for modeling hubs

License: Creative Commons Zero v1.0 Universal

schemas's Issues

Unclear where horizon units are specified

In the Docs, Horizon is defined as such:

Horizons define the difference between the target_date and the origin_date in time units specified by the hub (e.g., may be days, weeks, or months)

However, I'm unclear where the units for the horizon are being specified in the metadata.

Does `required` and `optional` apply to sample output types

Similar to #9 & #10 , Does it make sense to have required and optional for sample output types? If so would be great to have an example of what a reasonable:

"example": [{
           "required": [],
           "optional": []
           }]

example would be?

versioning for schemas, suggest that instances of the schema reference the versioned schema

I think we would like the schemas that are defined in this repository to be versioned -- that way, a configuration file like admin.json in a hub repository can reference the version of the admin-schema.json file that it is based on, and if we ever make changes to the schema definition, downstream tools can still process the hub's admin config settings correctly.

I'm not sure if there is a standard way to do this, but maybe we can use naming conventions like admin-schema-v0.0.1?

Then, we might expect the file like admin.json in a hub to look like this:

{
    "$schema": "https://raw.githubusercontent.com/Infectious-Disease-Modeling-Hubs/schemas/main/admin-schema-v0.0.1.json",
    "name": "Simple Forecast Hub",
    "maintainer": "Consortium of Infectious Disease Modeling Hubs",
    "contact": {
        "name": "Joe Bloggs",
        "email": "[email protected]"
    },
    "repository_url": "https://github.com/Infectious-Disease-Modeling-Hubs/example-simple-forecast-hub",
    "hub_models": [{
        "team_abbr": "simple_hub",
        "model_abbr": "baseline",
        "model_type": "baseline"
    }]
}

And in turn, admin-schema-v0.0.1.json would need to specify that such a $schema property is expected:

{
    "$schema": "http://json-schema.org/draft-07/schema",
    "title": "Hub administrative settings",
    "description": "This JSON file provides a schema for modeling hub administrative information.",
    "type": "object",
    "properties": {
        "$schema": {
            "description": "The version of the schema file used to define the administrative settings config file",
            "type": "string",
            "example": "https://raw.githubusercontent.com/Infectious-Disease-Modeling-Hubs/schemas/main/admin-schema-v0.0.1.json"
        },
        "name": {
            "description": "The name of the hub.",
            "type": "string",
            "example": "US COVID-19 Forecast Hub"
        },
    ...
}

Validation issues with admin-schema.json

Trying to use ajv-cli to validate against admin-schema.json, there are a few problems.

All of the "examples" should be lists and not single strings
error: unknown format "uri" ignored in schema at path "#/properties/schema_version"

Change `output_types` property to `output_type`

Change output_types property to output_type to match column names in data files.

Add a property for S3 bucket information?

Once we are implementing links to S3 buckets, it might be good to document details of buckets in hub configs too?

Broken link

specific documentation about the schema files

`type` often omitted when specifying property values with `enum` or `const`

There are some locations in the tasks.json schema where enum or const keywords are used to effectively dictate what property values should be but type is not specified for the property. See examples:

https://github.com/Infectious-Disease-Modeling-Hubs/schemas/blob/19f40344655064dff9fc8e78f7a10eacb5894cea/v0.0.1/tasks-schema.json#L759-L767

https://github.com/Infectious-Disease-Modeling-Hubs/schemas/blob/19f40344655064dff9fc8e78f7a10eacb5894cea/v0.0.1/tasks-schema.json#L769-L772

Although the values provided in enum or const somewhat dictate the data type of the field, the omission of the type property feels somewhat inconsistent.

In other places type is included:

https://github.com/Infectious-Disease-Modeling-Hubs/schemas/blob/19f40344655064dff9fc8e78f7a10eacb5894cea/v0.0.1/tasks-schema.json#L646-L657

Should I make sure all properties require that type is provided?

Thoughts @nickreich @LucieContamin ?

Add round name information

Per suggestion from @shauntruelove , the tasks.json file has a round_id field that can be a string.
In the scenario complex example repo, the tasks.json has multiple date as round_id.
The dates are used to identify each round in the filename, content of the files, etc. However, we also identify the name by round, for example, round 1 is the date "2020-01-01".

Should we either add a round_name field in the tasks.json file or should we have a specific place in the repo to store that information in a different format? or both?

consider formalizing dependence structure for sample outputs

There is some related discussion here: https://hubdocs.readthedocs.io/en/latest/format/model-output.html#formats-of-model-output

Reproducing the relevant part:

We emphasize that the mean, median, quantile, cdf, and pmf representations all summarize the marginal predictive distribution for a single combination of model task id variables. On the other hand, the sample representation may capture dependence across combinations of multiple model task id variables by recording samples from a joint predictive distribution. For example, suppose that the model task id variables are “forecast date”, “location” and “horizon”. A predictive mean will summarize the predictive distribution for a single combination of forecast date, location and horizon. On the other hand, there are several options for the distribution from which a sample might be drawn, capturing dependence across different levels of the task id variables, including:

the joint predictive distribution across all locations and horizons within each forecast date

the joint predictive distribution across all horizons within each forecast date and location

the joint predictive distribution across all locations within each forecast date and horizon

the marginal predictive distribution for each combination of forecast date, location, and horizon

Hubs should specify the collection of task id variables for which samples are expected to capture dependence; e.g., the first option listed above might specify that samples should be drawn from distributions that are “joint across” locations and horizons.

Should we provide a way for hubs to specify any desired dependence structure for sample outputs in their metadata?

timezone information

Per suggestions from Koji, should we add a "timezone" information field in the admin.json schema? As different Hubs are on different timezome, it might be interesting to store that information especially if, for example, the submission date start/end date of the Hub need to be precise.

add optional "model_output_folder" object in admin.json

The idea here is to allow for compatibility with repos that have a model_output folder that is not called model_output. Desired behavior is if this field is not specified then the validation code assumes there is a directory called "model_output". If it is specified then the expectation is that the model_output folder would be redirected to the specified folder in this field.

update default value for model_output_directory

I think it should be "model-output", not "model_output".

https://github.com/Infectious-Disease-Modeling-Hubs/schemas/blob/de580d56b8fc5c24dd36a32994182e37b8b0ac95/v2.0.0/admin-schema.json#L635

Add support for a distribution output type

We might need to have something like

output_type = distribution
output_type_id = param1, param2, param3, ...

where for each distribution, we have a map between a parameter and an output type id, e.g.

distribution = gaussian
param1 = mean
param2 = sd

etc...

Create JSON schema for hub tasks.json

Re-opening this issue here as seems the best place to track conversations
Originally opened in reichlab/hub-infrastructure-experiments#3
Original pull request tracking developement of hubmeta schema (which is actually the tasks-schema.json and will be shortly submitted here: reichlab/hub-infrastructure-experiments#4

The purpose of the schema is two-fold:

Act as documentation of expectations of vallid hubmeta.json.
Be used to validate json hubmeta files against.

PROs

simple one step validation which can be performed using many languages/tools. This means validation is not encode in language specific code.
human readability of file makes standards inspectable by users

CONs (within the hubUtils) context

package jsonvalidate which can be used to perform validations depends on package V8 and underlying V8 javascript and webassembly engine. Since 2020, it's much easier to install V8 (work out of the box with install.packages("V8") on maOS & Windows) it does need separate installation of the V8 C++ engine on Linux. This used to be problematic on systems e.g. without sudo rights but an alternative option to automatically download a static build of libv8 during package installation is now available.
While it appears json schema can be used to validate yaml, I can't find a tool for it in R. We can always however convert yaml to json prior to validating.

Initial resources

jsonvalidate R package
JSON Schema: A Media Type for Describing JSON Documents
Understanding JSON Schema ebook.
Online JSON to schema converter. Useful for getting started.
Frictionless data: Validating tabular data with a specialised JSON schema
Stencila use of JSON schema for validating notebook elements. Useful example of generating schema for custom objects.

double check data type options for standard task ids

change target_type "categorical" to "nominal"

For more clear distinction from "binary" and "ordinal" target types.

Clarify thinking about required and optional model tasks and output types

Currently, the required and optional values of output type ids can in effect also specify whether the corresponding output types as a whole are required: namely, a particular output type is required if it has at least one required type_id, and is optional otherwise. This may be confusing. Is there another way? See also the related discussion under issue #9.

Current proposed system

To explain the situation, we consider a series examples of hubs with varying modeling task specifications.

Example 1:

      "model_tasks": [
        {
          "task_ids": {
            "location": {
              "required": ["a", "b"],
              "optional": ["c", "d"]
            },
            "horizon": {
              "required": [1, 2],
              "optional": [3, 4]
            }
          },
          "output_types": {
            "median": {
              "type_id": {
                "required": ["NA"],
                "optional": null
              },
              "value" : {
                "type": "integer",
                "minimum": 0
              }
            },
            "quantile" : {
              "type_id": {
                "required": [0.25, 0.5, 0.75],
                "optional": [0.1, 0.9]
              },
              "value": {
                "type": "integer",
                "minimum": 0
              }
            }
          }
        }
      ]

For a hub with this specification, a valid submission must include at least the following rows, obtained via a kind of expand_grid action across the different combinations of required values for the task id variables and required type_ids within each output type. Note that in this process, you could imagine first concatenating the output_types with the options for type_id values within each output_type, so that they are treated as a "unit" when the expand_grid happens. Then split them back into two columns. This is necessary to track the nesting of type_id values withing the specific output types.

 location horizon output_type type_id value
        a       1     median      NA   ...
        b       1     median      NA   ...
        a       2     median      NA   ...
        b       2     median      NA   ...
        a       1    quantile    0.25   ...
        b       1    quantile    0.25   ...
        a       2    quantile    0.25   ...
        b       2    quantile    0.25   ...
        a       1    quantile     0.5   ...
        b       1    quantile     0.5   ...
        a       2    quantile     0.5   ...
        b       2    quantile     0.5   ...
        a       1    quantile    0.75   ...
        b       1    quantile    0.75   ...
        a       2    quantile    0.75   ...
        b       2    quantile    0.75   ...

Example 2

Example 2 is the same as example 1, but it has only one required quantile level:

      "model_tasks": [
        {
            "location": {
              "required": ["a", "b"],
              "optional": ["c", "d"]
            },
            "horizon": {
              "required": [1, 2],
              "optional": [3, 4]
            }
          },
          "output_types": {
            "median": {
              "type_id": {
                "required": ["NA"],
                "optional": null
              },
              "value" : {
                "type": "integer",
                "minimum": 0
              }
            },
            "quantile" : {
              "type_id": {
                "required": [0.5],
                "optional": [0.1, 0.25, 0.75, 0.9]
              },
              "value": {
                "type": "integer",
                "minimum": 0
              }
            }
          }
        }
      ]

For a hub with this specification, a valid submission must include at least the following rows:

 location horizon output_type type_id value
        a       1        median      NA   ...
        b       1        median      NA   ...
        a       2        median      NA   ...
        b       2        median      NA   ...
        a       1    quantile     0.5   ...
        b       1    quantile     0.5   ...
        a       2    quantile     0.5   ...
        b       2    quantile     0.5   ...

Example 3

Example 3 is similar to examples 1 and 2, but now all of the quantile levels are specified as optional.

      "model_tasks": [
        {
            "location": {
              "required": ["a", "b"],
              "optional": ["c", "d"]
            },
            "horizon": {
              "required": [1, 2],
              "optional": [3, 4]
            }
          },
          "output_types": {
            "median": {
              "type_id": {
                "required": ["NA"],
                "optional": null
              },
              "value" : {
                "type": "integer",
                "minimum": 0
              }
            },
            "quantile" : {
              "type_id": {
                "required": null,
                "optional": [0.1, 0.25, 0.5, 0.75, 0.9]
              },
              "value": {
                "type": "integer",
                "minimum": 0
              }
            }
          }
        }
      ]

For a hub with this specification, a valid submission must include at least the following rows:

 location horizon output_type type_id value
        a       1        median      NA   ...
        b       1        median      NA   ...
        a       2        median      NA   ...
        b       2        median      NA   ...

Example 4

Our final example is similar to example 1, but swaps the specification of ["NA"] and null values in the required and optional fields for the mean output type:

      "model_tasks": [
        {
            "location": {
              "required": ["a", "b"],
              "optional": ["c", "d"]
            },
            "horizon": {
              "required": [1, 2],
              "optional": [3, 4]
            }
          },
          "output_types": {
            "median": {
              "type_id": {
                "required": null,
                "optional": ["NA"]
              },
              "value" : {
                "type": "integer",
                "minimum": 0
              }
            },
            "quantile" : {
              "type_id": {
                "required": [0.25, 0.5, 0.75],
                "optional": [0.1, 0.9]
              },
              "value": {
                "type": "integer",
                "minimum": 0
              }
            }
          }
        }
      ]

For a hub with this specification, a valid submission must include at least the following rows:

 location horizon output_type type_id value
        a       1    quantile    0.25   ...
        b       1    quantile    0.25   ...
        a       2    quantile    0.25   ...
        b       2    quantile    0.25   ...
        a       1    quantile     0.5   ...
        b       1    quantile     0.5   ...
        a       2    quantile     0.5   ...
        b       2    quantile     0.5   ...
        a       1    quantile    0.75   ...
        b       1    quantile    0.75   ...
        a       2    quantile    0.75   ...
        b       2    quantile    0.75   ...

Summary and question for discussion

Summary: Under the current system, the required rows that a submission must minimally obtain are obtained by applying an expand_grid type of action to the task id variables and combinations of output types and type ids. This means that if there are no required values under the type_ids for a particular output type, a minimal submission does not need to include any rows with that output type. Effectively, this means that that output type is optional. Saying this again in different words: in this set up, a particular output type is required only if there is at least one value specified as required in the type_ids under that output type. This is illustrated in examples 3 and 4 above.

Every time this has come up, this use of required/optional values of a type_id to implicitly set the status of an output type has been non-intuitive. How can we resolve this? Three ideas:

Change the representation of the output column so that it has the required and optional properties similar to the other columns. We would then perhaps check that the names of any additional properties currently under "output_types" match the values that were specified as required or optional for the output column. We would need to think through and document how this interacts with the "implicit requirement" for output types that comes out of the current procedure as illustrated above.
Some other higher level field indicating which output types are required and optional. We would need to think through and document how this interacts with the "implicit requirement" for output types that comes out of the current procedure as illustrated above.
Somehow more directly specify the concatenated/nested values of columns output_type and type_id (and any restrictions on value) as being required or optional.
Lots of documentation.

Is limiting specification of maximum and minimum values for mean and median output types to integers to restrictive?

I noticed that in the schema, specifying output_type mean minimum and maximum values in a config file is restricted to "integer":

https://github.com/Infectious-Disease-Modeling-Hubs/schemas/blob/19f40344655064dff9fc8e78f7a10eacb5894cea/v0.0.1/tasks-schema.json#L474-L480

Same in median:

https://github.com/Infectious-Disease-Modeling-Hubs/schemas/blob/19f40344655064dff9fc8e78f7a10eacb5894cea/v0.0.1/tasks-schema.json#L568-L575

Although these are optional properties, Is this unnecessarily restrictive? I can imagine someone perhaps wanting to use a decimal number to specify ranges. Should I change the schema to ["integer", "number"].

admin-schema.json: split repository_url property into org name and repository name

As we onboard hubs to the cloud, splitting the admin-schema.json repository_url property into its atomic components of GitHub organization and repository name would be a helpful change. Or, alternately, combine these items and repository_host into a single repository group.

Reason: cloud-enabled hubs need the GitHub org name and repo name as separate strings for setting up AWS permissions.

Because we know that all hubs are currently hosted on GitHub, it's not worth introducing a breaking schema change right now (we can parse out org and repo name as needed). But logging this suggestion to consider the next time we do a breaking upgrade.

add model-schema.json

Add schema for model metadata as per https://hubdocs.readthedocs.io/en/latest/format/model-metadata.html

Add `forecast_date` / `target_end_date` properties to schema?

In our schema we currently have a origin_date:

https://github.com/Infectious-Disease-Modeling-Hubs/schemas/blob/c1d8fc0e993ed07c2d783becbe91ea10e31c7e85/v1.0.0/tasks-schema.json#L51C42-L60

and a target_date property:

https://github.com/Infectious-Disease-Modeling-Hubs/schemas/blob/c1d8fc0e993ed07c2d783becbe91ea10e31c7e85/v1.0.0/tasks-schema.json#L268-L306

Have we settled on whether forecast_date (or other) / target_end_date represent standard concepts and should therefore have schema properties for validating them when included in tasks.json files?

allow for more than 20 chars in target_id

The FluSight project wanted to use "wk flu hosp rate change" which is 23 characters long. The limit appears to be 20. Can we increase this to 30, and maybe throw a warning if >20?

add capacity to link integers with category numbers for categorical targets

In a categorical target definition, could we use what is currently stored in
model_tasks > output_type > categorical > type_id
to encode a mapping between integers and categories, so that we could have sample-based representations of categories? E.g. a table like

output_type | type id | value 
----------- | ------- | ------
"sample"    | 1       | 4
"sample"    | 2       | 3
"sample"    | 3       | 7
"sample"    | 4       | 4

where type_id corresponds to the index of the sample and value corresponds to the number corresponding to the category. Noting that categorical targets are weird because the categories can show up as the type_id for pmf output_type but also as the value for the sample output_type.

[MEETING - 2022-11-23] Resolve open discussion issues

Topics for discussion & ideally resolution at 2022-11-23 Meeting

Add schema support for strings as sample indices in `output_type_id`

Per discussion related to #48, in a first pass we will only support integer sample indices for output_type_id, but we would eventually like to support strings as well. To do this, we will need to add two properties to the output_type_id field in a hub's tasks.json config file:

“type”: the data type of the sample index. "string" or "integer"
“max_length”: if "type" is "string", the maximum number of characters in a sample index

"max_length" should be required if "type" is "string", but not otherwise.

Introduce `"mode"` output_type

Opening this issue to move discussions on this topic to the repo.

From slack:

@nickreich :

[5 days ago]
How would people feel about adding an output_type of "mode" to the other existing types? This came up today in a conversation with @annakrystalli as it seems like a possibly natural form of a point estimate for a categorical target. E.g. a "mean" or "median" wouldn’t make sense. I will note that the mode could be extracted from the representation of a probability mass function for a categorical outcome, but that would require a probabilistic forecast. If you like the idea, please just add a ✅ . If you have questions or comments or objections, please add a note here. Thanks!

One comment on this after discussing briefly with Evan is that the tabular data representation would maybe be kind of ugly, e.g. since we can only have numeric objects in the “value” column, maybe it would look something like this?

output_type	type_id	value
"mode"	["cat1", "cat2", "cat3"]	[0,1,1]

where the type-id is an array of the possible values of the categorical variable and the array in value would be indicating which value(s) are the mode? Or maybe this would need to be spread over two rows, to keep value purely numeric?

change valid output_type to "pmf" from "categorical"

Is this worth doing? It feels semantically a bit better. As the output is a category, not a categorical.

Add CC-BY license to the schema repo

License for Hub repository

Should we add a license field for the Hub repository, maybe in the admin.json file?

"submissions_due.relative_to" property should be optional

The relative_to property of submissions_due is listed as required here, but I think it should be optional -- a hub may want to specify an exact range of submission due dates as in the example scenario hub here.

Add `target_variable` and `target_outcome` to task ids as potential task ids

This reflects the fact we include target as a potential task_id and target_variable and target_outcome are an option that has been discussed.

One question I had is that while using a single target task id, optional and required arrays of multiple unique targets are explicit, what happens when you have multiple targets but spread across target_variable and target_outcome. How do you ensure they specify unique and correct combinations? Does the order they are specified in the array become important?

e.g. if two valid targets where inc hosp & inc case, how would they be specified?

"target_variable": {
  "required": null,
  "optional": ["hosp", "case"]
},
"target_outcome": {
  "required": null,
  "optional": ["inc", "inc"]
},

@elray1 your thoughts would be appreciated!

Correct `example` to valid `examples` keywords in all schema

beef up documentation about relative submission dates

https://github.com/Infectious-Disease-Modeling-Hubs/schemas/blob/b95a91cb9953315bc27ddea6278a28eea30a250c/tasks-schema.json#L747

Can we give some example calculations, so it's really specific about what date would be accepted or rejected in a particular situation with particular relative start/end dates?

Add round_id_from_variable property to tasks-schema.json

For details of discussion see: hubverse-org/example-complex-scenario-hub#1

nomenclature about `type`, `output_type`, and `target_type`

Currently in the hub tasks.json, we have:

model_tasks > output_type, which describes a representation or summary of a probability distribution and lines up with the type column in a model output submission file
model_tasks > target_metadata > target_type, which describes a statistical variable type

Maybe we should have a more consistent naming for the two things in point 1? e.g., might we want to change the column name in submission files to output_type rather than just type?

Does `required` and `optional` apply to categorical output types

Given probabilities across all possible categories sum to 1, it doesn't make much sense to me to have some categories optional and some required?

If anything feels like they should either all be required or all optional (with any missing assumed to be 0 probability) so long as what is provided sums to one?

more convenient representations of required/optional task id values

Is there a way we could simplify a situation where there are multiple overlapping sets of values for task ids?

Examples:

Location:

us: “US”
us_states”: [“US”, “01”, … ]
us_states_counties: above plus all the counties

Horizon:

horizon_1: [1]
horizon_2: [1,2]
horizon_3: [1,2,3]
…
horizon_52: [1, 2, 3, …, 52]

Ideas for a better way

Idea 1: Can we simplify by specifying data type and range rather than providing lists?

Ex. 1: horizon

Rather than horizon: [1, 2, 3, …, 52]
Something like horizon: {type:integer, min:1, max:52]

Ex. 2: origin_date
Rather than origin_date: [“2020-04-07”, “2020-04-14”, …, “2022-12-05”]
Something like origin_date: {type:date, min: “2020-04-07”}, perhaps with some way to specify weekday?

Idea 2: Some way to specify concatenation of arrays within the json format?

update schema to reflect changes in sample output_type

Details here. New schema will incorporate "output_type_id_params" with different parameters than other output_types.

Ensure `$schema` properties start with `http` not `https`

URLs which start with https are not recognised by ajv causing issues when using other validators like ajv-cli. So all URLs in $schema properties specified the json schema version should start with http.

See issues brought up in #42

"value" object in quantile "output_type" should always have specific values

https://github.com/Infectious-Disease-Modeling-Hubs/schemas/blob/b95a91cb9953315bc27ddea6278a28eea30a250c/tasks-schema.json#L486

Should always be this:

                            "value": {
                                "type": "numeric",
                                "minimum": 0,
                                "maximum": 1
                            }

maybe this could be documented clearly in the documentation?

Define data types for sample `value` minimum and maximum properties

Working and v1.0.0 I noticed that the schema for sample value minimum and maximum properties only has a description with no type or other specification.

https://github.com/Infectious-Disease-Modeling-Hubs/schemas/blob/c73b02de6dc44af4251c90a239e47505e7a40f91/v1.0.0/tasks-schema.json#L946-L966

Could I get some clarity on what the value column for sample output types will hold? Is it target dependant?

reconsider the name `output_type_id`

Dylan noted that this is a potentially confusing name for this column, as someone who is familiar with ideas about relational databases who is coming to the hubverse for the first time would interpret this as being the unique identifier for the output type (e.g. "quantile" = 1, "sample" = 2, etc). He suggests perhaps something like "output_value_metadata".

question about data types for task id values

Our current schema suggests that some task id values might accept multiple data types, e.g. scenario_id lists both "integer" and "string" here.

I had some questions about this:

How are we using these data types?
- It seems possible/likely that this is only being used to validate that the hub has specified its tasks.json file correctly.
- If I set up a hub where the scenario_id has an integer data type, will that column of model outputs be converted to an integer when I read in some model outputs? Would we want to do this?
- If I set up a hub where the scenario_id has integer data type, do we expect validations to throw an error if a model submission has encoded the value as "1" instead of 1, either via a data type specification in a parquet file or (possibly?) via quoting values in a csv file? Would we want to do this? (I think the quoting values in csv files is pretty iffy, that's not really a data type specification so much as a csv formatting thing...)

I am basically wondering if the data type that shows up in a hub's tasks.json file for values of task id variables matters or should matter for downstream processing?

Update authors

Use definition below
Definitions of roles (Author, maintainer, contributor, etc…)

Going forward, ensure that PRs from new contributors include an update to authorship roles (for R packages, in description file)

Add Tests!!

Version v0.0.9 had a bug in the admin.json file that effectively made the file unusable :(

Need a basic GitHub Action that at the very least checks that schema are valid json schema files

Handle mean and median type_id specifications more efficiently?

At the moment, for mean and median type_id, we are asking for both required and optional value specification. In type_id which will eventually be NA in R, we are also allowing for either NA or NULL to be supplied.

Given the type_id of mean and median must always be NA and required and optional has no meaning in the context of either type_id or value, should we simplify the schema structure to:

"mean": {
                      "type": "object",
                      "description": "the mean of the predictive distribution",
                      "properties": {
                        "type_id": {
                          "description": "Not used for mean output type. Must be NA or null.",
                              "type": "array",
                              "items": {
                                "enum": ["NA"],
                                "maxItems": 1
                              }
                            },
                        "value": {
                          "type": "object",
                          "properties": {
                            "type": {
                              "type": "string",
                              "enum": ["numeric", "double", "integer"]
                            },
                            "minimum": {
                              "type": "integer"
                            },
                            "maximum": {
                              "type": "integer"
                            }
                          },
                          "required": [
                            "type"
                          ]
                        }
                      },
                      "required": [
                        "type_id",
                        "value"
                      ]
                    }

Alternatively, we can make type_id optional all together for mean and median and ignore it by default in R. That would also get round the awkwardness of having to specify "NA" within a one element array so that R will automatically convert it to NA when reading in.

A final option is to require type_id for mean & median to always be null (again getting rid of the awkward array) in tasks.json and replace it with NA in R once it's read in?

hubverse-org / schemas Goto Github PK

schemas's Issues

Initial resources

Current proposed system

Example 1:

Example 2

Example 3

Example 4

Summary and question for discussion

@nickreich :

Examples:

Location:

Horizon:

Ideas for a better way

Idea 1: Can we simplify by specifying data type and range rather than providing lists?

Idea 2: Some way to specify concatenation of arrays within the json format?

Recommend Projects

Recommend Topics

Recommend Org