Re-opening this issue here as seems the best place to track conversations

Create JSON schema for hub tasks.json about schemas HOT 6 CLOSED

annakrystalli commented on July 30, 2024

Create JSON schema for hub tasks.json

from schemas.

Comments (6)

annakrystalli commented on July 30, 2024

Results of experimentation of validating `$refs` pointers in our JSON documents.

While jsonvalidate::json_validate() offers a succinct way of accomplishing basic hubmeta.json validation, it is hard to validate $ref pointers in the json document being validated using functionality available through R.

From researching how to encode in the schema the possibility of a $ref object in lieu of any other property type, it seems that the standard way of handling this situation is to first resolve the pointers in the document being validated and then preform the validation using the schema. This actually makes sense because in the schema we want to be encoding the type and other criteria a property should match which can only be validated once any pointer are resolved.

The problem is that we have not found functionality in R to resolve pointers (e.g. functionality equivalent to the python solution @evan proposed here: https://github.com/reichlab/hub-infrastructure-experiments/blob/9d4889de34e5bc1e7df1b32a577caa6657c3384d/metadata-format-examples/complex-example/process-metadata-proposal.py#L1-L7).

The solution we currently have works on the r list after the file has been read into R using custom function in hubUtils substitute_refs: https://github.com/Infectious-Disease-Modeling-Hubs/hubUtils/blob/ab81ae6b8afac11a52950f4272830ccd2e84a5e3/R/hubmeta-read.R#L80-L107

but the jsonvalidate package functions work on json so we would ideally want $ref pointer resolution to be carried out on the json formatted data prior to validation and reading into R.

Current options identified

Hacky workaround to allow for ref options in `schema.json` docs

One option would be to hard encode within the schema that any property could match either the defined expectation OR be a ref object.

I managed to encode this successfully for the 3 $refs to location values in the modified complex example in the json-schema-refs branch of my fork. You can see the diffs between my original proposal and the workaround to handle refs here: https://github.com/annakrystalli/hub-infrastructure-experiments/pull/1/files

Cons (many!)

Really verbose and hacky workaround. The change shown in the diff was to accommodate the validation of a single $ref pointer in just a single property! (the required property of the location task_id. https://github.com/annakrystalli/hub-infrastructure-experiments/blob/af6132871089884648bb20b3e05b350a91ad948c/json-schema/hubmeta-schema.json#L80-L102

We would have to effectively wrap every single property specification in these oneOf expressions throughout the whole nested structure of the schema! (unless I am missing something but at the moment I can't see a more succint way to do it).
Also hacky how I'm matching the $refs property name. Using it as the bare refs name with the ajv validation engine was being interpreted as an actual pointer in jsonvalidate::json_validator(), causing it to look for a non-existent string subschema file and throwing an error. This could be avoided by using the hacky "patternProperties" method I've gone for where the validation is applied to a property whose name regex matches $ref .
Finally (and likely most importantly!), the contents of the $ref pointers in $defs is NOT validated against the schema. The schema only validates that the ref points to a valid address within the schema.

Pros

Probably the single pro is that is will all be handled in the schema.json file but not much else good about this approach.

Resolving refs in R and re-converting to JSON.

The suggested workflow is to read the json config file into a list in R, resolve any refs with hubUtils:::substitute_refs() , convert back to JSON and then perform validation using jsonvalidate .

This also upfront feels really hacky and wasteful. The only saving grace is that such a workflow (i.e. reading into R and converting to JSON) may be required anyways to validate yml config files as I've not found (so far) any functionality for yml equivalent to jsonvalidate.

However, experimentation with this is also throwing up issues to do with how R serialises back to JSON.

In particular, it has to do with serialisation of vectors of length 1. The standard behaviour of toJSON is to serialise all vectors of length one to arrays with one element. This results in fields that were originally encoded as a simple "key": "value" pairs in JSON to now be encoded as "key": ["value"] arrays when re-serialised and fail validation as the schema is not expecting an array but rather a single value of a specific type.

A way to switch off this behaviour is to use the auto_unbox = TRUE in toJSON which means all vectors of length 1 would all be converted to "key": "value" pairs. This creates the opposite problem that properties defined as single element arrays would now be encoded as "key": "value" pairs and now these would fail validation.

One way to get around this is offered by jsonvalidate which allows serialisation to be informed by our schema using the following code:

# Read JSON into an R list
complex_mod_path <- here::here("json-schema", 
                               "modified-hubmeta-examples", 
                               "complex-hubmeta-mod.json")
json_list <- jsonlite::read_json(complex_mod_path,
                                 simplifyVector = TRUE,
                                 simplifyDataFrame = FALSE
) 

# Create new schema instance
schema <- jsonvalidate::json_schema$new(
    schema = here::here("json-schema", "hubmeta-schema.json"),
    engine = "ajv")

# Use Schema to serialise list to JSON
json <- schema$serialise(json_list)

# Use Schema to validate JSON
schema$validate(json)

Nice! BUT!...

When trying to run this, it stumbles on the fact that schema$serialise(json_list) cannot handle null elements and returns:

Error in context_eval(join(src), private$context, serialize, await) : 
  TypeError: Cannot convert undefined or null to object

😩

Wrap a python or other method for resolving JSON references

Instant massive additional dependency overhead. I'm not a big fan of this approach unless it can be super lean but happy to hear others thoughts about it!

Conclusion

Sorry for the huge length of this investigation report but I wanted to capture everything I tried to inform next steps.

At this stage, I'm actually leaning most towards opening an issue or two in jsonvalidate and see how amenable authors would be to:

handle conversion of null objects, especially if the schema itself allows (if possible of course)
see how they feel about adding functionality to resolve refs prior to validating. Their functions currently resolve ref pointers in the schema prior to validating so I'm wondering how much work it would be to also perform this on the json being validated?

Beyond that happy to hear other folks thoughts!

Obviously, the fall back is to do write code to do all the validation against schema within hubUtils ourselves. This would remove the V8 jsonvalidate dependency but also feels a big task given that a one liner using functions in jsonvalidate would (almost!) do the job!

from schemas.

annakrystalli commented on July 30, 2024

Have updated my notebook with some code and output of my experimentations: https://annakrystalli.me/hub-infrastructure-experiments/json-schema/jsonvalidate.html

Note all was run in this repo and branch: https://github.com/annakrystalli/hub-infrastructure-experiments/tree/json-schema-refs

from schemas.

elray1 commented on July 30, 2024

It seems like the second approach, resolving refs in R and converting back to json, is probably the way to go. The drawbacks you noted for the other ones (not actually validating the contents of referenced objects, introducing dependencies on other languages like Python, and then having the whole problem over again if we do want to support yaml) are pretty severe.

Your suggestion about adding a feature to jsonvalidate to handle conversion of null objects makes sense. I wonder if another option might be to:

modify our schema so that anything that's currently allowed to be null can just be omitted instead (such omitted items would be implicitly null)
then, drop any null objects from the processed R list before outputting it

The other request for jsonvalidate, about resolving references before validating, also makes sense. It seems like it would be a good feature for that package to have, but it feels like it might be a larger request (?) and also might not help if we want to validate yaml (?).

from schemas.

annakrystalli commented on July 30, 2024

Nice idea about just removing NULL fields! That should work nicely.

Re the validation of yaml, to use the jsonvalidate functions we would have to convert yaml files to JSON anyways so that shouldn't add any extra steps that we wouldn't have to perform anyways.

from schemas.

annakrystalli commented on July 30, 2024

FYI, I've opened a couple of issues in jsonvalidate regarding some of the issues we've been having:

from schemas.

annakrystalli commented on July 30, 2024

Been looking at workarounds for outstanding issues with jsonvalidate and have encountered an additional issue. Even when using the schema to re-serialise, if a property type has multiple options which include "array", of specific interest to us where type can be null or array (specified as "type": ["null", "array"]), array seems to suppress unboxing a NULL value which ends up being converted to [null] instead of just null. 😭

from schemas.

Create JSON schema for hub tasks.json about schemas HOT 6 CLOSED

Comments (6)

Results of experimentation of validating `$refs` pointers in our JSON documents.

Current options identified

Hacky workaround to allow for ref options in `schema.json` docs

Cons (many!)

Pros

Resolving refs in R and re-converting to JSON.

Wrap a python or other method for resolving JSON references

Conclusion

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (6)

Results of experimentation of validating $refs pointers in our JSON documents.

Current options identified

Hacky workaround to allow for ref options in schema.json docs

Cons (many!)

Pros

Resolving refs in R and re-converting to JSON.

Wrap a python or other method for resolving JSON references

Conclusion

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

Results of experimentation of validating `$refs` pointers in our JSON documents.

Hacky workaround to allow for ref options in `schema.json` docs