Giter Club home page Giter Club logo

Comments (6)

annakrystalli avatar annakrystalli commented on July 30, 2024

Results of experimentation of validating $refs pointers in our JSON documents.

While jsonvalidate::json_validate() offers a succinct way of accomplishing basic hubmeta.json validation, it is hard to validate $ref pointers in the json document being validated using functionality available through R.

From researching how to encode in the schema the possibility of a $ref object in lieu of any other property type, it seems that the standard way of handling this situation is to first resolve the pointers in the document being validated and then preform the validation using the schema. This actually makes sense because in the schema we want to be encoding the type and other criteria a property should match which can only be validated once any pointer are resolved.

The problem is that we have not found functionality in R to resolve pointers (e.g. functionality equivalent to the python solution @evan proposed here: https://github.com/reichlab/hub-infrastructure-experiments/blob/9d4889de34e5bc1e7df1b32a577caa6657c3384d/metadata-format-examples/complex-example/process-metadata-proposal.py#L1-L7).

The solution we currently have works on the r list after the file has been read into R using custom function in hubUtils substitute_refs: https://github.com/Infectious-Disease-Modeling-Hubs/hubUtils/blob/ab81ae6b8afac11a52950f4272830ccd2e84a5e3/R/hubmeta-read.R#L80-L107

but the jsonvalidate package functions work on json so we would ideally want $ref pointer resolution to be carried out on the json formatted data prior to validation and reading into R.

Current options identified

Hacky workaround to allow for ref options in schema.json docs

One option would be to hard encode within the schema that any property could match either the defined expectation OR be a ref object.

I managed to encode this successfully for the 3 $refs to location values in the modified complex example in the json-schema-refs branch of my fork. You can see the diffs between my original proposal and the workaround to handle refs here: https://github.com/annakrystalli/hub-infrastructure-experiments/pull/1/files

Cons (many!)
  • Really verbose and hacky workaround. The change shown in the diff was to accommodate the validation of a single $ref pointer in just a single property! (the required property of the location task_id. https://github.com/annakrystalli/hub-infrastructure-experiments/blob/af6132871089884648bb20b3e05b350a91ad948c/json-schema/hubmeta-schema.json#L80-L102

    We would have to effectively wrap every single property specification in these oneOf expressions throughout the whole nested structure of the schema! (unless I am missing something but at the moment I can't see a more succint way to do it).

  • Also hacky how I'm matching the $refs property name. Using it as the bare refs name with the ajv validation engine was being interpreted as an actual pointer in jsonvalidate::json_validator(), causing it to look for a non-existent string subschema file and throwing an error. This could be avoided by using the hacky "patternProperties" method I've gone for where the validation is applied to a property whose name regex matches $ref .

  • Finally (and likely most importantly!), the contents of the $ref pointers in $defs is NOT validated against the schema. The schema only validates that the ref points to a valid address within the schema.

Pros

Probably the single pro is that is will all be handled in the schema.json file but not much else good about this approach.

Resolving refs in R and re-converting to JSON.

The suggested workflow is to read the json config file into a list in R, resolve any refs with hubUtils:::substitute_refs() , convert back to JSON and then perform validation using jsonvalidate .

This also upfront feels really hacky and wasteful. The only saving grace is that such a workflow (i.e. reading into R and converting to JSON) may be required anyways to validate yml config files as I've not found (so far) any functionality for yml equivalent to jsonvalidate.

However, experimentation with this is also throwing up issues to do with how R serialises back to JSON.

In particular, it has to do with serialisation of vectors of length 1. The standard behaviour of toJSON is to serialise all vectors of length one to arrays with one element. This results in fields that were originally encoded as a simple "key": "value" pairs in JSON to now be encoded as "key": ["value"] arrays when re-serialised and fail validation as the schema is not expecting an array but rather a single value of a specific type.

A way to switch off this behaviour is to use the auto_unbox = TRUE in toJSON which means all vectors of length 1 would all be converted to "key": "value" pairs. This creates the opposite problem that properties defined as single element arrays would now be encoded as "key": "value" pairs and now these would fail validation.

One way to get around this is offered by jsonvalidate which allows serialisation to be informed by our schema using the following code:

# Read JSON into an R list
complex_mod_path <- here::here("json-schema", 
                               "modified-hubmeta-examples", 
                               "complex-hubmeta-mod.json")
json_list <- jsonlite::read_json(complex_mod_path,
                                 simplifyVector = TRUE,
                                 simplifyDataFrame = FALSE
) 

# Create new schema instance
schema <- jsonvalidate::json_schema$new(
    schema = here::here("json-schema", "hubmeta-schema.json"),
    engine = "ajv")

# Use Schema to serialise list to JSON
json <- schema$serialise(json_list)

# Use Schema to validate JSON
schema$validate(json)

Nice! BUT!...

When trying to run this, it stumbles on the fact that schema$serialise(json_list) cannot handle null elements and returns:

Error in context_eval(join(src), private$context, serialize, await) : 
  TypeError: Cannot convert undefined or null to object

😩

Wrap a python or other method for resolving JSON references

Instant massive additional dependency overhead. I'm not a big fan of this approach unless it can be super lean but happy to hear others thoughts about it!

Conclusion

Sorry for the huge length of this investigation report but I wanted to capture everything I tried to inform next steps.

At this stage, I'm actually leaning most towards opening an issue or two in jsonvalidate and see how amenable authors would be to:

  • handle conversion of null objects, especially if the schema itself allows (if possible of course)

  • see how they feel about adding functionality to resolve refs prior to validating. Their functions currently resolve ref pointers in the schema prior to validating so I'm wondering how much work it would be to also perform this on the json being validated?

Beyond that happy to hear other folks thoughts!

Obviously, the fall back is to do write code to do all the validation against schema within hubUtils ourselves. This would remove the V8 jsonvalidate dependency but also feels a big task given that a one liner using functions in jsonvalidate would (almost!) do the job!

from schemas.

annakrystalli avatar annakrystalli commented on July 30, 2024

Have updated my notebook with some code and output of my experimentations: https://annakrystalli.me/hub-infrastructure-experiments/json-schema/jsonvalidate.html

Note all was run in this repo and branch: https://github.com/annakrystalli/hub-infrastructure-experiments/tree/json-schema-refs

from schemas.

elray1 avatar elray1 commented on July 30, 2024

It seems like the second approach, resolving refs in R and converting back to json, is probably the way to go. The drawbacks you noted for the other ones (not actually validating the contents of referenced objects, introducing dependencies on other languages like Python, and then having the whole problem over again if we do want to support yaml) are pretty severe.

Your suggestion about adding a feature to jsonvalidate to handle conversion of null objects makes sense. I wonder if another option might be to:

  • modify our schema so that anything that's currently allowed to be null can just be omitted instead (such omitted items would be implicitly null)
  • then, drop any null objects from the processed R list before outputting it

The other request for jsonvalidate, about resolving references before validating, also makes sense. It seems like it would be a good feature for that package to have, but it feels like it might be a larger request (?) and also might not help if we want to validate yaml (?).

from schemas.

annakrystalli avatar annakrystalli commented on July 30, 2024

Nice idea about just removing NULL fields! That should work nicely.

Re the validation of yaml, to use the jsonvalidate functions we would have to convert yaml files to JSON anyways so that shouldn't add any extra steps that we wouldn't have to perform anyways.

from schemas.

annakrystalli avatar annakrystalli commented on July 30, 2024

FYI, I've opened a couple of issues in jsonvalidate regarding some of the issues we've been having:

from schemas.

annakrystalli avatar annakrystalli commented on July 30, 2024

Been looking at workarounds for outstanding issues with jsonvalidate and have encountered an additional issue. Even when using the schema to re-serialise, if a property type has multiple options which include "array", of specific interest to us where type can be null or array (specified as "type": ["null", "array"]), array seems to suppress unboxing a NULL value which ends up being converted to [null] instead of just null. 😭

from schemas.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.