Comments (6)
Results of experimentation of validating $refs
pointers in our JSON documents.
While jsonvalidate::json_validate()
offers a succinct way of accomplishing basic hubmeta.json
validation, it is hard to validate $ref
pointers in the json document being validated using functionality available through R.
From researching how to encode in the schema the possibility of a $ref
object in lieu of any other property type, it seems that the standard way of handling this situation is to first resolve the pointers in the document being validated and then preform the validation using the schema. This actually makes sense because in the schema we want to be encoding the type and other criteria a property should match which can only be validated once any pointer are resolved.
The problem is that we have not found functionality in R to resolve pointers (e.g. functionality equivalent to the python solution @evan proposed here: https://github.com/reichlab/hub-infrastructure-experiments/blob/9d4889de34e5bc1e7df1b32a577caa6657c3384d/metadata-format-examples/complex-example/process-metadata-proposal.py#L1-L7).
The solution we currently have works on the r list after the file has been read into R using custom function in hubUtils
substitute_refs
: https://github.com/Infectious-Disease-Modeling-Hubs/hubUtils/blob/ab81ae6b8afac11a52950f4272830ccd2e84a5e3/R/hubmeta-read.R#L80-L107
but the jsonvalidate
package functions work on json so we would ideally want $ref
pointer resolution to be carried out on the json formatted data prior to validation and reading into R.
Current options identified
Hacky workaround to allow for ref options in schema.json
docs
One option would be to hard encode within the schema that any property could match either the defined expectation OR be a ref object.
I managed to encode this successfully for the 3 $ref
s to location
values in the modified complex example in the json-schema-refs
branch of my fork. You can see the diffs between my original proposal and the workaround to handle refs here: https://github.com/annakrystalli/hub-infrastructure-experiments/pull/1/files
Cons (many!)
-
Really verbose and hacky workaround. The change shown in the diff was to accommodate the validation of a single
$ref
pointer in just a single property! (therequired
property of thelocation
task_id
. https://github.com/annakrystalli/hub-infrastructure-experiments/blob/af6132871089884648bb20b3e05b350a91ad948c/json-schema/hubmeta-schema.json#L80-L102We would have to effectively wrap every single property specification in these
oneOf
expressions throughout the whole nested structure of the schema! (unless I am missing something but at the moment I can't see a more succint way to do it). -
Also hacky how I'm matching the
$refs
property name. Using it as the barerefs
name with theajv
validation engine was being interpreted as an actual pointer injsonvalidate::json_validator()
, causing it to look for a non-existentstring
subschema file and throwing an error. This could be avoided by using the hacky"patternProperties"
method I've gone for where the validation is applied to a property whose name regex matches$ref
. -
Finally (and likely most importantly!), the contents of the
$ref
pointers in$defs
is NOT validated against the schema. The schema only validates that the ref points to a valid address within the schema.
Pros
Probably the single pro is that is will all be handled in the schema.json
file but not much else good about this approach.
Resolving refs in R and re-converting to JSON.
The suggested workflow is to read the json config file into a list in R, resolve any refs with hubUtils:::substitute_refs()
, convert back to JSON and then perform validation using jsonvalidate
.
This also upfront feels really hacky and wasteful. The only saving grace is that such a workflow (i.e. reading into R and converting to JSON) may be required anyways to validate yml config files as I've not found (so far) any functionality for yml equivalent to jsonvalidate
.
However, experimentation with this is also throwing up issues to do with how R serialises back to JSON.
In particular, it has to do with serialisation of vectors of length 1. The standard behaviour of toJSON
is to serialise all vectors of length one to arrays with one element. This results in fields that were originally encoded as a simple "key": "value"
pairs in JSON to now be encoded as "key": ["value"]
arrays when re-serialised and fail validation as the schema is not expecting an array but rather a single value of a specific type.
A way to switch off this behaviour is to use the auto_unbox = TRUE
in toJSON
which means all vectors of length 1 would all be converted to "key": "value"
pairs. This creates the opposite problem that properties defined as single element arrays would now be encoded as "key": "value"
pairs and now these would fail validation.
One way to get around this is offered by jsonvalidate
which allows serialisation to be informed by our schema using the following code:
# Read JSON into an R list
complex_mod_path <- here::here("json-schema",
"modified-hubmeta-examples",
"complex-hubmeta-mod.json")
json_list <- jsonlite::read_json(complex_mod_path,
simplifyVector = TRUE,
simplifyDataFrame = FALSE
)
# Create new schema instance
schema <- jsonvalidate::json_schema$new(
schema = here::here("json-schema", "hubmeta-schema.json"),
engine = "ajv")
# Use Schema to serialise list to JSON
json <- schema$serialise(json_list)
# Use Schema to validate JSON
schema$validate(json)
Nice! BUT!...
When trying to run this, it stumbles on the fact that schema$serialise(json_list)
cannot handle null
elements and returns:
Error in context_eval(join(src), private$context, serialize, await) :
TypeError: Cannot convert undefined or null to object
😩
Wrap a python or other method for resolving JSON references
Instant massive additional dependency overhead. I'm not a big fan of this approach unless it can be super lean but happy to hear others thoughts about it!
Conclusion
Sorry for the huge length of this investigation report but I wanted to capture everything I tried to inform next steps.
At this stage, I'm actually leaning most towards opening an issue or two in jsonvalidate
and see how amenable authors would be to:
-
handle conversion of null objects, especially if the schema itself allows (if possible of course)
-
see how they feel about adding functionality to resolve refs prior to validating. Their functions currently resolve ref pointers in the schema prior to validating so I'm wondering how much work it would be to also perform this on the json being validated?
Beyond that happy to hear other folks thoughts!
Obviously, the fall back is to do write code to do all the validation against schema within hubUtils
ourselves. This would remove the V8 jsonvalidate
dependency but also feels a big task given that a one liner using functions in jsonvalidate
would (almost!) do the job!
from schemas.
Have updated my notebook with some code and output of my experimentations: https://annakrystalli.me/hub-infrastructure-experiments/json-schema/jsonvalidate.html
Note all was run in this repo and branch: https://github.com/annakrystalli/hub-infrastructure-experiments/tree/json-schema-refs
from schemas.
It seems like the second approach, resolving refs in R and converting back to json, is probably the way to go. The drawbacks you noted for the other ones (not actually validating the contents of referenced objects, introducing dependencies on other languages like Python, and then having the whole problem over again if we do want to support yaml) are pretty severe.
Your suggestion about adding a feature to jsonvalidate
to handle conversion of null objects makes sense. I wonder if another option might be to:
- modify our schema so that anything that's currently allowed to be
null
can just be omitted instead (such omitted items would be implicitlynull
) - then, drop any
null
objects from the processed R list before outputting it
The other request for jsonvalidate
, about resolving references before validating, also makes sense. It seems like it would be a good feature for that package to have, but it feels like it might be a larger request (?) and also might not help if we want to validate yaml (?).
from schemas.
Nice idea about just removing NULL
fields! That should work nicely.
Re the validation of yaml, to use the jsonvalidate
functions we would have to convert yaml files to JSON anyways so that shouldn't add any extra steps that we wouldn't have to perform anyways.
from schemas.
FYI, I've opened a couple of issues in jsonvalidate
regarding some of the issues we've been having:
from schemas.
Been looking at workarounds for outstanding issues with jsonvalidate
and have encountered an additional issue. Even when using the schema to re-serialise, if a property type has multiple options which include "array"
, of specific interest to us where type can be null
or array
(specified as "type": ["null", "array"]
), array seems to suppress unboxing a NULL
value which ends up being converted to [null]
instead of just null
. 😭
from schemas.
Related Issues (20)
- double check data type options for standard task ids HOT 2
- reconsider the name `output_type_id` HOT 7
- Add schema support for strings as sample indices in `output_type_id` HOT 2
- Broken link HOT 2
- update default value for model_output_directory HOT 12
- Update authors HOT 1
- Add a property for S3 bucket information? HOT 14
- admin-schema.json: split repository_url property into org name and repository name HOT 3
- update schema to reflect changes in sample output_type
- Notify community of v3.0.0 schema version breaking changes HOT 2
- Allow additional properties? HOT 1
- hubAdmin::validate_hub_config - model metadata information HOT 2
- Add note to README to update hubTemplate whenever a new schema version is released
- remove epidemic week formatting requirements for CDF category values HOT 2
- New value features
- Decimal place in value - new features?
- [ORG NAME CHANGE]: Update repo to hubverse-org organisation name
- Document that for cdf and ordinal pmf output types, output_type_id values should be listed in order
- Introduce a property to fix the `output_type_id` column data type across the hub
- Better test for new schema version releases
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from schemas.