leocalm / avro_validator Goto Github PK
View Code? Open in Web Editor NEWA pure python avro schema validator
License: MIT License
A pure python avro schema validator
License: MIT License
I think if the json being validated has a number that is in integer it should still pass as a float.
If others agree I can create a pull request for this change.
When using other fields in my .asvc schema then the standard:
name: a JSON string providing the name of the record (required).
namespace, a JSON string that qualifies the name;
doc: a JSON string providing documentation to the user of this schema (optional).
aliases: a JSON array of strings, providing alternate names for this record (optional).
fields: a JSON array, listing fields (required). Each field is a JSON object with the following attributes:
I get an error that these are the only fields allowed in RecordType.
Here is the line of code: https://github.com/leocalm/avro_validator/blob/master/avro_validator/avro_types.py#L67
But in the avro documentation, it states this is allowed: "...permitted as metadata..."
https://avro.apache.org/docs/1.10.2/spec.html#schemas
A Schema is represented in JSON by one of:
A JSON string, naming a defined type.
A JSON object, of the form:
{"type": "typeName" ...attributes...}
where typeName is either a primitive or derived type name, as defined below. Attributes not defined in this document are permitted as metadata, but must not affect the format of serialized data.
Could you either fix this, or give me write access so I can fix it myself in a branch?
Thank you.
Hi all! It looks like we're running into problem when we've got a schema as JSON in a variable that we pass into the Schema class constructor.
> schema = Schema(json.dumps(local_schema))
18:34:32
18:34:32 test/test_sdp_producer.py:451:
18:34:32 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
18:34:32 lib/python3.7/site-packages/avro_validator/schema.py:17: in __init__
18:34:32 if file_path.exists():
18:34:32 /usr/local/lib/python3.7/pathlib.py:1329: in exists
18:34:32 self.stat()
18:34:32 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
18:34:32
18:34:32 self = PosixPath('{"type": "record", "name": "Invoice", "fields": [{"name": "enterpriseEventEnvelope", "type": {"type": "reco...e": ["null", {"type": "record", "name": "DomainPayLoadRecord", "fields": [{"name": "eventId", "type": "string"}]}]}]}')
18:34:32
18:34:32 def stat(self):
18:34:32 """
18:34:32 Return the result of the stat() system call on this path, like
18:34:32 os.stat() does.
18:34:32 """
18:34:32 > return self._accessor.stat(self)
18:34:32 E OSError: [Errno 36] File name too long: '{"type": "record", "name": "Invoice", "fields": [{"name": "enterpriseEventEnvelope", "type": {"type": "record", "name": "EnterpriseEventEnvelopeRecord", "fields": [{"name": "eventId", "type": "string"}]}}, {"name": "domainPayload", "type": ["null", {"type": "record", "name": "DomainPayLoadRecord", "fields": [{"name": "eventId", "type": "string"}]}]}]}'
18:34:32
18:34:32 /usr/local/lib/python3.7/pathlib.py:1151: OSError
As far as I can tell, has only become a problem in the last day or so, with the addition of the Path library. Perhaps we need to wrap the file_path.exists() check in a try block and fall back to the else clause when we catch an OSError?
I have the following field declared in my schema, but missing in the data:
{
"name": "myField",
"type": [
"null",
{
"type": "map",
"values": {
"type": "string",
"avro.java.string": "String"
},
"avro.java.string": "String"
}
],
"default": null
}
I would expect the schema validator to ignore it given that it both has "null" as a part of union type and as default. However, the validator throws the following error:
ValueError: Error parsing the field [surfaceIds]: The MapType can only contains {'values', 'type'} keys
Hi!
Thanks for your awesome, developer friendly, library.
I just have a question, I began to migrate my nifi workflow to an airflow workflow, and I want to know if there is a way to skip the check for extra_field ? I read a csv with the pandas library, but I just want to check some column with my avro schema, not all fields.
Kind regards
Hello,
I am doing in command line
avro_validator union_schema.avsc producing_message.json
My <union_schema.avsc>
is a JSON array, with different dependent objects inside. I can give an example below.
[
{
"type": "record",
"namespace": "com.company.model",
"name": "AddressRecord",
"fields": [
{
"name": "streetaddress",
"type": "string"
},
{
"name": "city",
"type": "string"
}
]
},
{
"namespace": "com.company.model",
"type": "record",
"name": "person",
"fields": [
{
"name": "firstname",
"type": "string"
},
{
"name": "lastname",
"type": "string"
},
{
"name": "address",
"type": {
"type": "array",
"items": "com.company.model.AddressRecord"
}
}
]
}
]
When I was trying to validate through command line, I got an error
Traceback (most recent call last):
File "/.local/bin/avro_validator", line 8, in <module>
sys.exit(main())
File "/.local/lib/python3.6/site-packages/avro_validator/cli.py", line 28, in main
parsed_schema = schema.parse()
File "/.local/lib/python3.6/site-packages/avro_validator/schema.py", line 28, in parse
return RecordType.build(schema)
File "/.local/lib/python3.6/site-packages/avro_validator/avro_types.py", line 647, in build
cls._validate_json_repr(json_repr)
File "/.local/lib/python3.6/site-packages/avro_validator/avro_types.py", line 63, in _validate_json_repr
if cls.required_attributes.intersection(json_repr.keys()) != cls.required_attributes:
AttributeError: 'list' object has no attribute 'keys'
I couldn't find info in README in this repo about avro union. What should I do to make this work? Thanks
The following schema fails with error: ValueError: The RecordType must have {'name', 'fields'} defined.
{
"name": "test",
"type": "array",
"items": {
"type": "string"
}
}
It appears that the schema parser only accepts schemas with the top level being a record.
A schema with a complex structure of nested record types fail to parse.
Just found this project today and it has been very helpful, but I ran into a problem with nested arrays.
The following avro schema:
{
"namespace": "com.company.code",
"type": "record",
"name": "MyAvroSchema",
"doc" : "...",
"fields": [
{
"name": "MyArray",
"type": {
"type": "array",
"items": {
"type": {"type": "array","items": "int"}
}
}
}
]
}
generates the following error:
Traceback (most recent call last):
File "/Users/user/miniconda3/envs/arrhythmia/lib/python3.9/code.py", line 90, in runcode
exec(code, self.locals)
File "<input>", line 1, in <module>
File "/Users/user/Library/Application Support/JetBrains/Toolbox/apps/PyCharm-P/ch-0/222.4459.20/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_umd.py", line 198, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "/Users/user/Library/Application Support/JetBrains/Toolbox/apps/PyCharm-P/ch-0/222.4459.20/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/Users/user/arrhythmia/file_schemas/validate_schema.py", line 5, in <module>
parsed_schema = schema.parse()
File "/Users/user/miniconda3/envs/arrhythmia/lib/python3.9/site-packages/avro_validator/schema.py", line 28, in parse
return RecordType.build(schema, skip_extra_keys=skip_extra_keys)
File "/Users/user/miniconda3/envs/arrhythmia/lib/python3.9/site-packages/avro_validator/avro_types.py", line 812, in build
record_type.__fields = {
File "/Users/user/miniconda3/envs/arrhythmia/lib/python3.9/site-packages/avro_validator/avro_types.py", line 813, in <dictcomp>
field['name']: RecordTypeField.build(
File "/Users/user/miniconda3/envs/arrhythmia/lib/python3.9/site-packages/avro_validator/avro_types.py", line 585, in build
field.__type = cls.__build_field_type(json_repr, custom_fields)
File "/Users/user/miniconda3/envs/arrhythmia/lib/python3.9/site-packages/avro_validator/avro_types.py", line 545, in __build_field_type
return cls._get_field_from_json(json_repr['type'], custom_fields)
File "/Users/user/miniconda3/envs/arrhythmia/lib/python3.9/site-packages/avro_validator/avro_types.py", line 210, in _get_field_from_json
return getattr(sys.modules[__name__], FIELD_MAPPING[field_type['type']]).build(field_type, custom_fields)
File "/Users/user/miniconda3/envs/arrhythmia/lib/python3.9/site-packages/avro_validator/avro_types.py", line 1016, in build
array_type.__items = ArrayType._get_field_from_json(json_repr['items'], custom_fields)
File "/Users/user/miniconda3/envs/arrhythmia/lib/python3.9/site-packages/avro_validator/avro_types.py", line 210, in _get_field_from_json
return getattr(sys.modules[__name__], FIELD_MAPPING[field_type['type']]).build(field_type, custom_fields)
TypeError: unhashable type: 'dict'
If I remove the inner array it parses correctly.
avro_validator/avro_validator/avro_types.py
Line 543 in ef7fd09
validate
function raise exception upon invalid data and returns True
otherwise. Would it be better to have another optional parameter indicating whether to raise the exception? Something like below:
def validate(self, value: Any, raise_errors=True) -> bool:
Without this, users will have to do below to get around it if they only care about whether the data is valid (not caring about what exactly error it is). And I think this scenario is quite common, e.g. users may only want to process some data that matches the schema and simply discard other data.
valid = False
try:
valid = schema.validate(data)
except ValueError:
pass
Avro now accepts python datetime
objects for int/long fields with logicalType
's in the date/timestamp family. It'd be great if this validator could do the same.
ValueError: The type [Actor] is not recognized by Avro
import json
from avro_validator.schema import Schema
SCHEMA = {
"name": "Actor",
"type": "record",
"fields": [
{
"name": "actedBy",
"type": ["null", "Actor"],
}
]
}
Schema(json.dumps(SCHEMA)).parse()
Avro records with nullable fields (based on union types) encoded as JSON
are not correctly parsed using the avro_validator
cli. The error is
due to an inconsistency in how this tool interprets union types and how
they're encoded in JSON (link to docs).
Specifically:
For example, the union schema ["null","string","Foo"], where Foo is a
record name, would encode:
- null as null;
- string "a" as {"string": "a"}; and
- a Foo instance as {"Foo": {...}}, where {...} indicates the JSON encoding of a Foo instance.
The following schema includes some nullable fields, which can be used to
generate some random data.
{
"type" : "record",
"name" : "test",
"namespace" : "com.example",
"fields" : [ {
"name" : "name",
"type" : "string"
}, {
"name" : "null_name1",
"type" : [ "null", "string" ]
}, {
"name" : "null_name2",
"type" : [ "string", "null" ]
}, {
"name" : "num",
"type" : "int"
}, {
"name" : "null_num1",
"type" : [ "null", "int" ]
}, {
"name" : "null_num2",
"type" : [ "int", "null" ]
} ]
}
This following record fails the validation based on the above schema:
{
"name": "snhepdirqromqkgllhgljumtuj",
"null_name1": null,
"null_name2": null,
"num": 186374858,
"null_num1": {
"int": -1433093325
},
"null_num2": {
"int": -1728851584
}
}
$ avro_validator test.avsc test.json
Error validating value for field [null_num1]: The value [{'int': -1433093325}] is not from one of the following types: [[NullType, IntType]]
Schema constructor currently leaves the schema file open until it's GC'd (unless I misunderstand the semantics of open()).
Current code:
self._schema = open(schema, 'r').read()
This would be more reliable:
with open(schema, 'r') as f:
self._schema = f.read()
Hey Leo!
Is possible to use multiple files avsc to validate json? I have some custom schemas and that are being reutilized in main schema.
I'm calling avro_validator.schema.Schema(s)
where s
is a json string and len(s)==33335
.
The constructor throws the following exception:
{ValueError}stat: path too long for Windows
It's probably not correct to use os.path.isfile() to distinguish between files and json strings. The simplest fix would be to wrap it in a try/catch and treat an exception as another indicator that it's not a file.
Hi there,
I'm getting this error The RecordTypeField can only contains {'order', 'aliases', 'type', 'doc', 'name', 'default'} keys
, even though it looks like it's supported according to this.
They seem to have some other fields listed on the docs there, but I haven't tried that myself. This issue could resolve this as well, that would be nice to have some checking and just be able to support the new fields.
Thanks for this project!
schema = json.dumps({
'name': 'test schema',
'type': 'record',
'doc': 'schema for testing avro_validator',
'fields': [
{
"name": "event_time",
"type": "long",
"logicalType": "timestamp-millis"
}
]
})
schema = Schema(schema)
parsed_schema = schema.parse()
get:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/amihayg/.virtualenvs/data-entity-schema/lib/python3.9/site-packages/avro_validator/schema.py", line 28, in parse
return RecordType.build(schema, skip_extra_keys=skip_extra_keys)
File "/Users/amihayg/.virtualenvs/data-entity-schema/lib/python3.9/site-packages/avro_validator/avro_types.py", line 673, in build
record_type.__fields = {
File "/Users/amihayg/.virtualenvs/data-entity-schema/lib/python3.9/site-packages/avro_validator/avro_types.py", line 674, in <dictcomp>
field['name']: RecordTypeField.build(
File "/Users/amihayg/.virtualenvs/data-entity-schema/lib/python3.9/site-packages/avro_validator/avro_types.py", line 440, in build
cls._validate_json_repr(json_repr, skip_extra_keys=skip_extra_keys)
File "/Users/amihayg/.virtualenvs/data-entity-schema/lib/python3.9/site-packages/avro_validator/avro_types.py", line 72, in _validate_json_repr
raise ValueError(f'The {cls.__name__} can only contains '
ValueError: The RecordTypeField can only contains {'doc', 'default', 'name', 'order', 'aliases', 'type'} keys, but does contain also {'logicalType'}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.