leocalm / avro_validator Goto Github PK

View Code? Open in Web Editor NEW

27.0 2.0 16.0 112 KB

A pure python avro schema validator

License: MIT License

Python 100.00%

avro avro-schema apache-avro python python3

avro_validator's People

Contributors

Stargazers

Watchers

Forkers

helver pawndev wayfair-contribs cbcoutinho countsudoku jacobjohansen saipranita15 rgouda2020 zolvaczi lcl-mrozee e11it prakashautade manmat venkrao arpitjain799 amylase

avro_validator's Issues

Float validation fails if JSON has integer

I think if the json being validated has a number that is in integer it should still pass as a float.
If others agree I can create a pull request for this change.

RecordType ValueError is too strict

When using other fields in my .asvc schema then the standard:

name: a JSON string providing the name of the record (required).
namespace, a JSON string that qualifies the name;
doc: a JSON string providing documentation to the user of this schema (optional).
aliases: a JSON array of strings, providing alternate names for this record (optional).
fields: a JSON array, listing fields (required). Each field is a JSON object with the following attributes:

I get an error that these are the only fields allowed in RecordType.
Here is the line of code: https://github.com/leocalm/avro_validator/blob/master/avro_validator/avro_types.py#L67

But in the avro documentation, it states this is allowed: "...permitted as metadata..."
https://avro.apache.org/docs/1.10.2/spec.html#schemas

A Schema is represented in JSON by one of:

A JSON string, naming a defined type.
A JSON object, of the form:
{"type": "typeName" ...attributes...}
where typeName is either a primitive or derived type name, as defined below. Attributes not defined in this document are permitted as metadata, but must not affect the format of serialized data.

Could you either fix this, or give me write access so I can fix it myself in a branch?

Thank you.

Support enum types

Problem with Schema constructor

Hi all! It looks like we're running into problem when we've got a schema as JSON in a variable that we pass into the Schema class constructor.

>       schema = Schema(json.dumps(local_schema))
18:34:32 
18:34:32 test/test_sdp_producer.py:451: 
18:34:32 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
18:34:32 lib/python3.7/site-packages/avro_validator/schema.py:17: in __init__
18:34:32     if file_path.exists():
18:34:32 /usr/local/lib/python3.7/pathlib.py:1329: in exists
18:34:32     self.stat()
18:34:32 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
18:34:32 
18:34:32 self = PosixPath('{"type": "record", "name": "Invoice", "fields": [{"name": "enterpriseEventEnvelope", "type": {"type": "reco...e": ["null", {"type": "record", "name": "DomainPayLoadRecord", "fields": [{"name": "eventId", "type": "string"}]}]}]}')
18:34:32 
18:34:32     def stat(self):
18:34:32         """
18:34:32         Return the result of the stat() system call on this path, like
18:34:32         os.stat() does.
18:34:32         """
18:34:32 >       return self._accessor.stat(self)
18:34:32 E       OSError: [Errno 36] File name too long: '{"type": "record", "name": "Invoice", "fields": [{"name": "enterpriseEventEnvelope", "type": {"type": "record", "name": "EnterpriseEventEnvelopeRecord", "fields": [{"name": "eventId", "type": "string"}]}}, {"name": "domainPayload", "type": ["null", {"type": "record", "name": "DomainPayLoadRecord", "fields": [{"name": "eventId", "type": "string"}]}]}]}'
18:34:32 
18:34:32 /usr/local/lib/python3.7/pathlib.py:1151: OSError

As far as I can tell, has only become a problem in the last day or so, with the addition of the Path library. Perhaps we need to wrap the file_path.exists() check in a try block and fall back to the else clause when we catch an OSError?

Neither "null" in the union type nor default are respected

I have the following field declared in my schema, but missing in the data:

    {
      "name": "myField",
      "type": [
        "null",
        {
          "type": "map",
          "values": {
            "type": "string",
            "avro.java.string": "String"
          },
          "avro.java.string": "String"
        }
      ],
      "default": null
    }

I would expect the schema validator to ignore it given that it both has "null" as a part of union type and as default. However, the validator throws the following error:

ValueError: Error parsing the field [surfaceIds]: The MapType can only contains {'values', 'type'} keys

Is there a way to skip the extra_fields check ?

Hi!

Thanks for your awesome, developer friendly, library.
I just have a question, I began to migrate my nifi workflow to an airflow workflow, and I want to know if there is a way to skip the check for extra_field ? I read a csv with the pandas library, but I just want to check some column with my avro schema, not all fields.

Kind regards

does not support avro union, i.e., .avsc file is a JSON array?

Hello,

I am doing in command line

avro_validator union_schema.avsc producing_message.json

My <union_schema.avsc> is a JSON array, with different dependent objects inside. I can give an example below.

[
{
    "type": "record",
    "namespace": "com.company.model",
    "name": "AddressRecord",
    "fields": [
        {
            "name": "streetaddress",
            "type": "string"
        },
        {
            "name": "city",
            "type": "string"
        }
    ]
},
{
    "namespace": "com.company.model",
    "type": "record",
    "name": "person",
    "fields": [
        {
            "name": "firstname",
            "type": "string"
        },
        {
            "name": "lastname",
            "type": "string"
        },
        {
            "name": "address",
            "type": {
                "type": "array",
                "items": "com.company.model.AddressRecord"
            }
        }
    ]
}
]

When I was trying to validate through command line, I got an error

Traceback (most recent call last):
  File "/.local/bin/avro_validator", line 8, in <module>
    sys.exit(main())
  File "/.local/lib/python3.6/site-packages/avro_validator/cli.py", line 28, in main
    parsed_schema = schema.parse()
  File "/.local/lib/python3.6/site-packages/avro_validator/schema.py", line 28, in parse
    return RecordType.build(schema)
  File "/.local/lib/python3.6/site-packages/avro_validator/avro_types.py", line 647, in build
    cls._validate_json_repr(json_repr)
  File "/.local/lib/python3.6/site-packages/avro_validator/avro_types.py", line 63, in _validate_json_repr
    if cls.required_attributes.intersection(json_repr.keys()) != cls.required_attributes:
AttributeError: 'list' object has no attribute 'keys'

I couldn't find info in README in this repo about avro union. What should I do to make this work? Thanks

Cannot parse schema with top level array

The following schema fails with error: ValueError: The RecordType must have {'name', 'fields'} defined.

{
	"name": "test",
	"type": "array",
	"items": {
		"type": "string"
	}
}

It appears that the schema parser only accepts schemas with the top level being a record.

Nested record types fail to parse

A schema with a complex structure of nested record types fail to parse.

Failing to parse two-dimensional nested array of ints

Just found this project today and it has been very helpful, but I ran into a problem with nested arrays.
The following avro schema:

{
    "namespace": "com.company.code",
    "type": "record",
    "name": "MyAvroSchema",
    "doc" : "...",
    "fields": [
        {
            "name": "MyArray",
            "type": {
                "type": "array",
                "items": {
                    "type": {"type": "array","items": "int"}
                }
            }
        }
    ]
}

generates the following error:

Traceback (most recent call last):
  File "/Users/user/miniconda3/envs/arrhythmia/lib/python3.9/code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 1, in <module>
  File "/Users/user/Library/Application Support/JetBrains/Toolbox/apps/PyCharm-P/ch-0/222.4459.20/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_umd.py", line 198, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "/Users/user/Library/Application Support/JetBrains/Toolbox/apps/PyCharm-P/ch-0/222.4459.20/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/Users/user/arrhythmia/file_schemas/validate_schema.py", line 5, in <module>
    parsed_schema = schema.parse()
  File "/Users/user/miniconda3/envs/arrhythmia/lib/python3.9/site-packages/avro_validator/schema.py", line 28, in parse
    return RecordType.build(schema, skip_extra_keys=skip_extra_keys)
  File "/Users/user/miniconda3/envs/arrhythmia/lib/python3.9/site-packages/avro_validator/avro_types.py", line 812, in build
    record_type.__fields = {
  File "/Users/user/miniconda3/envs/arrhythmia/lib/python3.9/site-packages/avro_validator/avro_types.py", line 813, in <dictcomp>
    field['name']: RecordTypeField.build(
  File "/Users/user/miniconda3/envs/arrhythmia/lib/python3.9/site-packages/avro_validator/avro_types.py", line 585, in build
    field.__type = cls.__build_field_type(json_repr, custom_fields)
  File "/Users/user/miniconda3/envs/arrhythmia/lib/python3.9/site-packages/avro_validator/avro_types.py", line 545, in __build_field_type
    return cls._get_field_from_json(json_repr['type'], custom_fields)
  File "/Users/user/miniconda3/envs/arrhythmia/lib/python3.9/site-packages/avro_validator/avro_types.py", line 210, in _get_field_from_json
    return getattr(sys.modules[__name__], FIELD_MAPPING[field_type['type']]).build(field_type, custom_fields)
  File "/Users/user/miniconda3/envs/arrhythmia/lib/python3.9/site-packages/avro_validator/avro_types.py", line 1016, in build
    array_type.__items = ArrayType._get_field_from_json(json_repr['items'], custom_fields)
  File "/Users/user/miniconda3/envs/arrhythmia/lib/python3.9/site-packages/avro_validator/avro_types.py", line 210, in _get_field_from_json
    return getattr(sys.modules[__name__], FIELD_MAPPING[field_type['type']]).build(field_type, custom_fields)
TypeError: unhashable type: 'dict'

If I remove the inner array it parses correctly.

Raise error or return False when validate

avro_validator/avro_validator/avro_types.py

Line 543 in ef7fd09

def validate(self, value: Any) -> bool:

For now, the above validate function raise exception upon invalid data and returns True otherwise. Would it be better to have another optional parameter indicating whether to raise the exception? Something like below:

def validate(self, value: Any, raise_errors=True) -> bool:

Without this, users will have to do below to get around it if they only care about whether the data is valid (not caring about what exactly error it is). And I think this scenario is quite common, e.g. users may only want to process some data that matches the schema and simply discard other data.

valid = False
try:
    valid = schema.validate(data)
except ValueError:
    pass

Date/Datetime Validation

Avro now accepts python datetime objects for int/long fields with logicalType's in the date/timestamp family. It'd be great if this validator could do the same.

Validator doesn't understand recursive schemas

ValueError: The type [Actor] is not recognized by Avro

import json
from avro_validator.schema import Schema

SCHEMA = {
    "name": "Actor",
    "type": "record",
    "fields": [
        {
            "name": "actedBy",
            "type": ["null", "Actor"],
        }
    ]
}

Schema(json.dumps(SCHEMA)).parse()

Valid data files with nullable fields not correctly validated

Avro records with nullable fields (based on union types) encoded as JSON
are not correctly parsed using the avro_validator cli. The error is
due to an inconsistency in how this tool interprets union types and how
they're encoded in JSON (link to docs).

Specifically:

For example, the union schema ["null","string","Foo"], where Foo is a
record name, would encode:

null as null;

string "a" as {"string": "a"}; and

a Foo instance as {"Foo": {...}}, where {...} indicates the JSON encoding of a Foo instance.

The following schema includes some nullable fields, which can be used to
generate some random data.

{
  "type" : "record",
  "name" : "test",
  "namespace" : "com.example",
  "fields" : [ {
    "name" : "name",
    "type" : "string"
  }, {
    "name" : "null_name1",
    "type" : [ "null", "string" ]
  }, {
    "name" : "null_name2",
    "type" : [ "string", "null" ]
  }, {
    "name" : "num",
    "type" : "int"
  }, {
    "name" : "null_num1",
    "type" : [ "null", "int" ]
  }, {
    "name" : "null_num2",
    "type" : [ "int", "null" ]
  } ]
}

This following record fails the validation based on the above schema:

{
  "name": "snhepdirqromqkgllhgljumtuj",
  "null_name1": null,
  "null_name2": null,
  "num": 186374858,
  "null_num1": {
    "int": -1433093325
  },
  "null_num2": {
    "int": -1728851584
  }
}

$ avro_validator test.avsc test.json
Error validating value for field [null_num1]: The value [{'int': -1433093325}] is not from one of the following types: [[NullType, IntType]]

Schema constructor does not close the file

Schema constructor currently leaves the schema file open until it's GC'd (unless I misunderstand the semantics of open()).

Current code:

self._schema = open(schema, 'r').read()

This would be more reliable:

with open(schema, 'r') as f:
   self._schema = f.read()

Validate multiple schemas against json

Hey Leo!

Is possible to use multiple files avsc to validate json? I have some custom schemas and that are being reutilized in main schema.

Schema constructor fails on a long json string

I'm calling avro_validator.schema.Schema(s) where s is a json string and len(s)==33335.
The constructor throws the following exception:

{ValueError}stat: path too long for Windows

It's probably not correct to use os.path.isfile() to distinguish between files and json strings. The simplest fix would be to wrap it in a try/catch and treat an exception as another indicator that it's not a file.

logicalType in schema

Hi there,

I'm getting this error The RecordTypeField can only contains {'order', 'aliases', 'type', 'doc', 'name', 'default'} keys, even though it looks like it's supported according to this.

They seem to have some other fields listed on the docs there, but I haven't tried that myself. This issue could resolve this as well, that would be nice to have some checking and just be able to support the new fields.

Thanks for this project!

logicalType is not supported

schema = json.dumps({
    'name': 'test schema',
    'type': 'record',
    'doc': 'schema for testing avro_validator',
    'fields': [
                            {
                        "name": "event_time",
                        "type": "long",
                        "logicalType": "timestamp-millis"
                    }
    ]
})

schema = Schema(schema)
parsed_schema = schema.parse()

get:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/amihayg/.virtualenvs/data-entity-schema/lib/python3.9/site-packages/avro_validator/schema.py", line 28, in parse
    return RecordType.build(schema, skip_extra_keys=skip_extra_keys)
  File "/Users/amihayg/.virtualenvs/data-entity-schema/lib/python3.9/site-packages/avro_validator/avro_types.py", line 673, in build
    record_type.__fields = {
  File "/Users/amihayg/.virtualenvs/data-entity-schema/lib/python3.9/site-packages/avro_validator/avro_types.py", line 674, in <dictcomp>
    field['name']: RecordTypeField.build(
  File "/Users/amihayg/.virtualenvs/data-entity-schema/lib/python3.9/site-packages/avro_validator/avro_types.py", line 440, in build
    cls._validate_json_repr(json_repr, skip_extra_keys=skip_extra_keys)
  File "/Users/amihayg/.virtualenvs/data-entity-schema/lib/python3.9/site-packages/avro_validator/avro_types.py", line 72, in _validate_json_repr
    raise ValueError(f'The {cls.__name__} can only contains '
ValueError: The RecordTypeField can only contains {'doc', 'default', 'name', 'order', 'aliases', 'type'} keys, but does contain also {'logicalType'}