ynqa / pandavro Goto Github PK

View Code? Open in Web Editor NEW

134.0 134.0 30.0 104 KB

Apache Avro <-> pandas DataFrame

License: MIT License

Python 97.58% Makefile 2.42%

apache-avro pandas python

pandavro's Introduction

pandavro's People

Contributors

Stargazers

Watchers

pandavro's Issues

Make a new release to allow versions of fastavro>1.6.0

The latest release (1.7.2) is restrcting fastavro~=1.5.1, which translates to >=1.5.1 , <1.6.0

Currently in trunk this dependency is loosened to 'fastavro>=1.5.1,<2.0.0'

Could we please have a new release with this change? Any version of fastavro<1.8.2 cannot be used on M-series macs and this is causing some paint for local development as I need to constantly work around it.

Thanks!

Unable to infer timestamp type

I have my Avro Schema as below:

schema = { "namespace": "example.avro", "type": "record", "name": "IoTData", "fields": [ {"name": "nodeId", "type": ["null", "string"], "default": None}, {"name": "displayName", "type": ["null", "string"], "default": None}, {"name": "dataType", "type": ["null", "string"], "default": None}, {"name": "statusCode", "type": ["null", "string"], "default": None}, {"name": "timestamp", "type": ["null", {"type": "string", "logicalType": "timestamp-micros"}], "default": None}, {"name": "sourceTimestamp", "type": ["null", {"type": "string", "logicalType": "timestamp-micros"}], "default": None}, {"name": "value", "type": ["null", "double"], "default": None} ] }

I had to put default values as None as sometimes these values maybe blank.

File content is as below:

[{'nodeId': 'ns=2;s=SCD30_CO2', 'displayName': 'SCD30_CO2', 'dataType': 'Double', 'statusCode': 'Good', 'timestamp': '2024-02-25T22:56:21.622480', 'sourceTimestamp': '2024-02-26T03:56:20.224859', 'value': 61.83}, {'nodeId': 'ns=2;s=SCD30_TEMPERATURE', 'displayName': 'SCD30_TEMPERATURE', 'dataType': 'Double', 'statusCode': 'Good', 'timestamp': '2024-02-25T22:56:21.622480', 'sourceTimestamp': '2024-02-26T03:56:20.224859', 'value': 27.35}, {'nodeId': 'ns=2;s=SCD30_HUMIDITY', 'displayName': 'SCD30_HUMIDITY', 'dataType': 'Double', 'statusCode': 'Good', 'timestamp': '2024-02-25T22:56:21.622480', 'sourceTimestamp': '2024-02-26T03:56:20.224859', 'value': 41.49}, {'nodeId': 'ns=2;s=SCD30_CO2', 'displayName': 'SCD30_CO2', 'dataType': 'Double', 'statusCode': 'Good', 'timestamp': '2024-02-25T22:56:22.704777', 'sourceTimestamp': '2024-02-26T03:56:22.250094', 'value': 63.27}]

When I convert this to Pandas data types are as below:

0 nodeId 60 non-null string
1 displayName 60 non-null string
2 dataType 60 non-null string
3 statusCode 60 non-null string
4 timestamp 60 non-null object
5 sourceTimestamp 60 non-null object
6 value 60 non-null float64

Its unable to infer timestamp type.

add support of snappy codec (ValueError: snappy codec is supported but you need to install python-snappy)

ValueError: snappy codec is supported but you need to install python-snappy

from_records() got an unexpected keyword argument 'na_dtypes'

Hi,

First of all: many thanks for pandavro! It's incredibly useful in day-to-day data operations.

When using read_avro() with na_dtypes=True, I get the following TypeError, using Pandas 1.3.5:

from_records() got an unexpected keyword argument 'na_dtypes'

Will post full trace below.

I'd like to humbly request if this is a know issue and if there is a workaround. If it is a new issue, I'm willing to help fix it. Any pointers to get started are deeply appreciated.

Full command:

df = pdx.read_avro('./test.avro', na_dtypes=True)

Full trace:

TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_4020/1232288353.py in <module>
----> 1 df = pdx.read_avro('./test.avro', na_dtypes=True)

/opt/conda/lib/python3.7/site-packages/pandavro/__init__.py in read_avro(file_path_or_buffer, schema, **kwargs)
    194     if isinstance(file_path_or_buffer, six.string_types):
    195         with open(file_path_or_buffer, 'rb') as f:
--> 196             return __file_to_dataframe(f, schema, **kwargs)
    197     else:
    198         return __file_to_dataframe(file_path_or_buffer, schema, **kwargs)

/opt/conda/lib/python3.7/site-packages/pandavro/__init__.py in __file_to_dataframe(f, schema, **kwargs)
    177 def __file_to_dataframe(f, schema, **kwargs):
    178     reader = fastavro.reader(f, reader_schema=schema)
--> 179     return pd.DataFrame.from_records(list(reader), **kwargs)
    180 
    181 

TypeError: from_records() got an unexpected keyword argument 'na_dtypes'

Release 1.5.x

Overview

Use git tag to upload for PyPI #16.

Release 1.5.x:

Fix incorrect type inference, Python compatibility issues, support compression #12 @dargueta
Feature/add support append file #13 @AlanTaranti
Add section to README that explains how pandavro infers schema #15 @The-Fonz
Use tags for deploy #16 @ynqa

decimal logical type : TypeError: can only concatenate str (not "int") to str

am trying to convert an existing csv to avro using pandavro.

am not able to resolve the below error:
File "fastavro/_logical_writers.pyx", line 130, in fastavro._logical_writers.prepare_bytes_decimal
File "fastavro/_logical_writers.pyx", line 143, in fastavro._logical_writers.prepare_bytes_decimal
TypeError: can only concatenate str (not "int") to str

i did check my csv, avsc and pandavro lines of code multiple times.. am not able to find what is the problem. am not savvy enough to call it a bug.
can anyone provide me with some pointers.

data in the csv : 999.879
the column p_cost in avsc:
{ "name": "p_cost", "type": {"name": "decimalEntry", "type": "bytes", "logicalType": "decimal", "precision": 15, "scale": 3} },
the lines of code. :

def convert_to_decimal(val):
"""
Convert the string number value to a Decimal
- Must set precision and scale beforehand
"""
return Decimal(val)

   schema_promotion = load_schema("promotion.avsc")
   df_promotion = pd.read_csv( '/scratch/tpcds_1/promotion/promotion.dat' , delimiter='|',header=None,usecols=[0,1,2,3,4,5,6,7,8,9,10,11,12

,13,14,15,16,17,18],names=['p_promo_sk',....,'p_cost',...,'p_discount_active']
,dtype={'p_cost': 'str'})

getcontext().prec = 15 # set precision of all future decimals
type(df_promotion['p_cost'])

df_promotion['p_cost'] = df_promotion['p_cost'].apply(convert_to_decimal)
pdx.to_avro('test_promotion.avro', df_promotion, schema=schema_promotion )

throws below error:

Traceback (most recent call last):
File "perfectlyrandom.py", line 313, in
promotion()
File "perfectlyrandom.py", line 262, in promotion
pdx.to_avro('test_promotion.avro', df_promotion, schema=schema_promotion )
File "/home/opc/.local/lib/python3.8/site-packages/pandavro/init.py", line 322, in to_avro
fastavro.writer(f, schema=schema,
File "fastavro/_write.pyx", line 727, in fastavro._write.writer
File "fastavro/_write.pyx", line 680, in fastavro._write.Writer.write
File "fastavro/_write.pyx", line 432, in fastavro._write.write_data
File "fastavro/_write.pyx", line 422, in fastavro._write.write_data
File "fastavro/_write.pyx", line 366, in fastavro._write.write_record
File "fastavro/_write.pyx", line 387, in fastavro._write.write_data
File "fastavro/_logical_writers.pyx", line 130, in fastavro._logical_writers.prepare_bytes_decimal
File "fastavro/_logical_writers.pyx", line 143, in fastavro._logical_writers.prepare_bytes_decimal
TypeError: can only concatenate str (not "int") to str

if full schema definition and pandas df definition is needed, i shall provide the same.
pip list:
avro-python3 1.10.2
fastavro 1.5.1
numpy 1.23.3
pandas 1.5.0
pandavro 1.7.1

Add compatibility with Python 3.11

The current version pandavro==1.7.1 is not compatible with Python 3.11 because of the pinned dependency fastavro==1.5.1. I would like to request to fix this, so any app using pandavro can be upgraded to the latest version of Python.

I understand that this issue is closely related to #39, but I still decided to open it for raising awareness.

Support for bytes

Hi! first of all, thanks for this very useful package. We use it in our ETL and it's really convenient.

I wonder whether there is support for bytes and I missed it. When I add a column to the dataframe being bytes, I get the error

TypeError: argument of type 'NoneType' is not iterable

which I'm getting with other complex types. This doesn't seem a very complex type, so I wonder if it'd be very difficult to add.

At the moment what I'm doing is this:

schema = pdx.schema_infer(df)
bytes_field_idx = next(idx for idx, field in enumerate(schema["fields"]) if field["name"] == "bytes_field")
schema["fields"][bytes_field_idx]["type"] = ["null", "bytes"]

pdx.to_avro(
        str(path),
        df,
        schema=schema,
)

but ofc would be great if I could delegate everything to schema_infer. Am I missing something? It'd be great to support Pathlib.Path as well, but that's not such a big deal :)

Problem with datatype datetime64[ns]

I'm having an issue trying to convert a Apache Parquet file into an Apache Avro file.

This is the code:

`import pyarrow.parquet as pq
import pandavro as pdx

table = pq.read_table('/media/sf_AWS/kafka/acciones_postcorte.parq')
pdx.to_avro('opers.avro', table.to_pandas())`

This is the schema of the file:

divpol: string
division: string
poliza: string
asignacion: string
num_asignacion: string
f_asignacion: timestamp[ms]
campana: string
campanacontable: string
despacho_empresa: string
municipio: string
deudagestionar: double
deudavencida: double
d_oap: double
fresultado: timestamp[ms]
resultado: string
gestor: string
captura: timestamp[ms]
gpo1: string
toap: string
visitado: int64
visitado_aus: int64
ps_ano: timestamp[ms]
dg_filepath: string
dg_date: timestamp[ms]
dg_schema_version: int64
index_level_0: int64
metadata

{b'pandas': b'{"index_columns": ["index_level_0"], "column_indexes": [{"na'
b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
b' [{"name": "divpol", "field_name": "divpol", "pandas_type": "uni'
b'code", "numpy_type": "object", "metadata": null}, {"name": "divi'
b'sion", "field_name": "division", "pandas_type": "unicode", "nump'
b'y_type": "object", "metadata": null}, {"name": "poliza", "field_'
b'name": "poliza", "pandas_type": "unicode", "numpy_type": "object'
b'", "metadata": null}, {"name": "asignacion", "field_name": "asig'
b'nacion", "pandas_type": "unicode", "numpy_type": "object", "meta'
b'data": null}, {"name": "num_asignacion", "field_name": "num_asig'
b'nacion", "pandas_type": "unicode", "numpy_type": "object", "meta'
b'data": null}, {"name": "f_asignacion", "field_name": "f_asignaci'
b'on", "pandas_type": "datetime", "numpy_type": "datetime64[ns]", '
b'"metadata": null}, {"name": "campana", "field_name": "campana", '
b'"pandas_type": "unicode", "numpy_type": "object", "metadata": nu'
b'll}, {"name": "campanacontable", "field_name": "campanacontable"'
b', "pandas_type": "unicode", "numpy_type": "object", "metadata": '
b'null}, {"name": "despacho_empresa", "field_name": "despacho_empr'
b'esa", "pandas_type": "unicode", "numpy_type": "object", "metadat'
b'a": null}, {"name": "municipio", "field_name": "municipio", "pan'
b'das_type": "unicode", "numpy_type": "object", "metadata": null},'
b' {"name": "deudagestionar", "field_name": "deudagestionar", "pan'
b'das_type": "float64", "numpy_type": "float64", "metadata": null}'
b', {"name": "deudavencida", "field_name": "deudavencida", "pandas'
b'_type": "float64", "numpy_type": "float64", "metadata": null}, {'
b'"name": "d_oap", "field_name": "d_oap", "pandas_type": "float64"'
b', "numpy_type": "float64", "metadata": null}, {"name": "fresulta'
b'do", "field_name": "fresultado", "pandas_type": "datetime", "num'
b'py_type": "datetime64[ns]", "metadata": null}, {"name": "resulta'
b'do", "field_name": "resultado", "pandas_type": "unicode", "numpy'
b'type": "object", "metadata": null}, {"name": "gestor", "field_n'
b'ame": "gestor", "pandas_type": "unicode", "numpy_type": "object"'
b', "metadata": null}, {"name": "captura", "field_name": "captura"'
b', "pandas_type": "datetime", "numpy_type": "datetime64[ns]", "me'
b'tadata": null}, {"name": "gpo1", "field_name": "gpo1", "pandas_t'
b'ype": "unicode", "numpy_type": "object", "metadata": null}, {"na'
b'me": "toap", "field_name": "toap", "pandas_type": "unicode", "nu'
b'mpy_type": "object", "metadata": null}, {"name": "visitado", "fi'
b'eld_name": "visitado", "pandas_type": "int64", "numpy_type": "in'
b't64", "metadata": null}, {"name": "visitado_aus", "field_name": '
b'"visitado_aus", "pandas_type": "int64", "numpy_type": "int64", "'
b'metadata": null}, {"name": "ps_ano", "field_name": "ps_ano", "pa'
b'ndas_type": "datetime", "numpy_type": "datetime64[ns]", "metadat'
b'a": null}, {"name": "dg_filepath", "field_name": "dg_filepath", '
b'"pandas_type": "unicode", "numpy_type": "object", "metadata": nu'
b'll}, {"name": "dg_date", "field_name": "dg_date", "pandas_type":'
b' "datetime", "numpy_type": "datetime64[ns]", "metadata": null}, '
b'{"name": "dg_schema_version", "field_name": "dg_schema_version",'
b' "pandas_type": "int64", "numpy_type": "int64", "metadata": null'
b'}, {"name": null, "field_name": "index_level_0", "pandas_typ'
b'e": "int64", "numpy_type": "int64", "metadata": null}], "pandas'
b'version": "0.22.0"}'}

This is the error:

Traceback (most recent call last): File "p2a.py", line 10, in <module> pdx.to_avro('opers.avro', table.to_pandas()) File "/home/jpardobl/python_envs/venv_kafka/lib/python3.6/site-packages/pandavro/__init__.py", line 77, in to_avro schema = __schema_infer(df) File "/home/jpardobl/python_envs/venv_kafka/lib/python3.6/site-packages/pandavro/__init__.py", line 33, in __schema_infer fields = __fields_infer(df) File "/home/jpardobl/python_envs/venv_kafka/lib/python3.6/site-packages/pandavro/__init__.py", line 27, in __fields_infer type_avro = __type_infer(type_np) File "/home/jpardobl/python_envs/venv_kafka/lib/python3.6/site-packages/pandavro/__init__.py", line 21, in __type_infer raise TypeError('Invalid type: {}'.format(t)) TypeError: Invalid type: datetime64[ns]

pypi homepage points to the wrong link

The "Home Page" link in https://pypi.python.org/pypi/pandavro/1.0.0 points to https://github.com/ynqa/pandabro instead of https://github.com/ynqa/pandavro (notice the typo pandabro vs. pandavro) . I am wondering if this can be corrected?

PS: Not sure if this is the correct place to report this issue as the typo is in the pypi listing page. I am reporting here since I do not have an account in pypi,

Latest pandas does not have DatetimeTZDtypeType

There was a recent change that added a check for type pd.core.dtypes.dtypes.DatetimeTZDtypeType. This does not exist any more in the latest version of Pandas, unfortunately, throwing an error.

AttributeError: module 'pandas.core.dtypes.dtypes' has no attribute 'DatetimeTZDtypeType'

Auto-deploy on merge to master?

I see that the config is

deploy:
  provider: pypi
  user: pyncha
  password:
    secure: ****
  on:
    tags: true
    python: 3.9

so it doesn't auto-release https://pypi.org/project/pandavro/#history. Not being an expert in travis, it looks to me like it will only release a new version once you manually add a tag to a branch, as explained in https://docs.travis-ci.com/user/deployment/pypi/#deploying-tags. Wouldn't be easier just to release every time you merge on master? i.e:

  on:
    branch: master

as explained here: https://docs.travis-ci.com/user/deployment/pypi/#deploying-specific-branches

WDYT?

Datetime-like values errors

Got some problems with datetime-like values.
Tried with pandas 1.0.3 and 0.25.3, both don't working.
fastavro 0.23.4.

date

Traceback (most recent call last):

  File "<ipython-input-180-724a28b4d15a>", line 1, in <module>
    pdx.to_avro('test.avro', df.drop(columns=['event_timestamp']))

  File "/opt/anaconda3/lib/python3.7/site-packages/pandavro/__init__.py", line 151, in to_avro
    records=df.to_dict('records'), codec=codec)

  File "fastavro/_write.pyx", line 628, in fastavro._write.writer

  File "fastavro/_write.pyx", line 581, in fastavro._write.Writer.write

  File "fastavro/_write.pyx", line 335, in fastavro._write.write_data

  File "fastavro/_write.pyx", line 285, in fastavro._write.write_record

  File "fastavro/_write.pyx", line 333, in fastavro._write.write_data

  File "fastavro/_write.pyx", line 249, in fastavro._write.write_union

ValueError: datetime.date(2020, 6, 10) (type <class 'datetime.date'>) do not match ['null', 'string']

NaTType

Traceback (most recent call last):

  File "<ipython-input-182-991911d54074>", line 1, in <module>
    pdx.to_avro('test.avro', df.drop(columns=['event_date','items']))

  File "/opt/anaconda3/lib/python3.7/site-packages/pandavro/__init__.py", line 151, in to_avro
    records=df.to_dict('records'), codec=codec)

  File "fastavro/_write.pyx", line 628, in fastavro._write.writer

  File "fastavro/_write.pyx", line 581, in fastavro._write.Writer.write

  File "fastavro/_write.pyx", line 335, in fastavro._write.write_data

  File "fastavro/_write.pyx", line 285, in fastavro._write.write_record

  File "fastavro/_write.pyx", line 333, in fastavro._write.write_data

  File "fastavro/_write.pyx", line 234, in fastavro._write.write_union

  File "fastavro/_validation.pyx", line 169, in fastavro._validation._validate

  File "fastavro/_validation.pyx", line 178, in fastavro._validation._validate

  File "fastavro/_logical_writers.pyx", line 72, in fastavro._logical_writers.prepare_timestamp_micros

  File "fastavro/_logical_writers.pyx", line 105, in fastavro._logical_writers.prepare_timestamp_micros

  File "pandas/_libs/tslibs/nattype.pyx", line 58, in pandas._libs.tslibs.nattype._make_error_func.f

ValueError: NaTType does not support timestamp

Add Python 3.12 support

Hi!
I use pandavro in a couple projects and would love to see Python 3.12 support soon!

Currently, when trying to install Pandavro 1.7.2 under Python 3.12, I get a fail when building fastavro 1.5.4:

Compiler crash traceback from this point on:
  File "/tmp/tmprnf1_8vp/.venv/lib64/python3.12/site-packages/Cython/Compiler/Nodes.py", line 2786, in call_self_node
    type_entry = self.type.args[0].type.entry
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'PyObjectType' object has no attribute 'entry'

I know that fastavro 1.9.0 has Python 3.12 support so this should be feasible.
Thank you in advance and keep up the good workl!

Tests failing in master

Hi, I was thinking about contributing to #27, but just ran tests in master and they fail for me

FAILED tests/pandavro_test.py::test_buffer_e2e - AssertionError: numpy array are different
FAILED tests/pandavro_test.py::test_file_path_e2e - AssertionError: numpy array are different
FAILED tests/pandavro_test.py::test_delegation - AssertionError: numpy array are different
FAILED tests/pandavro_test.py::test_append - AssertionError: numpy array are different
FAILED tests/pandavro_test.py::test_dataframe_kwargs - AssertionError: numpy array are different
========================= 5 failed, 4 passed, 5 warnings in 0.60s ===========================

I can see

(Pdb) expect
   Boolean          DateTime64   Float64  Int64 String
0     True 2018-12-31 23:00:00 -0.579613      8    foo
1    False 2019-01-01 23:00:00 -0.922827      3    bar
2     True 2019-01-02 23:00:00 -1.070658      8    foo
3    False 2019-01-03 23:00:00 -0.072218      2    bar
4     True 2019-01-04 23:00:00 -1.604049      3    foo
5    False 2019-01-05 23:00:00 -0.822774      0    bar
6     True 2019-01-06 23:00:00 -0.504930      4    foo
7    False 2019-01-07 23:00:00  1.357435      0    bar
(Pdb) dataframe
   Boolean DateTime64   Float64  Int64 String
0     True 2019-01-01 -0.579613      8    foo
1    False 2019-01-02 -0.922827      3    bar
2     True 2019-01-03 -1.070658      8    foo
3    False 2019-01-04 -0.072218      2    bar
4     True 2019-01-05 -1.604049      3    foo
5    False 2019-01-06 -0.822774      0    bar
6     True 2019-01-07 -0.504930      4    foo
7    False 2019-01-08  1.357435      0    bar

in test_append. Any ideas? There's like a mismatch of 1h, maybe some rounding issue? @ynqa

allow for process_record() while reading in avro

Feature request:
Could you allow for process_record function while reading in avro? Here is a suggestion.

def __file_to_dataframe(f, schema, process_record=None, **kwargs):

    reader = fastavro.reader(f, reader_schema=schema)
    records = list()
   if preprocess_record:
            records = [process_record(r) for r in avro_reader]
   else:
            records = list(avro_reader)

    return pd.DataFrame.from_records(records, **kwargs)

Does the fastavro dependency version need to be pinned?

setup.py has:

    install_requires=[
        # fixed versions.
        'fastavro==1.5.1',
        'pandas>=1.1',
        # https://pandas.pydata.org/pandas-docs/version/1.1/getting_started/install.html#dependencies
        'numpy>=1.15.4',
    ],

This causes a dependency resolution failure for me because I'm using another package that requires fastavro>=1.5.4.

Would it be possible to relax that requirement to be 'fastavro>=1.5.1