ynqa / pandavro Goto Github PK
View Code? Open in Web Editor NEWApache Avro <-> pandas DataFrame
License: MIT License
Apache Avro <-> pandas DataFrame
License: MIT License
The latest release (1.7.2) is restrcting fastavro~=1.5.1
, which translates to >=1.5.1 , <1.6.0
Currently in trunk this dependency is loosened to 'fastavro>=1.5.1,<2.0.0'
Could we please have a new release with this change? Any version of fastavro<1.8.2
cannot be used on M-series macs and this is causing some paint for local development as I need to constantly work around it.
Thanks!
I have my Avro Schema as below:
schema = { "namespace": "example.avro", "type": "record", "name": "IoTData", "fields": [ {"name": "nodeId", "type": ["null", "string"], "default": None}, {"name": "displayName", "type": ["null", "string"], "default": None}, {"name": "dataType", "type": ["null", "string"], "default": None}, {"name": "statusCode", "type": ["null", "string"], "default": None}, {"name": "timestamp", "type": ["null", {"type": "string", "logicalType": "timestamp-micros"}], "default": None}, {"name": "sourceTimestamp", "type": ["null", {"type": "string", "logicalType": "timestamp-micros"}], "default": None}, {"name": "value", "type": ["null", "double"], "default": None} ] }
I had to put default values as None as sometimes these values maybe blank.
File content is as below:
[{'nodeId': 'ns=2;s=SCD30_CO2', 'displayName': 'SCD30_CO2', 'dataType': 'Double', 'statusCode': 'Good', 'timestamp': '2024-02-25T22:56:21.622480', 'sourceTimestamp': '2024-02-26T03:56:20.224859', 'value': 61.83}, {'nodeId': 'ns=2;s=SCD30_TEMPERATURE', 'displayName': 'SCD30_TEMPERATURE', 'dataType': 'Double', 'statusCode': 'Good', 'timestamp': '2024-02-25T22:56:21.622480', 'sourceTimestamp': '2024-02-26T03:56:20.224859', 'value': 27.35}, {'nodeId': 'ns=2;s=SCD30_HUMIDITY', 'displayName': 'SCD30_HUMIDITY', 'dataType': 'Double', 'statusCode': 'Good', 'timestamp': '2024-02-25T22:56:21.622480', 'sourceTimestamp': '2024-02-26T03:56:20.224859', 'value': 41.49}, {'nodeId': 'ns=2;s=SCD30_CO2', 'displayName': 'SCD30_CO2', 'dataType': 'Double', 'statusCode': 'Good', 'timestamp': '2024-02-25T22:56:22.704777', 'sourceTimestamp': '2024-02-26T03:56:22.250094', 'value': 63.27}]
When I convert this to Pandas data types are as below:
0 nodeId 60 non-null string
1 displayName 60 non-null string
2 dataType 60 non-null string
3 statusCode 60 non-null string
4 timestamp 60 non-null object
5 sourceTimestamp 60 non-null object
6 value 60 non-null float64
Its unable to infer timestamp type.
ValueError: snappy codec is supported but you need to install python-snappy
Hi,
First of all: many thanks for pandavro! It's incredibly useful in day-to-day data operations.
When using read_avro() with na_dtypes=True, I get the following TypeError, using Pandas 1.3.5:
from_records() got an unexpected keyword argument 'na_dtypes'
Will post full trace below.
I'd like to humbly request if this is a know issue and if there is a workaround. If it is a new issue, I'm willing to help fix it. Any pointers to get started are deeply appreciated.
Full command:
df = pdx.read_avro('./test.avro', na_dtypes=True)
Full trace:
TypeError Traceback (most recent call last)
/tmp/ipykernel_4020/1232288353.py in <module>
----> 1 df = pdx.read_avro('./test.avro', na_dtypes=True)
/opt/conda/lib/python3.7/site-packages/pandavro/__init__.py in read_avro(file_path_or_buffer, schema, **kwargs)
194 if isinstance(file_path_or_buffer, six.string_types):
195 with open(file_path_or_buffer, 'rb') as f:
--> 196 return __file_to_dataframe(f, schema, **kwargs)
197 else:
198 return __file_to_dataframe(file_path_or_buffer, schema, **kwargs)
/opt/conda/lib/python3.7/site-packages/pandavro/__init__.py in __file_to_dataframe(f, schema, **kwargs)
177 def __file_to_dataframe(f, schema, **kwargs):
178 reader = fastavro.reader(f, reader_schema=schema)
--> 179 return pd.DataFrame.from_records(list(reader), **kwargs)
180
181
TypeError: from_records() got an unexpected keyword argument 'na_dtypes'
am trying to convert an existing csv to avro using pandavro.
am not able to resolve the below error:
File "fastavro/_logical_writers.pyx", line 130, in fastavro._logical_writers.prepare_bytes_decimal
File "fastavro/_logical_writers.pyx", line 143, in fastavro._logical_writers.prepare_bytes_decimal
TypeError: can only concatenate str (not "int") to str
i did check my csv, avsc and pandavro lines of code multiple times.. am not able to find what is the problem. am not savvy enough to call it a bug.
can anyone provide me with some pointers.
data in the csv : 999.879
the column p_cost in avsc:
{ "name": "p_cost", "type": {"name": "decimalEntry", "type": "bytes", "logicalType": "decimal", "precision": 15, "scale": 3} },
the lines of code. :
def convert_to_decimal(val):
"""
Convert the string number value to a Decimal
- Must set precision and scale beforehand
"""
return Decimal(val)
schema_promotion = load_schema("promotion.avsc")
df_promotion = pd.read_csv( '/scratch/tpcds_1/promotion/promotion.dat' , delimiter='|',header=None,usecols=[0,1,2,3,4,5,6,7,8,9,10,11,12
,13,14,15,16,17,18],names=['p_promo_sk',....,'p_cost',...,'p_discount_active']
,dtype={'p_cost': 'str'})
getcontext().prec = 15 # set precision of all future decimals
type(df_promotion['p_cost'])
df_promotion['p_cost'] = df_promotion['p_cost'].apply(convert_to_decimal)
pdx.to_avro('test_promotion.avro', df_promotion, schema=schema_promotion )
throws below error:
Traceback (most recent call last):
File "perfectlyrandom.py", line 313, in
promotion()
File "perfectlyrandom.py", line 262, in promotion
pdx.to_avro('test_promotion.avro', df_promotion, schema=schema_promotion )
File "/home/opc/.local/lib/python3.8/site-packages/pandavro/init.py", line 322, in to_avro
fastavro.writer(f, schema=schema,
File "fastavro/_write.pyx", line 727, in fastavro._write.writer
File "fastavro/_write.pyx", line 680, in fastavro._write.Writer.write
File "fastavro/_write.pyx", line 432, in fastavro._write.write_data
File "fastavro/_write.pyx", line 422, in fastavro._write.write_data
File "fastavro/_write.pyx", line 366, in fastavro._write.write_record
File "fastavro/_write.pyx", line 387, in fastavro._write.write_data
File "fastavro/_logical_writers.pyx", line 130, in fastavro._logical_writers.prepare_bytes_decimal
File "fastavro/_logical_writers.pyx", line 143, in fastavro._logical_writers.prepare_bytes_decimal
TypeError: can only concatenate str (not "int") to str
if full schema definition and pandas df definition is needed, i shall provide the same.
pip list:
avro-python3 1.10.2
fastavro 1.5.1
numpy 1.23.3
pandas 1.5.0
pandavro 1.7.1
The current version pandavro==1.7.1
is not compatible with Python 3.11 because of the pinned dependency fastavro==1.5.1
. I would like to request to fix this, so any app using pandavro
can be upgraded to the latest version of Python.
I understand that this issue is closely related to #39, but I still decided to open it for raising awareness.
Hi! first of all, thanks for this very useful package. We use it in our ETL and it's really convenient.
I wonder whether there is support for bytes and I missed it. When I add a column to the dataframe being bytes, I get the error
TypeError: argument of type 'NoneType' is not iterable
which I'm getting with other complex types. This doesn't seem a very complex type, so I wonder if it'd be very difficult to add.
At the moment what I'm doing is this:
schema = pdx.schema_infer(df)
bytes_field_idx = next(idx for idx, field in enumerate(schema["fields"]) if field["name"] == "bytes_field")
schema["fields"][bytes_field_idx]["type"] = ["null", "bytes"]
pdx.to_avro(
str(path),
df,
schema=schema,
)
but ofc would be great if I could delegate everything to schema_infer
. Am I missing something? It'd be great to support Pathlib.Path as well, but that's not such a big deal :)
I'm having an issue trying to convert a Apache Parquet file into an Apache Avro file.
This is the code:
`import pyarrow.parquet as pq
import pandavro as pdx
table = pq.read_table('/media/sf_AWS/kafka/acciones_postcorte.parq')
pdx.to_avro('opers.avro', table.to_pandas())`
This is the schema of the file:
divpol: string
division: string
poliza: string
asignacion: string
num_asignacion: string
f_asignacion: timestamp[ms]
campana: string
campanacontable: string
despacho_empresa: string
municipio: string
deudagestionar: double
deudavencida: double
d_oap: double
fresultado: timestamp[ms]
resultado: string
gestor: string
captura: timestamp[ms]
gpo1: string
toap: string
visitado: int64
visitado_aus: int64
ps_ano: timestamp[ms]
dg_filepath: string
dg_date: timestamp[ms]
dg_schema_version: int64
index_level_0: int64
metadata
{b'pandas': b'{"index_columns": ["index_level_0"], "column_indexes": [{"na'
b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
b' [{"name": "divpol", "field_name": "divpol", "pandas_type": "uni'
b'code", "numpy_type": "object", "metadata": null}, {"name": "divi'
b'sion", "field_name": "division", "pandas_type": "unicode", "nump'
b'y_type": "object", "metadata": null}, {"name": "poliza", "field_'
b'name": "poliza", "pandas_type": "unicode", "numpy_type": "object'
b'", "metadata": null}, {"name": "asignacion", "field_name": "asig'
b'nacion", "pandas_type": "unicode", "numpy_type": "object", "meta'
b'data": null}, {"name": "num_asignacion", "field_name": "num_asig'
b'nacion", "pandas_type": "unicode", "numpy_type": "object", "meta'
b'data": null}, {"name": "f_asignacion", "field_name": "f_asignaci'
b'on", "pandas_type": "datetime", "numpy_type": "datetime64[ns]", '
b'"metadata": null}, {"name": "campana", "field_name": "campana", '
b'"pandas_type": "unicode", "numpy_type": "object", "metadata": nu'
b'll}, {"name": "campanacontable", "field_name": "campanacontable"'
b', "pandas_type": "unicode", "numpy_type": "object", "metadata": '
b'null}, {"name": "despacho_empresa", "field_name": "despacho_empr'
b'esa", "pandas_type": "unicode", "numpy_type": "object", "metadat'
b'a": null}, {"name": "municipio", "field_name": "municipio", "pan'
b'das_type": "unicode", "numpy_type": "object", "metadata": null},'
b' {"name": "deudagestionar", "field_name": "deudagestionar", "pan'
b'das_type": "float64", "numpy_type": "float64", "metadata": null}'
b', {"name": "deudavencida", "field_name": "deudavencida", "pandas'
b'_type": "float64", "numpy_type": "float64", "metadata": null}, {'
b'"name": "d_oap", "field_name": "d_oap", "pandas_type": "float64"'
b', "numpy_type": "float64", "metadata": null}, {"name": "fresulta'
b'do", "field_name": "fresultado", "pandas_type": "datetime", "num'
b'py_type": "datetime64[ns]", "metadata": null}, {"name": "resulta'
b'do", "field_name": "resultado", "pandas_type": "unicode", "numpy'
b'type": "object", "metadata": null}, {"name": "gestor", "field_n'
b'ame": "gestor", "pandas_type": "unicode", "numpy_type": "object"'
b', "metadata": null}, {"name": "captura", "field_name": "captura"'
b', "pandas_type": "datetime", "numpy_type": "datetime64[ns]", "me'
b'tadata": null}, {"name": "gpo1", "field_name": "gpo1", "pandas_t'
b'ype": "unicode", "numpy_type": "object", "metadata": null}, {"na'
b'me": "toap", "field_name": "toap", "pandas_type": "unicode", "nu'
b'mpy_type": "object", "metadata": null}, {"name": "visitado", "fi'
b'eld_name": "visitado", "pandas_type": "int64", "numpy_type": "in'
b't64", "metadata": null}, {"name": "visitado_aus", "field_name": '
b'"visitado_aus", "pandas_type": "int64", "numpy_type": "int64", "'
b'metadata": null}, {"name": "ps_ano", "field_name": "ps_ano", "pa'
b'ndas_type": "datetime", "numpy_type": "datetime64[ns]", "metadat'
b'a": null}, {"name": "dg_filepath", "field_name": "dg_filepath", '
b'"pandas_type": "unicode", "numpy_type": "object", "metadata": nu'
b'll}, {"name": "dg_date", "field_name": "dg_date", "pandas_type":'
b' "datetime", "numpy_type": "datetime64[ns]", "metadata": null}, '
b'{"name": "dg_schema_version", "field_name": "dg_schema_version",'
b' "pandas_type": "int64", "numpy_type": "int64", "metadata": null'
b'}, {"name": null, "field_name": "index_level_0", "pandas_typ'
b'e": "int64", "numpy_type": "int64", "metadata": null}], "pandas'
b'version": "0.22.0"}'}
This is the error:
Traceback (most recent call last): File "p2a.py", line 10, in <module> pdx.to_avro('opers.avro', table.to_pandas()) File "/home/jpardobl/python_envs/venv_kafka/lib/python3.6/site-packages/pandavro/__init__.py", line 77, in to_avro schema = __schema_infer(df) File "/home/jpardobl/python_envs/venv_kafka/lib/python3.6/site-packages/pandavro/__init__.py", line 33, in __schema_infer fields = __fields_infer(df) File "/home/jpardobl/python_envs/venv_kafka/lib/python3.6/site-packages/pandavro/__init__.py", line 27, in __fields_infer type_avro = __type_infer(type_np) File "/home/jpardobl/python_envs/venv_kafka/lib/python3.6/site-packages/pandavro/__init__.py", line 21, in __type_infer raise TypeError('Invalid type: {}'.format(t)) TypeError: Invalid type: datetime64[ns]
The "Home Page" link in https://pypi.python.org/pypi/pandavro/1.0.0 points to https://github.com/ynqa/pandabro instead of https://github.com/ynqa/pandavro (notice the typo pandabro vs. pandavro) . I am wondering if this can be corrected?
PS: Not sure if this is the correct place to report this issue as the typo is in the pypi listing page. I am reporting here since I do not have an account in pypi,
There was a recent change that added a check for type pd.core.dtypes.dtypes.DatetimeTZDtypeType
. This does not exist any more in the latest version of Pandas, unfortunately, throwing an error.
AttributeError: module 'pandas.core.dtypes.dtypes' has no attribute 'DatetimeTZDtypeType'
I see that the config is
deploy:
provider: pypi
user: pyncha
password:
secure: ****
on:
tags: true
python: 3.9
so it doesn't auto-release https://pypi.org/project/pandavro/#history. Not being an expert in travis, it looks to me like it will only release a new version once you manually add a tag to a branch, as explained in https://docs.travis-ci.com/user/deployment/pypi/#deploying-tags. Wouldn't be easier just to release every time you merge on master? i.e:
on:
branch: master
as explained here: https://docs.travis-ci.com/user/deployment/pypi/#deploying-specific-branches
WDYT?
Got some problems with datetime-like values.
Tried with pandas 1.0.3
and 0.25.3
, both don't working.
fastavro 0.23.4
.
Traceback (most recent call last):
File "<ipython-input-180-724a28b4d15a>", line 1, in <module>
pdx.to_avro('test.avro', df.drop(columns=['event_timestamp']))
File "/opt/anaconda3/lib/python3.7/site-packages/pandavro/__init__.py", line 151, in to_avro
records=df.to_dict('records'), codec=codec)
File "fastavro/_write.pyx", line 628, in fastavro._write.writer
File "fastavro/_write.pyx", line 581, in fastavro._write.Writer.write
File "fastavro/_write.pyx", line 335, in fastavro._write.write_data
File "fastavro/_write.pyx", line 285, in fastavro._write.write_record
File "fastavro/_write.pyx", line 333, in fastavro._write.write_data
File "fastavro/_write.pyx", line 249, in fastavro._write.write_union
ValueError: datetime.date(2020, 6, 10) (type <class 'datetime.date'>) do not match ['null', 'string']
Traceback (most recent call last):
File "<ipython-input-182-991911d54074>", line 1, in <module>
pdx.to_avro('test.avro', df.drop(columns=['event_date','items']))
File "/opt/anaconda3/lib/python3.7/site-packages/pandavro/__init__.py", line 151, in to_avro
records=df.to_dict('records'), codec=codec)
File "fastavro/_write.pyx", line 628, in fastavro._write.writer
File "fastavro/_write.pyx", line 581, in fastavro._write.Writer.write
File "fastavro/_write.pyx", line 335, in fastavro._write.write_data
File "fastavro/_write.pyx", line 285, in fastavro._write.write_record
File "fastavro/_write.pyx", line 333, in fastavro._write.write_data
File "fastavro/_write.pyx", line 234, in fastavro._write.write_union
File "fastavro/_validation.pyx", line 169, in fastavro._validation._validate
File "fastavro/_validation.pyx", line 178, in fastavro._validation._validate
File "fastavro/_logical_writers.pyx", line 72, in fastavro._logical_writers.prepare_timestamp_micros
File "fastavro/_logical_writers.pyx", line 105, in fastavro._logical_writers.prepare_timestamp_micros
File "pandas/_libs/tslibs/nattype.pyx", line 58, in pandas._libs.tslibs.nattype._make_error_func.f
ValueError: NaTType does not support timestamp
Hi!
I use pandavro in a couple projects and would love to see Python 3.12 support soon!
Currently, when trying to install Pandavro 1.7.2 under Python 3.12, I get a fail when building fastavro 1.5.4:
Compiler crash traceback from this point on:
File "/tmp/tmprnf1_8vp/.venv/lib64/python3.12/site-packages/Cython/Compiler/Nodes.py", line 2786, in call_self_node
type_entry = self.type.args[0].type.entry
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'PyObjectType' object has no attribute 'entry'
I know that fastavro 1.9.0 has Python 3.12 support so this should be feasible.
Thank you in advance and keep up the good workl!
Hi, I was thinking about contributing to #27, but just ran tests in master and they fail for me
FAILED tests/pandavro_test.py::test_buffer_e2e - AssertionError: numpy array are different
FAILED tests/pandavro_test.py::test_file_path_e2e - AssertionError: numpy array are different
FAILED tests/pandavro_test.py::test_delegation - AssertionError: numpy array are different
FAILED tests/pandavro_test.py::test_append - AssertionError: numpy array are different
FAILED tests/pandavro_test.py::test_dataframe_kwargs - AssertionError: numpy array are different
========================= 5 failed, 4 passed, 5 warnings in 0.60s ===========================
I can see
(Pdb) expect
Boolean DateTime64 Float64 Int64 String
0 True 2018-12-31 23:00:00 -0.579613 8 foo
1 False 2019-01-01 23:00:00 -0.922827 3 bar
2 True 2019-01-02 23:00:00 -1.070658 8 foo
3 False 2019-01-03 23:00:00 -0.072218 2 bar
4 True 2019-01-04 23:00:00 -1.604049 3 foo
5 False 2019-01-05 23:00:00 -0.822774 0 bar
6 True 2019-01-06 23:00:00 -0.504930 4 foo
7 False 2019-01-07 23:00:00 1.357435 0 bar
(Pdb) dataframe
Boolean DateTime64 Float64 Int64 String
0 True 2019-01-01 -0.579613 8 foo
1 False 2019-01-02 -0.922827 3 bar
2 True 2019-01-03 -1.070658 8 foo
3 False 2019-01-04 -0.072218 2 bar
4 True 2019-01-05 -1.604049 3 foo
5 False 2019-01-06 -0.822774 0 bar
6 True 2019-01-07 -0.504930 4 foo
7 False 2019-01-08 1.357435 0 bar
in test_append
. Any ideas? There's like a mismatch of 1h, maybe some rounding issue? @ynqa
Feature request:
Could you allow for process_record function while reading in avro? Here is a suggestion.
def __file_to_dataframe(f, schema, process_record=None, **kwargs):
reader = fastavro.reader(f, reader_schema=schema)
records = list()
if preprocess_record:
records = [process_record(r) for r in avro_reader]
else:
records = list(avro_reader)
return pd.DataFrame.from_records(records, **kwargs)
setup.py
has:
install_requires=[
# fixed versions.
'fastavro==1.5.1',
'pandas>=1.1',
# https://pandas.pydata.org/pandas-docs/version/1.1/getting_started/install.html#dependencies
'numpy>=1.15.4',
],
This causes a dependency resolution failure for me because I'm using another package that requires fastavro>=1.5.4
.
Would it be possible to relax that requirement to be 'fastavro>=1.5.1
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.