Comments (3)
For reference, the ENUM logical type is described as:
ENUM
annotates the binary primitive type and indicates that the value was converted from an enumerated type in another data model (e.g. Thrift, Avro, Protobuf). Applications using a data model lacking a native enum type should interpretENUM
annotated field as a UTF-8 encoded string.
So as a starter, I don't think it would be a good match for our categorical dtype. The ENUM seems to annotate a column for which the actual values are stored as variable length binary data. The categorical dtype in pandas is under the hood represented as an array of integer indices (pointing to a set of unique categories). Such integers are much more efficient to store than the materialized binary data.
But also, secondly, pandas uses PyArrow (and the Parquet C++ implementation that pyarrow provides bindings for) to read/write parquet files, but AFAIK Parquet C++ does not really support the ENUM logical type (on read, it will support it but it just reads it as normal binary data; and from python you can't actually write it I think).
So certainly for writing I wouldn't use ENUM, and I think with pyarrow it's also not actually possible. On the reading side, read_parquet
will read it as general binary data. But if you directly want to read it as categorical dtype in pandas, pyarrow does support reading binary data as "dictionary encoded" in arrow, which will then translate to categorical dtype in pandas. See the read_dictionary
keyword in https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html. You can pass this keyword to pd.read_parquet
and it will be passed through to pyarrow.
from pandas.
cc @jorisvandenbossche for any thoughts.
from pandas.
Thanks @jorisvandenbossche - closing.
from pandas.
Related Issues (20)
- DOC: Intro to data structures HOT 3
- ENH: to_excel: warning/conversion of text values starting with '=' HOT 6
- DOC: Separate Examples for String Methods (str.isalnum(), str.isalpha(), etc.) in docs HOT 2
- ENH: Add kwargs to `Series.map` HOT 3
- No Python 3.13 wheels available in scientific-python-nightly-wheels HOT 3
- BUG: Empty column name in group dataframe when using pyarrow types HOT 11
- QST: how to pandas fast nested for loop for "non numeric" columns? HOT 2
- BUG: Index union with datetime64[us] dtype and frequency HOT 2
- BUG: is_unique fails when using float128 indexes, since pandas 2.0 HOT 5
- ENH: Add parameter to read_html() that disables the _remove_whitespace() function HOT 6
- BUG: df.MultiIndex.levels continues to return the index of the original df after mutation HOT 1
- ENH: Restore the functionality of `.fillna` HOT 2
- BUG: `None` values are not processed when applying `pd.isnull` to a Series with dtype `category` HOT 3
- BUG: Timestamp.tz and DatetimeIndex.tz are inconsistent when pytz 2024.2 is installed HOT 2
- Inconsistent Return Types Between numpy 1.26.4 and numpy 2.1.0 in pandas 2.2.2 HOT 2
- ENH: add a comments variable to pandas.DataFrame.to_csv HOT 11
- BUG: to_sql with ADBC driver fails when schema exists without tables HOT 1
- BUG: pd.options.future.no_silent_downcasting is not backward compatible HOT 1
- BUG: Automatic change of color when the plot type is "line" but not when it is "scatter" HOT 3
- BUILD: no linux-aarch64 wheels for v2.2.3? HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pandas.