Comments (7)
I think PDEP-13 is going to be important for this. We have so many new string dtypes...while they all have merits in their own right I don't think this makes for a good end user experience and it is confusing how to produce and control them throughout their lifecycle in our codebase
from pandas.
Also, it might be able to supplant the
pyarrow_numpy
dtype.
Whether to use the new numpy 2.0 string dtype in pandas is IMO unrelated to the string[pyarrow_numpy]
dtype. We can use numpy to eventually replace the currently existing numpy-backed string dtypes (string[python]
), but not the pyarrow-backed ones (pyarrow still has a performance benefit compared to numpy).
string[pyarrow_numpy]
was only introduced to have pyarrow-based string dtype suitable to make the default in pandas 3.0, on the aspect of missing value semantics. The same consideration will have to be made for a np.StringDType based dtype.
As long as numpy is a required dependency of ours, I think it should make sense that we support these strings natively and not force conversion to object/Arrow.
If we would require pyarrow as a hard dependency, then IMO it would be perfectly reasonable to force conversion of any string-like input to a pyarrow string array, including the numpy string dtype, and to converge on a single string dtype implementation in pandas.
As long as we keep pyarrow optional and have a "fallback" string dtype using numpy under the hood, then of course we can use newer numpy features to improve our existing numpy-backed string dtype.
from pandas.
then IMO it would be perfectly reasonable to force conversion of any string-like input to a pyarrow string array, including the numpy string dtype
reasonable but not obvious. e.g. if the user expects to be doing __setitem__
s there is likely to be a performance difference. But the more important point is one on which you (joris) and I very much agree: we don't need to decide on that right now, and so shouldn't.
from pandas.
and not force conversion to object/Arrow
100%
from pandas.
Maybe we can (deprecate) and rename this to something
string[pyarrow_nplike]
, or juststring[nplike]
if we want to replace thepyarrow_numpy
strings altogether
what about numpy_numpy
? 😉
(where
nplike
will default to numpy 2.0 if you have that installed, and fallback to Arrow if not installed, if thepyarrow_numpy
dtype will go away in the future).
Mixing the dtype systems is a concern to others as well as myself.
Once numpy 2.0 becomes commonplace, users will probably try to pass in StringDType strings into pandas.
I think this a valid point that could be part of the discussion in #57073 , After all, interoperability was one of the 3 benefits cited in PDEP-10
from pandas.
Once numpy 2.0 becomes commonplace, users will probably try to pass in StringDType strings into pandas.
I think this a valid point that could be part of the discussion in #57073 , After all, interoperability was one of the 3 benefits cited in PDEP-10
Ah. I see that you did mention this #57073 (comment). But no direct response to that comment to-date.
from pandas.
Thanks @simonjayhawkins for providing all of this input. I would support what I think you are asking for with either a new PDEP or a revote/reclarification on PDEP 10 before investing a lot of effort into these.
I do agree that we have quasi worked around what we agreed to in a lot of smaller PRs and are not in an ideal state with our string dtypes. Between the different string implementations, nullability semantics, infer_strings settings, dtype_backend arguments, requiring versus not requiring pyarrow, etc... I find it personally challenging to navigate where we stand now. At the very least having this discussed and communicated in one central location should be beneficial
from pandas.
Related Issues (20)
- BUG: scipy rolling exponential is breaking MultiIndex columns HOT 2
- BUG: ChainedAssignmentError link to documentation will break? HOT 2
- BUG: joining dataframes with multi-index and None index label results in AssertionError HOT 5
- BUG: `margins` value incorrect with `count` aggfunc and no index HOT 3
- BUG: NotImplementedError: `mod` not implemented in `pandas 2.2.2` with `int64[pyarrow]` HOT 2
- BUG: DatetimeIndex.is_year_start breaks on BusinessMonthStart frequency
- ENH: Python 3.13 support
- BUG: "styler.format.thousands" option doesn't work for integers HOT 4
- BUG: Pandas 2 is broken! HOT 2
- BUG: 2-sided inplace drop loses freq in DatetimeIndex HOT 3
- BUG: read_orc does not use the provided filesystem for all operations HOT 1
- BUG: pd.to_datetime fails to identify actual date format HOT 4
- BUG: eval fails for ExtensionArray HOT 2
- ENH: Randomised row selection with read_csv() HOT 4
- BUG: read_parquet converts all digits strings to int HOT 2
- Make specific pandas dataframe column immuteable / not changeable HOT 4
- BUG: df.drop_duplicates fails if there is only a single row HOT 3
- Potential regression with PR "PERF: Eliminate circular references in accessor attributes (#58733)" HOT 1
- ENH: support parquet's enum type using Categorical when (de)serializing HOT 3
- ENH: generalize `__init__` on a `dict` to `abc.collections.Mapping` and `__getitem__` on a `list` to `abc.collections.Sequence` HOT 10
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pandas.