Motivation Once numpy 2.0 becomes commonplace, users will probably

Maybe we can (deprecate) and rename this to something <code class="notran

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

DISC: Supporting numpy StringDType in Pandas about pandas HOT 7 OPEN

lithomas1 commented on May 28, 2024 1

DISC: Supporting numpy StringDType in Pandas

from pandas.

Comments (7)

WillAyd commented on May 28, 2024 1

I think PDEP-13 is going to be important for this. We have so many new string dtypes...while they all have merits in their own right I don't think this makes for a good end user experience and it is confusing how to produce and control them throughout their lifecycle in our codebase

from pandas.

jorisvandenbossche commented on May 28, 2024 1

Also, it might be able to supplant the pyarrow_numpy dtype.

Whether to use the new numpy 2.0 string dtype in pandas is IMO unrelated to the string[pyarrow_numpy] dtype. We can use numpy to eventually replace the currently existing numpy-backed string dtypes (string[python]), but not the pyarrow-backed ones (pyarrow still has a performance benefit compared to numpy).

string[pyarrow_numpy] was only introduced to have pyarrow-based string dtype suitable to make the default in pandas 3.0, on the aspect of missing value semantics. The same consideration will have to be made for a np.StringDType based dtype.

As long as numpy is a required dependency of ours, I think it should make sense that we support these strings natively and not force conversion to object/Arrow.

If we would require pyarrow as a hard dependency, then IMO it would be perfectly reasonable to force conversion of any string-like input to a pyarrow string array, including the numpy string dtype, and to converge on a single string dtype implementation in pandas.
As long as we keep pyarrow optional and have a "fallback" string dtype using numpy under the hood, then of course we can use newer numpy features to improve our existing numpy-backed string dtype.

from pandas.

jbrockmendel commented on May 28, 2024 1

then IMO it would be perfectly reasonable to force conversion of any string-like input to a pyarrow string array, including the numpy string dtype

reasonable but not obvious. e.g. if the user expects to be doing __setitem__s there is likely to be a performance difference. But the more important point is one on which you (joris) and I very much agree: we don't need to decide on that right now, and so shouldn't.

from pandas.

jbrockmendel commented on May 28, 2024

and not force conversion to object/Arrow

100%

from pandas.

simonjayhawkins commented on May 28, 2024

Maybe we can (deprecate) and rename this to something string[pyarrow_nplike], or just string[nplike] if we want to replace the pyarrow_numpy strings altogether

what about numpy_numpy? 😉

(where nplike will default to numpy 2.0 if you have that installed, and fallback to Arrow if not installed, if the pyarrow_numpy dtype will go away in the future).

Mixing the dtype systems is a concern to others as well as myself.

Once numpy 2.0 becomes commonplace, users will probably try to pass in StringDType strings into pandas.

I think this a valid point that could be part of the discussion in #57073 , After all, interoperability was one of the 3 benefits cited in PDEP-10

from pandas.

simonjayhawkins commented on May 28, 2024

Once numpy 2.0 becomes commonplace, users will probably try to pass in StringDType strings into pandas.

I think this a valid point that could be part of the discussion in #57073 , After all, interoperability was one of the 3 benefits cited in PDEP-10

Ah. I see that you did mention this #57073 (comment). But no direct response to that comment to-date.

from pandas.

WillAyd commented on May 28, 2024

Thanks @simonjayhawkins for providing all of this input. I would support what I think you are asking for with either a new PDEP or a revote/reclarification on PDEP 10 before investing a lot of effort into these.

I do agree that we have quasi worked around what we agreed to in a lot of smaller PRs and are not in an ideal state with our string dtypes. Between the different string implementations, nullability semantics, infer_strings settings, dtype_backend arguments, requiring versus not requiring pyarrow, etc... I find it personally challenging to navigate where we stand now. At the very least having this discussed and communicated in one central location should be beneficial

from pandas.

DISC: Supporting numpy StringDType in Pandas about pandas HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent