Giter Club home page Giter Club logo

Comments (7)

WillAyd avatar WillAyd commented on May 28, 2024 1

I think PDEP-13 is going to be important for this. We have so many new string dtypes...while they all have merits in their own right I don't think this makes for a good end user experience and it is confusing how to produce and control them throughout their lifecycle in our codebase

from pandas.

jorisvandenbossche avatar jorisvandenbossche commented on May 28, 2024 1

Also, it might be able to supplant the pyarrow_numpy dtype.

Whether to use the new numpy 2.0 string dtype in pandas is IMO unrelated to the string[pyarrow_numpy] dtype. We can use numpy to eventually replace the currently existing numpy-backed string dtypes (string[python]), but not the pyarrow-backed ones (pyarrow still has a performance benefit compared to numpy).

string[pyarrow_numpy] was only introduced to have pyarrow-based string dtype suitable to make the default in pandas 3.0, on the aspect of missing value semantics. The same consideration will have to be made for a np.StringDType based dtype.

As long as numpy is a required dependency of ours, I think it should make sense that we support these strings natively and not force conversion to object/Arrow.

If we would require pyarrow as a hard dependency, then IMO it would be perfectly reasonable to force conversion of any string-like input to a pyarrow string array, including the numpy string dtype, and to converge on a single string dtype implementation in pandas.
As long as we keep pyarrow optional and have a "fallback" string dtype using numpy under the hood, then of course we can use newer numpy features to improve our existing numpy-backed string dtype.

from pandas.

jbrockmendel avatar jbrockmendel commented on May 28, 2024 1

then IMO it would be perfectly reasonable to force conversion of any string-like input to a pyarrow string array, including the numpy string dtype

reasonable but not obvious. e.g. if the user expects to be doing __setitem__s there is likely to be a performance difference. But the more important point is one on which you (joris) and I very much agree: we don't need to decide on that right now, and so shouldn't.

from pandas.

jbrockmendel avatar jbrockmendel commented on May 28, 2024

and not force conversion to object/Arrow

100%

from pandas.

simonjayhawkins avatar simonjayhawkins commented on May 28, 2024

Maybe we can (deprecate) and rename this to something string[pyarrow_nplike], or just string[nplike] if we want to replace the pyarrow_numpy strings altogether

what about numpy_numpy? 😉

(where nplike will default to numpy 2.0 if you have that installed, and fallback to Arrow if not installed, if the pyarrow_numpy dtype will go away in the future).

Mixing the dtype systems is a concern to others as well as myself.

Once numpy 2.0 becomes commonplace, users will probably try to pass in StringDType strings into pandas.

I think this a valid point that could be part of the discussion in #57073 , After all, interoperability was one of the 3 benefits cited in PDEP-10

from pandas.

simonjayhawkins avatar simonjayhawkins commented on May 28, 2024

Once numpy 2.0 becomes commonplace, users will probably try to pass in StringDType strings into pandas.

I think this a valid point that could be part of the discussion in #57073 , After all, interoperability was one of the 3 benefits cited in PDEP-10

Ah. I see that you did mention this #57073 (comment). But no direct response to that comment to-date.

from pandas.

WillAyd avatar WillAyd commented on May 28, 2024

Thanks @simonjayhawkins for providing all of this input. I would support what I think you are asking for with either a new PDEP or a revote/reclarification on PDEP 10 before investing a lot of effort into these.

I do agree that we have quasi worked around what we agreed to in a lot of smaller PRs and are not in an ideal state with our string dtypes. Between the different string implementations, nullability semantics, infer_strings settings, dtype_backend arguments, requiring versus not requiring pyarrow, etc... I find it personally challenging to navigate where we stand now. At the very least having this discussed and communicated in one central location should be beneficial

from pandas.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.