Giter Club home page Giter Club logo

Comments (27)

jorisvandenbossche avatar jorisvandenbossche commented on May 27, 2024 1

FWIW, there is also a fourth option: not have any keyword for this, and don't give users a way to control this through StringDtype().
Calling the default pd.StringDtype() will under the hood still create a dtype instance either backed by numpy object or pyarrow depending on whether pyarrow is installed, but that choice would then always be done automatically. And we can still have a private API to create the dtype with one of the specific backends for testing.

from pandas.

Dr-Irv avatar Dr-Irv commented on May 27, 2024 1

It might be useful to separate the idea of the backend / provider (e.g. "pyarrow") from how that backend / provider is actually implementing things (e.g. "string", "large_string", "string_view"). @Dr-Irv I think you are going down that path with your "nature" suggestion but rather than trying to cram all that metadata into a single string having two properties might be more future proof

Yes, the fact that a pyarrow backend could use a different physical type adds another aspect. So backend and physical_type might make sense to have instead of "nature"

from pandas.

WillAyd avatar WillAyd commented on May 27, 2024 1

Also, for StringDtype, is the first argument required? Could you just do StringDtype(na_value=pd.NA), and then it will use pyarrow if installed, otherwise python?

I think this is going to be really problematic. While I like the spirit of having one StringDtype with potentially different underlying implementations depending on what is installed, I think this is just going to open another can of worms:

>>> ser = pd.Series(["x", "y", None], dtype=pd.StringDtype(na_value=pd.NA)
>>> ser.str.len()

What data type gets returned here if arrow is installed? "pyarrow[int64]" maybe makes sense, but then if the user for whatever reason uninstalls pyarrow does the same code now return "Int64"? What if they changed the na_value to:

>>> ser = pd.Series(["x", "y", None], dtype=pd.StringDtype()

Whether or not we have pyarrow installed I assume this returns a float (?)

I am definitely not a fan of our current "string[pyarrow_numpy]" but to its credit it is at least consistent in the data types it returns

With PDEP-13 my expectation would be that regardless of what is installed:

>>> ser = pd.Series(["x", "y", None], dtype=pd.StringDtype(na_value=<na_value>)
>>> ser.str.len()
0 1
1 1
2 <na_value>
dtype: pd.Int64()

so I think abstracts a lot of these issues. But I don't think only partially implementing that for a StringDtype is going to get us to a better place

from pandas.

simonjayhawkins avatar simonjayhawkins commented on May 27, 2024

I guess I was hinting at something similar for the semantics keyword #58551 (comment). i.e. We do don't have that public on the dtype (but as a dtype property) and perhaps control that at the DataFrame (and maybe Series) level, so that a nullable array assigned to a DataFrame would coerce to NumPy semantics.

from pandas.

Dr-Irv avatar Dr-Irv commented on May 27, 2024

It seems to me that there are 2 choices that a user could make, along with options for those choices

  • How strings are stored and manipulated:
    • pyarrow
    • numpy objects (current 2.x behavior)
    • numpy strings (requires numpy 2.0)
  • How missing values are handled:
    • Use np.nan
    • Use pd.NA

Aside from the numpy 2.0 option, I think we have implementation of all combinations of the storage and missing values available, and it seems to me that when we implement support for numpy strings that depends on numpy 2.0, we'd want to support np.nan and pd.NA for missing values.

There are then a few questions to address:

  1. How should a user choose the storage/manipulation/backend and what the keyword should be
  2. How should a user choose the missing value behavior
  3. What should the defaults of each be
  4. What do we do to handle compatibility from version 2.2 to 3.0
  5. What is the class name for specifying these dtypes
  6. What is the string representation of these dtypes

I'd like to suggest the word nature as the keyword to represent the storage/backend and a keyword missing for the missing value. We deprecate the word storage in pd.StringDtype(), and document how the current storage argument for pd.StringDtype() map to nature and missing . That will address questions 1, 2 and 4. The typed signature in the future (without the storage keyword) would be:

class StringDtype(StorageExtensionDtype):
    def __init__(self, nature: Literal["pyarrow", "numpy.object", "numpy.string"] = "pyarrow",
                 missing: np.nan | pd.NA = pd.NA) -> None: ...

Why nature? When you choose one of the 3 options, you are describing not just the storage backend, but also the implementation of the string methods like str.len(), str.split, etc. So it is the "nature" of the storage and the behavior that is being specified.

I'd also like to address question 6, we use a new nomenclature for the strings that represent dtypes corresponding to strings. Let's NOT use the word "numpy" to represent using np.nan for missing values, but be explicit in using the strings "pd.NA" and "np.nan". The resulting strings representing all 6 combinations could then be (with my best guess for the equivalences of today's behavior):

  • "pyarrow_np.nan" (equivalent to "pyarrow_numpy")
  • "pyarrow_pd.NA" (equivalent to storage="pyarrow")
  • "numpy.object_np.nan" (equivalent to dtype="object")
  • "numpy.object_pd.NA" (equivalent to pd.StringDtype() in 2.x ??)
  • "numpy.string_pd.NA" (New)
  • "numpy.string_np.nan" (New)

For the missing value indicators, let's be explicit about whether np.nan or pd.NA is being used whenever we refer to them, because the community at large is going to have to be educated about the difference in the future and using some other "code word" to represent the 2 possibilities is just masking the issue even more (no pun intended).

from pandas.

simonjayhawkins avatar simonjayhawkins commented on May 27, 2024

After writing this down, I think my current preference would go to StringDtype(backend="python"|"pyarrow"), as that seems the simplest for most users (it's a bit confusing for those who already explicitly used storage, but most users have never done that)

I'd like to suggest the word nature as the keyword to represent the storage/backend

some more suggestions: memory, memory_layout or layout

  • How missing values are handled:

For the missing value indicators, let's be explicit about whether np.nan or pd.NA is being used whenever we refer to them

This is maybe not explicit about the behavior in the sense that the nullable string dtypes that return numeric output will always return a nullable integer dtype, rather than either int or float dtype, depending on the presence of NA values. So this behavior gives a more consistent return type and using terms such as missing or na_value do not convey this.

from pandas.

jbrockmendel avatar jbrockmendel commented on May 27, 2024

backend="python"|"pyarrow"

is "python" an option for the dtype_backend keyword? The keywords are similar enough that it will cause confusion if they don't have matching behavior.

nullable=False (although we have used "nullable dtypes" in the past to denote the dtypes using NA, it's also confusing here because the False variant does support missing values as well)

Yah I think there is confusion as to what "nullable" means depending on the writer/context.

propagation=... would be more explicit than semantics in describing what it controls. The downside is I can never remember whether to spell it "propa" or "propo".

I lean towards storage+na_value.

from pandas.

jorisvandenbossche avatar jorisvandenbossche commented on May 27, 2024

It seems to me that there are 2 choices that a user could make, along with options for those choices

For the PDEP / pandas 3.0, I personally explicitly do not want typical users to make this choice. In each context (pyarrow installed or not), there is just one default string dtype, and that is all what most uses should worry about.

So while we still need some (public or internal) way to create the different variants explicitly (which is what we are discussing here, so thanks for your comments!), in context of the PDEP I would like to hide that as much as possible. For that reason, I am personally not really a fan of adding an explicit keyword to choose the missing value semantics (like na_value or missing). Or the elaborate string representations that give a lot of details. Users that didn't opt in to any of the experimental options should just see string, and IMO we could even disallow creating the new NaN-variants of the string dtype with a alias other than "string" (like, don't support "string[pyarrow_numpy]" to use the current naming in the main branch)

from pandas.

jorisvandenbossche avatar jorisvandenbossche commented on May 27, 2024

backend="python"|"pyarrow"

is "python" an option for the dtype_backend keyword? The keywords are similar enough that it will cause confusion if they don't have matching behavior.

No, but currently you also can't choose the "default" dtypes through dtype_backend, that keyword is only used for opting in to a set of non-default dtypes.
Now, it is a good point that backend="python" wouldn't translate to other dtypes like numeric or datetime dtypes, where we would never have the option to store such data as python objects (although in theory for date that could be an option, but not that we actually want to add it I think)

from pandas.

Dr-Irv avatar Dr-Irv commented on May 27, 2024

For the PDEP / pandas 3.0, I personally explicitly do not want typical users to make this choice. In each context (pyarrow installed or not), there is just one default string dtype, and that is all what most uses should worry about.

But at the top of this issue, you wrote:

one of the discussion points is how to name the different variants of the StringDtype that will exist with the PDEP (whether using pyarrow or numpy object dtype for the data under the hood, and whether using NA or NaN as missing value sentinel).

I agree that typical users should not make that choice, but if we use the nature and missing concept, and have default values for those parameters, then they don't have to make the choice, unless they choose to do so.

from pandas.

lithomas1 avatar lithomas1 commented on May 27, 2024

FWIW, there is also a fourth option: not have any keyword for this, and don't give users a way to control this through StringDtype(). Calling the default pd.StringDtype() will under the hood still create a dtype instance either backed by numpy object or pyarrow depending on whether pyarrow is installed, but that choice would then always be done automatically. And we can still have a private API to create the dtype with one of the specific backends for testing.

Big +1 on this.

I don't think there's a good way to resolve the ambiguity in a name like "pyarrow_numpy", and since pyarrow_numpy and the python fallback are going to be the default string dtype anyways, it's not going to be a common usecase to convert to it.

So, I would be fine making the users manually specify the whatever keywords that we decide on to create the pyarrow/python backed string array with np.nan as the missing value, and having no way to create e.g. a pyarrow_numpy array with a string alias.

I lean towards storage+na_value.

+1 on this.

There's a precedent for storage for string arrays, and na_value has history in things like read_csv.

I would be against adding a new keyword that clashes with "storage" (or changing "storage"), since it makes for messy handling internally. I also don't think it's worth the churn.

I'm less opinionated about na_value.

from pandas.

WillAyd avatar WillAyd commented on May 27, 2024

I realize this is for PDEP 14 which we want as a fast mover, but PDEP 13 proposed the following structure for a data type:

class BaseType:

    @property
    def dtype_backend -> Literal["pandas", "numpy", "pyarrow"]:
        """
        Library is responsible for the array implementation
        """
        ...

    @property
    def physical_type:
        """
        How does the backend physically implement this logical type? i.e. our
        logical type may be a "string" and we are using pyarrow underneath -
        is it a pa.string(), pa.large_string(), pa.string_view() or something else?
        """
        ...

    @property
    def missing_value_marker -> pd.NA|np.nan:
        """
        Sentinel used to denote missing values
        """
        ...

Which may be of interest here too (though feedback so far is that missing_value_marker is better called na_value).

It might be useful to separate the idea of the backend / provider (e.g. "pyarrow") from how that backend / provider is actually implementing things (e.g. "string", "large_string", "string_view"). @Dr-Irv I think you are going down that path with your "nature" suggestion but rather than trying to cram all that metadata into a single string having two properties might be more future proof

from pandas.

jorisvandenbossche avatar jorisvandenbossche commented on May 27, 2024

Thanks for the discussion here. I would like to do a next iteration of my proposal, based on:

  • I would like to keep the changes as minimal as possible for 3.0. For example, we already have a storage keyword, so let's just stick to that for 3.0.
    In the logical dtypes discussion, we can still decide to generalize that for all dtypes, potentially with a different name like backend or nature, and at that point we can just alias or eventually deprecate storage for StringDtype (something we would otherwise have to do now as well, so let's just leave that for the logical dtypes discussion)
  • The easiest way to have a minimal distinction between the NA and NaN variants is to add one keyword for that (I agree now that my suggestion of having both storage and backend keywords, the one for the NA variants and the other for NaN, would be quite confusing).
    And given we already use na_value as an attribute on the dtypes right now, that seems the most obvious choice as others have also argued.
  • We don't yet have different physical types for one backend (with StringDtype("pyarrow"), you always get pyarrow's large_string), so again that is something we can leave for the logical dtypes discussion, and there is no need to already add a keyword for that right now.

That leads me to the following table (the first column is how the user can create a dtype, the second column the concrete dtype isntance they would get, and the third column the string alias they see in displays and can use as dtype specification in addition to the first column):

User specification Concrete dtype String alias Note
Unspecified (inference) StringDtype("pyarrow"|"python", na_value=np.nan) "string" (1)
StringDtype() or "string" StringDtype("pyarrow"|"python", na_value=np.nan) "string" (1), (2)
StringDtype("pyarrow") StringDtype("pyarrow", na_value=np.nan) "string" (2)
StringDtype("python") StringDtype("python", na_value=np.nan) "string" (2)
StringDtype("pyarrow", na_value=pd.NA) StringDtype("pyarrow", na_value=pd.NA) "string[pyarrow]"
StringDtype("python", na_value=pd.NA) StringDtype("pyarrow", na_value=pd.NA) "string[python]"
StringDtype("pyarrow_numpy") StringDtype("pyarrow", na_value=np.nan) "string[pyarrow_numpy]" (3)

(1) You get "pyarrow" or "python" depending on pyarrow being installed.
(2) Those three rows are backwards incompatible (i.e. they work now but give you the NA-variant), but we could still do a deprecation warning about that in advance of changing it.
(3) Keep "pyarrow_numpy" temporarily because this is in main, but deprecate in 2.2.x and have removed for 3.0

Additional notes on the string aliases:

  • I would explicitly not allow using "string[pyarrow]" and "string[python]" string aliases to create the new default NaN-variants, but only allow "string".
    Reasons:
    • 1) users should never specify it but rely on the default selection of backend (if you do specify it, that could make your code non-portable to an environment without pyarrow), and there is still the StringDtype(..) way to be explicit in case you need it.
    • 2) that would also allow to keep the existing string alias like "string[pyarrow]" backwards compatible (although that could also be confusing that this gives the NA-variant, so we might still want to deprecate that regardless).
    • Finally, this also avoids having the need to encode the NA/NaN value in the string alias to be able to have a distinct string alias for all variants.

from pandas.

Dr-Irv avatar Dr-Irv commented on May 27, 2024

That leads me to the following table (the first column is how the user can create a dtype, the second column the concrete dtype isntance they would get, and the third column the string alias they see in displays and can use as dtype specification in addition to the first column):

I think the rows annotated with footnote (2) are problematic, because it is a change in behavior for people currently using StringDtype(), or StringDtype("pyarrow") or StringDtype("python"), i.e., you are now using np.nan as the default NA value, rather than pd.NA, which is there today.

Here's another idea. What if we deprecate the top-level namespace specifications for dtype, and move all of them to a pandas.dtype package, i.e., pandas.dtype.string, pandas.dtype.int64, etc. If you then use StringDtype(), you get the current behavior (pd.NA for missing values). But if you use pd.dtype.string, you get what you propose above. Then the only people affected by behavior changes are ones who specified dtype="string", because that would now use np.nan rather than pd.NA.

from pandas.

jorisvandenbossche avatar jorisvandenbossche commented on May 27, 2024

I think the rows annotated with footnote (2) are problematic, because it is a change in behavior for people currently using StringDtype(), or StringDtype("pyarrow") or StringDtype("python")

Yes, but as discussed on the PDEP, we could still add a deprecation warning for it. I know that doesn't change that it still is a behaviour change (and users will only see the deprecation warning for a short time), but at least we could do it with some warning in advance.

Further, my guess is that "string" will be used more often (that guess if based on the usage in our own documentation, and from a quick search on StackOverflow questions labeled with pandas for "StringDtype") than StringDtype, and certainly as the ones with an explicit keyword.

So if we are eventually on board with changing the behaviour for "string", I think changing StringDtype() as well is OK.

Further, I think quite some users that now do StringDtype don't necessarily need the NA-variant, but would be perfectly fine with the NaN-variant. And so that would also avoid them to change their code.
(i.e. only those who explicitly want to work with the nullable NA dtypes will have to make their dtype specification more explicit)

from pandas.

jorisvandenbossche avatar jorisvandenbossche commented on May 27, 2024

Here's another idea

And to be clear, I think that is a good idea long-term, but I would personally keep that for when we are ready to do that for all dtypes, instead of now only having a string dtype in such a namespace (while other default dtypes like categorical, datetimetz, etc still live top-level)

from pandas.

Dr-Irv avatar Dr-Irv commented on May 27, 2024

Further, I think quite some users that now do StringDtype don't necessarily need the NA-variant, but would be perfectly fine with the NaN-variant. And so that would also avoid them to change their code.
(i.e. only those who explicitly want to work with the nullable NA dtypes will have to make their dtype specification more explicit)

Given what @WillAyd wrote here: #58551 (comment) I think we need to be careful about this.

from pandas.

Dr-Irv avatar Dr-Irv commented on May 27, 2024

And to be clear, I think that is a good idea long-term, but I would personally keep that for when we are ready to do that for all dtypes, instead of now only having a string dtype in such a namespace (while other default dtypes like categorical, datetimetz, etc still live top-level)

But if we did this for all dtypes as part of the change for strings (and deprecated the top level dtypes), then we accomplish both goals that includes a better migration path for strings (maybe).

from pandas.

jorisvandenbossche avatar jorisvandenbossche commented on May 27, 2024

Further, I think quite some users that now do StringDtype don't necessarily need the NA-variant, ....

Given what @WillAyd wrote here: #58551 (comment) I think we need to be careful about this.

I think many of the posts about the StringDtype are not necessarily about the aspect how NA is different from NaN, but just about how cool it is to have an actual string dtype, instead of the confusing-catch-all-object-dtype, and (in case it's about the pyarrow variant) the performance improvements.
I can't actually check the medium blogpost (it's member-only), but at least the SO question is just about converting to strings. And for example https://pythonspeed.com/articles/pandas-string-dtype-memory/ is about memory improvement. https://park.is/notebooks/comparing-pandas-string-dtypes/ does an in-depth comparison but actually doesn't really mention the difference in missing value semantics (it does mention missing values, but to show how the string dtype no longer converts missing values to its string repr, in contrast to astype(str) with object dtype). (that were one of the first hits for blog posts in a google search on "pandas string dtype")

from pandas.

Dr-Irv avatar Dr-Irv commented on May 27, 2024

Consider this code (using 2.2):

>>> s = pd.Series(["a", "b", "c"], dtype="string[pyarrow_numpy]")
>>> s
0    a
1    b
2    c
dtype: string
>>> s.str.len()
0    1
1    1
2    1
dtype: int64
>>> s2 = pd.Series(["a", "b", "c"], dtype="string")
>>> s2
0    a
1    b
2    c
dtype: string
>>> s2.str.len()
0    1
1    1
2    1
dtype: Int64
>>> s.shift(1)
0    NaN
1      a
2      b
dtype: string
>>> s2.shift(1)
0    <NA>
1       a
2       b
dtype: string
>>> s2.shift(1).str.len()
0    <NA>
1       1
2       1
dtype: Int64
>>> s.shift(1).str.len()
0    NaN
1    1.0
2    1.0
dtype: float64

If we adopt your proposal, then if you have np.nan in your series of strings, and take the length, you get a float series. I've found this annoying in the past.

I'm not saying this is a reason to not adopt this proposal, but just wanted to point out this behavior.

from pandas.

Dr-Irv avatar Dr-Irv commented on May 27, 2024

For StringDtype("pyarrow", na_value=pd.NA) and StringDtype("python", na_value=pd.NA), can we add the alias "String"

Also, for StringDtype, is the first argument required? Could you just do StringDtype(na_value=pd.NA), and then it will use pyarrow if installed, otherwise python?

from pandas.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.