Comments (9)
Yet another possible solution to the 3.0 string problems that @simonjayhawkins @jorisvandenbossche @phofl @MarcoGorelli @lithomas1 have been actively discussing. I think the main point of this discussion is that there would be no new string dtypes - we just manage pyarrow installation as an implementation detail.
from pandas.
I did not open a PR with everything else going on but you can see an initial diff of this here:
If you checkout that branch some cherry-picked methods / accessors"work" (the return types are native nanopandas type - needs effort to wrap properly):
>>> import pandas as pd
>>> ser = pd.Series(["x", "aaa", "fooooo"], dtype="string[pyarrow]")
>>> ser.str.upper()
StringArray
["X", "AAA", "FOOOOO"]
>>> ser.str.len()
Int64Array
[1, 3, 6]
>>> ser.iloc[0]
'x'
>>> ser.iloc[1]
'aaa'
>>> ser.iloc[2]
'fooooo'
>>> ser.iloc[3]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/willayd/clones/pandas/pandas/core/indexing.py", line 1193, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/home/willayd/clones/pandas/pandas/core/indexing.py", line 1754, in _getitem_axis
self._validate_integer(key, axis)
File "/home/willayd/clones/pandas/pandas/core/indexing.py", line 1687, in _validate_integer
raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds
I have not put any effort into trying to optimize nanopandas, but out of the box performance is better than our status quo:
In [1]: import pandas as pd
In [2]: ser = pd.Series(["a", "bbbbb", "cc"] * 100_000)
In [3]: %timeit ser.str.len()
82.4 ms ± 2.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: %timeit ser.str.upper()
52.8 ms ± 461 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [5]: ser = pd.Series(["a", "bbbbb", "cc"] * 100_000, dtype="string[pyarrow]")
In [6]: %timeit ser.str.len()
3.95 ms ± 214 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [7]: %timeit ser.str.upper()
26.7 ms ± 159 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [8]: import pyarrow
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[6], line 1
----> 1 import pyarrow
ModuleNotFoundError: No module named 'pyarrow'
from pandas.
...and just now fixed things so that a Series can be built from nanopandas arrays. The bool / integer are inefficient because they go nanopandas -> python -> NumPy but that could be implemented upstream in nanopandas without a lot of effort:
In [1]: import pandas as pd
In [2]: ser = pd.Series(["a", "bbbbb", "cc", None], dtype="string[pyarrow]")
In [3]: ser.str.len()
Out[3]:
0 1
1 5
2 2
3 <NA>
dtype: Int64
In [4]: ser.str.isalnum()
Out[4]:
0 True
1 True
2 True
3 <NA>
dtype: boolean
In [5]: import pyarrow as pa
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[5], line 1
----> 1 import pyarrow as pa
ModuleNotFoundError: No module named 'pyarrow'
from pandas.
Yet another possible solution to the 3.0 string problems
I personally agree with you this is an interesting option to explore, but I would put it as "yet another possible solution for after 3.0" (that's how I also described it in the current PDEP text). At least for the current timeline of 3.0 (even if this is "a couple of months"), I don't think such a big change is realistic so quickly.
I think the main point of this discussion is that there would be no new string dtypes - we just manage pyarrow installation as an implementation detail.
FWIW, this is not necessarily unique to this solution. Also for the object-dtype vs pyarrow backends, we could do this without having two dtype variants, but by just making this choice behind the scenes automatically. It is our choice how we implement this.
(of course for object dtype vs pyarrow, the difference is bigger because also the stored memory is different, and not only the functions that are being called on it, but there is nothing that prevents us technically to follow the same approach)
from pandas.
Cool thanks for clarifying. Just to be clear on my expectations, if we were interested in this I would propose that we just spend however much time we feel we need on it before releasing 3.0. One of the major points is to avoid yet another string dtype, but if we've already released 3.0 with one then I don't know this would be worth it
Also for the object-dtype vs pyarrow backends, we could do this without having two dtype variants, but by just making this choice behind the scenes automatically
As PDEP 14 is written now, I don't think this is true, or arguably misleading. Yes we do have "string" today (and have for quite some time) so we can avoid adding a new data type by repurposing the old one and changing the na sentinel, but I think that can easily yield more problems.
Maybe it is more accurate to say no new string dtypes and no repurposing / breakage of existing types
from pandas.
but if we've already released 3.0 with one then I don't know this would be worth it
AFAIU the main benefit of using nanoarrow would be to avoid falling back to an object-dtype based implementation (and so also give some performance and memory improvements in case pyarrow is not installed). That benefit is equally true before or after 3.0?
Maybe it is more accurate to say no new string dtypes and no repurposing / breakage of existing types
That is only true if it would use NA semantics, which I assume you are assuming here? But that that is one of the main points being proposed by the PDEP: if we introduce a string dtype for 3.0, it will use NaN. And so also if we would introduce a nanoarrow-backed version for 3.0, which you propose here, it should IMO also have to use NaN. And at that point you have all the same issues / discussions about dtype variants.
from pandas.
AFAIU the main benefit of using nanoarrow would be to avoid falling back to an object-dtype based implementation (and so also give some performance and memory improvements in case pyarrow is not installed). That benefit is equally true before or after 3.0?
Yea for sure. I think its just the issue of there being so many possible string implementations each with their own merits. I don't think its worth just adding one because it offers some incremental value after 3.0; I think it being a solution to the 3.0 problem and not requiring any other changes is the main draw
That is only true if it would use NA semantics, which I assume you are assuming here?
Yea that's correct. This is totally separate from any NA / np.nan discussions.
if we introduce a string dtype for 3.0, it will use NaN.
That is a breaking change for dtype="string", so with that proposal we either just do that or go through a deprecation cycle. This has the advantage of requiring neither
from pandas.
I think it being a solution to the 3.0 problem ...
..
This is totally separate from any NA / np.nan discussions.
In that sense, for me it's entirely not a solution for the 3.0 problem, because in my mind the 3.0 problem is that we need to live with NaN being the default sentinel ;)
from pandas.
Ah OK. Well let's continue that conversation on the PDEP itself so we don't get too fragmented. But appreciated the responses here so far
from pandas.
Related Issues (20)
- DataFrameGroupBy.agg with nan results into inf HOT 3
- CI: Don't run the 'trailing-whitespace' check on markdown files.
- BUG: HDF support and `show_versions()` broken with pandas 2.2.2 and numpy 2.0
- BUG: to_sql does not populate index column with a value when using the mssql+pyodbc engine HOT 1
- BUG: to_sql does gives incorrect column name for index when callable passed in to method HOT 1
- ENH: Back pd.BooleanArray with nanoarrow HOT 3
- BUG: None becomes empty string when writing multiple columns to CSV, but double quotes "" when writing single columns HOT 3
- BUG: Unable to import `pandas` when `pyarrow` 16.1.0 is installed HOT 5
- BUG: datetime64[s] data changes when put into HDFStore HOT 1
- BUG: random crash / hang when calculating rolling sum HOT 2
- BUG: 0/0 with arrow backend is not "NA" HOT 4
- BUG: pivot_table chokes on pd.DatetimeTZDtype if there are no rows.
- DOC: "Accelerated operations" talks about speedup in obsolete versions of Pandas HOT 9
- BUG: rounding datetime in series is broken HOT 1
- BUG: DataFrame to JSON failed when it with UUID HOT 4
- ENH: merge_asof with "how” functionality
- ENH: is it possible to save a reference to a method depending on the version HOT 3
- ValueError: Usecols do not match columns, columns expected but not found: ['Col3', 'Col1'] HOT 6
- ENH: Query function to have a star * for all columns HOT 1
- BUG: read_html returns empty list HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pandas.