Comments (3)
I believe the code you are complaining about is in these lines (but it's worth double-checking):
pandas/pandas/core/_numba/kernels/min_max_.py
Line 106 in bef88ef
pandas/pandas/core/_numba/kernels/min_max_.py
Lines 49 to 52 in bef88ef
Let me explain how I think your issue report could be modified and why.
Scratch that
From my perspective, a nan must generate other nan, an aggregation of nan, must again generate nan
semantically: "An invalid value, cannot be computed, so a transformation of it should result again into an invalid value"
an aggregation (via groupby) of nan, should result into nan
We need to be precise in the quantification all vs any when dealing with aggregation or reduction.
NumPy follows "any nan implies nan":
np.array([0, 1, 2]).max() # 2 (no nan => no nan)
np.array([0, 1, np.inf]).max() # inf (no nan => no nan)
np.array([np.nan, 0, 1, np.inf]).max() # nan (some nan => nan)
(np.array([0, 1, np.inf]) / np.array([0, 1, np.inf])).max() # nan (some nan => nan)
Pandas follows "all NA implies NA":
pd.Series([0, 1, 2], dtype='Float64').max() # 2 (no NA => no NA)
pd.Series([0, 1, np.inf], dtype='Float64').max() # inf (no NA => no NA)
pd.Series([np.nan, 0, 1, np.inf], dtype='Float64').max() # inf (some NA =/=> NA)
pd.Series([np.nan, np.nan, np.nan], dtype='Float64').max() # <NA> (all NA => NA)
To complicate the matter, Pandas treats NA
and np.nan
sometimes differently and sometimes not. It is still being decided by seniors in #32265 (which you referenced) what exactly the semantics of NA
and np.nan
should be in Pandas. The consensus tends to be that NA
is a missing value, while np.nan
is a bad value. In most cases, missing values can be simply ignored, unlike bad values. This explains why in a single bad value np.nan
ruins the computation in NumPy, while a single missing value pd.NA
does not do the same in Pandas.
Now, to complicate the matter even further, Pandas transforms np.nan
into pd.NA
:
s = pd.Series([np.nan, 0, 1], dtype="Float64")
s.max() # 1.0, because max([<NA>, 0, 1]) is 1
(s / s).max() # <NA>, because max([<NA>, np.nan, 1]) is np.nan which becomes <NA>
- In the first line, Pandas tranforms
np.nan
(which historically denoted a missing value in Pandas, before nullable arrays were introduced). So the Series is[<NA>, 0, 1]
- In the second line, as expected and as we saw before, a missing value is simply ignored:
max([<NA>, 0, 1])
gives1
. - In the third line,
s / s
becomes[<NA>, np.nan, 1]
, where0 / 0
ornp.nan
is a bad value, which must derail the aggregation, somax([<NA>, np.nan, 1])
givesnp.nan
. But this is not the end. For some reason, Pandas convertsnp.nan
again to<NA>
.
The expected behaviour you propose, @glaucouri, would equate pd.NA
to np.nan
. I don't think the council of maintainers would support this. Therefore I suggest to reframe your issue differently:
I misunderstood your suggestion initially. You indeed insist on treating np.nan
as an invalid value consistently in aggregation functions. I personally care more about consistency, so here is another example of the supposed bug:
s = pd.Series([np.nan, 0, 1], dtype="Float64")
(s / s).max() # <NA>
(s / s).groupby([9, 9, 9]).max().iat[0] # 1.0
The last two lines were expected to give the same result (whatever it should be).
from pandas.
Thanks for the report, it looks like Series.max
is not adhering to the default value skipna=True
. Further investigations and PRs to fix are welcome!
from pandas.
take
from pandas.
Related Issues (20)
- BUG: pandas.DataFrame.plot crashes, if subplots argument receives a touple. HOT 3
- BUG: assert_frame_equal does not include the obj parameter in error when a MultiIndex is different HOT 4
- QST: How to use 'numba' for group by sum HOT 1
- BUG: Large XML files on Windows trigger false Encoding error HOT 2
- BUG: Series.replace(dict-like, dict-like) raises uninformative AttributeError HOT 1
- BUG: escapechar=',' Causes Double Commas in Output in Pandas 2.2.2 HOT 2
- DOC: fix docstring validation errors for `pandas.Timestamp` HOT 15
- DOC: Development on Gitpod have problems HOT 1
- BUG: Ability to set both color and style in pandas plotting HOT 2
- Title: Feature Request: Improve diff Function to Support Forward and Backward CompletionENH: HOT 2
- Title: Feature Request: Improve diff Function to Support Forward and Backward CompletionENH: HOT 2
- Bug in Chunk Processing: Non-NULL IDs Become NULL During IterationBUG: HOT 2
- ENH: Add Float128 support for groupby. HOT 1
- PERF: Excessive memory consumption in pd.read_parquet HOT 2
- BUG: Resampling to `"B"` frequency with `closed="right"` and `label="right"` adds empty bins
- BUG: ArrowNotImplementedError: Unsupported cast from int64 to null using function cast_null HOT 1
- ENH: Adding `skipna:bool` to Rolling.sum HOT 2
- BUG: Replace fails after NaN in a Series of `string` HOT 2
- DEPR: future.no_silent_downcasting option HOT 1
- BUILD: Python Docker Build Issues HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pandas.