To my knowledge, SQL, Panda, SAS & Stata all default ski

I tried to lay out some arguments: I don't real

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

As noted by <a class="user-mention notranslate" data-hovercard-type="user" data-hoverc

I think if the user calls the standard mean() , <code

default skipnull about nullablestats.jl HOT 9 OPEN

matthieugomez commented on September 25, 2024

default skipnull

from nullablestats.jl.

Comments (9)

davidagold commented on September 25, 2024

I don't know if there's any public discussion about it. But this seems as good a place as any to ask the question -- let's just make sure to x-ref it to more visible threads in the future.

Behind the decision to set the default to false is the sense that eliding null values ought to be undertaken explicitly and with care. Without further information about whether or not the distribution of nulls is associated with patterns in the covariates there is no reason to think that operations over just the present values will give valid results. Arguably, the default setting for skipnull ought to reflect the "default" skepticism about the validity of such operations.

But we're happy to hear arguments for the other default, too.

from nullablestats.jl.

matthieugomez commented on September 25, 2024

I tried to lay out some arguments:

I don't really buy the skepticism argument. If I take it very seriously, why implement a skipnull = true option to begin with? Importantly, since there is a skipnull = true option, no function is agnostic about missing values. Making this stance a part of the default behavior makes it clearer. Otherwise it's just hidden deeper.
When working with dataframes, NA are the rules, not the exception and skipna = true is just more convenient. As a simple example, the output of describe on a dataframe should return something helpful. If skipnull = false is the default, this will return something like NA, NA, NA, NA. In contrast, the "describe" command in SQL, SAS, Panda and Stata all return a nice set of summary statistics on the set of non-missing values (along with a missing value count). As much as I like coming up with theoretical arguments, the truth is that, as an applied researcher, I really spend a lot of time with datasets with missing values & getting simple summary statistics by default really makes life easier.

In particular, we are talking about methods for a type defined to contain missing values. Defining a method on this type should return something useful by default, not another missing value. R is the only language for data analysis that defaults to na.rm = false, and it is probably [?] because R is the only language that does not create a distinction for vectors that may contain missing values.
I don't even get why the behavior skipna = false is needed to begin with. Why would someone ever want this behavior? If someone needed the skipna behavior, would not it be cleaner to do something like if any(NA) return NA else mean(x)? skipna = false strikes me as needlessly paternalistic - if it's a reminder that missing values should be handled with care, it's a bad one. In fact, SQL, Panda, SAS & Stata don't even have an option skipna = false. Some functions accept a NA related option when there may be two sensible modes to handle NAs (for instance, should NA considered as distinct values?) hence missing option in Stata, NOMISS in SAS.
Humans are lazy :
-- If skipnull = false is the default behavior, I think some developers may not implement a skipnull = true option. For instance, there is no way to compute the correlation between two vectors with missing values in R, but there is one in Stata, Panda, & SAS.
-- In R, functions that implement na.rm = TRUE do so inefficiently. Basically 90% of R functions (including in the core language) handle missing values through the line if na.rm x <- x[!is.na(x)]. Looking at the code in your repository, I think you'd agree that functions on NullableArray should be written with the case skipnull = true in mind first.

The way Julia handles weight is already vastly superior to R. I really want to make sure Julia will make it easy to work with missing values - and I think defaulting to skipnull = true is the right step in this direction.

I'm also adding pointers to the the default behavior of mean in each language:
Panda: http://pandas.pydata.org/pandas-docs/stable/missing_data.html
SQL: http://www.ats.ucla.edu/stat/sas/library/nesug98/p004.pdf
SAS: http://www.ats.ucla.edu/stat/sas/modules/missing.htm
Stata: http://www.ats.ucla.edu/stat/stata/modules/missing.html

Thanks for welcoming the discussion!

from nullablestats.jl.

davidagold commented on September 25, 2024

@johnmyleswhite @quinnj @nalimilan any thoughts on this discussion?

@matthieugomez I hear your concerns for user-friendly default behavior, and I think sufficient evidence for a broad consensus could sway us on this point.

Another reason for the present default behavior is that, prima facie, it is more in line with the attitude taken towards lifted functions in general (i.e. not just functions whose primary functionality is stats-related).

from nullablestats.jl.

alyst commented on September 25, 2024

As noted by @matthieugomez skipnull=false is rarely of any practical use, it's rather some sort of "diagnostic" about missing values that can crawl into unexpected places. But in R there's no distinction between nullable/nonnullable arrays, so na.rm=false might have some sense.
In Julia, however, an attempt to set Array element to null would throw an exception, and that would happen much earlier than any statistic is attempted. If the user has chosen NullableArray over Array, he already explicitly asserted the presence of NAs, no need to acknowledge it the second time by skipnull=true.
AFAIU, importing tables from files into data frames is the only place where the choice between NullableArray/Array might happen without user. Import functions could default to using normal Array, if possible. Then any transformations (e.g. outer joins) that would generate nulls will throw exception (and that would be so much nicer than debugging the source of NAs in R!)

from nullablestats.jl.

johnmyleswhite commented on September 25, 2024

FWIW, I think R's strategy of assuming that skipnull = false is the right default for doing rigorous applied statistics. skipnull = true is generally equivalent to the assumption that nulls reflect values that are missing completely at random, which is almost never true in applied work.

from nullablestats.jl.

alyst commented on September 25, 2024

I think if the user calls the standard mean(), std() etc he/she already asserts the naive "missing at random" model. If more rigorous statistics is required (including, but not limited to the missing data modeling), other functions should be called.

Of course, one very frequent case of "completely not at random" NAs is when it just means 0, and the user might have that in mind when calling mean(). So defaulting to skipnull=false would help him to find the bug. But again it would not help him calculating what he wants.

from nullablestats.jl.

alyst commented on September 25, 2024

Just an idea how nulls checking could be made more useful: replace skipnull=true/false with ifnull=:skip/:null/:error/:zero:

:skip is equivalent to skipnull=true
:null is skipnull=false
:error throws an error if null is encountered
:zero treats null as 0

from nullablestats.jl.

nalimilan commented on September 25, 2024

FWIW, I think this has been discussed in various places. One of them is JuliaStats/DataArrays.jl#39 (though the discussion about indexing should probably be kept separate as it opens yet another series of questions).

I fully agree with @matthieugomez and @alyst's points above. People wouldn't use NullableArray if they didn't expect missing values to be present in the data. The fact that Julia makes a difference between Array and NullableArray, while introducing a bit more complexity than R, is also the occasion of making Julia more user-friendly for common data analysis work by not requiring those annoying na.rm=TRUE/skipnull=true everywhere to get things working.

@johnmyleswhite What kind of situation do you envision in which somebody would choose NullableArray, call mean on it, and not expect it to be computed on non-missing entries only? Of course, this result is correct only if missingness happens at random, but I don't think typing explicitly skipnull=false really increases the chances that people will be rigorous. I think a good strategy is rather to show the number of missing values e.g. in the output of describe(), and encourage people to use it to explore their data, instead of just mean().

from nullablestats.jl.

johnmyleswhite commented on September 25, 2024

Instead of discussing this issue further, I would prefer that we focus on building a feature-complete, well-tested and highly-performant implementation of the design that David and I have already developed.

When that implementation is finished, it will be much easier to see which design decisions are most worth revisiting. Changing the default keyword argument to aggregation functions from skipnulls = false to skipnulls = true is always going to remain a possibility and will require little effort to implement if I change my mind on this topic.

Given the generally lamentable state of our core data infra, I think we should focus our energy on more pressing issues.

from nullablestats.jl.

default skipnull about nullablestats.jl HOT 9 OPEN

Comments (9)

Related Issues (3)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent