Giter Club home page Giter Club logo

Comments (9)

davidagold avatar davidagold commented on September 25, 2024

I don't know if there's any public discussion about it. But this seems as good a place as any to ask the question -- let's just make sure to x-ref it to more visible threads in the future.

Behind the decision to set the default to false is the sense that eliding null values ought to be undertaken explicitly and with care. Without further information about whether or not the distribution of nulls is associated with patterns in the covariates there is no reason to think that operations over just the present values will give valid results. Arguably, the default setting for skipnull ought to reflect the "default" skepticism about the validity of such operations.

But we're happy to hear arguments for the other default, too.

from nullablestats.jl.

matthieugomez avatar matthieugomez commented on September 25, 2024

I tried to lay out some arguments:

  • I don't really buy the skepticism argument. If I take it very seriously, why implement a skipnull = true option to begin with? Importantly, since there is a skipnull = true option, no function is agnostic about missing values. Making this stance a part of the default behavior makes it clearer. Otherwise it's just hidden deeper.

  • When working with dataframes, NA are the rules, not the exception and skipna = true is just more convenient. As a simple example, the output of describe on a dataframe should return something helpful. If skipnull = false is the default, this will return something like NA, NA, NA, NA. In contrast, the "describe" command in SQL, SAS, Panda and Stata all return a nice set of summary statistics on the set of non-missing values (along with a missing value count). As much as I like coming up with theoretical arguments, the truth is that, as an applied researcher, I really spend a lot of time with datasets with missing values & getting simple summary statistics by default really makes life easier.

    In particular, we are talking about methods for a type defined to contain missing values. Defining a method on this type should return something useful by default, not another missing value. R is the only language for data analysis that defaults to na.rm = false, and it is probably [?] because R is the only language that does not create a distinction for vectors that may contain missing values.

  • I don't even get why the behavior skipna = false is needed to begin with. Why would someone ever want this behavior? If someone needed the skipna behavior, would not it be cleaner to do something like if any(NA) return NA else mean(x)? skipna = false strikes me as needlessly paternalistic - if it's a reminder that missing values should be handled with care, it's a bad one. In fact, SQL, Panda, SAS & Stata don't even have an option skipna = false. Some functions accept a NA related option when there may be two sensible modes to handle NAs (for instance, should NA considered as distinct values?) hence missing option in Stata, NOMISS in SAS.

  • Humans are lazy :
    -- If skipnull = false is the default behavior, I think some developers may not implement a skipnull = true option. For instance, there is no way to compute the correlation between two vectors with missing values in R, but there is one in Stata, Panda, & SAS.
    -- In R, functions that implement na.rm = TRUE do so inefficiently. Basically 90% of R functions (including in the core language) handle missing values through the line if na.rm x <- x[!is.na(x)]. Looking at the code in your repository, I think you'd agree that functions on NullableArray should be written with the case skipnull = true in mind first.

The way Julia handles weight is already vastly superior to R. I really want to make sure Julia will make it easy to work with missing values - and I think defaulting to skipnull = true is the right step in this direction.

I'm also adding pointers to the the default behavior of mean in each language:
Panda: http://pandas.pydata.org/pandas-docs/stable/missing_data.html
SQL: http://www.ats.ucla.edu/stat/sas/library/nesug98/p004.pdf
SAS: http://www.ats.ucla.edu/stat/sas/modules/missing.htm
Stata: http://www.ats.ucla.edu/stat/stata/modules/missing.html

Thanks for welcoming the discussion!

from nullablestats.jl.

davidagold avatar davidagold commented on September 25, 2024

@johnmyleswhite @quinnj @nalimilan any thoughts on this discussion?

@matthieugomez I hear your concerns for user-friendly default behavior, and I think sufficient evidence for a broad consensus could sway us on this point.

Another reason for the present default behavior is that, prima facie, it is more in line with the attitude taken towards lifted functions in general (i.e. not just functions whose primary functionality is stats-related).

from nullablestats.jl.

alyst avatar alyst commented on September 25, 2024

As noted by @matthieugomez skipnull=false is rarely of any practical use, it's rather some sort of "diagnostic" about missing values that can crawl into unexpected places. But in R there's no distinction between nullable/nonnullable arrays, so na.rm=false might have some sense.
In Julia, however, an attempt to set Array element to null would throw an exception, and that would happen much earlier than any statistic is attempted. If the user has chosen NullableArray over Array, he already explicitly asserted the presence of NAs, no need to acknowledge it the second time by skipnull=true.
AFAIU, importing tables from files into data frames is the only place where the choice between NullableArray/Array might happen without user. Import functions could default to using normal Array, if possible. Then any transformations (e.g. outer joins) that would generate nulls will throw exception (and that would be so much nicer than debugging the source of NAs in R!)

from nullablestats.jl.

johnmyleswhite avatar johnmyleswhite commented on September 25, 2024

FWIW, I think R's strategy of assuming that skipnull = false is the right default for doing rigorous applied statistics. skipnull = true is generally equivalent to the assumption that nulls reflect values that are missing completely at random, which is almost never true in applied work.

from nullablestats.jl.

alyst avatar alyst commented on September 25, 2024

I think if the user calls the standard mean(), std() etc he/she already asserts the naive "missing at random" model. If more rigorous statistics is required (including, but not limited to the missing data modeling), other functions should be called.

Of course, one very frequent case of "completely not at random" NAs is when it just means 0, and the user might have that in mind when calling mean(). So defaulting to skipnull=false would help him to find the bug. But again it would not help him calculating what he wants.

from nullablestats.jl.

alyst avatar alyst commented on September 25, 2024

Just an idea how nulls checking could be made more useful: replace skipnull=true/false with ifnull=:skip/:null/:error/:zero:

  • :skip is equivalent to skipnull=true
  • :null is skipnull=false
  • :error throws an error if null is encountered
  • :zero treats null as 0

from nullablestats.jl.

nalimilan avatar nalimilan commented on September 25, 2024

FWIW, I think this has been discussed in various places. One of them is JuliaStats/DataArrays.jl#39 (though the discussion about indexing should probably be kept separate as it opens yet another series of questions).

I fully agree with @matthieugomez and @alyst's points above. People wouldn't use NullableArray if they didn't expect missing values to be present in the data. The fact that Julia makes a difference between Array and NullableArray, while introducing a bit more complexity than R, is also the occasion of making Julia more user-friendly for common data analysis work by not requiring those annoying na.rm=TRUE/skipnull=true everywhere to get things working.

@johnmyleswhite What kind of situation do you envision in which somebody would choose NullableArray, call mean on it, and not expect it to be computed on non-missing entries only? Of course, this result is correct only if missingness happens at random, but I don't think typing explicitly skipnull=false really increases the chances that people will be rigorous. I think a good strategy is rather to show the number of missing values e.g. in the output of describe(), and encourage people to use it to explore their data, instead of just mean().

from nullablestats.jl.

johnmyleswhite avatar johnmyleswhite commented on September 25, 2024

Instead of discussing this issue further, I would prefer that we focus on building a feature-complete, well-tested and highly-performant implementation of the design that David and I have already developed.

When that implementation is finished, it will be much easier to see which design decisions are most worth revisiting. Changing the default keyword argument to aggregation functions from skipnulls = false to skipnulls = true is always going to remain a possibility and will require little effort to implement if I change my mind on this topic.

Given the generally lamentable state of our core data infra, I think we should focus our energy on more pressing issues.

from nullablestats.jl.

Related Issues (3)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.