One aspect of missing data that JuliaData does not support is dense arrays with missin

Regarding <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

RFC: Dense arrays with missing data,about juliadata/dataframes.jl

Comments (14)

tshort commented on May 16, 2024

I like this idea. Making DataVec more general will help make it more useful. I strongly suggest making DataArray inherit from AbstractArray and DataVec inherit from AbstractVector (issue #23). This will go a long way towards making DataVecs and DataArrays useable. I would not keep DataVec separate but make it a DataArray{T,1} as you suggest. Another option to consider is making the mask be a bitarray (issue #3).

Also, if you use bitstypes capable of supporting NA's (issue #45), you have arrays with NA support now.

As to float arrays with missing data, I like the idea. Harlan and Stefan are resistant as you can see from the commentary on issue #45. For floating point values, NaN's don't completely have correct semantics for missing data. The one area of difference is comparisons. Comparisons involving NaN always return false and not NaN (because booleans don't have a concept of NA or NaN). NaN's are useable as NA's given this. Pandas (python) uses NaN's as NA's despite this "feature". You just have to be aware and check for NA conditions in comparisons (which you normally have to do anyway). Having a bitstype for floats that supports NA's (as an NaN) gets around this by building in the check for NA conditions in comparisons..

from dataframes.jl.

HarlanH commented on May 16, 2024

In general, I'm definitely in support of a data type for floating-point matrices (and/or higher-dimensional Float arrays) with NA implemented by NaN payload, presumably by a bits type with appropriate conversions, as Tom suggests. I don't see any reason why that type can't inherit from AbstractArray/Vector. Thanks for your efforts here!

But DataFrames are semantically different from a DataMatrix/DataArray, and I strongly feel that there should be a single globally-useful implementation of NAs for the DataFrame type, and trying to push the round NaN peg into that square hole is not going to end up being easy for users (or package developers) to work with.

For now, let's keep the code in this issue separate from the existing DataVec/DataFrame types. Definitely re-use Indexes and other ideas as you can, but let's treat this as a separate "JuliaData" type for working with separate types of data.

I sure which I had more time to work on JuliaData! Maybe in a week or two...

from dataframes.jl.

nfoti commented on May 16, 2024

Thanks for the comments.

Making DataArray inherit from AbstractArray does seem a natural thing to do. I plan on keeping DataVec separate from DataArray right now, I was just suggesting that if a 1d DataArray is equivalent (both in syntax and semantics) to a current DataVec when everything is implemented some duplicate code could be removed. I agree that a bitarray is the most compact type for a mask, I am still working out the best way to return a subscripted version (I'm assuming bitarrays cannot be multidimensional).

As for float arrays I don't think the NaN comparison issue is really an issue as long as it's documented that you should only compare non-NaN elements.

I have implemented a few functions for float arrays that allow skipping NaNs via a Bool argument. I have run into problems implementing var because there are so many versions with Bool flags already. Without keyword arguments to functions it is very difficult to add a new flag. I am now seeing why Matlab named their functions nanvar, etc. What are your opinions on the naming? I'm tempted to go with nanmean, nanvar, nansum, etc. rather than adding flags for consistency of the interface. If there are ever named function arguments then we could consider a "skipna" argument.

Thanks.

from dataframes.jl.

tshort commented on May 16, 2024

In a quick glance, it looks like you can have multidimensional BitArrays.

As far as NA's or NaN's and functions that work on them, i don't really like nanmean, nanvar, etc. I'd rather see something like:

mean(naFilter(x)) or mean(naReplace(x, -1))

In this case, naFilter doesn't actually filter; it just sets up a type indicating thatmean should skip NA's. Then, you just need to define the method to work with that type. For DataVec's, that's done by setting the filter flag. Then, mean(dv::DataVec) will skip over NA's. For arrays with NA's as NaN's, you can set up a type that basically just holds the data (for naReplace, it'd also need to hold the replace value). For examples of this, see issue #40 and:

https://github.com/tshort/JuliaData/blob/floatNA/src/alternate_NA.jl

Some of that is commented out, but at least one of the functions worked at one time.

from dataframes.jl.

nfoti commented on May 16, 2024

Multidimensional BitArrays should make DataArray straight-forward to implement (famous last words).

I'm not a fan of the nanfun family of functions and the mean(naFilter(x)) syntax is not my favorite either. I'm partial to something like mean(x, dim, skipna), however, we run into problems with var(x, dim, skipna) as there is already a version of var that takes an AbstractArray, an Int and a Bool. However, I do think that the "functional" syntax mean(naFilter(x)) is useful and should be implemented if possible. One problem I see with it is computing a function (say the mean) of the rows skipping NaNs. I think the naFilter etc. should return a flattened iterator which makes computing functions along a dimension difficult. I guess computing the mean along the columns could be implemented as

means = similar(x)
for i = 1:size(x,1)
  means(i) = mean(naFilter(x[i,:]))
end

but this is ugly. Also, it seems like implementing naFilter/naReplace for Float arrays requires a new type as those functions set up a new DataVec that references the same data with different flags and then the DataVec iterator functionality is used. I would like to avoid introducing a new type and messing with the functionality of Arrays.

We could use the options module to pass in a skipna option to allow syntax like mean(x, 2, skipna) (and switch to a keyword argument if they are ever implemented) as opposed to nanfun type functions. I will try this out at some point soon. If there are any ideas for making naFilter(x) work for a Float array or the mean(naFilter(x), 2) syntax work I'd love to hear them.

I will push some code on my fork of JuliaData later today so you can see the current state of things.

from dataframes.jl.

johnmyleswhite commented on May 16, 2024

I don't see why your concern about var() is a problem: doesn't Julia's multiple dispatch system always select a method based on the maximally specific form when varying levels of generality exist? If DataArray <: AbstractArray, then var(a::DataArray, i::Int, b::Bool) will take precedence over var(a::AbstractArray, i::Int, b::Bool). Am I missing something?

from dataframes.jl.

tshort commented on May 16, 2024

Here is the code for sum that uses regular arrays from the link I provided above. mean would be similar. The function works directly on the array (that's what the first line does; A.x is the array). The NAFilter type is really just an indicator and doesn't add overhead (no data copies involved). I would not have naFilter return a flattened iterator; you might want to do rowSums or something on a matrix, and that wouldn't work if it was flattened.

function sum(A::NAFilter)
    A = A.x
    v = 0.0
    for x in A
        if !isna(x)
            v += x
        end
    end
    v
end

For DataVecs, you could have a DataVec-specific method that could handle both the replace and the filter flags.

from dataframes.jl.

nfoti commented on May 16, 2024

Regarding @johnmyleswhite's comment, you're right, this is not a problem for DataArray. It is a problem for the special case of handling NaNs in Array{Float} types (orthogonal to DataArrays). I think the solution right now is that there won't be short versions of var for skipping NaNs. The syntax will be something like `var(X, corrected, dim, skipna).

@tshort, I'm going to play around with your idea of sum(A::NAFilter).

Thanks.

from dataframes.jl.

HarlanH commented on May 16, 2024

I agree with Tom here. Although the naFilter/naReplace operations need
work, especially with DataFrames, they're very light-weight
performance-wise. And I do quite specifically like the functional syntax.
Also, if you haven't seen the Options module in extras/, take a look -- it
would be a reasonable way to deal with named arguments until such time as
the core language supports them.

On Mon, Aug 27, 2012 at 10:51 AM, Tom Short [email protected]:

Here is the code for sum that uses regular arrays from the link I provided
above. mean would be similar. The function works directly on the array
(that's what the first line does; A.x is the array). The NAFilter type is
really just an indicator and doesn't add overhead (no data copies
involved). I would not have naFilter return a flattened iterator; you might
want to do rowSums or something on a matrix, and that wouldn't work if it
was flattened.

function sum(A::NAFilter)
A = A.x
v = 0.0
for x in A
if !isna(x)
v += x
end
end
vend

For DataVecs, you could have a DataVec-specific method that could handle
both the replace and the filter flags.

—
Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/54#issuecomment-8058052.

from dataframes.jl.

nfoti commented on May 16, 2024

I have pushed my current code implementing some functions to handle NaNs as missing data for Float arrays to the float-nan branch of my fork of JuliaData. Specifically, sum, prod, max, min and mean are implemented. It's preliminary code, I'm sure there are obvious improvements. A few different interfaces are implemented included a boolean flag skipna, functions named nansum, etc. and the interface discussed above, e.g. nansum(naFilter(x), 2). The naFilter approach was actually quite easy to implement and I think is pretty lightweight. I'm also less opposed to it now that I see how general it is.

Feedback is appreciated.

Thanks.

from dataframes.jl.

tshort commented on May 16, 2024

Good stuff, nfoti,

I don't have time for much of a review, and I'll be out for the next week, but here are some quick comments:

Of all of the interfaces you tried, I think I still like mean(naFilter(x)) the best. If we get keyword arguments, then mean(x, skipna = true) is more attractive. nanmean(x), mean(x, true), and mean(x, @options skipna = true) are not as attractive to me.
In nanarray.jl, you define many methods based on StridedArrays. I think those can all be AbstractArrays. That's useful if you want a sparse array or some other array flavor.
In nanstats.jl, you define some of the NAFilter functions in terms of isnan. That will be slower than using a loop like your versions in nanarray.jl.

from dataframes.jl.

nfoti commented on May 16, 2024

Thanks for taking a look, there's no need for a thorough review yet.

I agree that of the options that are available now the mean(naFilter(x)) syntax is the nicest. I'll clean the other interfaces out of the code and implement as many of the statistical functions that make sense for missing data. If keyword arguments are ever implemented someone should come back to this and implement the skipna version.

You're right, the functions in nanarray.jl can probably be implemented with AbstractArray rather than StridedArray. I think I just followed what array.jl does.

Good point with isnan, I totally wrote off the fact that isnan(A) has to make a new array. I've been doing a lot of Matlab lately and went on isnan autopilot. The implementations in there right now are just proof-of-concept, now that we've picked an interface I can go through and implement them all as loops.

Thanks again.

Nick

from dataframes.jl.

nfoti commented on May 16, 2024

I've pushed some new code (float-nan branch) that only implements the naFilter interface. Operations like sum(naFilter(x)) are only slightly slower than sum(x) (with no missing data) and about 4x faster than sum(x[!isnan(x)]). However, the versions that work on a dimension (or a Dimspec), e.g. mean(naFilter(x), 2), are about 15x slower than sum(x, 2). I'm assuming this is because the nanplus, etc. functions that I use for the reductions implementing those operations have a lot of overhead.

from dataframes.jl.

johnmyleswhite commented on May 16, 2024

Closed by b95ee3f

from dataframes.jl.

RFC: Dense arrays with missing data about dataframes.jl HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent