Comments (14)
I like this idea. Making DataVec
more general will help make it more useful. I strongly suggest making DataArray
inherit from AbstractArray
and DataVec
inherit from AbstractVector
(issue #23). This will go a long way towards making DataVecs
and DataArrays
useable. I would not keep DataVec
separate but make it a DataArray{T,1}
as you suggest. Another option to consider is making the mask be a bitarray (issue #3).
Also, if you use bitstypes capable of supporting NA's (issue #45), you have arrays with NA support now.
As to float arrays with missing data, I like the idea. Harlan and Stefan are resistant as you can see from the commentary on issue #45. For floating point values, NaN's don't completely have correct semantics for missing data. The one area of difference is comparisons. Comparisons involving NaN always return false
and not NaN (because booleans don't have a concept of NA or NaN). NaN's are useable as NA's given this. Pandas (python) uses NaN's as NA's despite this "feature". You just have to be aware and check for NA conditions in comparisons (which you normally have to do anyway). Having a bitstype for floats that supports NA's (as an NaN) gets around this by building in the check for NA conditions in comparisons..
from dataframes.jl.
In general, I'm definitely in support of a data type for floating-point matrices (and/or higher-dimensional Float arrays) with NA implemented by NaN payload, presumably by a bits type with appropriate conversions, as Tom suggests. I don't see any reason why that type can't inherit from AbstractArray/Vector. Thanks for your efforts here!
But DataFrames are semantically different from a DataMatrix/DataArray, and I strongly feel that there should be a single globally-useful implementation of NAs for the DataFrame type, and trying to push the round NaN peg into that square hole is not going to end up being easy for users (or package developers) to work with.
For now, let's keep the code in this issue separate from the existing DataVec/DataFrame types. Definitely re-use Indexes and other ideas as you can, but let's treat this as a separate "JuliaData" type for working with separate types of data.
I sure which I had more time to work on JuliaData! Maybe in a week or two...
from dataframes.jl.
Thanks for the comments.
Making DataArray
inherit from AbstractArray
does seem a natural thing to do. I plan on keeping DataVec
separate from DataArray
right now, I was just suggesting that if a 1d DataArray
is equivalent (both in syntax and semantics) to a current DataVec
when everything is implemented some duplicate code could be removed. I agree that a bitarray is the most compact type for a mask, I am still working out the best way to return a subscripted version (I'm assuming bitarrays cannot be multidimensional).
As for float arrays I don't think the NaN comparison issue is really an issue as long as it's documented that you should only compare non-NaN elements.
I have implemented a few functions for float arrays that allow skipping NaNs via a Bool argument. I have run into problems implementing var
because there are so many versions with Bool flags already. Without keyword arguments to functions it is very difficult to add a new flag. I am now seeing why Matlab named their functions nanvar, etc. What are your opinions on the naming? I'm tempted to go with nanmean, nanvar, nansum, etc. rather than adding flags for consistency of the interface. If there are ever named function arguments then we could consider a "skipna" argument.
Thanks.
from dataframes.jl.
In a quick glance, it looks like you can have multidimensional BitArrays.
As far as NA's or NaN's and functions that work on them, i don't really like nanmean, nanvar, etc. I'd rather see something like:
mean(naFilter(x))
or mean(naReplace(x, -1))
In this case, naFilter
doesn't actually filter; it just sets up a type indicating thatmean
should skip NA's. Then, you just need to define the method to work with that type. For DataVec's, that's done by setting the filter flag. Then, mean(dv::DataVec)
will skip over NA's. For arrays with NA's as NaN's, you can set up a type that basically just holds the data (for naReplace, it'd also need to hold the replace value). For examples of this, see issue #40 and:
https://github.com/tshort/JuliaData/blob/floatNA/src/alternate_NA.jl
Some of that is commented out, but at least one of the functions worked at one time.
from dataframes.jl.
Multidimensional BitArrays should make DataArray
straight-forward to implement (famous last words).
I'm not a fan of the nanfun
family of functions and the mean(naFilter(x))
syntax is not my favorite either. I'm partial to something like mean(x, dim, skipna)
, however, we run into problems with var(x, dim, skipna)
as there is already a version of var
that takes an AbstractArray
, an Int and a Bool. However, I do think that the "functional" syntax mean(naFilter(x))
is useful and should be implemented if possible. One problem I see with it is computing a function (say the mean) of the rows skipping NaNs. I think the naFilter
etc. should return a flattened iterator which makes computing functions along a dimension difficult. I guess computing the mean along the columns could be implemented as
means = similar(x)
for i = 1:size(x,1)
means(i) = mean(naFilter(x[i,:]))
end
but this is ugly. Also, it seems like implementing naFilter
/naReplace
for Float arrays requires a new type as those functions set up a new DataVec
that references the same data with different flags and then the DataVec
iterator functionality is used. I would like to avoid introducing a new type and messing with the functionality of Arrays.
We could use the options module to pass in a skipna option to allow syntax like mean(x, 2, skipna)
(and switch to a keyword argument if they are ever implemented) as opposed to nanfun
type functions. I will try this out at some point soon. If there are any ideas for making naFilter(x)
work for a Float array or the mean(naFilter(x), 2)
syntax work I'd love to hear them.
I will push some code on my fork of JuliaData later today so you can see the current state of things.
from dataframes.jl.
I don't see why your concern about var()
is a problem: doesn't Julia's multiple dispatch system always select a method based on the maximally specific form when varying levels of generality exist? If DataArray <: AbstractArray
, then var(a::DataArray, i::Int, b::Bool)
will take precedence over var(a::AbstractArray, i::Int, b::Bool)
. Am I missing something?
from dataframes.jl.
Here is the code for sum that uses regular arrays from the link I provided above. mean
would be similar. The function works directly on the array (that's what the first line does; A.x
is the array). The NAFilter type is really just an indicator and doesn't add overhead (no data copies involved). I would not have naFilter return a flattened iterator; you might want to do rowSums
or something on a matrix, and that wouldn't work if it was flattened.
function sum(A::NAFilter)
A = A.x
v = 0.0
for x in A
if !isna(x)
v += x
end
end
v
end
For DataVecs, you could have a DataVec-specific method that could handle both the replace and the filter flags.
from dataframes.jl.
Regarding @johnmyleswhite's comment, you're right, this is not a problem for DataArray
. It is a problem for the special case of handling NaNs in Array{Float}
types (orthogonal to DataArray
s). I think the solution right now is that there won't be short versions of var
for skipping NaNs. The syntax will be something like `var(X, corrected, dim, skipna).
@tshort, I'm going to play around with your idea of sum(A::NAFilter)
.
Thanks.
from dataframes.jl.
I agree with Tom here. Although the naFilter/naReplace operations need
work, especially with DataFrames, they're very light-weight
performance-wise. And I do quite specifically like the functional syntax.
Also, if you haven't seen the Options module in extras/, take a look -- it
would be a reasonable way to deal with named arguments until such time as
the core language supports them.
On Mon, Aug 27, 2012 at 10:51 AM, Tom Short [email protected]:
Here is the code for sum that uses regular arrays from the link I provided
above. mean would be similar. The function works directly on the array
(that's what the first line does; A.x is the array). The NAFilter type is
really just an indicator and doesn't add overhead (no data copies
involved). I would not have naFilter return a flattened iterator; you might
want to do rowSums or something on a matrix, and that wouldn't work if it
was flattened.function sum(A::NAFilter)
A = A.x
v = 0.0
for x in A
if !isna(x)
v += x
end
end
vendFor DataVecs, you could have a DataVec-specific method that could handle
both the replace and the filter flags.—
Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/54#issuecomment-8058052.
from dataframes.jl.
I have pushed my current code implementing some functions to handle NaNs as missing data for Float arrays to the float-nan branch of my fork of JuliaData. Specifically, sum
, prod
, max
, min
and mean
are implemented. It's preliminary code, I'm sure there are obvious improvements. A few different interfaces are implemented included a boolean flag skipna
, functions named nansum
, etc. and the interface discussed above, e.g. nansum(naFilter(x), 2)
. The naFilter
approach was actually quite easy to implement and I think is pretty lightweight. I'm also less opposed to it now that I see how general it is.
Feedback is appreciated.
Thanks.
from dataframes.jl.
Good stuff, nfoti,
I don't have time for much of a review, and I'll be out for the next week, but here are some quick comments:
- Of all of the interfaces you tried, I think I still like
mean(naFilter(x))
the best. If we get keyword arguments, thenmean(x, skipna = true)
is more attractive.nanmean(x)
,mean(x, true)
, andmean(x, @options skipna = true)
are not as attractive to me. - In nanarray.jl, you define many methods based on StridedArrays. I think those can all be AbstractArrays. That's useful if you want a sparse array or some other array flavor.
- In nanstats.jl, you define some of the NAFilter functions in terms of
isnan
. That will be slower than using a loop like your versions in nanarray.jl.
from dataframes.jl.
Thanks for taking a look, there's no need for a thorough review yet.
I agree that of the options that are available now the mean(naFilter(x))
syntax is the nicest. I'll clean the other interfaces out of the code and implement as many of the statistical functions that make sense for missing data. If keyword arguments are ever implemented someone should come back to this and implement the skipna
version.
You're right, the functions in nanarray.jl can probably be implemented with AbstractArray
rather than StridedArray
. I think I just followed what array.jl does.
Good point with isnan
, I totally wrote off the fact that isnan(A)
has to make a new array. I've been doing a lot of Matlab lately and went on isnan
autopilot. The implementations in there right now are just proof-of-concept, now that we've picked an interface I can go through and implement them all as loops.
Thanks again.
Nick
from dataframes.jl.
I've pushed some new code (float-nan branch) that only implements the naFilter
interface. Operations like sum(naFilter(x))
are only slightly slower than sum(x)
(with no missing data) and about 4x faster than sum(x[!isnan(x)])
. However, the versions that work on a dimension (or a Dimspec
), e.g. mean(naFilter(x), 2)
, are about 15x slower than sum(x, 2)
. I'm assuming this is because the nanplus
, etc. functions that I use for the reductions implementing those operations have a lot of overhead.
from dataframes.jl.
Closed by b95ee3f
from dataframes.jl.
Related Issues (20)
- Add rename!(::DataFrame, ::Pair{Regex, SubstitutionString}) method
- GroupBy then combine changes column order HOT 2
- Inconsistent Mean Calculation in Grouped DataFrame Compared to Overall DataFrame HOT 2
- What is the best way to write large DataFrames efficiently and with high performance in Julia while minimizing memory usage? HOT 4
- Segmentation Fault when reading compressed file HOT 1
- Revisit spreading for `AsTable` output` HOT 6
- Better error message when forming a DataFrame from a vector of dictionaries with missing data. HOT 2
- `describe` is slow HOT 3
- CartesianIndex error in Julia 1.11 HOT 4
- `DataFrame(x=Int[], y=Int)` HOT 3
- Add comparison function for dataframes which can handle both isapprox and isequal column types HOT 2
- unique fails with column-type FixedDecimal HOT 5
- mapcols! should modify the parent of a SubDataFrame HOT 11
- Feature request: Pairs in stack HOT 2
- Grouped DataFrame with array elements fails to combine HOT 4
- error when combining a grouped empty dataframe using `first` HOT 6
- Short circuit && on subset? HOT 1
- Integer strings as colnames/selectors are error prone HOT 2
- Suggestion - Matrix Syntax for hcat (as well as vcat) HOT 4
- Document custom generation of column names in manual HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dataframes.jl.