Comments (21)
Overall, this is a great approach. One issue with this is that the result of isna()
now becomes a placeholder type, not an array. That means you can't do df[isna(df[:colA]), :]
, at least without adding another getindex method to DataFrames.
from dataarrays.jl.
Why is isna
a placeholder now? I'm proposing just returning the "true" values of the na
field instead of returning a copy of them.
from dataarrays.jl.
Got it. I misunderstood what you were planning to return.
On Sunday, February 2, 2014, John Myles White [email protected]
wrote:
Why is isna a placeholder now? I'm proposing just returning the "true"
values of the na field instead of returning a copy of them.Reply to this email directly or view it on GitHubhttps://github.com//issues/71#issuecomment-33903565
.
from dataarrays.jl.
To make sure everybody's on the same page here: this change would make DataArrays mutable in a potentially dangerous way, since you could modify one of the underlying values array and/or missingness mask without modifying the other.
from dataarrays.jl.
Could we do:
isna(da::DataArray, inds...) = getindex(da.na, inds...)
values(da::DataArray, inds...) = getindex(da.data, inds...)
Seems like this would be less dangerous, and also more easily adapted to deal with PooledDataArrays.
from dataarrays.jl.
That's a way better idea than mine.
from dataarrays.jl.
One of the things that troubles me about this approach (which I think is still such an improvement over our current interface that we should move forward with it regardless) is it makes getindex and setindex! less symmetric.
from dataarrays.jl.
I see these not as desirable parts of our API, but as stopgap solutions until Julia does a better job of handling union types. We don't have a type inference problem for setindex!
, so it makes sense for it to be asymmetric.
from dataarrays.jl.
Agreed.
from dataarrays.jl.
Spent some time discussing this exact issue (re. compilation of R) with Duncan Temple Lang the other day. If we manage to find a solution to nominal/ordinal variables that allows us to rid of PDA's, it might be nice to introduce a @notna
macro that lets us write things like:
da = @data([1, 2, 3, NA, 5, NA])
s = 0
for i in 1:length(da)
if !isna(da, i)
@notna begin
s += (da[i] * da[i]) / sin(da[i])
end
end
end
This could then get expanded into some trivial function call like getdata
that is exactly getindex
for most types and is getindex(da.data)
for DataArrays.
from dataarrays.jl.
In 80068de, I set up @simonster's generalized definition of isna
. If there's no objection, I can also go ahead and set up a definition for values
, which should maybe get called something less-overloaded like getdata
.
from dataarrays.jl.
Over time, I've come to change my mind about this issue. I now think that we will want, in the future, to implement option types as the representation of scalar values with potential missingness. We'd keep DataArrays around, but have all scalar indexing into them return wrapped option types. This would solve all of the type stability issues, although it requires moving our idioms further from R's.
from dataarrays.jl.
If I'm understanding you correctly, with this change we'd have to give up on passing DataArrays to functions that accept AbstractArrays. This would seem to move our idioms further from Julia's as well, and I'm not sure that's worth it. The generality is often useful despite the performance cost.
An alternative approach might be to create a macro that converts getindex
calls to getindex_wrapped
, which could return a wrapper type.
from dataarrays.jl.
Why would we have to give up on AbstractArray
? Under the hypothetical option type, you could even get rid of DataArray and just use Array{Option{T}}
. In practice, DataArray{T}
would behave like Array{Option{T}}
except that it would store things in a different format.
from dataarrays.jl.
We could have DataArray{T} <: AbstractArray{Option{T}}
, but we'd have to figure out how to delegate methods to the contained element for that to be useful. (This is tractable for arithmetic operators in Base, but maybe not so tractable for package code.) Even if we can make that work, we can't have Option{T<:S} <: S
, so a method that takes a Number
wouldn't take an Option{Number}
, and a method that takes an AbstractArray{T<:Number}
wouldn't take an DataArray{T<:Number}
(although maybe we could define NumberOption{T<:Number} <: Number
and StringOption{T<:String} <: String
and give up on other types?).
I think we could probably make the most common idioms work, but other code that works fine with Arrays and currently works slowly with DataArrays would not work at all. I'm still hoping we'll get better handling of union types at the compiler level...
from dataarrays.jl.
I'd like to have a longer chat with the folks working on the compiler about their opinions on this topic. I feel like working with Union
types isn't the ideal long run solution, because we're putting so much effort into teaching people to avoid Union
types in the parts of their code that aren't interacting with DataArrays.
From my perspective, I think we'd want to maintain our current assumption that DataArray{T} <: AbstractArray{T}
. I don't think we'd want to have Option{T} <: T
. In my mind, the virtue of using Option
is that it makes most usage of a potentially NA
value an error, which I like. If you want to apply foo(x::Number)
to Option{Float64}
, you'd need to resolve the option's uncertainty using a get
function that produces an actual number. Otherwise, you'd immediately get a fatal error -- either a no method error or an error about a missing value. As such, you wouldn't need to define functions to work on Option
.
To elaborate on that: my thinking is that you'd want to start with the simplest possible implementation of Option
, which would just act as a wrapper around values that requires people to decide how NA
should be treated before you could treat the Option
object as a value. Specifically, we'd implement get
, which would allow you to deference an object as long as it's not NA
. If the value was missing, you'd immediately get an NAException
when calling get
.
In cases in which NA
could be replaced by a value like NaN
and produce reasonable results, you'd use get(o, NaN)
as a way of getting the value or replacing a missing value with an acceptable default. In that case, you'd be able to get all the benefits of languages like R and Python, which use NaN
for NA
for speed. We'd still be slower, because of branches, but we'd make it easier to write code that runs quickly.
Stefan, Jeff and the rest are visiting my office on Friday. I'm planning to spend some time chatting with them about this then.
from dataarrays.jl.
This makes much sense as this would make DataArray
s just a special case of a wider feature which would be supported an optimized by Julia everywhere. But there's still the problem that an Option
type would need to store two fields: the actual value, and whether the value is missing or not. Thus, for efficiency, DataArray
s would not be simply an array of Option
objects, but remain stored as they currently are. So I'm not sure how this simplifies the implementation, nor how it allows it to run faster...
The other possibility is to allow the compiler to mark that Option
objects are NULL/NA in the same memory structure as non-NULL values, i.e. like using NaN
payloads for floats, etc. Then DataArray
s could just be arrays of Option
s and still be memory-efficient, and Julia would take care of optimizing code, knowing it can throw an error whenever a NULL/NA is accessed.
from dataarrays.jl.
Yes, we would keep DataArray
as is.
The speedup comes from code like the following:
s, n = 0.0, 0
for i in 1:length(da)
x = da[i]
if !isna(x)
s += get(x)
n += 1
end
end
This code has no type-instability, which makes it much easier for Julia to optimize under the assumption that s
will always be a Float64
. Right now, you can never index into a DataArray without suffering the cost of type-instability because da[i]
always produces Union(NAtype, Float64)
as the output's type.
The sentinel values approach used by R is interesting, but doesn't generalize to arbitrary types. It's much easier to allow people to do something like:
s = 0.0
for i in 1:length(da)
s += get(da[i], NaN)
end
In cases where this is a sensible default, this lets people choose it as appropriate for their problem. Trying to invent a sensible default for all problems doesn't work.
from dataarrays.jl.
@johnmyleswhite But wouldn't x = da[i]
incur an overhead given that the actual value and its missingness would have to be retrieved from the two fields of da
, and an Option
type be constructed from that? If it can be made efficient then indeed it would be a good solution.
(I don't really agree that "sentinel values" do not generalize to arbitrary types: it's mostly a problem for basic types like integers. There are many ways of storing a NULL/NA value in more complex types, and as most of them are just composite types, one just needs to make one of its fields an Option
and make it NULL/NA to mark that the whole type is NULL/NA.)
from dataarrays.jl.
The construction of an immutable type has almost no cost.
from dataarrays.jl.
AFAIK, the reason we teach people to avoid union types in parts of their code not interacting with DataArrays is that they are slow, not that there is anything inherently wrong with them. If we didn't want people to use them at all, I don't think they'd exist at all. There's a difference between what is presently fast and what can be made fast. I would suggest that we design around the latter and provide (hopefully temporary) workarounds for the former.
Optimizing union types is probably a decent amount of work, but it's something that JITs for dynamic languages generally do (see polymorphic inline caching). This particular case should be reasonably simple since the possible types can still be statically inferred. I believe what is needed for good performance is 1) to pass union types of immutables on the stack or in registers instead of allocating them on the heap and 2) to emit two direct calls with a branch instead of calling via jl_apply_generic
.
I'm not sure if we can make DataArray{T} <: AbstractArray{T}
if indexing actually returns Option{T}
. It is already a bit weird that we don't have DataArray{T} <: AbstractArray{Union(T,NAtype)}
, and some code definitely breaks because of this. I'm also not convinced that it's helpful to break generic code on DataArrays because elements could be NA even if no NAs are actually present.
The Option
API seems reasonable if the user knows that they will be operating on a DataArray with NA values, but it seems less reasonable if one wants functions that accept arbitrary AbstactArrays to work on DataArrays as well, for which the current strategy seems best. If we are focusing on what can be made fast and not what is presently fast, I think we should keep the current indexing behavior, but maybe also provide an indexing API that returns Option
types for the time being.
Here is the set of things I think we need for indexing a DataArray to be fast:
- Scalar indexing operations on a DataArray should not allocate on the heap
- Operations on the return value of indexing into a DataArray should not incur dynamic dispatch overhead
- Scalar indexing operations on a DataArray should be fully inlined
- Checks for undefined references for
data
andna
should be removed or hoisted outside of the loop
(1) and (2) can be satisfied by either the Option
API or at the compiler level. (3) and (4) are problems for either approach, although both are on the list at JuliaLang/julia#3440.
from dataarrays.jl.
Related Issues (20)
- #undef/uninitialised values from unique on PooledDataArray
- Get rid of @ngenerate
- Missing DataArray{T}(dims) constructors HOT 1
- MethodError in pooleddataarray.jl in Julia 0.6 HOT 7
- Inference failure/indexing issues HOT 14
- View on DataArray makes copy HOT 1
- broadcast() inference issue with mean() and > HOT 5
- In v 0.7 precompile fails with typeassert HOT 1
- `safe_mapslices` at test/reducedim.jl is broken on 0.7 HOT 2
- Can't load DataArrays HOT 9
- Weighted mean broken for matrices HOT 1
- Avoid copying input Array{Union{T, Null}} in DataArray{T} constructor
- getindex broken when using multidimensional index of DataArray{Bool}
- Division DataArray{Int} / Int results in Array{Any} HOT 2
- doesn't load on julia v0.7 because of `printf` HOT 1
- PooledDataArray depreciated? HOT 1
- Julia v0.7 compatibility HOT 9
- "UndefVarError: centralizedabs2fun not defined" while precompiling HOT 3
- Julia-1.0.0 Compile Error HOT 3
- Package compatibility caps
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dataarrays.jl.