juliastats / dataarrays.jl Goto Github PK

View Code? Open in Web Editor NEW

53.0 53.0 50.0 890 KB

DEPRECATED: Data structures that allow missing values

License: Other

Julia 100.00%

deprecated julia

dataarrays.jl's People

Contributors

Stargazers

Watchers

dataarrays.jl's Issues

getindex bug keeping DataFrames from subsetting rows by expression.

#24 fixes this.

Interface for high-performance indexing operations

As mentioned in JuliaData/DataFrames.jl#523, we might want to expose an "unsafe" interface to the underlying values of a DataArray for those trying to do high-performance work:

Check if an entry is NA by making isna return a reference. We could also implementing complex indexing for isna(da, inds...) but that seems like a lot of needless work.
Access the underlying values of a DataArray using something like values(da), which will have undefined values for any NA entries.

This would put us in a position to write code like:

dm = @data([1 2; 3 4])
isna(dm)[1, 1]
values(dm)[1, 1]

We could make this code very fast because it would be perfectly type stable. As a (probably too) radical step, we could even change getindex to implement the semantics of values.

padNA() broken

julia> da = DataArray([1,2,3])
3-element DataArray{Int64,1}:
 1
 2
 3

julia> padNA(da, 1, 1)
5-element DataArray{Int64,1}:
    261993005056
               1
               2
               3
 140720308486144

Bizarre, not sure how this happens.

Remove `DataArray` and `PooledDataArray` constructors with no Array parallels

We have several types of constructors for DataArrays that have no parallels in Base Julia. We should remove them.

This includes things like DataArray(3) and DataArray(3, 3).

Operators for PooledDataArray?

For unary operators, binary operators with scalar aguments, and some others (e.g. transpose, which I'm working on now), we could make specialized versions that operate substantially faster on PooledDataArray than the current implementations for AbstractDataArray. My questions are:

Is it worth doing this? I suspect it is, since it's mostly just a change to a couple macros and such operators could be useful in practice.
If we have implementations for both DataArray and PooledDataArray, should we remove the implementation for AbstractDataArray? AFAICT we don't provide any other subtypes of AbstractDataArray in this package, and if someone wants to implement their own AbstractDataArray, it is highly likely that there would be a more efficient way of performing these operations than the generic implementation.

Be more conservative about recycling values

Right now, we are far too liberal when recycling values. We should only allow the following behaviors:

Allow operations between arrays when their sizes are perfectly matched.
Recycle scalars so that they behave like an array of the same size as the non-scalar operand in an operation like addition or multiplication.

We should definitely not follow the R lead of recycling vectors of short length until they match the length of the longer vector. Only arrays whose sizes exactly match should be allowed to interact.

As an example of what we should not allow going forward, consider the following

julia> using DataArrays

julia> x = @data([1, 2, 3, 4, 5])
5-element DataArray{Int64,1}:
 1
 2
 3
 4
 5

julia> y = @data([6, 7])
2-element DataArray{Int64,1}:
 6
 7

julia> x[:] = y
2-element DataArray{Int64,1}:
 6
 7

julia> x
5-element DataArray{Int64,1}:
 6
 7
 3
 4
 5

This operation does not work on Julia's normal arrays. In general, we should always try to behave exactly like Julia's normal arrays, except with NA's added in.

Improve data and pdata macros

The @data and @pdata macros need to be improved to handle more types of expressions. In particular, it would be good if things like isequal(@data ones(2), DataArray(ones(2)) didn't parse beyond the comma.

Argument against the array function instead of using convert

ERROR: no method array(Array{Float64,1})
 in model_response at /Users/johnmyleswhite/.julia/DataFrames/src/formula.jl:181
 in glm at /Users/johnmyleswhite/.julia/GLM/src/glmfit.jl:117
 in glm at /Users/johnmyleswhite/.julia/GLM/src/glmfit.jl:134

Dividing DataArray by integer gives InexactError()

Dividing DataArray by integer gives InexactError() when it should convert to float:

>DataArray([1:10])./10

InexactError()
at In[8]:1
 in ./ at C:\Users\admin\.julia\DataArrays\src\operators.jl:135

Systematic test coverage

I think our current ad hoc testing practice lets too many things slip through the cracks. I propose that we switch to a simple, systematic testing rule: every file in src needs to have a mirror file in test that contains tests that check every function defined in the src file in the order present in the src file. This kind of test file is pretty boring to write, but is much more systematic and reliable. I've started going this myself, so I'll upload tests as they get constructed.

One other rule: I'd propose that the tests for the src file, x.jl go in a test file called x.jl that contains a module called TestX. This ensures parallelization across test files. Ideally the module would contain tests written as functions, which can be easily analyzed by a code checker to confirm systematic test coverage of the src file.

Error defining Stats.table in extras.jl

On the latest Julia 0.3.0-dev:

Version 0.3.0-prerelease+734 (2013-12-29 21:13 UTC)
Commit 974b794* (0 days old master)
x86_64-apple-darwin12.5.0

Loading DataArrays fails on extras.jl attempting to define Stats.table:

julia> using DataArrays
ERROR: table not defined
 in reload_path at loading.jl:146
 in _require at loading.jl:59
 in require at loading.jl:43
while loading ~/.julia/DataArrays/src/extras.jl, in expression starting on line 1
while loading ~/.julia/DataArrays/src/DataArrays.jl, in expression starting on line 85

Commenting out this:

function Stats.table{T}(d::AbstractDataArray{T})
    counts = Dict{Union(T, NAtype), Int}()
    for i = 1:length(d)
        if haskey(counts, d[i])
            counts[d[i]] += 1
        else
            counts[d[i]] = 1
        end
    end
    return counts
end

Works as a workaround.

isequal(@data([2, NA]), @pdata([2, NA])) ?

I was expecting it to be false (same for @data(1:3) == @pdata(1:3)), but I guess either way, you lose the ability to evaluate something easily.

Proposed roadmap for DataArrays revision

Here's a list of what I would consider the most important changes to make to this package:

Minimal constructors

Why discussing the recent changes to DA constructors, @simonster made a suggestion that I really like that would have us remove a constructor that isn't strictly.

I'd like to propose the following design principle:

We only offer two constructors for DA's and PDA's:
- Construction using the implementation details
- Construction using an empty DA/PDA of a specific type and size.
For creating objects, one either uses:
- Literals written using @data or @pdata
- Explicit conversions.

Does this sound right to you, @simonster?

New ambiguity warnings

The number of new methods required is really kind of insane and not sustainable if people keep adding things to Base.

promote_type(DataArray, Array) should be DataArray?

... rather than AbstractArray?

Replace PooledDataArray with NominalVariable/CategoricalVariable

In #67, @johnmyleswhite proposed replacing PooledDataArrays with OrdinalVariable/NominalVariable enums. Here are four possible approaches in order of my current preferences regarding them:

Keep the concept of the PooledDataArray for efficiently storing values, but make getindex wrap extracted value as an OrdinalVariable and NominalVariable. ~~So far, this is looking like the best approach to me.~~ Pros: Efficient storage and avoids most of the potential performance pitfalls of the below approaches. Cons: We don't get to get rid of much code.
Make OrdinalVariable and NominalVariables have a field for the array of possible values they can take on. Pros: We kill PooledDataArray. Cons: This incurs both memory and performance overhead, since the array reference means an extra 8 bytes and GC root per object. (~~I have no idea how much this matters.~~ This makes garbage collection really slow.)
Assign each enum an ID, keep this ID in the objects, and keep the pool in a global array. Pros: We kill PooledDataArray, and we only need an extra few bytes per object and we don't need GC roots. Cons: Pools never get garbage collected.
Make OrdinalVariable and NominalVariable immutables parametrized by tuples that represent the possible values, or similarly, dynamically generate types at runtime. Pros: We kill PooledDataArray but store values equally efficiently. Cons: While this seems clever, it is really an abuse of the type system. Julia will compile specialized code for each enum, which has undesirable performance characteristics and is even worse than (3) as far as leaving around data that never gets garbage collected. The tuple approach has the additional disadvantage that type inference probably stops working if there are >8 values in the tuple.

Type-instability in PooledDataArray's?

While trying to clean up the PDA code, I realized that our current approach, which allows PDA's to change the type of their references field, might be introducing type-instability into our code. I don't know if there are specific cases where this is a problem yet, but it seems worth starting to debate.

ambiguous definitions with BinDeps

On 0.2:

julia> using BinDeps

julia> using DataArrays
Warning: New definition
    |(NAtype,Any) at C:\Users\mlubin\.julia\DataArrays\src\operators.jl:502
is ambiguous with:
    |(Any,SynchronousStepCollection) at C:\Users\mlubin\.julia\BinDeps\src\BinDeps.jl:286.
To fix, define
    |(NAtype,SynchronousStepCollection)
before the new definition.
Warning: New definition
    |(Any,NAtype) at C:\Users\mlubin\.julia\DataArrays\src\operators.jl:502
is ambiguous with:
    |(SynchronousStepCollection,Any) at C:\Users\mlubin\.julia\BinDeps\src\BinDeps.jl:283.
To fix, define
    |(SynchronousStepCollection,NAtype)
before the new definition.

Travis CI

I just added a .travis.yml file (copied from DataFrames). @johnmyleswhite, can you flip the switch to enable the service hook?

Implement ==

Right now == on two DataArrays gives an error, but I'd like to implement it. Since NA == NA returns NA, I think the right thing to do is to make == return NA if there are any NAs in either DataFrame, and otherwise give the same behavior as == for standard Arrays. Is this reasonable?

Deprecate isna(Scalar), only provide isna(AbstractArray, inds...)

Until we can do something to provide the compiler more information about DataArrays, I'd like to propose that we deprecate isna(x::NAtype) and isna(x::Any). This will encourage people to write loops like

function sum(x::DataArray{T})
    s = 0.0
    for i in 1:length(x)
        if !isna(x, i)
            s += x[i]::T
        end
    end
    return s
end

and to discourage loops like

function sum(x::DataArray{T})
    s = 0.0
    for i in 1:length(x)
        if !isna(x[i])
            s += x[i]::T
        end
    end
    return s
end

Ideally we'd like to get rid of that T annotation, but I'd like to provide and encourage idioms that will perform better than the type-unstable code we currently encourage. The more we push people towards type-stable code, the fewer performance questions we'll have to field.

Broken PooledDataArray constructor

At some point, the PooledDataArray constructors got a little wonky:

julia> PooledDataArray([true], [false])
1-element PooledDataArray{Bool,Uint32,1}:
 NA

julia> PooledDataArray([true], [true])
1-element PooledDataArray{Bool,Uint32,1}:
 true

This is almost the exact opposite semantics that we should have.

Remove DataVector[1] hack; use a @data macro

I've been hoping to get rid of the DataVector[1, NA] hack for a long time now. It's really convenient, but didn't extend to matrices in any clear way.

I think the solution is to create @data and @pdata macros, which will take in literals that could contain NA values and generate DataArray's and PooledDataArray's. You'd end up with:

@data [1, NA, 3] #=> DataArray([1, 0, 3], [false, true, false])
@pdata ["a", "a", "a"] #=> PooledDataArray(["a", "a", "a"], [false, false, false])

@data [1 NA; 3 4] #=> DataArray([1 0; 3 4], [false true; false false])
@pdata ["a" "a"; "a" "a"] #=> PooledDataArray(["a" "a"; "a" "a"], [false false; false false])

WIP: Clean up indexing

Currently, DataArray(Int, 3, 1)[:, 1] = @data [1, 2, NA] fails, so I started cleaning up the setindex! functions. Unfortunately, while this is more concise than the old indexing functions, handles a larger variety of cases, and passes tests, the generated code looks pretty bad (not that it looked particularly good before). I'm also missing setindex!(DataArray, AbstractDataArray, inds...) (the general case of this bug), which should probably use Cartesian, as should setindex!(PooledDataArray, AbstractArray, inds...), where I'm currently doing allocation.

@johnmyleswhite Is it really necessary to document each method of getindex and setindex!? Could we just specify that they will behave the same way for DataArrays as they do for Arrays? Right now I've got about twice as much documentation here as I have methods.

Implement logical indexing with nd arrays

We do not presently have logical (boolean) indexing with anything besides vectors:

julia> using DataArrays

julia> a = @data [1 2; 3 4]
2x2 DataArray{Int64,2}:
 1  2
 3  4

julia> a[a .== 1]
ERROR: BoundsError()
 in getindex at bitarray.jl:363

julia> a[a .== 1] = 1
ERROR: no method setindex!(DataArray{Int64,2}, Int64, DataArray{Bool,2})

When implemented, I can remove the ugly vec in the tests for #68.

Support other statistics functions in Base

Per JuliaLang/julia#4552, users will now be recommended to use DataArrays to handle missing data. For completeness, it would be nice to support the full range of statistical functions defined in Base.

select(:((foo .== "bar") & (cat .== "dog")), df) fails

While trying to subset a df where both foo and cat are PooledDataArrays, I get:

no method &(PooledDataArray{Bool,Uint32,1},PooledDataArray{Bool,Uint32,1})

But,

select(:(DataArray(foo .== "bar") & DataArray(cat .== "dog")), pda)

works as expected.

Ambiguities with subtraction operator

Currently, the tests passed on my machine. Just that it produces ambiguity warnings when loaded:

julia> using DataArrays
Warning: New definition 
    -(DataArray{T,N},AbstractArray{T,N}) at /Users/dhlin/.julia/DataArrays/src/operators.jl:324
is ambiguous with: 
    -(AbstractArray{T,2},Diagonal{T}) at linalg/diagonal.jl:27.
To fix, define 
    -(DataArray{T,2},Diagonal{T})
before the new definition.
Warning: New definition 
    -(AbstractArray{T,N},DataArray{T,N}) at /Users/dhlin/.julia/DataArrays/src/operators.jl:324
is ambiguous with: 
    -(Diagonal{T},AbstractArray{T,2}) at linalg/diagonal.jl:26.
To fix, define 
    -(Diagonal{T},DataArray{T,2})
before the new definition.
Warning: New definition 
    -(AbstractDataArray{T,N},AbstractArray{T,N}) at /Users/dhlin/.julia/DataArrays/src/operators.jl:345
is ambiguous with: 
    -(AbstractArray{T,2},Diagonal{T}) at linalg/diagonal.jl:27.
To fix, define 
    -(AbstractDataArray{T,2},Diagonal{T})
before the new definition.
Warning: New definition 
    -(AbstractArray{T,N},AbstractDataArray{T,N}) at /Users/dhlin/.julia/DataArrays/src/operators.jl:345
is ambiguous with: 
    -(Diagonal{T},AbstractArray{T,2}) at linalg/diagonal.jl:26.
To fix, define 
    -(Diagonal{T},AbstractDataArray{T,2})
before the new definition.

The problem is that the Base defines:

- (AbstractMatrix, Diagonal)
- (Diagonal, AbstractMatrix)

and this package defines:

- (AbstractArray, DataArray)
- (DataArray, AbstractArray)

So when you write something like a diagonal matrix subtract a data matrix, the compiler won't know which method to use.

Proposed PooledDataArray changes

My thinking on PooledDataArray's is slowly changing, especially with regard to the absence of anything like an official factor type like. Here are some proposals for how we might use them going forward.

PDA's shouldn't be used as a workhorse data structure for data that is still likely to be mutated. One should, under most circumstances, only construct a PDA when you know that the data is effectively static.
The value of PDA's is that they cache information about categorical data. For now, PDA's cache only information about the unique values found in the full array. But I'd like to propose adding more information.
In addition to caching the unique values, I would argue for maintaing a count of the number of times that each value occurs. This would make it easier to repeatedly use PDA's in cross tabulations. In addition, we can use a number of counts of each unique value to ensure that the PDA's pool never contains entries that are no longer present in the PDA's refs.
Right now the ordering of levels in the pool is something one has to guess at. Although we thankfully construct PDA's such that the pool is ordered, we don't maintain this invariant when values are inserted or deleted from the PDA. I would argue that we should guarantee that all operations on PDA's leave the pool ordered.
We shouldn't require that users ever call compact. Instead, we should revert our earlier stance and ensure, on every operation, that PDA's are represented in maximally compact form.
We should implement an ordering option for PDA's by adding two new fields to the data structure: ordered::Bool and order::Vector{Uint64}, which define an ordering over the pool by mapping each item of the pool to an integer that matches the rank of the associated level in the ordering over categories.
We should maybe remove/rename reorder so that we only have a setorder! function, which takes in a new ordering of the pool as a vector. This would let you take something like pda = @pdata(["a", "b", "c"]) and reorder it as setorder!(pda, ["b", "c", "a"]).
We should never leak the numeric representation of the refs to the outside world. This means that get_indices and similar functions should not exist.

Remove values function

The values function in DataArrays doesn't make any sense: it turns PDA's into DA's, but returns copies of DA's and Array's. If people want to convert a PDA into DA, they should use convert, not values.

getindex produces (potentially invalid) ambiguity warnings

julia> using DataArrays
Warning: New definition 
    getindex(DataArray{T,N},Union(Array{T,1},Ranges{T})) at /Users/johnmyleswhite/.julia/DataArrays/src/dataarray.jl:350
is ambiguous with 
    getindex(DataArray{T<:Number,N},Union(Array{T,1},Ranges{T},BitArray{1})) at /Users/johnmyleswhite/.julia/DataArrays/src/dataarray.jl:334.
Make sure 
    getindex(DataArray{T<:Number,N},Union(Array{T,1},Ranges{T}))
is defined first.

Statistics functions that ignore NAs

We should definitely implement these. See also JuliaData/DataFrames.jl#354 and JuliaData/DataFrames.jl#325. I'm beginning to think the most general/performant approach is to replicate some of @lindahua's code from NumericExtensions but add in an if statement to avoid touching values that are NA.

Category to integer mapping: bikeshedding session

After merging #52, we should provide a tool that constructs a mapping from the levels of a PooledDataArray to the integers. This function should make clear that the mapping is ad hoc and not related to the underlying representation of the data.

Should we call it levelsmap?

Lots of warnings on loading

Warning: New definition 
    round(DataArray{T<:Number,N},Integer...) at /Users/viral/.julia/DataFrames/src/operators.jl:350
is ambiguous with: 
    round(AbstractArray{T<:Real,1},) at operators.jl:236.
To fix, define 
    round(DataArray{_<:Real,1},)
before the new definition.
Warning: New definition 
    round(DataArray{T<:Number,N},Integer...) at /Users/viral/.julia/DataFrames/src/operators.jl:350
is ambiguous with: 
    round(AbstractArray{T<:Real,2},) at operators.jl:237.
To fix, define 
    round(DataArray{_<:Real,2},)
before the new definition.
Warning: New definition 
    round(AbstractDataArray{T<:Number,N},Integer...) at /Users/viral/.julia/DataFrames/src/operators.jl:359
is ambiguous with: 
    round(AbstractArray{T<:Real,1},) at operators.jl:236.
To fix, define 
    round(AbstractDataArray{_<:Real,1},)
before the new definition.
Warning: New definition 
    round(AbstractDataArray{T<:Number,N},Integer...) at /Users/viral/.julia/DataFrames/src/operators.jl:359
is ambiguous with: 
    round(AbstractArray{T<:Real,2},) at operators.jl:237.
To fix, define 
    round(AbstractDataArray{_<:Real,2},)
before the new definition.
Warning: New definition 
    ceil(DataArray{T<:Number,N},Integer...) at /Users/viral/.julia/DataFrames/src/operators.jl:350
is ambiguous with: 
    ceil(AbstractArray{T<:Real,1},) at operators.jl:236.
To fix, define 
    ceil(DataArray{_<:Real,1},)
before the new definition.
Warning: New definition 
    ceil(DataArray{T<:Number,N},Integer...) at /Users/viral/.julia/DataFrames/src/operators.jl:350
is ambiguous with: 
    ceil(AbstractArray{T<:Real,2},) at operators.jl:237.
To fix, define 
    ceil(DataArray{_<:Real,2},)
before the new definition.
Warning: New definition 
    ceil(AbstractDataArray{T<:Number,N},Integer...) at /Users/viral/.julia/DataFrames/src/operators.jl:359
is ambiguous with: 
    ceil(AbstractArray{T<:Real,1},) at operators.jl:236.
To fix, define 
    ceil(AbstractDataArray{_<:Real,1},)
before the new definition.
Warning: New definition 
    ceil(AbstractDataArray{T<:Number,N},Integer...) at /Users/viral/.julia/DataFrames/src/operators.jl:359
is ambiguous with: 
    ceil(AbstractArray{T<:Real,2},) at operators.jl:237.
To fix, define 
    ceil(AbstractDataArray{_<:Real,2},)
before the new definition.
Warning: New definition 
    floor(DataArray{T<:Number,N},Integer...) at /Users/viral/.julia/DataFrames/src/operators.jl:350
is ambiguous with: 
    floor(AbstractArray{T<:Real,1},) at operators.jl:236.
To fix, define 
    floor(DataArray{_<:Real,1},)
before the new definition.
Warning: New definition 
    floor(DataArray{T<:Number,N},Integer...) at /Users/viral/.julia/DataFrames/src/operators.jl:350
is ambiguous with: 
    floor(AbstractArray{T<:Real,2},) at operators.jl:237.
To fix, define 
    floor(DataArray{_<:Real,2},)
before the new definition.
Warning: New definition 
    floor(AbstractDataArray{T<:Number,N},Integer...) at /Users/viral/.julia/DataFrames/src/operators.jl:359
is ambiguous with: 
    floor(AbstractArray{T<:Real,1},) at operators.jl:236.
To fix, define 
    floor(AbstractDataArray{_<:Real,1},)
before the new definition.
Warning: New definition 
    floor(AbstractDataArray{T<:Number,N},Integer...) at /Users/viral/.julia/DataFrames/src/operators.jl:359
is ambiguous with: 
    floor(AbstractArray{T<:Real,2},) at operators.jl:237.
To fix, define 
    floor(AbstractDataArray{_<:Real,2},)
before the new definition.
Warning: New definition 
    trunc(DataArray{T<:Number,N},Integer...) at /Users/viral/.julia/DataFrames/src/operators.jl:350
is ambiguous with: 
    trunc(AbstractArray{T<:Real,1},) at operators.jl:236.
To fix, define 
    trunc(DataArray{_<:Real,1},)
before the new definition.
Warning: New definition 
    trunc(DataArray{T<:Number,N},Integer...) at /Users/viral/.julia/DataFrames/src/operators.jl:350
is ambiguous with: 
    trunc(AbstractArray{T<:Real,2},) at operators.jl:237.
To fix, define 
    trunc(DataArray{_<:Real,2},)
before the new definition.
Warning: New definition 
    trunc(AbstractDataArray{T<:Number,N},Integer...) at /Users/viral/.julia/DataFrames/src/operators.jl:359
is ambiguous with: 
    trunc(AbstractArray{T<:Real,1},) at operators.jl:236.
To fix, define 
    trunc(AbstractDataArray{_<:Real,1},)
before the new definition.
Warning: New definition 
    trunc(AbstractDataArray{T<:Number,N},Integer...) at /Users/viral/.julia/DataFrames/src/operators.jl:359
is ambiguous with: 
    trunc(AbstractArray{T<:Real,2},) at operators.jl:237.
To fix, define 
    trunc(AbstractDataArray{_<:Real,2},)
before the new definition.
Warning: New definition 
    formatter(
Array{T,N},Date{C<:Calendar}...)
is ambiguous with: 
    formatter(Array{T,N},FloatingPoint...).
To fix, define 
    formatter(Array{T,N},)
before the new definition.

Lightweight wrapper for floating point arrays with NaNs

It'd be nice to have a subtype of AbstractDataArray that wraps a regular floating point array and treats NaN values as NA. I think this should generally be faster than indexing the BitArray that holds the list of NA values, and with #4 and some minor API changes this should eventually give us nansum etc. with equivalent performance to manual loops.

Ambiguous copy!(AbstractDataArray{T,N},Any) warnings

julia> using DataArrays
Warning: New definition 
    copy!(AbstractDataArray{T,N},Any) at /Users/jiahao/.julia/v0.3/DataArrays/src/abstractdataarray.jl:48
is ambiguous with: 
    copy!(AbstractArray{T,2},AbstractArray{T,2}) at multidimensional.jl:142.
To fix, define 
    copy!(AbstractDataArray{T,2},AbstractArray{T,2})
before the new definition.
...(repeats)

Semantics of `similar`

The bug I fixed last night suggested that similar was not setting the NA values in a DataArray correctly. This led me to wonder whether we should have similar do any NA setting or whether it should only do allocation. My inclination is that it should only do allocation, not value setting. But there is obviously an argument for automatically setting everything to NA until a value is known.

Define first and last methods

# import Base.first, Base.last

first(d::DataArray) = d[1]
last(d::DataArray) = d[size(d)[1]]

Constructors shouldn't be doing conversion

Right now we have a bunch of constructors that do conversion:

julia> DataArray(1:10)
10-element DataArray{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

julia> Array(1:1)
ERROR: no method Array{T,N}(Range1{Int64})

Standard Array's don't this and neither should we. At some point I'm just going to open an issue that lists all constructors for Array's so that we can exactly match it.

Conversely, we're missing the relevant conversion methods:

julia> convert(DataVector, 1:10)
ERROR: no method convert(Type{DataArray{T,1}}, Range1{Int64})
 in convert at base.jl:11

julia> convert(Vector, 1:10)
10-element Array{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

Semantics of unique and levels

The semantics of unique and levels are a mess at the moment, which causes the error Andreas sees in DataFrames:

julia> using DataArrays

julia> da = @data([1, 2, NA])
3-element DataArray{Int64,1}:
 1  
 2  
  NA

julia> pda = @pdata([1, 2, NA])
3-element PooledDataArray{Int64,Uint32,1}:
 1  
 2  
  NA

julia> unique(da)
3-element DataArray{Int64,1}:
  NA
 2  
 1  

julia> levels(da)
3-element DataArray{Int64,1}:
  NA
 2  
 1  

julia> unique(pda)
3-element DataArray{Int64,1}:
 1  
 2  
  NA

julia> levels(pda)
2-element Array{Int64,1}:
 1
 2

My preference is that unique should always return a DataArray of the same type containing all the unique values (including NA), whereas levels should always return an Array of the same type containing only the non-NA unique values. We're doing this for PDA's, but not for DA's.

max of integer DataArrays results in Array{Any}. Breaks max of several DataArrays

The max function changes the type when applied to instances of DataArrays. As a side effect, taking the maximum of 3 or more instances ofDataArrays result in an error:

julia> u = DataArray([1,2,3]); v = DataArray([0,4,0]); w =[4,0,0];

julia> u = DataArray([1,2,3]); v = DataArray([0,4,0]); w = DataArray([4,0,0]);

julia> typeof(max(u,v))
Array{Any,1}

julia> max(u,v,w)
ERROR: no method isless(DataArray{Int64,1}, Array{Any,1})
 in max at operators.jl:71

I have this issue with Julia 0.3.0 preview (mac binary package). In Julia 0.2.0 max returns and Array{Int64, 1} While perhaps not ideal, at least this allowed to execute max(u,v,w).

Remove initialized constructors

Now that we have @data and @pdata macros, I'd like to remove the data* constructors. Instead of having a special datazeros() function, it's much more general to use @data zeros().

`reorder` method using `DataFrame` throws error.

Redoing the Julia benchmarks results in an error due to a use of DataFrame here.

rewrite and rename percent_change() -> percentchange()

This method is very common and when it comes to naming it, it's easy to start calling it something different depending on what package you happen to be using. R has this issue and calls this method ROC, Delt, delt, returns and probably a handful of other names. TimeSeries uses percentchange now (I just deprecated simple_return and log_return). That naming is consistent with Julian convention to remove underscores, the bane of Ruby variable names.

percentchange.TimeSeries also offers a kwarg for either simple percent change (which is what the currently implemented percent_change offers), and log percent change.

I can do a PR on this. It will also require 3 changes to files in DataFrames, where the method is referenced as percent_change.

Here is what the code looks like:

function percentchange(dv::DataArray; method="simple")
  if method == "simple" 
    return expm1(log_return(dv))
  elseif method == "log" 
    return log_return(dv)
  else 
    throw("only simple and log methods supported")
  end
end

function log_return(dv::DataArray)
  pad(diff(log(dv)), 1, 0, NA)
end

The code is short enough where you can refactor and put the definition of log_return inside the original method.

Consolidate failNA, replaceNA, vector and matrix into one function

Right now we have four functions that could plausibly be merged into a single function that takes keyword args:

failNA
replaceNA
vector
matrix

There's really no need to have vector and matrix be separate functions: they could both be replaced by array. And we could incorporate the ability to (a) fail on encountering NA and (b) replace NA with a chosen value into this new array function.

If we used something like,

array(da::DataArray{T}; outtype = T, fail = true, replace = nothing)

we could do all of this work at once. The only thing that's tricky is getting removeNA into a function, since its output has unpredictable size.

test/runtests.jl fails

Xref #73

$ ./julia -e 'versioninfo()'
Julia Version 0.3.0-prerelease+1692
Commit 736251d* (2014-02-23 06:21 UTC)
Platform Info:
  System: Linux (i686-redhat-linux)
  CPU: Genuine Intel(R) CPU           T2250  @ 1.73GHz
  WORD_SIZE: 32
  BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY)
  LAPACK: libopenblas
  LIBM: libopenlibm
$ ./julia ~/.julia/DataArrays/test/runtests.jl 
Running tests:
 * abstractarray.jl
 * booleans.jl
 * constructors.jl
 * containers.jl
 * conversions.jl
 * data.jl
 * dataarray.jl
 * datamatrix.jl
 * linalg.jl
 * operators.jl
ERROR: no method -(PooledDataArray{Float64,Uint32,1})
 in anonymous at /home/rick/.julia/DataArrays/test/operators.jl:59
 in anonymous at no file:32
 in include_from_node1 at loading.jl:120
while loading /home/rick/.julia/DataArrays/test/operators.jl, in expression starting on line 56
while loading /home/rick/.julia/DataArrays/test/runtests.jl, in expression starting on line 30

The rabbit hole that is the data macro

Right now, the @data macro handles most things correctly except for variables that are equal to NA.

Consider this example:

a, b, c = 1, 2, NA
@data [a, b, c]

This will fail because the type of c isn't known from the surface analysis that the @data macro does. To fix this, we'd need to write out code that analyzes the values of the inputs, which can't be known at compile-time. So the macro needs to write code that does analysis at run-time.

Don't allow NA inside indices

As discussed in #38, any indexing operation that uses an NA index should fail. Right now, we simply drop NA's indices, but it's safer if you know that leaving NA's in your indices will always fail.

Clash with BinDeps in definition of `|`

I haven't looked closely, but this might just need to be changed to |>.
(Of course, the same change might be needed in BinDeps...)

julia> using Winston
Warning: New definition 
    |(SynchronousStepCollection,Any) at /home/kmsquire/.julia/v0.2/BinDeps/src/BinDeps.jl:283
is ambiguous with: 
    |(Any,NAtype) at /home/kmsquire/.julia/v0.2/DataArrays/src/operators.jl:502.
To fix, define 
    |(SynchronousStepCollection,NAtype)
before the new definition.
Warning: New definition 
    |(Any,SynchronousStepCollection) at /home/kmsquire/.julia/v0.2/BinDeps/src/BinDeps.jl:286
is ambiguous with: 
    |(NAtype,Any) at /home/kmsquire/.julia/v0.2/DataArrays/src/operators.jl:502.
To fix, define 
    |(NAtype,SynchronousStepCollection)
before the new definition.

juliastats / dataarrays.jl Goto Github PK

dataarrays.jl's People

Contributors

Stargazers

Watchers

Forkers

dataarrays.jl's Issues

Recommend Projects

Recommend Topics

Recommend Org