juliadata / datatables.jl Goto Github PK

View Code? Open in Web Editor NEW

29.0 12.0 11.0 8.15 MB

(DEPRECATED) A rewrite of DataFrames.jl based on Nullable

License: Other

Julia 100.00%

julia datatables tabular-data dataframes

datatables.jl's Introduction

DataTables.jl

Tools for working with tabular data in Julia.

Installation: at the Julia REPL, Pkg.add("DataTables")

Documentation:

Reporting Issues and Contributing: See CONTRIBUTING.md

Maintenance: DataTables is maintained collectively by the JuliaData collaborators. Responsiveness to pull requests and issues can vary, depending on the availability of key collaborators.

datatables.jl's People

Contributors

Stargazers

Watchers

Forkers

tkelman greimel fredsoftwares mkborregaard renato-zannon drkrar davidanthoff oxinabox isgasho stjordanis

datatables.jl's Issues

Overwrites DataFrames describe function

I have a lot of situations where I need both DataFrames and DataTables loaded at the same time, e.g. I start out with:

using DataFrames, DataTables

Right now I always get a warning that DataTables overwrites describe from DataFrames, which is not ideal.

I guess the solution for this is to move the function definition in some common base package, and then both DataFrames and DataTables will add a method? Would that be AbstractTables? If so, could we maybe start with a really bare bones AbstractTables now, that only holds that one definition, and then later more stuff can be added?

Throw error for type mismatch using join?

Just wondering if it makes sense to include an error when two dt column types don't match?

Currently, join() will allow you to do the blend but will return all null values.

Backporting to DataFrames

Is there any plan for the best way to accomplish this or any recommended strategies to try first?
Would it be preferable to get Nulls working here before the backport or get started now and move the all open PRs here (#66 included) to DataFrames and finish them there?

describe does not function

The docs define describe.

...
If the column's base type derives from Number, compute the minimum, first quantile, median, mean, third quantile, and maximum. Nulls are filtered and reported separately.

But a MWE with columns types as Float64

dt = DataTable(a=rand(10), b=randn(10))
describe(dt)

Outputs:

a
Summary Stats:
Length:         10
Type:           Nullable{Float64}
Number Unique:  10

b
Summary Stats:
Length:         10
Type:           Nullable{Float64}
Number Unique:  10

This is on Stable version of DataTables
with

Julia Version 0.5.1
Commit 6445c82 (2017-03-05 13:25 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

Contrast this with the expected behaviour from DataFrames:

dt = DataFrame(a=rand(10), b=randn(10))
describe(dt)

a
Summary Stats:
Mean:           0.554118
Minimum:        0.050461
1st Quartile:   0.361887
Median:         0.530056
3rd Quartile:   0.836832
Maximum:        0.933014
Length:         10
Type:           Float64

b
Summary Stats:
Mean:           0.054508
Minimum:        -1.119196
1st Quartile:   -0.792691
Median:         0.165122
3rd Quartile:   0.749462
Maximum:        1.449910
Length:         10
Type:           Float64

Join on columns of different name

The docs say we have to join only on columns that have the same name on both DTs. What about using

join(a, b, on=(:a_name, :b_name))
join(a, b, on=[(:a_name1, :b_name1), (:a_name2, :b_name2)])

for joining on columns of different names? This is because I think that is common to reuse a and b after the merge, so it's not nice to have the obligation to rename in case of a merge.

complete_cases for selected columns

Arising from discussion here: https://discourse.julialang.org/t/is-there-a-dropna-for-dataframe/1777

I think it could be useful to have a version of complete_cases that only drops rows that have Nulls in specified columns. The use case is if you have a big dataset with thousands of rows and hundreds of columns, some of which have Nulls, and you want to do an operation on a subset of the columns in a statistical function (lm EDIT: I see that lm already deals with Nulls) or plot it. Running complete_cases would remove rows that had NAs in columns unrelated to the function.

In my experience, this scenario is very common: All the data is collected in a single DataTable, even some variables that may only be useful for a few operations and that have lots of Null values.

Also, I think it would be nice and efficient if this version (or all versions) of complete_cases returned a view. This would be much more efficient for on-the-fly computations, and would help distinguish complete_cases from dropna functionality. EDIT: I guess the user can make a view if desired with the current functionality, but using views would align functionality more with complete_cases!.

In the discourse post (https://discourse.julialang.org/t/is-there-a-dropna-for-dataframe/1777/19) , @mwsohn suggested some code that could form a useful basis for a PR. I am posting a modification of this for easy reference, but I take no credit for the code. Except for BitArray and view (which might also be suggested modifications to the existing complete_cases if BitArrays are more efficient), the only change from the current code in DataFrames is the args argument.

function complete_cases(dt::DataTable, args::Vector{Symbol})
    ba = BitArray(trues(size(dt,1)))
    for sym in args
        ba &= ~BitArray(isnull.(dt[sym]))
    end
    view(dt, ba) # or just ba if a view is not desired
end

deprecate readtable & writetable for CSV.read & CSV.write

a find & replace of DataFrame -> DataTable project-wide for each was sufficient to get those working.

DataStreams.jl
- issue JuliaData/DataStreams.jl#27
- ready branch https://github.com/cjprybol/DataStreams.jl/tree/cjp/DataTables
CSV.jl
- ready branch https://github.com/cjprybol/CSV.jl/tree/cjp/DataTables

after those changes we'll deprecate

readtable in favor of CSV.read
writetable in favor of CSV.write

I've seen this discussed but couldn't find an open issue

Start replacing Nullable with Null

@quinnj has just created a Null.jl package providing a new Null type to replace DataArrays' NAtype. Even if the Julia compiler doesn't yet include the necessary optimizations to handle Union{T, Null} efficiently (see e.g. discussion at JuliaData/Missings.jl#3), I think we should start moving away from Nullable now, so that at least we can stabilize the API even if performance remains poor for some time.

NullableArray can be replaced with Array{Union{T, Null}}, which Jameson said will eventually use the same memory layout as NullableArray. This should suit quite well with @cjprybol's PR #53 which is going to stop auto-promoting columns to NullableArray. CategoricalArray and NullableCategoricalArray will have to be adapted, but that shouldn't be too hard.

Strange error message at REPL with the latest Julia nightly

(rmbpro)-~$ julia-dev
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.0-pre.alpha.165 (2017-03-17 15:48 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit a2e714b (0 days old master)
|__/                   |  x86_64-apple-darwin13.4.0

julia> using DataTables
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:244 overwritten in module NullableArrays at /Users/Randy/.julia/v0.6/NullableArrays/src/operators.jl:128.

julia> a = 1
Error showing value of type Int64:
ERROR: MethodError: Cannot `convert` an object of type Int64 to an object of type Array{T,1}
This may have arisen from a call to the constructor Array{T,1}(...),
since type constructors fall back to convert methods.
Stacktrace:
 [1] (::Type{Array{T,1}})(::Int64) at ./sysimg.jl:24
 [2] vect(::T<:Type{Any}, ::Vararg{T<:Type{Any},N} where N) at ./array.jl:64
 [3] #getdisplay#5(::Void, ::Function, ::Type{T} where T, ::Dict{Any,Any}) at /Users/Randy/.julia/v0.6/Media/src/system.jl:110
 [4] (::Media.#kw##getdisplay)(::Array{Any,1}, ::Media.#getdisplay, ::Type{T} where T, ::Dict{Any,Any}) at ./<missing>:0
 [5] display(::Media.DisplayHook, ::Int64) at /Users/Randy/.julia/v0.6/Media/src/compat.jl:9
 [6] display(::Int64) at ./multimedia.jl:194

I have also tried loading DataFrames and it didn't show any errors.

"correct" join results

We're making progress on removing the automatic promotion of columns to nullable arrays (NullableArrays & NullableCategoricalArrays) in #30 and I would like some feedback and discussion about the best way to handle outer joins (outer, left, right). Outer joins are particularly interesting because they all have the potential to introduce nulls, but do not necessarily need to do so. All of the other joins currently supported create subsets of the tables and shouldn't have the potential to introduce nulls. Please correct me if I'm wrong on this.

For the other joins that do not have the potential to introduce missing data, it seems that the "correct" result is pretty easy to decide on: The result should be in the same order as the input data and the column types should be preserved.

For outer joins, we have two inter-related "correct"-ness questions that both have to do with nulls. The first question is what is the "correct" ordering of the returned values when nulls are introduced. Do we try and retain the original ordering of the input datasets, or do we default to putting the null values at the end, or left then right table ordering? The second is what is the "correct" column type to return to the user. As all outer joins have the potential to introduce missingness, the "correct" result could be the one that is most column-type consistent. In this case, we would retain the column type and only promote to using nullable arrays when nulls are introduced. On the flip side of this is the behavioral consistency approach. Because all outer joins have the potential to introduce missing data, it could be argued that outer joins should always return nullable columns (only for the smaller of the two joined tables or just completely nullify the datatable). In that case, users can always expect to receive nullable columns to be output from outer joins and write nullable-tolerant code accordingly without anything unexpected.

Here is an example set of joins. I'll turn these into test cases to assert whatever behavior pattern we decide on, and if any other examples are put forth here I'll add those too.

the data

julia> small = DataTable(id = 1:2:5, fid = 1.0:2.0:5.0)
3×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 1  │ 1.0 │
│ 2   │ 3  │ 3.0 │
│ 3   │ 5  │ 5.0 │

julia> large = DataTable(id = 0:4, fid = 0.0:4.0)
5×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 0  │ 0.0 │
│ 2   │ 1  │ 1.0 │
│ 3   │ 2  │ 2.0 │
│ 4   │ 3  │ 3.0 │
│ 5   │ 4  │ 4.0 │

and here are the results I would expect from the joins, which are commented out in the line before the result I expected.

julia> N = Nullable()
Nullable{Union{}}()

julia> # join(small, large, kind=:cross)
       DataTable(id = repeat(1:2:5, inner=5),
                 fid = repeat(1.0:2.0:5.0, inner=5),
                 id_1 = repeat(0:4, outer=3),
                 fid_1 = repeat(0.0:4.0, outer=3))
15×4 DataTables.DataTable
│ Row │ id │ fid │ id_1 │ fid_1 │
├─────┼────┼─────┼──────┼───────┤
│ 1   │ 1  │ 1.0 │ 0    │ 0.0   │
│ 2   │ 1  │ 1.0 │ 1    │ 1.0   │
│ 3   │ 1  │ 1.0 │ 2    │ 2.0   │
│ 4   │ 1  │ 1.0 │ 3    │ 3.0   │
│ 5   │ 1  │ 1.0 │ 4    │ 4.0   │
│ 6   │ 3  │ 3.0 │ 0    │ 0.0   │
│ 7   │ 3  │ 3.0 │ 1    │ 1.0   │
│ 8   │ 3  │ 3.0 │ 2    │ 2.0   │
│ 9   │ 3  │ 3.0 │ 3    │ 3.0   │
│ 10  │ 3  │ 3.0 │ 4    │ 4.0   │
│ 11  │ 5  │ 5.0 │ 0    │ 0.0   │
│ 12  │ 5  │ 5.0 │ 1    │ 1.0   │
│ 13  │ 5  │ 5.0 │ 2    │ 2.0   │
│ 14  │ 5  │ 5.0 │ 3    │ 3.0   │
│ 15  │ 5  │ 5.0 │ 4    │ 4.0   │

julia> # id
       # join(small, large, on=:id, kind=:inner)
       DataTable(id = [1, 3], fid = [1.0, 3.0], fid_1 = [1.0, 3.0])
2×3 DataTables.DataTable
│ Row │ id │ fid │ fid_1 │
├─────┼────┼─────┼───────┤
│ 1   │ 1  │ 1.0 │ 1.0   │
│ 2   │ 3  │ 3.0 │ 3.0   │

julia> # join(small, large, on=:id, kind=:left)
       DataTable(id = [1, 3, 5], fid = [1.0, 3.0, 5.0], fid_1 = [1.0, 3.0, N])
3×3 DataTables.DataTable
│ Row │ id │ fid │ fid_1 │
├─────┼────┼─────┼───────┤
│ 1   │ 1  │ 1.0 │ 1.0   │
│ 2   │ 3  │ 3.0 │ 3.0   │
│ 3   │ 5  │ 5.0 │ #NULL │

julia> # join(small, large, on=:id, kind=:semi)
       DataTable(id = [1, 3], fid = [1.0, 3.0])
2×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 1  │ 1.0 │
│ 2   │ 3  │ 3.0 │

julia> # join(small, large, on=:id, kind=:anti)
       DataTable(id = 5, fid = 5.0)
1×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 5  │ 5.0 │

julia> # join(small, large, on=:id, kind=:outer)
       DataTable(id = 0:5, fid = [N, 1.0, N, 3.0, N, 5.0], fid_1 = [0.0, 1.0, 2.0, 3.0, 4.0, N])
6×3 DataTables.DataTable
│ Row │ id │ fid   │ fid_1 │
├─────┼────┼───────┼───────┤
│ 1   │ 0  │ #NULL │ 0.0   │
│ 2   │ 1  │ 1.0   │ 1.0   │
│ 3   │ 2  │ #NULL │ 2.0   │
│ 4   │ 3  │ 3.0   │ 3.0   │
│ 5   │ 4  │ #NULL │ 4.0   │
│ 6   │ 5  │ 5.0   │ #NULL │

julia> # join(small, large, on=:id, kind=:right)
       DataTable(id = 0:4, fid = [N, 1.0, N, 3.0, N], fid_1 = 0.0:4.0)
5×3 DataTables.DataTable
│ Row │ id │ fid   │ fid_1 │
├─────┼────┼───────┼───────┤
│ 1   │ 0  │ #NULL │ 0.0   │
│ 2   │ 1  │ 1.0   │ 1.0   │
│ 3   │ 2  │ #NULL │ 2.0   │
│ 4   │ 3  │ 3.0   │ 3.0   │
│ 5   │ 4  │ #NULL │ 4.0   │

julia> # fid
       # join(small, large, on=:fid, kind=:inner)
       DataTable(id = [1, 3], fid = [1.0, 3.0], id_1 = [1, 3])
2×3 DataTables.DataTable
│ Row │ id │ fid │ id_1 │
├─────┼────┼─────┼──────┤
│ 1   │ 1  │ 1.0 │ 1    │
│ 2   │ 3  │ 3.0 │ 3    │

julia> # join(small, large, on=:fid, kind=:left)
       DataTable(id = [1, 3, 5], fid = [1.0, 3.0, 5.0], id_1 = [1, 3, N])
3×3 DataTables.DataTable
│ Row │ id │ fid │ id_1  │
├─────┼────┼─────┼───────┤
│ 1   │ 1  │ 1.0 │ 1     │
│ 2   │ 3  │ 3.0 │ 3     │
│ 3   │ 5  │ 5.0 │ #NULL │

julia> # join(small, large, on=:fid, kind=:semi)
       DataTable(id = [1, 3], fid = [1.0, 3.0])
2×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 1  │ 1.0 │
│ 2   │ 3  │ 3.0 │

julia> # join(small, large, on=:fid, kind=:anti)
       DataTable(id = 5, fid = 5.0)
1×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 5  │ 5.0 │

julia> # join(small, large, on=:fid, kind=:outer)
       DataTable(id = [N, 1, N, 3, N, 5], fid = 0.0:5.0, id_1 = [0, 1, 2, 3, 4, N])
6×3 DataTables.DataTable
│ Row │ id    │ fid │ id_1  │
├─────┼───────┼─────┼───────┤
│ 1   │ #NULL │ 0.0 │ 0     │
│ 2   │ 1     │ 1.0 │ 1     │
│ 3   │ #NULL │ 2.0 │ 2     │
│ 4   │ 3     │ 3.0 │ 3     │
│ 5   │ #NULL │ 4.0 │ 4     │
│ 6   │ 5     │ 5.0 │ #NULL │

julia> # join(small, large, on=:fid, kind=:right)
       DataTable(id = [N, 1, N, 3, N], fid = 0.0:4.0, id_1 = 0:4)
5×3 DataTables.DataTable
│ Row │ id    │ fid │ id_1 │
├─────┼───────┼─────┼──────┤
│ 1   │ #NULL │ 0.0 │ 0    │
│ 2   │ 1     │ 1.0 │ 1    │
│ 3   │ #NULL │ 2.0 │ 2    │
│ 4   │ 3     │ 3.0 │ 3    │
│ 5   │ #NULL │ 4.0 │ 4    │

julia> # both
       # join(small, large, on=[:id, :fid], kind=:inner)
       DataTable(id = [1, 3], fid = [1.0, 3.0])
2×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 1  │ 1.0 │
│ 2   │ 3  │ 3.0 │

julia> # join(small, large, on=[:id, :fid], kind=:left)
       DataTable(id = 1:2:5, fid = 1.0:2.0:5.0)
3×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 1  │ 1.0 │
│ 2   │ 3  │ 3.0 │
│ 3   │ 5  │ 5.0 │

julia> # join(small, large, on=[:id, :fid], kind=:semi)
       DataTable(id = [1, 3], fid = [1.0, 3.0])
2×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 1  │ 1.0 │
│ 2   │ 3  │ 3.0 │

julia> # join(small, large, on=[:id, :fid], kind=:anti)
       DataTable(id = 5, fid = 5.0)
1×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 5  │ 5.0 │

julia> # join(small, large, on=[:id, :fid], kind=:outer)
       DataTable(id = 0:5, fid = 0.0:5.0)
6×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 0  │ 0.0 │
│ 2   │ 1  │ 1.0 │
│ 3   │ 2  │ 2.0 │
│ 4   │ 3  │ 3.0 │
│ 5   │ 4  │ 4.0 │
│ 6   │ 5  │ 5.0 │

julia> # join(small, large, on=[:id, :fid], kind=:right)
       DataTable(id = 0:4, fid = 0.0:4.0)
5×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 0  │ 0.0 │
│ 2   │ 1  │ 1.0 │
│ 3   │ 2  │ 2.0 │
│ 4   │ 3  │ 3.0 │
│ 5   │ 4  │ 4.0 │

The only ones that don't behave as expected are the right outer join and the full outer join, which both appear to order according to the left data table and then the right. The answer I expected immediately follows the join I expected to produce it.

julia> join(small, large, on=:id, kind=:outer)
6×3 DataTables.DataTable
│ Row │ id │ fid   │ fid_1 │
├─────┼────┼───────┼───────┤
│ 1   │ 1  │ 1.0   │ 1.0   │
│ 2   │ 3  │ 3.0   │ 3.0   │
│ 3   │ 5  │ 5.0   │ #NULL │
│ 4   │ 0  │ #NULL │ 0.0   │
│ 5   │ 2  │ #NULL │ 2.0   │
│ 6   │ 4  │ #NULL │ 4.0   │

julia> DataTable(id = 0:5, fid = [N, 1.0, N, 3.0, N, 5.0], fid_1 = [0.0, 1.0, 2.0, 3.0, 4.0, N])
6×3 DataTables.DataTable
│ Row │ id │ fid   │ fid_1 │
├─────┼────┼───────┼───────┤
│ 1   │ 0  │ #NULL │ 0.0   │
│ 2   │ 1  │ 1.0   │ 1.0   │
│ 3   │ 2  │ #NULL │ 2.0   │
│ 4   │ 3  │ 3.0   │ 3.0   │
│ 5   │ 4  │ #NULL │ 4.0   │
│ 6   │ 5  │ 5.0   │ #NULL │

julia> join(small, large, on=:id, kind=:right)
5×3 DataTables.DataTable
│ Row │ id    │ fid   │ fid_1 │
├─────┼───────┼───────┼───────┤
│ 1   │ 0     │ 1.0   │ 1.0   │
│ 2   │ 3     │ 3.0   │ 3.0   │
│ 3   │ 2     │ #NULL │ 0.0   │
│ 4   │ #NULL │ #NULL │ 2.0   │
│ 5   │ 4     │ #NULL │ 4.0   │

julia> DataTable(id = 0:4, fid = [N, 1.0, N, 3.0, N], fid_1 = 0.0:4.0)
5×3 DataTables.DataTable
│ Row │ id │ fid   │ fid_1 │
├─────┼────┼───────┼───────┤
│ 1   │ 0  │ #NULL │ 0.0   │
│ 2   │ 1  │ 1.0   │ 1.0   │
│ 3   │ 2  │ #NULL │ 2.0   │
│ 4   │ 3  │ 3.0   │ 3.0   │
│ 5   │ 4  │ #NULL │ 4.0   │

julia> join(small, large, on=:fid, kind=:outer)
6×3 DataTables.DataTable
│ Row │ id    │ fid │ id_1  │
├─────┼───────┼─────┼───────┤
│ 1   │ 1     │ 1.0 │ 1     │
│ 2   │ 3     │ 3.0 │ 3     │
│ 3   │ 5     │ 5.0 │ #NULL │
│ 4   │ #NULL │ 0.0 │ 0     │
│ 5   │ #NULL │ 2.0 │ 2     │
│ 6   │ #NULL │ 4.0 │ 4     │

julia> DataTable(id = [N, 1, N, 3, N, 5], fid = 0.0:5.0, id_1 = [0, 1, 2, 3, 4, N])
6×3 DataTables.DataTable
│ Row │ id    │ fid │ id_1  │
├─────┼───────┼─────┼───────┤
│ 1   │ #NULL │ 0.0 │ 0     │
│ 2   │ 1     │ 1.0 │ 1     │
│ 3   │ #NULL │ 2.0 │ 2     │
│ 4   │ 3     │ 3.0 │ 3     │
│ 5   │ #NULL │ 4.0 │ 4     │
│ 6   │ 5     │ 5.0 │ #NULL │

julia> join(small, large, on=:fid, kind=:right)
5×3 DataTables.DataTable
│ Row │ id    │ fid   │ id_1 │
├─────┼───────┼───────┼──────┤
│ 1   │ 1     │ 0.0   │ 1    │
│ 2   │ 3     │ 3.0   │ 3    │
│ 3   │ #NULL │ 2.0   │ 0    │
│ 4   │ #NULL │ #NULL │ 2    │
│ 5   │ #NULL │ 4.0   │ 4    │

julia> DataTable(id = [N, 1, N, 3, N], fid = 0.0:4.0, id_1 = 0:4)
5×3 DataTables.DataTable
│ Row │ id    │ fid │ id_1 │
├─────┼───────┼─────┼──────┤
│ 1   │ #NULL │ 0.0 │ 0    │
│ 2   │ 1     │ 1.0 │ 1    │
│ 3   │ #NULL │ 2.0 │ 2    │
│ 4   │ 3     │ 3.0 │ 3    │
│ 5   │ #NULL │ 4.0 │ 4    │

julia> join(small, large, on=[:id, :fid], kind=:outer)
6×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 1  │ 1.0 │
│ 2   │ 3  │ 3.0 │
│ 3   │ 5  │ 5.0 │
│ 4   │ 0  │ 0.0 │
│ 5   │ 2  │ 2.0 │
│ 6   │ 4  │ 4.0 │

julia> DataTable(id = 0:5, fid = 0.0:5.0)
6×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 0  │ 0.0 │
│ 2   │ 1  │ 1.0 │
│ 3   │ 2  │ 2.0 │
│ 4   │ 3  │ 3.0 │
│ 5   │ 4  │ 4.0 │
│ 6   │ 5  │ 5.0 │

julia> join(small, large, on=[:id, :fid], kind=:right)
5×2 DataTables.DataTable
│ Row │ id    │ fid   │
├─────┼───────┼───────┤
│ 1   │ 0     │ 0.0   │
│ 2   │ 3     │ 3.0   │
│ 3   │ 2     │ 2.0   │
│ 4   │ #NULL │ #NULL │
│ 5   │ 4     │ 4.0   │

julia> DataTable(id = 0:4, fid = 0.0:4.0)
5×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 0  │ 0.0 │
│ 2   │ 1  │ 1.0 │
│ 3   │ 2  │ 2.0 │
│ 4   │ 3  │ 3.0 │
│ 5   │ 4  │ 4.0 │

We recently merged a PR that modified joining in #17, and at first I thought it was a behavioral bug I had missed, but I ran these tests against DataFrames and got the same results so it seems this is how outer joins have behaved in Julia in the past few years. I personally would prefer that both

we never promote to nullable arrays unless required
we order as shown in the examples rather than what the current behavior yields.

I tried to find examples of this but I came up with very few guiding examples to inform what would be the "correct" behavior here. Turns out everyone on the internet explains joins with venn diagrams. Here's one example that would suggest our current ordering is wrong.
https://blog.codinghorror.com/a-visual-explanation-of-sql-joins/

If anyone has examples from other languages (R, Python, Matlab, others) or examples from reference material those would be very helpful.

For quick polling, let's do
👍 = change both (REMOVE/CHANGE autopromotion and CHANGE the ordering)
😕 = KEEP autopromotion of outer joins, but the ordering is confusing so CHANGE ordering
👎 = REMOVE/CHANGE autopromotion, but KEEP order as is
❤️ = I love it the way it is, KEEP autopromotion of outer joins and KEEP order as is

Thanks for reading!

Remove DataFrames releases/tags

They're no longer relevant and should be removed

usefulness of a `colwise!` method?

DataTables implement a colwise method, but no colwise!. There aren't a massive number of use cases for this, but I can think of e.g. data centering and normalization. If there are no major technical obstacles with making such a function I think it'd make a nice addition.

Future of DataTables

I wrote this:

Are there any plans what will happen with DataTables? I've been working on a branch of DataTables that uses DataValue instead of Nullable and once that is ready I'll want to make that work available in some form as a package. One option would be to just release it as DataValueTables. On the flipside, if the current plan for DataTables is that it will just go away once things are merged back into DataFrames, we could repurpose DataTables to continue to exist as a table that is based on DataValue, i.e. it would continue to be a table implementation that uses a container based approach for missing data, but it would be (hopefully) much more usable than the Nullable based approach we have right now.

I'm not entirely sure where I'm going with that work. In the short term (i.e. julia 0.6 timeframe) I just want to have a table type that is fast and has good usability, so that is the short term goal. Medium term, I mostly see it as a hedge: if the whole Union{T,Null} thing works out for julia 1.0, it will probably just go away. But if not, we would still have something fast and easy to use for julia 1.0.

@nalimilan wrote this:

@davidanthoff Let's discuss this in another issue. One problem for the future of DataTables is that CategoricalArrays are going to switch to Union{T, Null} too, so DataTables will be stuck with the current CategoricalArrays version.

Tag a new version?

Compiling error

With julia version

Julia Version 0.6.0
Commit 903644385b (2017-06-19 13:05 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin13.4.0)
CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

And DataTables version

v"0.0.3"

The following error occurs.

julia> using DataTables
INFO: Precompiling module DataTables.
WARNING: Module Compat with uuid 254536935008373 is missing from the cache.
This may mean module Compat does not support precompilation but is imported by a module that does.
ERROR: LoadError: Declaring precompile(false) is not allowed in files that are being precompiled.
Stacktrace:
[1] _require(::Symbol) at ./loading.jl:448
[2] require(::Symbol) at ./loading.jl:398
[3] include_from_node1(::String) at ./loading.jl:569
[4] include(::String) at ./sysimg.jl:14
[5] anonymous at ./:2
while loading /Users/brent/.julia/v0.6/DataTables/src/DataTables.jl, in expression starting on line 11
ERROR: Failed to precompile DataTables to /Users/brent/.julia/lib/v0.6/DataTables.ji.
Stacktrace:
[1] compilecache(::String) at ./loading.jl:703
[2] _require(::Symbol) at ./loading.jl:490
[3] require(::Symbol) at ./loading.jl:398

documentation link broken

Cannot acces docs clicking the badge in the README

Make DataTables a major release of DataFrames

I'm currently trying to write code that works with both DataFrames and DataTables, but I'm constantly running into name collision issues. It would make my work a lot easier if the DataTables code was simply a major release version of DataFrames.

How do I use CSV.read to return a DataTable instead of a DataFrame?

Unable to find this in the documentation.

duplicate functionality in merge! and hcat!

AFAICT merge! and hcat! do the same thing. Should we export hcat! and deprecate merge!?

Recommended date to make DataTables branches master

When is the recommended date to make DataTables branches of packages that depend on DataFrames the main branch (i.e. the one that goes into METADATA)?

`==` does not compare columns of `ZonedDateTime`s correctly

Comparing two ZonedDateTimes that represent the same "instant" (but in different time zones) with == returns true, but comparing them with isequal returns false.

julia> using TimeZones, DataFrames, DataTables

julia> ZonedDateTime(2016, 1, 1, TimeZone("America/Winnipeg")) == ZonedDateTime(2016, 1, 1, 6, TimeZone("UTC"))
true

julia> isequal(ZonedDateTime(2016, 1, 1, TimeZone("America/Winnipeg")), ZonedDateTime(2016, 1, 1, 6, TimeZone("UTC")))
false

DataFrames.jl maintains this convention:

julia> using TimeZones, DataFrames

julia> df_1 = DataFrame(id=[1,2], date=[ZonedDateTime(2016, 1, 1, 0, TimeZone("America/Winnipeg")), ZonedDateTime(2016, 1, 1, 1, TimeZone("America/Winnipeg"))])
2×2 DataFrames.DataFrame
│ Row │ id │ date                      │
├─────┼────┼───────────────────────────┤
│ 1   │ 1  │ 2016-01-01T00:00:00-06:00 │
│ 2   │ 2  │ 2016-01-01T01:00:00-06:00 │

julia> df_2 = DataFrame(id=[1,2], date=[ZonedDateTime(2016, 1, 1, 6, TimeZone("UTC")), ZonedDateTime(2016, 1, 1, 7, TimeZone("UTC"))])
2×2 DataFrames.DataFrame
│ Row │ id │ date                      │
├─────┼────┼───────────────────────────┤
│ 1   │ 1  │ 2016-01-01T06:00:00+00:00 │
│ 2   │ 2  │ 2016-01-01T07:00:00+00:00 │

julia> df_1 == df_2
true

julia> isequal(df_1, df_2)
false

...but DataTables.jl doesn't:

julia> using TimeZones, DataTables

julia> dt_1 = DataTable(id=[1,2], date=[ZonedDateTime(2016, 1, 1, 0, TimeZone("America/Winnipeg")), ZonedDateTime(2016, 1, 1, 1, TimeZone("America/Winnipeg"))])
2×2 DataTables.DataTable
│ Row │ id │ date                      │
├─────┼────┼───────────────────────────┤
│ 1   │ 1  │ 2016-01-01T00:00:00-06:00 │
│ 2   │ 2  │ 2016-01-01T01:00:00-06:00 │

julia> dt_2 = DataTable(id=[1,2], date=[ZonedDateTime(2016, 1, 1, 6, TimeZone("UTC")), ZonedDateTime(2016, 1, 1, 7, TimeZone("UTC"))])
2×2 DataTables.DataTable
│ Row │ id │ date                      │
├─────┼────┼───────────────────────────┤
│ 1   │ 1  │ 2016-01-01T06:00:00+00:00 │
│ 2   │ 2  │ 2016-01-01T07:00:00+00:00 │

julia> dt_1 == dt_2
false

It's no real mystery why, given the fairly terse definition of ==:

@compat(Base.:(==))(dt1::AbstractDataTable, dt2::AbstractDataTable) = isequal(dt1, dt2)

I think that supporting == comparisons (rather than just doing isequals all the way down) would be preferable in this case.

Version information:

julia> Pkg.status("DataTables")
 - DataTables                    0.0.3

julia> versioninfo()
Julia Version 0.6.0-rc3.0
Commit ad290e93e4* (2017-06-07 11:53 UTC)

Don't reexport Statsbase

Would it be possible to not reexport Statsbase? I often use DataTables (and DataFrames) without using Statsbase.

I think the reexport is not intuitive and having fit and predict exported is annoying when using DataTables with ScikitLearn.jl . Also DataTables exporting things like autocor is a bit weird.