We're making progress on removing the automatic promotion of columns to nullable arrays (NullableArrays & NullableCategoricalArrays) in #30 and I would like some feedback and discussion about the best way to handle outer joins (outer, left, right). Outer joins are particularly interesting because they all have the potential to introduce nulls, but do not necessarily need to do so. All of the other joins currently supported create subsets of the tables and shouldn't have the potential to introduce nulls. Please correct me if I'm wrong on this.
For the other joins that do not have the potential to introduce missing data, it seems that the "correct" result is pretty easy to decide on: The result should be in the same order as the input data and the column types should be preserved.
For outer joins, we have two inter-related "correct"-ness questions that both have to do with nulls. The first question is what is the "correct" ordering of the returned values when nulls are introduced. Do we try and retain the original ordering of the input datasets, or do we default to putting the null values at the end, or left then right table ordering? The second is what is the "correct" column type to return to the user. As all outer joins have the potential to introduce missingness, the "correct" result could be the one that is most column-type consistent. In this case, we would retain the column type and only promote to using nullable arrays when nulls are introduced. On the flip side of this is the behavioral consistency approach. Because all outer joins have the potential to introduce missing data, it could be argued that outer joins should always return nullable columns (only for the smaller of the two joined tables or just completely nullify the datatable). In that case, users can always expect to receive nullable columns to be output from outer joins and write nullable-tolerant code accordingly without anything unexpected.
Here is an example set of joins. I'll turn these into test cases to assert whatever behavior pattern we decide on, and if any other examples are put forth here I'll add those too.
the data
julia> small = DataTable(id = 1:2:5, fid = 1.0:2.0:5.0)
3×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1 │ 1 │ 1.0 │
│ 2 │ 3 │ 3.0 │
│ 3 │ 5 │ 5.0 │
julia> large = DataTable(id = 0:4, fid = 0.0:4.0)
5×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1 │ 0 │ 0.0 │
│ 2 │ 1 │ 1.0 │
│ 3 │ 2 │ 2.0 │
│ 4 │ 3 │ 3.0 │
│ 5 │ 4 │ 4.0 │
and here are the results I would expect from the joins, which are commented out in the line before the result I expected.
julia> N = Nullable()
Nullable{Union{}}()
julia> # join(small, large, kind=:cross)
DataTable(id = repeat(1:2:5, inner=5),
fid = repeat(1.0:2.0:5.0, inner=5),
id_1 = repeat(0:4, outer=3),
fid_1 = repeat(0.0:4.0, outer=3))
15×4 DataTables.DataTable
│ Row │ id │ fid │ id_1 │ fid_1 │
├─────┼────┼─────┼──────┼───────┤
│ 1 │ 1 │ 1.0 │ 0 │ 0.0 │
│ 2 │ 1 │ 1.0 │ 1 │ 1.0 │
│ 3 │ 1 │ 1.0 │ 2 │ 2.0 │
│ 4 │ 1 │ 1.0 │ 3 │ 3.0 │
│ 5 │ 1 │ 1.0 │ 4 │ 4.0 │
│ 6 │ 3 │ 3.0 │ 0 │ 0.0 │
│ 7 │ 3 │ 3.0 │ 1 │ 1.0 │
│ 8 │ 3 │ 3.0 │ 2 │ 2.0 │
│ 9 │ 3 │ 3.0 │ 3 │ 3.0 │
│ 10 │ 3 │ 3.0 │ 4 │ 4.0 │
│ 11 │ 5 │ 5.0 │ 0 │ 0.0 │
│ 12 │ 5 │ 5.0 │ 1 │ 1.0 │
│ 13 │ 5 │ 5.0 │ 2 │ 2.0 │
│ 14 │ 5 │ 5.0 │ 3 │ 3.0 │
│ 15 │ 5 │ 5.0 │ 4 │ 4.0 │
julia> # id
# join(small, large, on=:id, kind=:inner)
DataTable(id = [1, 3], fid = [1.0, 3.0], fid_1 = [1.0, 3.0])
2×3 DataTables.DataTable
│ Row │ id │ fid │ fid_1 │
├─────┼────┼─────┼───────┤
│ 1 │ 1 │ 1.0 │ 1.0 │
│ 2 │ 3 │ 3.0 │ 3.0 │
julia> # join(small, large, on=:id, kind=:left)
DataTable(id = [1, 3, 5], fid = [1.0, 3.0, 5.0], fid_1 = [1.0, 3.0, N])
3×3 DataTables.DataTable
│ Row │ id │ fid │ fid_1 │
├─────┼────┼─────┼───────┤
│ 1 │ 1 │ 1.0 │ 1.0 │
│ 2 │ 3 │ 3.0 │ 3.0 │
│ 3 │ 5 │ 5.0 │ #NULL │
julia> # join(small, large, on=:id, kind=:semi)
DataTable(id = [1, 3], fid = [1.0, 3.0])
2×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1 │ 1 │ 1.0 │
│ 2 │ 3 │ 3.0 │
julia> # join(small, large, on=:id, kind=:anti)
DataTable(id = 5, fid = 5.0)
1×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1 │ 5 │ 5.0 │
julia> # join(small, large, on=:id, kind=:outer)
DataTable(id = 0:5, fid = [N, 1.0, N, 3.0, N, 5.0], fid_1 = [0.0, 1.0, 2.0, 3.0, 4.0, N])
6×3 DataTables.DataTable
│ Row │ id │ fid │ fid_1 │
├─────┼────┼───────┼───────┤
│ 1 │ 0 │ #NULL │ 0.0 │
│ 2 │ 1 │ 1.0 │ 1.0 │
│ 3 │ 2 │ #NULL │ 2.0 │
│ 4 │ 3 │ 3.0 │ 3.0 │
│ 5 │ 4 │ #NULL │ 4.0 │
│ 6 │ 5 │ 5.0 │ #NULL │
julia> # join(small, large, on=:id, kind=:right)
DataTable(id = 0:4, fid = [N, 1.0, N, 3.0, N], fid_1 = 0.0:4.0)
5×3 DataTables.DataTable
│ Row │ id │ fid │ fid_1 │
├─────┼────┼───────┼───────┤
│ 1 │ 0 │ #NULL │ 0.0 │
│ 2 │ 1 │ 1.0 │ 1.0 │
│ 3 │ 2 │ #NULL │ 2.0 │
│ 4 │ 3 │ 3.0 │ 3.0 │
│ 5 │ 4 │ #NULL │ 4.0 │
julia> # fid
# join(small, large, on=:fid, kind=:inner)
DataTable(id = [1, 3], fid = [1.0, 3.0], id_1 = [1, 3])
2×3 DataTables.DataTable
│ Row │ id │ fid │ id_1 │
├─────┼────┼─────┼──────┤
│ 1 │ 1 │ 1.0 │ 1 │
│ 2 │ 3 │ 3.0 │ 3 │
julia> # join(small, large, on=:fid, kind=:left)
DataTable(id = [1, 3, 5], fid = [1.0, 3.0, 5.0], id_1 = [1, 3, N])
3×3 DataTables.DataTable
│ Row │ id │ fid │ id_1 │
├─────┼────┼─────┼───────┤
│ 1 │ 1 │ 1.0 │ 1 │
│ 2 │ 3 │ 3.0 │ 3 │
│ 3 │ 5 │ 5.0 │ #NULL │
julia> # join(small, large, on=:fid, kind=:semi)
DataTable(id = [1, 3], fid = [1.0, 3.0])
2×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1 │ 1 │ 1.0 │
│ 2 │ 3 │ 3.0 │
julia> # join(small, large, on=:fid, kind=:anti)
DataTable(id = 5, fid = 5.0)
1×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1 │ 5 │ 5.0 │
julia> # join(small, large, on=:fid, kind=:outer)
DataTable(id = [N, 1, N, 3, N, 5], fid = 0.0:5.0, id_1 = [0, 1, 2, 3, 4, N])
6×3 DataTables.DataTable
│ Row │ id │ fid │ id_1 │
├─────┼───────┼─────┼───────┤
│ 1 │ #NULL │ 0.0 │ 0 │
│ 2 │ 1 │ 1.0 │ 1 │
│ 3 │ #NULL │ 2.0 │ 2 │
│ 4 │ 3 │ 3.0 │ 3 │
│ 5 │ #NULL │ 4.0 │ 4 │
│ 6 │ 5 │ 5.0 │ #NULL │
julia> # join(small, large, on=:fid, kind=:right)
DataTable(id = [N, 1, N, 3, N], fid = 0.0:4.0, id_1 = 0:4)
5×3 DataTables.DataTable
│ Row │ id │ fid │ id_1 │
├─────┼───────┼─────┼──────┤
│ 1 │ #NULL │ 0.0 │ 0 │
│ 2 │ 1 │ 1.0 │ 1 │
│ 3 │ #NULL │ 2.0 │ 2 │
│ 4 │ 3 │ 3.0 │ 3 │
│ 5 │ #NULL │ 4.0 │ 4 │
julia> # both
# join(small, large, on=[:id, :fid], kind=:inner)
DataTable(id = [1, 3], fid = [1.0, 3.0])
2×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1 │ 1 │ 1.0 │
│ 2 │ 3 │ 3.0 │
julia> # join(small, large, on=[:id, :fid], kind=:left)
DataTable(id = 1:2:5, fid = 1.0:2.0:5.0)
3×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1 │ 1 │ 1.0 │
│ 2 │ 3 │ 3.0 │
│ 3 │ 5 │ 5.0 │
julia> # join(small, large, on=[:id, :fid], kind=:semi)
DataTable(id = [1, 3], fid = [1.0, 3.0])
2×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1 │ 1 │ 1.0 │
│ 2 │ 3 │ 3.0 │
julia> # join(small, large, on=[:id, :fid], kind=:anti)
DataTable(id = 5, fid = 5.0)
1×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1 │ 5 │ 5.0 │
julia> # join(small, large, on=[:id, :fid], kind=:outer)
DataTable(id = 0:5, fid = 0.0:5.0)
6×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1 │ 0 │ 0.0 │
│ 2 │ 1 │ 1.0 │
│ 3 │ 2 │ 2.0 │
│ 4 │ 3 │ 3.0 │
│ 5 │ 4 │ 4.0 │
│ 6 │ 5 │ 5.0 │
julia> # join(small, large, on=[:id, :fid], kind=:right)
DataTable(id = 0:4, fid = 0.0:4.0)
5×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1 │ 0 │ 0.0 │
│ 2 │ 1 │ 1.0 │
│ 3 │ 2 │ 2.0 │
│ 4 │ 3 │ 3.0 │
│ 5 │ 4 │ 4.0 │
The only ones that don't behave as expected are the right outer join and the full outer join, which both appear to order according to the left data table and then the right. The answer I expected immediately follows the join
I expected to produce it.
julia> join(small, large, on=:id, kind=:outer)
6×3 DataTables.DataTable
│ Row │ id │ fid │ fid_1 │
├─────┼────┼───────┼───────┤
│ 1 │ 1 │ 1.0 │ 1.0 │
│ 2 │ 3 │ 3.0 │ 3.0 │
│ 3 │ 5 │ 5.0 │ #NULL │
│ 4 │ 0 │ #NULL │ 0.0 │
│ 5 │ 2 │ #NULL │ 2.0 │
│ 6 │ 4 │ #NULL │ 4.0 │
julia> DataTable(id = 0:5, fid = [N, 1.0, N, 3.0, N, 5.0], fid_1 = [0.0, 1.0, 2.0, 3.0, 4.0, N])
6×3 DataTables.DataTable
│ Row │ id │ fid │ fid_1 │
├─────┼────┼───────┼───────┤
│ 1 │ 0 │ #NULL │ 0.0 │
│ 2 │ 1 │ 1.0 │ 1.0 │
│ 3 │ 2 │ #NULL │ 2.0 │
│ 4 │ 3 │ 3.0 │ 3.0 │
│ 5 │ 4 │ #NULL │ 4.0 │
│ 6 │ 5 │ 5.0 │ #NULL │
julia> join(small, large, on=:id, kind=:right)
5×3 DataTables.DataTable
│ Row │ id │ fid │ fid_1 │
├─────┼───────┼───────┼───────┤
│ 1 │ 0 │ 1.0 │ 1.0 │
│ 2 │ 3 │ 3.0 │ 3.0 │
│ 3 │ 2 │ #NULL │ 0.0 │
│ 4 │ #NULL │ #NULL │ 2.0 │
│ 5 │ 4 │ #NULL │ 4.0 │
julia> DataTable(id = 0:4, fid = [N, 1.0, N, 3.0, N], fid_1 = 0.0:4.0)
5×3 DataTables.DataTable
│ Row │ id │ fid │ fid_1 │
├─────┼────┼───────┼───────┤
│ 1 │ 0 │ #NULL │ 0.0 │
│ 2 │ 1 │ 1.0 │ 1.0 │
│ 3 │ 2 │ #NULL │ 2.0 │
│ 4 │ 3 │ 3.0 │ 3.0 │
│ 5 │ 4 │ #NULL │ 4.0 │
julia> join(small, large, on=:fid, kind=:outer)
6×3 DataTables.DataTable
│ Row │ id │ fid │ id_1 │
├─────┼───────┼─────┼───────┤
│ 1 │ 1 │ 1.0 │ 1 │
│ 2 │ 3 │ 3.0 │ 3 │
│ 3 │ 5 │ 5.0 │ #NULL │
│ 4 │ #NULL │ 0.0 │ 0 │
│ 5 │ #NULL │ 2.0 │ 2 │
│ 6 │ #NULL │ 4.0 │ 4 │
julia> DataTable(id = [N, 1, N, 3, N, 5], fid = 0.0:5.0, id_1 = [0, 1, 2, 3, 4, N])
6×3 DataTables.DataTable
│ Row │ id │ fid │ id_1 │
├─────┼───────┼─────┼───────┤
│ 1 │ #NULL │ 0.0 │ 0 │
│ 2 │ 1 │ 1.0 │ 1 │
│ 3 │ #NULL │ 2.0 │ 2 │
│ 4 │ 3 │ 3.0 │ 3 │
│ 5 │ #NULL │ 4.0 │ 4 │
│ 6 │ 5 │ 5.0 │ #NULL │
julia> join(small, large, on=:fid, kind=:right)
5×3 DataTables.DataTable
│ Row │ id │ fid │ id_1 │
├─────┼───────┼───────┼──────┤
│ 1 │ 1 │ 0.0 │ 1 │
│ 2 │ 3 │ 3.0 │ 3 │
│ 3 │ #NULL │ 2.0 │ 0 │
│ 4 │ #NULL │ #NULL │ 2 │
│ 5 │ #NULL │ 4.0 │ 4 │
julia> DataTable(id = [N, 1, N, 3, N], fid = 0.0:4.0, id_1 = 0:4)
5×3 DataTables.DataTable
│ Row │ id │ fid │ id_1 │
├─────┼───────┼─────┼──────┤
│ 1 │ #NULL │ 0.0 │ 0 │
│ 2 │ 1 │ 1.0 │ 1 │
│ 3 │ #NULL │ 2.0 │ 2 │
│ 4 │ 3 │ 3.0 │ 3 │
│ 5 │ #NULL │ 4.0 │ 4 │
julia> join(small, large, on=[:id, :fid], kind=:outer)
6×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1 │ 1 │ 1.0 │
│ 2 │ 3 │ 3.0 │
│ 3 │ 5 │ 5.0 │
│ 4 │ 0 │ 0.0 │
│ 5 │ 2 │ 2.0 │
│ 6 │ 4 │ 4.0 │
julia> DataTable(id = 0:5, fid = 0.0:5.0)
6×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1 │ 0 │ 0.0 │
│ 2 │ 1 │ 1.0 │
│ 3 │ 2 │ 2.0 │
│ 4 │ 3 │ 3.0 │
│ 5 │ 4 │ 4.0 │
│ 6 │ 5 │ 5.0 │
julia> join(small, large, on=[:id, :fid], kind=:right)
5×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼───────┼───────┤
│ 1 │ 0 │ 0.0 │
│ 2 │ 3 │ 3.0 │
│ 3 │ 2 │ 2.0 │
│ 4 │ #NULL │ #NULL │
│ 5 │ 4 │ 4.0 │
julia> DataTable(id = 0:4, fid = 0.0:4.0)
5×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1 │ 0 │ 0.0 │
│ 2 │ 1 │ 1.0 │
│ 3 │ 2 │ 2.0 │
│ 4 │ 3 │ 3.0 │
│ 5 │ 4 │ 4.0 │
We recently merged a PR that modified joining in #17, and at first I thought it was a behavioral bug I had missed, but I ran these tests against DataFrames and got the same results so it seems this is how outer joins have behaved in Julia in the past few years. I personally would prefer that both
- we never promote to nullable arrays unless required
- we order as shown in the examples rather than what the current behavior yields.
I tried to find examples of this but I came up with very few guiding examples to inform what would be the "correct" behavior here. Turns out everyone on the internet explains joins with venn diagrams. Here's one example that would suggest our current ordering is wrong.
https://blog.codinghorror.com/a-visual-explanation-of-sql-joins/
If anyone has examples from other languages (R, Python, Matlab, others) or examples from reference material those would be very helpful.
For quick polling, let's do
👍 = change both (REMOVE/CHANGE autopromotion and CHANGE the ordering)
😕 = KEEP autopromotion of outer joins, but the ordering is confusing so CHANGE ordering
👎 = REMOVE/CHANGE autopromotion, but KEEP order as is
❤️ = I love it the way it is, KEEP autopromotion of outer joins and KEEP order as is
Thanks for reading!