Comments (21)
I don't like isunique(dt)
, as the question of whether there are duplicated columns does not make much sense to me. Also, to check whether there are duplicate rows, we should rather override allunique
. Data tables are not exactly like arrays, their dimensions are not really symmetric so we don't always need to support operations for both rows and columns.
iscomplete(dt)
makes more sense, but I'm not sure I like the generalization to rows and columns. I don't think there's any precedent for this in Julia, and in DataTables generally functions do not support choosing the dimension to consider: as I said, they are not symmetric. There's also the function anynull
which does the same thing, but which could be deprecated in favor of any(isnull, x)
; at the very least we need to consider these functions together.
R calls nonunique duplicated - so you can subset by subset(df, !duplicated(name)), which reads quite intuitively.
@mkborregaard Yet that pattern is terribly wasteful since it creates a temporary vector. If we expect people to use the !
often, better define it other other way around. Also the term "duplicated" isn't used in the Julia API, "unique" (i.e. the reverse) is always used.
Would having an argument (in your example 1 or 2, though I'd prefer :rows or :columns/:cols) to specify the direction conflict with the functionality I added/suggested to specify the columns to check for uniques/nulls (because of occupying the argument position)? Or would the column specification only apply to dropnull and unique? That would make unique and isunique quite different I think. Which may not be a bad thing, but I'm not sure.
Yes, dropnull
and unique
currently operates row-wise, and this is very natural. This is another reason for not having iscomplete
and isunique
take the dimension as a second argument: better have consistent pairs of functions.
What is the use case for isunique(dt, 1 (or 2))? I think originally it was to use for sorting the datatable - e.g. df[!nonunique(df[1:3]), :], but that use case could be replaced by unique(dt, 1:3).
That's indeed a good question, since completecases
clearly has its origins in the R inspiration of DataFrames. In R it's idiomatic to do df[complete_cases(df),]
. In Julia you wouldn't never write this, and providing this function could even be misleading for new users.
Yet there may be valid use cases, like storing the vector of complete cases somewhere for later use (e.g. when building a model matrix, to remember where incomplete cases are for predict
, cf. JuliaStats/StatsModels.jl#17). But this is an advanced use case and the docs for completecases
should point to dropnull
/dropnull!
as the recommended solution.
from datatables.jl.
I like the idea, though having it a keyword would violate type stability I think?
from datatables.jl.
Let's comment on the PR for the first part of the proposal.
WRT returning a view, that makes sense, but we should have a clear policy applying to all functions (e.g. to nonunique
unique
too). So we need to identify all functions which would need to be checked and possibly changed.
from datatables.jl.
Otherwise I see and agree with your argument for making completecases (and nonunique) return a Boolean vector and dropnull return (possibly) a view. IMHO view = true
could even be the default, as that should work in most cases and be much faster.
from datatables.jl.
Keyword arguments can be typed, the problem is that the return type of a function cannot depend on the value of an argument if we want inference to figure it out (there are also complications due to the type of keyword arguments not being taken into account by inference to compute the return type, but that's another issue).
I'm not sure about the naming. isunique
sounds too much like it would return a single Bool
. (non)uniqueindices
would be more explicit, but maybe a bit verbose.
from datatables.jl.
e.g.
julia> function test(; tst = true)
tst && return 1
1.
end
test (generic function with 1 method)
julia> function test2()
1
end
test2 (generic function with 1 method)
julia> function test2(x)
1.
end
test2 (generic function with 2 methods)
julia> @code_warntype test(tst = true)
Variables:
#unused#::#kw##test
#temp#@_2::Array{Any,1}
::#test
tst::Any
#temp#@_5::Int64
#temp#@_6::Int64
#temp#@_7::Int64
#temp#@_8::Any
#temp#@_9::Union{Float64,Int64}
Body:
begin
tst::Any = true
SSAValue(2) = (Base.arraylen)(#temp#@_2::Array{Any,1})::Int64
SSAValue(3) = (Base.select_value)((Base.sle_int)(0,1)::Bool,(Base.box)(Int64,(Base.ashr_int)(SSAValue(2),(Base.box)(UInt64,1))),(Base.box)(Int64,(Base.shl_int)(SSAValue(2),(Base.box)(UInt64,(Base.box)(Int64,(Base.neg_int)(1))))))::Int64
SSAValue(4) = (Base.select_value)((Base.sle_int)(1,SSAValue(3))::Bool,SSAValue(3),(Base.box)(Int64,(Base.sub_int)(1,1)))::Int64
#temp#@_5::Int64 = 1
6:
unless (Base.box)(Base.Bool,(Base.not_int)((#temp#@_5::Int64 === (Base.box)(Int64,(Base.add_int)(SSAValue(4),1)))::Bool)) goto 21
SSAValue(5) = #temp#@_5::Int64
SSAValue(6) = (Base.box)(Int64,(Base.add_int)(#temp#@_5::Int64,1))
#temp#@_6::Int64 = SSAValue(5)
#temp#@_5::Int64 = SSAValue(6)
#temp#@_7::Int64 = (Base.box)(Int64,(Base.sub_int)((Base.box)(Int64,(Base.mul_int)(#temp#@_6::Int64,2)),1))
#temp#@_8::Any = (Core.arrayref)(#temp#@_2::Array{Any,1},#temp#@_7::Int64)::Any
unless (#temp#@_8::Any === :tst)::Bool goto 17
tst::Any = (Core.arrayref)(#temp#@_2::Array{Any,1},(Base.box)(Int64,(Base.add_int)(#temp#@_7::Int64,1)))::Any
goto 19
17:
(Base.throw)($(Expr(:new, :(Base.MethodError), :((Core.getfield)((Core.getfield)((Core.getfield)(#test,:name)::TypeName,:mt),:kwsorter)), :((Core.tuple)(#temp#@_2,)::Tuple{Array{Any,1},#test}))))::Union{}
19:
goto 6
21:
# meta: location REPL[7] #test#1 2
unless tst::Any goto 26
#temp#@_9::Union{Float64,Int64} = 1
goto 29
26: # line 3:
#temp#@_9::Union{Float64,Int64} = 1.0
29:
# meta: pop location
return #temp#@_9::Union{Float64,Int64}
end::Union{Float64,Int64}
julia> @code_warntype test2(true)
Variables:
#self#::#test2
x::Bool
Body:
begin
return 1.0
end::Float64
from datatables.jl.
Sorry for the many edits.
from datatables.jl.
A pared down version is here: #39
I guess the suggestion to use views is separate.
from datatables.jl.
I like the views idea, but I'm worried making it part of completecases
could be confusing. Then dropnull
, dropnull!
and completecases
would all return DataTable
s and it wouldn't be immediately clear why completecases
is needed without checking the documentation. The goal of #6 was to introduce dropnull
as a means of making completecases
more distinct, and this would make them more similar again.
Same reservations for nonunique
, which then would behave similarly to unique
except it would return a view rather than a copy? Anyone who desires a boolean array would have to implement new functions themselves.
Could returning a view (rather than a copy or inplace operation) be a keyword argument for dropnull
and unique
dropnull(DataTable, view=false)
dropnull(DataTable, cols, view=false)
unique(DataTable, view=false)
unique(DataTable, cols, view=false)
?
👍 on the main idea of having completecases
for selected columns.
from datatables.jl.
Of course when I said returning a view was a good idea, I was referring to functions which currently return a data table. I just confused unique
and nonunique
: only the former would return a view. Anyway we should rename nonunique
that's too confusing.
from datatables.jl.
This all sounds good to me!
Out of curiosity, why does using keyword arguments violate type stability? I've seen examples where keyword arguments are typed, like here in CSV.jl. Are those not type stable?
nonunique
-> isunique
? isuniquerow
? with inverted true/false output relative to nonunique
from datatables.jl.
Yes, to be 100% honest I also started from the belief that completecases
would return the complete cases, e.g. a DataTable 😊 .
Maybe arenonunique
? (no that's horrible). I think isnonunique
could work. And iscomplete
.
from datatables.jl.
Yes, to be 100% honest I also started from the belief that completecases would return the complete cases, e.g. a DataTable 😊 .
^ coming from DataFrames, that's exactly what you should have expected!
Let's start with what we need, trying to forget what functions currently exist and pretending we have a clean slate.
- boolean vector indicating which rows are complete/nullfree.
- boolean vector indicating which rows are the first unique instance of a row.
Right now, completecases
satisfies the first, and the inverse of nonunique
, .!nonunique
, would satisfy the second. I don't think we should add any function with non
in front because the double-negative of .!non
is kind of silly right from the beginning
-
deprecate the name
nonunique
-
replace it with
isunique
.isunique(dt, 1)
would return a boolean vector indicating if for each row that row is the first unique instance of a given row in a datatable..!isunique(dt, 1) == nonunique(dt)
isunique(dt, 2)
would return a boolean vector indicating if for each column that column is the first unique instance of a given column in a datatable.- unique columns will be based on values without checking names, as columns cannot share names.
isunique(dt) = all(isunique(dt, 1)) && all(isunique(dt, 2))
-
We update
unique(dt)
to unique-ify rows AND columns. I imagine people don't usually have duplicate columns, so even though users currently expectunique(dt)
to only apply to rows, I think this should be ok?unique(dt, 1)
unique-ify the datatable based on rowsunique(dt, 2)
unique-ify the datatable based on columns
-
deprecate the name
completecases
-
rename it to
iscomplete
iscomplete(dt, 1)
would return a boolean vector indicating if for each row that row is complete (nullfree).iscomplete(dt, 1) == completecases(dt)
iscomplete(dt, 2)
would return a boolean vector indicating if for each column that column is complete (nullfree)iscomplete(dt) = all(iscomplete(dt, 2))
- and if we aren't doing this already,
iscomplete(dt, 1)/completecases
should maybe runiscomplete(dt, 2)
first, as in most cases I think nrows >> ncols and if you only have to check thateltypes(cols) .!<: Nullable
on 3 columns to avoid checking 100K rows, that might speed things up.
- and if we aren't doing this already,
With these changes, iscomplete(dt)
and isunique(dt)
will return single boolean values, as @nalimilan suggested he would expect and I agree is probably the most logical based on the naming. I imagine users would begin running @assert iscomplete(dt) && isunique(dt)
as a nice pre-check when working with new datasets. If you specify that you want to check row-wise or col-wise, then you'll break from that "single value" convention and get the boolean indices for all rows and columns as everyone (I think?) desires. Writing all(iscomplete(dt, 1))
isn't all that bad either if people really want single values for a single dimension. I think this will be helpful in that it unifies the behavior patterns of iscomplete
and isunique
and lets us completely start from scratch rather than carry over one of the two different behaviors of complete_cases
and complete_cases!
.
I think that covers all cases and seems to be a decent naming strategy that holds across functions and utilizes the widely used function(input, axis)
convention to be specific about what return type you want. Thoughts?
from datatables.jl.
I gave those changes a shot and opened a PR for them. We may not decide to go that direction, but it was helpful to confirm the approach would be viable.
Others I have talked to said they would expect the function completecases
to return a DataTable, so I think scrapping the name completecases
entirely would be beneficial.
@mkborregaard could you apply the changes (returning a view and only checking the appropriate columns for complete cases) to dropnull
in #39? If you wouldn't mind, could you also try a few benchmarks of the before/after speeds when writing tests? Having a few comparisons would be helpful both for evaluating the PR and to have on hand when we figure out how to automate performance testing.
Thanks to both of you for the explanations on the type insecurity.
from datatables.jl.
A few thoughts:
- R calls
nonunique
duplicated
- so you can subset bysubset(df, !duplicated(name))
, which reads quite intuitively. - Would having an argument (in your example
1
or2
, though I'd prefer:rows
or:columns
/:cols
) to specify the direction conflict with the functionality I added/suggested to specify the columns to check for uniques/nulls (because of occupying the argument position)? Or would the column specification only apply todropnull
andunique
? That would makeunique
andisunique
quite different I think. Which may not be a bad thing, but I'm not sure. - What is the use case for
isunique(dt, 1 (or 2))
? I think originally it was to use for sorting the datatable - e.g.df[!nonunique(df[1:3]), :]
, but that use case could be replaced byunique(dt, 1:3)
.
from datatables.jl.
And yes, I am happy to update dropnull as suggested and do the tests/benchmarks - I will do it at the same time as writing the promised tests for the predict PR :-) So it may be a few days.
from datatables.jl.
All makes very good sense. So what do you suggest? Keeping the current methods, implementing a col-specific version of dropnull!
, returning views rather than copies by default from methods that return a DataTable (the user can always copy
manually), and changing the names of nonunique
and completecases
?
from datatables.jl.
I tried not to make too many assumptions about user patterns when thinking about isunique
and iscomplete
behavior. I think we all have our own assumptions about what kind of patterns are common and justifiable, but one justification I could think of for isunique
on columns is what if someone wanted to do several outer joins and confirm that they hadn't introduced duplicate columns? Googling "remove duplicate column after join" yields > 250K results so I don't think it's an invalid case, even if rare.
from datatables.jl.
If we restrict the column subset to require an array I think all of these behaviors could be supported at once.
e.g.
unique(dt, 1) # unique based on rows
unique(dt, [1]) # unique, restricted to only checking for uniqueness in first column
etc...
@mkborregaard would that be an acceptable compromise to make when specifying specific columns to check?
from datatables.jl.
I'm not super fond of it, sorry. I agree with @nalimilan that it is important to keep the distinction between columns and rows in DataTables. And I have a personal dislike for functions where you cannot do indexin(3, 1:5)
but have to do indexin([3], 1:5)
.
from datatables.jl.
I agree that using brackets/passing arrays where they aren't necessary is pretty ugly. What about recycling the on=
keyword from join
? That wouldn't change the return type as it did for my earlier suggestion, and I think everyone would immediately intuit what completecases(dt, on={Int, Symbol, Vector{Int}, Vector{Symbol}}
means that the function is only checking either the individual (bracket free) column or the series of columns in the array. I'm also open to using :rows/:cols
to indicate dimensions in addition to or instead of 1/2
.
I find the argument that we shouldn't support columnar checks because datatables packages from other languages don't support them (lack of precedent) to be a weak one. There's evidence that people have run into the issue of checking columns for duplicates in the past and found it difficult enough to post to SO and other venues about. I'm in full agreement that the functionality you are asking for should be possible! I would just like that it didn't come at the cost of being unable to support other functionality.
edit: We could also make dim=
be the required keyword and @mkborregaard could go forward with his original proposal in the original form. That too would allow all of these functionalities to coexist.
edit 2: if we use the dim=
keyword we could leave the current system (only check rows, not rows and columns) in place by setting 1 or :rows
as the default. That wouldn't require changing any code, behavior, or syntax but would enable extending these functions to columns.
from datatables.jl.
Related Issues (20)
- documentation link broken HOT 1
- Recommended date to make DataTables branches master HOT 4
- Overwrites DataFrames describe function HOT 11
- "correct" join results HOT 9
- usefulness of a `colwise!` method? HOT 5
- Strange error message at REPL with the latest Julia nightly HOT 5
- How do I use CSV.read to return a DataTable instead of a DataFrame? HOT 3
- Make DataTables a major release of DataFrames HOT 3
- duplicate functionality in merge! and hcat! HOT 2
- Tag a new version? HOT 4
- Don't reexport Statsbase HOT 5
- describe does not function HOT 1
- Start replacing Nullable with Null HOT 3
- Throw error for type mismatch using join? HOT 5
- Join on columns of different name HOT 4
- Compiling error HOT 1
- Backporting to DataFrames HOT 7
- Future of DataTables HOT 13
- `==` does not compare columns of `ZonedDateTime`s correctly HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datatables.jl.