Comments (6)
At present the default method for levels ends up sorting the pool.
This is what the DataAPI.levels
API requires:
Values are returned in the preferred order for the collection, with the result of sort as a default.
As for PooledArray
we could write collect(skipmissing(DataAPI.refpool(pa)))
instead which should be faster. The question is what meaning levels
for PooledArray
should have, i.e. in what situations someone would use levels
on a PooledArray
(I do not think it is useful for this type).
from pooledarrays.jl.
At present the default method for levels ends up sorting the pool.
This is what the
DataAPI.levels
API requires:Values are returned in the preferred order for the collection, with the result of sort as a default.
I would say that for a PooledArray
the order in the refpool is the preferred order.
My understanding of the API is that refpool
can return nothing
if the argument is not a factor-like object such as PooledArray
, CategoricalArray
, Arrow.DictEncoded
, etc. But levels
should always return a set of potential levels for the argument. If the result of refpool
is nothing
then the levels are determined by a sorting operation that eliminates the duplicates.
So for factor-like arrays I would want the result of levels
to be the same as the result of refpool
without any missing
.
As for
PooledArray
we could writecollect(skipmissing(DataAPI.refpool(pa)))
instead which should be faster. The question is what meaninglevels
forPooledArray
should have, i.e. in what situations someone would uselevels
on aPooledArray
(I do not think it is useful for this type).
I am happy to create a PR with that implementation.
Perhaps I could explain why I want to add this definition. In the StatsModels
package when contrasts like EffectsCoding
are applied to a term in a formula the order of the levels determines the interpretation of the coefficients. The levels can be specified in the call to the coding specification but they default to the value of DataAPI.levels
.
An Arrow
representation of a DictEncoded
type can be flagged as ordered. In that case, the order should be preserved all the way to constructing contrasts for a model. Right now, copying an Arrow.DictEncoded
object creates a PooledArray
which is why I want to ensure that levels
applied to a PooledArray
returns the refpool
modulo removal of any missing values.
from pooledarrays.jl.
if the argument is not a factor-like object such as
PooledArray
I would say that PooledArray
is not a factor-like object. The fact that it returns refpool
is only performance and storage related. From user's perspective PooledArray
and Vector
should be indistinguishable.
I am not sure how it is the case for Arrow.DictEncoded
. For sure CategoricalArray
is factor-like.
The case of Arrow.DictEncoded
is that DataAPI.refpool
can contain duplicates AFAICT, as opposed to PooledArray
and CategoricalArray
. The question then is if levels
for Arrow.DictEncoded
should return duplicates. I think not as the DataAPI.levels
contract is:
a vector of unique values which occur or could occur in collection
x
and it explicitly requires uniqueness
Right now, copying an
Arrow.DictEncoded
object creates aPooledArray
which is why I want to ensure that levels applied to aPooledArray
returns the refpool modulo removal of any missing values.
@nalimilan - so we have a challenge here. Ideally a CategoricalArray
should be created instead of PooledArray
(so that in particular the ordered flag is retained). I understand that the challenge is that Arrow.DictEncoded
supports a wider array of types than CategoricalArray
does - is this correct?
If we do not resolve the issue cleanly then considering PooledArray
to be a factor-like container is a second best option (but first I would prefer to investigate if we can use CategoricalArray
instead as conceptually PooledArray
is just a vector without any notion of being a factor).
@dmbates - note that currently even different PooledArray
s share the same DataAPI.refpool
for efficiency (so really the fact that there is a DataAPI.refpool
for PooledArray
should be considered an implementation detail). Here is an example:
julia> using PooledArrays, DataAPI
julia> x = PooledArray(1:5);
julia> y = x[1:3];
julia> z = copy(x);
julia> DataAPI.refpool(x) === DataAPI.refpool(y) === DataAPI.refpool(z)
true
from pooledarrays.jl.
I would be tempted to define DataAPI.levels(::PooledArray)
to just be a very efficient version of unique(::PooledArray)
(or if there is already specialization of unique
to just set const DataAPI.levels(::PooledArray) = unique(::PooledArray)
). This would be useful in many contexts and at the same time provide the functionality that @dmbates wants.
from pooledarrays.jl.
Two things:
- we would still need to drop
missing
(this is easy) - If I understand things correctly this is not what @dmbates wants as he needs
levels
to inherit the order from sourceArrow.DictEncoded
that can be arbitrary and even contain values not present it the array
from pooledarrays.jl.
I agree with @bkamins that levels(x::Array)
should return the same thing as levels(PooledArray(x))
. This can indeed be made faster than the fallback method by calling sort!(unique(x))
, but that's an implementation detail.
(I don't know what's most appropriate for Arrow.DictEncoded
.)
from pooledarrays.jl.
Related Issues (20)
- view of PooledArray HOT 2
- PooledArrays.jl 0.6 release HOT 4
- TagBot trigger issue HOT 9
- Disallow mutable eltype in PooledArray HOT 3
- performance of getindex HOT 10
- PooledArray is potentially not thread safe HOT 1
- implement compactpool! and compactpool functions
- Change order of type parameters
- similar(pa, 0) stopped working HOT 2
- Purity assumption of map HOT 7
- Redesing of PooledArray internals HOT 5
- constuctor to create PooledArray sharing a pool from another PooledArray HOT 7
- implement reference offset HOT 4
- improve median for pooled vectors HOT 1
- add insert!
- `pure=true` default to align with Base HOT 12
- Add examples
- bug with empty constructors
- ambiguity in vcat
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pooledarrays.jl.