Giter Club home page Giter Club logo

Comments (6)

bkamins avatar bkamins commented on June 5, 2024

At present the default method for levels ends up sorting the pool.

This is what the DataAPI.levels API requires:

Values are returned in the preferred order for the collection, with the result of sort as a default.


As for PooledArray we could write collect(skipmissing(DataAPI.refpool(pa))) instead which should be faster. The question is what meaning levels for PooledArray should have, i.e. in what situations someone would use levels on a PooledArray (I do not think it is useful for this type).

from pooledarrays.jl.

dmbates avatar dmbates commented on June 5, 2024

At present the default method for levels ends up sorting the pool.

This is what the DataAPI.levels API requires:

Values are returned in the preferred order for the collection, with the result of sort as a default.

I would say that for a PooledArray the order in the refpool is the preferred order.

My understanding of the API is that refpool can return nothing if the argument is not a factor-like object such as PooledArray, CategoricalArray, Arrow.DictEncoded, etc. But levels should always return a set of potential levels for the argument. If the result of refpool is nothing then the levels are determined by a sorting operation that eliminates the duplicates.

So for factor-like arrays I would want the result of levels to be the same as the result of refpool without any missing.

As for PooledArray we could write collect(skipmissing(DataAPI.refpool(pa))) instead which should be faster. The question is what meaning levels for PooledArray should have, i.e. in what situations someone would use levels on a PooledArray (I do not think it is useful for this type).

I am happy to create a PR with that implementation.

Perhaps I could explain why I want to add this definition. In the StatsModels package when contrasts like EffectsCoding are applied to a term in a formula the order of the levels determines the interpretation of the coefficients. The levels can be specified in the call to the coding specification but they default to the value of DataAPI.levels.

An Arrow representation of a DictEncoded type can be flagged as ordered. In that case, the order should be preserved all the way to constructing contrasts for a model. Right now, copying an Arrow.DictEncoded object creates a PooledArray which is why I want to ensure that levels applied to a PooledArray returns the refpool modulo removal of any missing values.

from pooledarrays.jl.

bkamins avatar bkamins commented on June 5, 2024

if the argument is not a factor-like object such as PooledArray

I would say that PooledArray is not a factor-like object. The fact that it returns refpool is only performance and storage related. From user's perspective PooledArray and Vector should be indistinguishable.

I am not sure how it is the case for Arrow.DictEncoded. For sure CategoricalArray is factor-like.

The case of Arrow.DictEncoded is that DataAPI.refpool can contain duplicates AFAICT, as opposed to PooledArray and CategoricalArray. The question then is if levels for Arrow.DictEncoded should return duplicates. I think not as the DataAPI.levels contract is:

a vector of unique values which occur or could occur in collection x

and it explicitly requires uniqueness

Right now, copying an Arrow.DictEncoded object creates a PooledArray which is why I want to ensure that levels applied to a PooledArray returns the refpool modulo removal of any missing values.

@nalimilan - so we have a challenge here. Ideally a CategoricalArray should be created instead of PooledArray (so that in particular the ordered flag is retained). I understand that the challenge is that Arrow.DictEncoded supports a wider array of types than CategoricalArray does - is this correct?

If we do not resolve the issue cleanly then considering PooledArray to be a factor-like container is a second best option (but first I would prefer to investigate if we can use CategoricalArray instead as conceptually PooledArray is just a vector without any notion of being a factor).

@dmbates - note that currently even different PooledArrays share the same DataAPI.refpool for efficiency (so really the fact that there is a DataAPI.refpool for PooledArray should be considered an implementation detail). Here is an example:

julia> using PooledArrays, DataAPI

julia> x = PooledArray(1:5);

julia> y = x[1:3];

julia> z = copy(x);

julia> DataAPI.refpool(x) === DataAPI.refpool(y) === DataAPI.refpool(z)
true

from pooledarrays.jl.

palday avatar palday commented on June 5, 2024

I would be tempted to define DataAPI.levels(::PooledArray) to just be a very efficient version of unique(::PooledArray) (or if there is already specialization of unique to just set const DataAPI.levels(::PooledArray) = unique(::PooledArray)). This would be useful in many contexts and at the same time provide the functionality that @dmbates wants.

from pooledarrays.jl.

bkamins avatar bkamins commented on June 5, 2024

Two things:

  1. we would still need to drop missing (this is easy)
  2. If I understand things correctly this is not what @dmbates wants as he needs levels to inherit the order from source Arrow.DictEncoded that can be arbitrary and even contain values not present it the array

from pooledarrays.jl.

nalimilan avatar nalimilan commented on June 5, 2024

I agree with @bkamins that levels(x::Array) should return the same thing as levels(PooledArray(x)). This can indeed be made faster than the fallback method by calling sort!(unique(x)), but that's an implementation detail.

(I don't know what's most appropriate for Arrow.DictEncoded.)

from pooledarrays.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.