Giter Club home page Giter Club logo

Comments (66)

nalimilan avatar nalimilan commented on June 12, 2024

Are you aware of #50 (comment)?

from dataarrays.jl.

johnmyleswhite avatar johnmyleswhite commented on June 12, 2024

Thanks for laying out these ideas, Simon.

I'm not clear how we handle per-variable orderings with the proposal described in (1). If I do pda[1] < pda[2], I want < to be a function defined uniquely for that specific ordinal variable. So every instance of OrdinalVariable would need to link back to its source PDA or every instance of OrdinalVariable would need to be a custom type.

How are you thinking of handling those kinds of issue?

from dataarrays.jl.

simonster avatar simonster commented on June 12, 2024

I was thinking that for (1) and (2), OrdinalVariable object would look like:

immutable OrdinalVariable{T<:Integer,S}
    index::T
    pool::Vector{S} # Or maybe a wrapper over the pool?
end

Then we'd have:

function Base.(:<)(x::OrdinalVariable, y::OrdinalVariable)
    x.pool === y.pool || error("OrdinalVariables are not comparable")
    x.index < y.index
end

from dataarrays.jl.

johnmyleswhite avatar johnmyleswhite commented on June 12, 2024

Everything you've said, @simonster, seems completely reasonable. I agree that your first proposal would be the most straightforward way to move forward.

But I can't yet shake my sense that making everything nominal/ordinal variable into its own type is the simplest way to do things when you want to work with categorical data without using DataArrays. (This, of course, assumes that it's actually reasonable to care about ordinal factor without also caring about missingness. I think it is, but may get convinced it's not worth the effort.)

Rather than parametrize each runtime generated type with extra information, I was thinking that we would just define functions like: order(T::MyNewType). As you've already said, this function definition won't get garbage collected. For ordinal factors with >100,000 levels, this could be a problem. But I don't know how common that use case comes up in practice. When I work with categorical data, I'm typically working with a factor that has a few hundred levels at most. In that case, I'm not too worried about garbage-collection.

from dataarrays.jl.

simonster avatar simonster commented on June 12, 2024

Both (3) and (4) have the problem that the pool itself is getting leaked, but dynamically creating types has the additional problem that every function gets inferred/compiled for each type. The performance and memory characteristics depend on the number of types * the number of different functions called in addition to the size of the data. See the quick benchmark I made here. With the n_data = 1000 and n_vars = 100, the timings look like (in seconds):

option 1: 0.007096643
option 2: 0.000856045
option 3: 0.000178305
option 3.5: 7.3403e-5
option 4: 1.166516775

With n_data = 50_000_000 and n_vars = 1:

option 1: 0.755386073
option 2: 1.2290704
option 3: 0.16911997
option 3.5: 0.034059569
option 4: 0.034355416

So for some but not all cases, dynamically generating types can be far slower, since reinterpret, test_access, and getindex must be compiled for each possible type. I am not sure whether 100 enums is a good reflection of a real data set, since we would probably want to use one type for all enums with the same pool which might only mean a couple of enum types for typical cases, but this is something we should think about.

The option that I didn't consider in my original post, which I call option 3.5 above, is using an enum ID as in (3) in conjunction with a special array type as in (1) to get efficient storage but also avoid a GC root when accessing the values. If the ID is a Uint32, this easily beats all options but (4) for any choice of n_data and n_vars, beats (4) up to n_data = 50_000_000 with n_vars = 1, and is asymptotically about 1/3 slower. Moreover, unlike option (4), there is no need to dynamically generate types and so there is no overhead for compiling functions for each different pool. If we don't care about garbage collecting the pool, then this is an option to consider.

As far as categorical data without missingness with a PooledDataArray-like type, one possibility is to have a separate PooledArray type that shares most of the implementation but defines different getindex and setindex methods that don't accept NA.

from dataarrays.jl.

simonster avatar simonster commented on June 12, 2024

Those benchmarks aren't entirely fair, since some have to deal with garbage produced by preceding benchmarks and some benchmarks include copy overhead while others don't. The below timings should be fairer. With n_data = 1000; nvars = 100:

option 1: 0.000691289
option 2: 0.000917854
option 3: 0.000205124
option 3.5: 8.7721e-5
option 4: 1.307516131

With ndata = 50_000_000; nvars = 1:

option 1: 0.547448457
option 2: 1.013021837
option 3: 0.173764345
option 3.5: 0.055936094
option 4: 0.056068792

My feeling is that option 3.5 (storing raw values and the enum ID in a PooledDataArray type, and maintaining a global array of enum pools) is the way to go. Options 1 and 2 incur substantial garbage collection-related overhead for large data sets. With option 4, performance drops with the number of enums and the number of functions you plan to call on them, since each function needs to be compiled for each enum type. With option 3.5, performance depends only on the amount of data, and even if the user stores the enums in regular vectors instead of a special structure, it is not hugely inefficient (option 3 is ~3x slower).

from dataarrays.jl.

nalimilan avatar nalimilan commented on June 12, 2024

I'm not very comfortable with the idea of not being able to clear the pool from memory. Even if in most cases it shouldn't be a problem, in some cases the number of levels might be relatively high. When loading data produced by someone else (e.g. survey data), there may be many variables, some of which with many levels, and if they are marked as categorical they will waste memory which cannot be freed without restarting Julia (I need to do this with R quite often and it's painful).

Let's dream a little. In theory, it seems that the best solution would have the following properties:

  • the pool wouldn't be duplicated in each vector or element, but instead stored either via the type system (enum) or in a global pool
  • the enum/pool would be garbage-collected when no longer used anywhere
  • there would be no specialization of functions for each variable/enum, as all share a common structure and compiling several times the function is a waste of time and memory
  • indexing would be efficient (ideally, no need to create an object)
  • construction of the vector would be efficient, but that's secondary as you index a vector much more often than you create it.

Using enums and thus letting the type system handle the global pool sounds like a good solution, which would allow writing simpler code. Also, the type system is (or could be made) smart enough to avoid compiling two specialized functions for two enums which are actually Uint32 under the hood. Maybe there's a chance Julia could allow specifying that some types/enums should be garbage-collected if no object uses them anymore? Could some kind of reference counting be implemented? In any case, asking main developers wouldn't be a bad idea.

@simonster Regarding the benchmarks, I think you shouldn't build the vectors in the loops, as that's a more less common operation than indexing. Does the code in the gist correspond to your last comment?

from dataarrays.jl.

simonster avatar simonster commented on June 12, 2024

To make the Julia avoid compiling two specialized functions for two enums with the same fields, you need 1) abstract types with fields (JuliaLang/julia#4935) so that the compiler can optimize field access and 2) some way to specify that the compiler should not specialize on the subtypes (potentially related to "direct call methods specialized for non-leaf types" from JuliaLang/julia#3440, although in this case we'd have to be able to specify that on the type and not on the method). This is a little "featurey." There is also JuliaLang/julia#3080, although in that discussion it seems like people would like to be able to specialize/dispatch on the enum, which we probably do not want.

GC of types poses its own problems. Objects cannot simply hold a reference to the type without the same kind of overhead as in option 2 (which makes GC take 0.5 seconds with 50,000,000 values). I don't know if reference counting would really be much better, since we would still have to increment the reference count every time a value is retrieved from the array and decrement it when the value goes out of scope. This idea seems really featurey, since it isn't really intended that types are dynamically generated in the course of executing code in the first place.

In general, hacking this onto the type system continues to seem kind of wrong to me. The main purpose of the type system is dispatch and specialization. We don't want those things, just efficient storage.

Maybe @JeffBezanson or @StefanKarpinski have some thoughts here?

Here are the benchmarks for just array accesses. While option 2 looks better here, it is not really an option because, as noted above, with 50,000,000 data points and thus 50,000,000 references to the pool, it makes every GC take half a second. (This is not a problem with the other approaches; option 1 holds only one reference to the pool and the others hold no references.) Also the benchmarks for option 4 would obviously be worse if I were calling more than two functions that need to be specialized for each type. With n_data = 1000; nvars = 100:

option 1: 0.0007115400000000001
option 2: 0.00013821499999999994
option 3: 0.00011090599999999998
option 3.5: 0.0001345170000000001
option 4: 0.2781348129999999

With ndata = 50_000_000; nvars = 1:

option 1: 0.541569776
option 2: 0.050610905
option 3: 0.023341166
option 3.5: 0.035011226
option 4: 0.026047237

from dataarrays.jl.

nalimilan avatar nalimilan commented on June 12, 2024

Maybe using types is not the best solution after all.

Anyway the big issue with all solutions is how to keep a global pool which is not attached to any individual object, and yet could be garbage-collected when no longer needed. Indeed, for that you (or Julia) need to keep track of all objects which are using the pool, either by adapting the reference count when creating/deleting objects (and thus when indexing), or go over all existing objects at gc time.

from dataarrays.jl.

nalimilan avatar nalimilan commented on June 12, 2024

Let me add another alternative to be sure we have weighted all the advantages and drawbacks of using types to handle this.

@johnmyleswhite's comment #73 (comment) seems to point to a fifth solution, which could be more efficient (even though the garbage collection issue isn't fixed): create one type for each variable, which would only wrap a Uint32, have a order(T::MyNewType) function to get the levels, and make PDAs a simple array of these objects. The big win here is that when indexing you don't need to create objects on the fly, and you give the compiler all it needs to know to optimize operations. The handling of levels is moved to the type system.

Obviously all the issues about unwanted specialization and garbage collection do not go away in this scenario.

from dataarrays.jl.

simonster avatar simonster commented on June 12, 2024

I'm not sure if/how what you're suggesting differs from option 4 as implemented above. (I didn't implement the order method, but that's because I wasn't benchmarking it.)

One compromise between garbage collection and performance would be a variant of 3.5 where we maintain a strong reference to the pool in each PooledDataArray and a weak global reference. Then we'd avoid the overhead of maintaining references in each enum and the pool would get garbage collected after every PooledDataArray that uses it gets garbage collected. For most use cases I think this would work well, but it's certainly possible to write code where the PooledDataArray falls out of scope before the enum, which could be surprising, and resulting bugs would be nondeterministic, which is not great.

from dataarrays.jl.

nalimilan avatar nalimilan commented on June 12, 2024

Yeah, that's actually option 4), but not parametrized on a tuple of levels.

from dataarrays.jl.

simonster avatar simonster commented on June 12, 2024

@johnmyleswhite, can I bother you to get your current thinking on this?

from dataarrays.jl.

johnmyleswhite avatar johnmyleswhite commented on June 12, 2024

Yeah, let me think about it again and get back to you. I'd really like to hear @StefanKarpinski, @JeffBezanson or @ViralBShah's opinions before we make a decision. In retrospect, I think a lot of the design of DataArrays/DataFrames was too close to R's idioms and not optimized for Julia. I'd like our decision here to be more appropriate for what Julia is going to offer when 1.0 comes out.

from dataarrays.jl.

johnmyleswhite avatar johnmyleswhite commented on June 12, 2024

Coming back to this with fresh eyes, Option 3.5 does indeed seem best. Could we expose a mechanism for intentionally freeing the pool? This would only be used by performance buffs. Even though it's a little ugly, it does seem like the best solution.

I'm going to e-mail Jeff, Viral and Stefan offline to see if I can get them to chime in.

Once we resolve this, I'd like to also debate the remaining design issues for PDA's:

  • How do we represent the ordering? Presumably just by labeling each element of the pool with a corresponding integer.
  • How will unique/levels behave? This is #29.
  • Can the pool contain levels that aren't present in the data?
  • What operations on the pool are supported? And what's the interface like?

from dataarrays.jl.

StefanKarpinski avatar StefanKarpinski commented on June 12, 2024

I haven't read all of the commentary on this issue (there's a lot of comments here and in the linked discussions), but I think that part of the confusion here comes from conflation of semantics and implementation. There are data that you want to represent in a DataFrame that are nominal or ordinal and you want them to behave as such. As John observed, the nominal/ordinal nature of a variable must be linked to the element type, not a property of the container – otherwise when you take the elements out of the container, they have different behavior, which is bad. The solution there seems pretty obvious: use appropriate ordinal and nominal types. This isn't really related to the data frame abstraction at all. It makes sense to have types for these kinds of things, possibly something like an enum (although the difference between ordinal and nominal, not to mention flags indicates to me that maybe we need a slightly finer gradation).

Then there's also the issue of efficient storage and operation – which should not be confused with behavior. The fact that nominal and ordinal values can often be stored and operator on particularly efficiently might be the red herring here. I'm unconvinced that they should be related. Shouldn't this be like building an index for a data frame column? The element type is still the same and it doesn't affect the behavior of its values, but it provides a different, possibly more efficient implementation?

from dataarrays.jl.

StefanKarpinski avatar StefanKarpinski commented on June 12, 2024

In other words, I feel like this entire discussion assumes implicitly that nominal/ordinal values and value pooling should be linked – after all, it's often sensible to use a value pool when you have nominal/ordinal values and vice versa. But that's not necessarily the case. You could, for example, have a BigInt column that is truly integral, but because BigInts are huge, you may want to pool their storage. That data is pooled (implementation), but numerical. You might also have nominal data that you don't want to pool because the values are small and easier to simply store inline.

from dataarrays.jl.

simonster avatar simonster commented on June 12, 2024

This is basically the structure we have for PooledDataArrays right now, but a switch to enums would suggests two pools: one for the enums that contains its levels, and one for the PooledDataArrays that contains enums. In some sense this is more complicated, and I'm not convinced that the use case of storing BigInts/BigFloats in a PooledDataArray is a particularly strong one, but from a performance standpoint it might work better than the approaches I've discussed above, since we can just get the enum out of the pool instead of creating a new object on getindex, and we only need as many enums as there are levels in the PooledDataArray, so we don't have to worry about generating huge numbers of garbage-collected objects. I will try this and see how it performs.

from dataarrays.jl.

StefanKarpinski avatar StefanKarpinski commented on June 12, 2024

If you're using an enum, why do you need the pool? The only cases where I can see needing the pool is a case where you have a lot of strings or something and you want to reduce storage or do faster indexing. What kind of data really benefits from pooled storage?

from dataarrays.jl.

simonster avatar simonster commented on June 12, 2024

How do we store our enums in an ordinary array? Either we potentially have many thousands of objects that each hold a reference to the enum pool, which makes GC take forever, or we have to make the enum pools un-garbage-collectable. There is an additional concern of storage efficiency, since each enum needs to hold some kind of (redundant) pool identifier in addition to the level, but I don't think that's a big deal.

from dataarrays.jl.

StefanKarpinski avatar StefanKarpinski commented on June 12, 2024

I don't know what you're thinking about when you say "enum" but I'm thinking about a thin immutable types that wraps a Cint. It doesn't get any more efficient than that and there's no GC to do. Why do enums need a pool at all?

from dataarrays.jl.

simonster avatar simonster commented on June 12, 2024

Let's say you have a set of responses on a Likert scale, where each response is one of "strongly disagree," "disagree," "neutral," "agree," and "strongly agree." Those five levels are what I'm calling the enum pool, but perhaps that is a confusing name for it. We need a way to go from a given value back to the level to which it refers, or else you can't e.g. select all rows of a DataFrame where the response is "strongly agree."

from dataarrays.jl.

StefanKarpinski avatar StefanKarpinski commented on June 12, 2024

I feel like this should be the same as the proposed @enum functionality in base, no? The representation is just an int. The really bare bones implementation is here. The main issue with that is that it doesn't print nicely. That could be fixed by either generating a show method that switches on the enum value and prints its name instead of enum(n).

from dataarrays.jl.

kmsquire avatar kmsquire commented on June 12, 2024

See JuliaLang/julia#2988 for a possible implementation.

The title is misleading, since the PR includes a non-bitmask version of @enum that's more complete. A version of just that functionality is used by Logging.jl here.

from dataarrays.jl.

simonster avatar simonster commented on June 12, 2024

This has the same problem as the other type-based approaches: functions get specialized for each enum type, which means that if you have 100 different categorical columns you are compiling a lot of functions 100 times. Also both the functions and the lists of levels stay around in memory forever. There are major differences between what we want for doing stats and what @enum is meant for: we may have hundreds of different enum types, the levels are only known at runtime, and we don't want to specialize functions on individual enum types.

from dataarrays.jl.

StefanKarpinski avatar StefanKarpinski commented on June 12, 2024

Ok, these are good examples. Why not use strings or symbols? Do you have examples where pooling the storage is essential?

from dataarrays.jl.

nalimilan avatar nalimilan commented on June 12, 2024

Pooling the storage is essential in most uses; in the example of Likert scales, you don't want to repeat one million times "strongly disagree," "disagree," "neutral," "agree," and "strongly agree" as strings, as it would be terribly inefficient. And you don't want to store integers directly because it would be terribly confusing.

We seem to need a specific type of enum, which would not trigger function specialization, and which would be garbage collected when no longer used.

from dataarrays.jl.

StefanKarpinski avatar StefanKarpinski commented on June 12, 2024

One option would be just autopool strings behind the user's back. This is an option because Julia treats strings as immutable, so any two strings with the same value are just as good as each other. The strings still just act as strings but you'd avoid all the storage overhead of repeating the strings over and over again.

from dataarrays.jl.

StefanKarpinski avatar StefanKarpinski commented on June 12, 2024

By "autopool" I mean that when constructing a data frame, make all instances of equal strings point to a single canonical instance of that string. This can be done using a Set, but you can throw out the Set object once the data frame is constructed. This code shows how you can have distinct string objects that are equal that either share or don't share the same data array:

julia> a = "foo"
"foo"

julia> b = "foo"
"foo"

julia> c = ASCIIString(a.data)
"foo"

julia> a == b == c
true

julia> a === b
false

julia> a === c
true

The idea here is that you construct data frames so that all equal strings also share the same data array. After that, you can test for equality by checking identity, which is also much more efficient. Of course, if you're going to build algorithms that assume that testing string identity is sufficient to check string equality, then you need to really ensure that this is the case or you'll get very confusing behavior.

from dataarrays.jl.

nalimilan avatar nalimilan commented on June 12, 2024

@StefanKarpinski Actually the current implementation of PooledDataArrays solves this use case quite well. But there's a problem with how to handle ordinal data, i.e. how we could compare using < two values extracted from vectors. See #50 (comment) and discussion above.

Also, in some cases it helps to know in advance the pool of values the vector can take so that you can allocate the required number of slots (frequency table, model matrix), in particular when the data may be coming from a stream.

from dataarrays.jl.

nalimilan avatar nalimilan commented on June 12, 2024

Thinking more about it, maybe there should be different implementations for nominal and ordinal variables. Nominal variables are easier to handle and they work relatively well currently with PDAs; they can be stored as integers, and extracted as strings or symbols; same for testing equality. Ordinal variables are the only case where the value itself is not enough and you need access to the ordered pool; so we may create enums only for ordinal variables, which are likely to be 1) using a small pool, 2) less common, and 3) opt-in. By opt-in I mean that when loading a data set with many variables you won't get loads of enums with tens of hundreds of levels created automatically which will use memory forever.

This doesn't solve the method specialization problem, though.

from dataarrays.jl.

StefanKarpinski avatar StefanKarpinski commented on June 12, 2024

I think the approach of just uniquifying strings on data frame construction is much better than PDAs because it's completely transparent to the user – you can just do it (or not do it) and it looks and feels exactly the same – literally all it does is save you storage. I agree that perhaps enums and ordinals can coincide and have their own type. Regarding code generation, this is a broader problem that @JeffBezanson and I have discussed before – Julia generates a lot of code in general, and we need to start figuring out situations where the same generated code can be used for nominally different types. For example, the proposed enums which are just immutable wrappers around a Cint are really all the same for the purposes of most functions. They print differently, but that's about it. So we should maybe not make this basic decision based on properties of code generation that are likely to change in the relatively near future.

from dataarrays.jl.

StefanKarpinski avatar StefanKarpinski commented on June 12, 2024

One clever thing you could do to help decide whether you need to start pooling a column is to use a bloom filter initially and only start pooling if there are collisions in the bloom filter. You might get false positives, leading to pooling when you shouldn't, but you can control the probability of that easily and in cases where the data values are unique, you would never pay the cost of building a hash table to try to uniquify the string – although you would pay the cost of building and checking the bloom filter.

from dataarrays.jl.

johnmyleswhite avatar johnmyleswhite commented on June 12, 2024

I'm not aware of any way to test for collisions in a Bloom filter. Is there an algorithm for doing so?

from dataarrays.jl.

StefanKarpinski avatar StefanKarpinski commented on June 12, 2024

That's what a bloom filter does: you can see if something definitely hasn't been seen before or if it has probably been seen before – with some probability of a false positive. Thus, if you're looking over a column of strings with no repeats, you will probably see no collisions, with tunable probability.

from dataarrays.jl.

johnmyleswhite avatar johnmyleswhite commented on June 12, 2024

You're totally right: for no good reason, I was thinking that you meant to test for collisions in the filter itself between two items with the same signature.

from dataarrays.jl.

nalimilan avatar nalimilan commented on June 12, 2024

Oh, there's another feature of PDAs which is useful even for nominal variables: the ability to save an order for the levels to use when showing tables, graphics, etc. It looks like a detail, but that's really essential. So we need to store a pool together with arrays of nominal values. Anyway, the question of whether strings are stored as strings or as integers is an implementation detail as long as the interface makes it behave like an array of strings. What's hard is getting the correct interface.

Good to know that in the future Julia may stop specializing methods on an enum type. This is indeed something I hoped for.

from dataarrays.jl.

StefanKarpinski avatar StefanKarpinski commented on June 12, 2024

How about just having a single Ordinal type that stores both an order and a label? This basically:

immutable Ordinal
   order::Int
   label::UTF8String 
end

They compare by (order,label) but just display as label. When you convert a string column into these, you'll need to give an ordering of string values but you can then convert a column of strings into these. The strings can, at the same time, be uniqued so that all equal strings share storage. It's not as efficient as having an index into a pool, but it's a single global type.

from dataarrays.jl.

nalimilan avatar nalimilan commented on June 12, 2024

@StefanKarpinski Actually we also need a way to check that two Ordinal objects correspond to the same variable, so that you cannot accidentally compare apples and oranges and get a result. That's why the idea of an enum of a specific type makes sense; and once you are using a specific enum type for each variable, the pool doesn't need to be stored in each object anymore, it can be retrieved using a method.

from dataarrays.jl.

garborg avatar garborg commented on June 12, 2024

@nalimilan Same scale, rather than same variable?

from dataarrays.jl.

johnmyleswhite avatar johnmyleswhite commented on June 12, 2024

Can we just extend Stefan's type with an index of which variable is being worked with:

immutable Ordinal
   var::Int
   order::Int
   label::UTF8String 
end

I'm still trying to think about all the cases in which that approach might be tricky: things like trying to create a new scalar object of an ordinal scale stored in some DataFrame could be a little difficult to get right.

from dataarrays.jl.

garborg avatar garborg commented on June 12, 2024

Tying the instance to a variable seems awkward even for use within DataFrames: e.g. if variables before and after share scale [1=>"low", 2=>"medium", 3=>"high"], I'd expect df[:before] .< df[:after] to work (and ideally df[:age] .< df[:after], where age's scale is [1=>"<25", 2=>"25-50", 3=>">50"] , would throw).

from dataarrays.jl.

garborg avatar garborg commented on June 12, 2024

I missed that Ordinal would compare by (order, label). I still think that for a deeper check of whether a comparison is appropriate, it would need to link to a broader scale rather than a column of DataFrame.

from dataarrays.jl.

nalimilan avatar nalimilan commented on June 12, 2024

That's why using an enum type with a method giving access to the whole pool would be useful. It would allow checking whether the pools of two different variables are the same (levels and order). I think the enum approach would also be more efficient since you only need to carry an Int (potentially a Uint32 or even shorter), instead of creating an object with three fields (including a 64-bit pointer to the string).

from dataarrays.jl.

johnmyleswhite avatar johnmyleswhite commented on June 12, 2024

After implementing some ideas for working with missing scalar values in the NullableTypes package, I decided to go ahead and implement Simon's 1st proposal at the start of this issue so that we could have some way of working with scalar values of categorical datatypes. This reflects my concession that getting an Enum type to work isn't necessary.

Based on Simon's proposal, I've built up a potential backend for working with scalar values that are instances of either categorical and ordinal variables. You can check out what I've done at https://github.com/johnmyleswhite/CategoricalData.jl.

Basically, I've:

  • Replicated some of the PooledDataArray backend functionality in a way that's not inherently linked to missingness or to any specific array storage format. This makes it easier to talk about categorical variables without any references to an underlying PDA object. As a consequence, it's a little bit easier to talk about scalar objects that are categorical variables.
  • Added some basic tools for manipulating the pools from which values are drawn with methods like levels, levels!, order and order!.
  • Added types that represent scalar objects that implement either a single value of a categorical variable or a single value of an ordinal variable.
  • Defined ordering operations on ordinal variables so that scalars can be compared based on their rank order. These operations also raise a meaningful error message when you attempt to compare the order of categorical factors.

One thing that I've done that's not strictly necessary, but I would argue is helpful, is to remove any notion of missingness from the underlying representation of categorical data. My idea is that we'd replace the current PooledDataArray type with two new types, CategoricalArray and OrdinalArray, that build on top of the functionality provided in CategoricalData.jl. Those new types would handle the problem of missingness for arrays.

At the scalar level, we'd kick things back to NullableTypes.jl. Thus, indexing a scalar element from a CategoricalArray or a OrdinalArray would produce either an object of type Nullable{CategoricalVariable} or Nullable{OrdinalVariable}. That added bit of indirection ensures that the interface for these objects would be consistent with the interface for Nullable{T} for any other type.

from dataarrays.jl.

garborg avatar garborg commented on June 12, 2024

I like the CategoricalArray without missingness -- I've thought about that when dealing with PDAs.

from dataarrays.jl.

johnmyleswhite avatar johnmyleswhite commented on June 12, 2024

Sorry, I was a little unclear. My idea was that CategoricalArray would support missingness just as PDA does now (by using a pointer to an invalid index like 0), but that CategoricalPool would not support a missing value as a level. (Unless, of course, the missing value were explicit like "Did not answer question.")

from dataarrays.jl.

johnmyleswhite avatar johnmyleswhite commented on June 12, 2024

What's the use case for allowing users to say that CategoricalArray not permit NULL? To get something like the non-NULL constraint we're allowing now by inserting an Array into a DataFrame?

from dataarrays.jl.

garborg avatar garborg commented on June 12, 2024

It would definitely be good for that.

I suppose the real goal is, that however people end up representing arrays of categorical data without missing values elsewhere (perhaps an Enum type in Base), the API for the categorical version of DataArrays should mirror the Enum-like type as DataArrays mirrors Arrays (working with either is near identical except for added code/computational complexity to deal with missingness when working with DataArrays), and, for example, a function that builds a ModelMatrix should be be happy to accept either type.

That seems like a consideration to push aside for now in favor of easier experimentation. But I really appreciate the work to make DataArrays and DataFrames consistent with Base -- I'm the kind of person who forgets a library's names, syntax quirks, etc., after a very short time away.

from dataarrays.jl.

johnmyleswhite avatar johnmyleswhite commented on June 12, 2024

I guess the question for me is how Enum relates to CategoricalVariable. OrdinalVariable is, in my mind, clearly distinct from Enum because of the support for a custom order.

So the question is whether CategoricalVariable is redundant in the face of Enum.

I think the answer is probably no, but I'm not super sure.

Here's what I feel comfortable saying:

  • People are going to want to redefine the number of levels that a CategoricalVariable can take on. For example, given a CategoricalVariable with levels like "Far Left", "Left", "Moderate", "Right" and "Far Right", it's very natural to want to recode this variable as "Left, "Moderate" and "Right". As I understand it, this kind of recoding would introduce type-instability if the number of levels were a property of the Enum's type. (Saying this, I realize that I haven't written any code that provides a nice interface for recoding variables. I think that needs to happen at the CategoricalArray level to be really smooth.)
  • It's important that a CategoricalVariable that wraps T not behave like T by default. It should, for example, not be possible to add two CategoricalVariable{Int} objects together. I think Enum might do this, but definitely CategoricalVariable does. Not a strong point, but maybe important.

from dataarrays.jl.

garborg avatar garborg commented on June 12, 2024

Very good points in the bullets. I remember seeing in discussion of Enums for Base that order may be handled there, so it's possible that if CategoricalVariable was redundant OrdinalVariable would be, too, but I think you're right that neither will be redundant.

If neither type is redundant, it seems increasingly desirable to for both to have mirror types that don't take NA, but I'm not ready to take a strong stance one way or the other there.

from dataarrays.jl.

simonster avatar simonster commented on June 12, 2024

I think we actually want the CategoricalPool to hold an array of CategoricalVariables so that, instead of constructing a new CategoricalVariable on every access to a CategoricalArray, we would can just pull the CategoricalVariable out of the CategoricalPool. The reason is that a CategoricalVariable is a non-pointerfree immutable (because it holds a reference to the CategoricalPool, which holds an array), and it appears that Julia presently allocates these on the heap, i.e., they have performance characteristics more like regular objects than pointerfree immutable objects. To avoid allocation on every getindex call, we could just return a CategoricalVariable that's already constructed instead of constructing a new one. The only problem is that then the definitions of CategoricalVariable and CategoricalPool are mutually circular and you'd need to use one of the workarounds from JuliaLang/julia#269 to specify them.

I still think that Enums of the type people want in Base, which are primarily for controlling what code gets executed, are different from CategoricalVariables, which are primarily for holding data, and it's important we not conflate these two.

from dataarrays.jl.

garborg avatar garborg commented on June 12, 2024

I kind of regret referring to Enums now without knowing more about what
they'll look like -- what I really wanted to say is that it seems
reasonable to me that someone dealing with categorical data that didn't
involve missingness would prefer to work with a type that doesn't take NA,
and a shared API (apart from the some functions dealing with missingness)
would be nice. (Not having played with this package, I can't really say yet
whether I think it would work or be worth it.)

On Sun, Aug 17, 2014 at 3:04 PM, Simon Kornblith [email protected]
wrote:

I think we actually want the CategoricalPool to hold an array of
CategoricalVariables so that, instead of constructing a new
CategoricalVariable on every access to a CategoricalArray, we would can
just pull the CategoricalVariable out of the CategoricalPool. The reason is
that a CategoricalVariable is a non-pointerfree immutable (because it holds
a reference to the CategoricalPool, which holds an array), and it appears
that Julia presently allocates these on the heap, i.e., they have
performance characteristics more like regular objects than pointerfree
immutable objects. To avoid allocation on every getindex call, we could
just return a CategoricalVariable that's already constructed instead of
constructing a new one. The only problem is that then the definitions of
CategoricalVariable and CategoricalPool are mutually circular and you'd
need to use one of the workarounds from JuliaLang/julia#269
JuliaLang/julia#269 to specify them.

I still think that Enums of the type people want in Base, which are
primarily for controlling what code gets executed, are different from
CategoricalVariables, which are primarily for holding data, and it's
important we not conflate these two.


Reply to this email directly or view it on GitHub
#73 (comment)
.

from dataarrays.jl.

johnmyleswhite avatar johnmyleswhite commented on June 12, 2024

Ok. It sounds like we're all in agreement that Enum's aren't quite what we want. That makes me feel better about having implemented this stuff. :)

@simonster, your note about being pointer-free is well-taken. I'll try to cycle back on this soon and experiment with an alternative implementation that includes a small set of constant CategoricalVariables in a pool that can returned to users. There's also a question of whether the CategoricalPool should be immutable or not. The levels! function would be a lot simpler if the pool were mutable.

@garborg, the package I wrote doesn't provide much usable functionality: it's more a demonstration of a potential backend than a drop-in replacement for PooledDataArray's. I can demo out an implementation of CategoricalArray's and OrdinalArray's to give a feel how these things would work out in practice.

from dataarrays.jl.

nalimilan avatar nalimilan commented on June 12, 2024

@johnmyleswhite I don't get why OrdinalPool() takes two arguments. Wouldn't it be enough to take the levels in the order they appear in the first argument?

I also have a small remark regarding the naming: CategoricalVariable evokes an array to me; maybe it would be better to call it CategoricalValue instead?

Regarding enums, I agree with @garborg that we don't know yet how they will work, so it's hard to tell whether they would make CategoricalVariable redundant or not.

from dataarrays.jl.

johnmyleswhite avatar johnmyleswhite commented on June 12, 2024

@nalimilan You're right: there's no reason to have the two-vector constructor for OrdinalPool. I noticed that in my last pass through the code, but was too lazy to fix it.

CategoricalValue works for me.

from dataarrays.jl.

nalimilan avatar nalimilan commented on June 12, 2024

With a Ph.D. and a permanent position to start soon, I've finally found the time to take up @johnmyleswhite's work on CategoricalData.jl, and bring it to a point where I think it can constitute a replacement for PooledDataArrays: https://github.com/nalimilan/CategoricalArrays.jl

The features are essentially the same (though with additional support for ordered levels), and the API is very similar, except that instead of suffering from the NA type instability, it comes in two variants: one which does not support missing values, and one which returns Nullable objects. It should thus be a good companion to NullableArrays, but for categorical data.

Please have a look at the README (and if possible at the implementation) and tell me what you think. If you agree, I'd like to move this to JuliaStats soon, and open a branch in DataFrames which no longer uses DataArrays at all.

Cc possibly interested people: @andreasnoack @davidagold @tshort @bkamins

from dataarrays.jl.

andreasnoack avatar andreasnoack commented on June 12, 2024

Looks great. A few comments

  • Carrying around a Pool reference in the values will probably be quite costly but I guess that most of the computationally intensive operations will take place in functions that take an Ordinal/NominalArray as input and, therefore, computations on Ordinal/NominalValues would be relatively rare. You have probably been using these types so how important/useful to have the Pool referenced in the values? You could think of the Pool as a property of the collection only, i.e. when extracting a value then you "loose" the categorical information. I've asked earlier but I don't think I got a reply.
  • Have you thought of having a fixed number levels in the pool such that you could use a tuple instead of a vector for storage?

from dataarrays.jl.

nalimilan avatar nalimilan commented on June 12, 2024

Carrying around a Pool reference in the values will probably be quite costly but I guess that most of the computationally intensive operations will take place in functions that take an Ordinal/NominalArray as input and, therefore, computations on Ordinal/NominalValues would be relatively rare. You have probably been using these types so how important/useful to have the Pool referenced in the values? You could think of the Pool as a property of the collection only, i.e. when extracting a value then you "loose" the categorical information. I've asked earlier but I don't think I got a reply.

I haven't done systematic benchmarks yet. You're right that the most common use case is probably to pass a whole array, not individual values.

The reference to the pool may not be required after all. As I see it, it can be useful as it would allow comparing values from two distinct pools which are !== but are == (by testing that levels and order are equal). But that comparison would be dead slow anyway WRT just testing for equality of two pointers and of two integers. So a better solution could be to store the hash of the levels array, which would allow making NominalValue/OrdinalValue pointer-free immutables. That sounds so nice that I wonder whether I have forgotten a reason why we didn't choose that solution.

Have you thought of having a fixed number levels in the pool such that you could use a tuple instead of a vector for storage?

I don't think that would be very practical, as it would introduce type instability in a lot of places. It's already quite annoying to change the integer type used to store references based on the size of the pool (which isn't done automatically for this reason). Also, do you think it would bring a significant speed improvement? Accessing an array by integer indexing is fast too.

from dataarrays.jl.

nalimilan avatar nalimilan commented on June 12, 2024

I remember the problem now: storing (a reference to) the pool is the only way to provide both very fast equality checks between values from the same pool (the most common scenario), while still supporting comparisons between different pools. In the first case (same pool), you can simply check that pools are ===, and then compare the integer indices. In the second case, you can extract the actual levels from the pool and compare them (which is slower for e.g. strings).

If we don't want to keep a reference to the pool, we need to store the actual level, which in the common case will be a string: AFAIK this wouldn't be really more efficient than keeping a reference to the pool, would it? Except if we could find a way to store strings inline, e.g. as a tuple of chars, as an implementation detail.

(While it's currently and error in the package, I think the ability to compare values from different pools is important, as when importing data arrays may have the same set of levels but not the same pool objects, and requiring the user to merge them would be painful.)

Forgot to CC @quinnj.

from dataarrays.jl.

quinnj avatar quinnj commented on June 12, 2024

Thanks for the CC @nalimilan; I'll look more into your fork of CategoricalArrays.

I've read through a good number of the comments here, but I'd like to add one more point on @simonster 's original point 4, which would make a type definition like

# `O` = whether the Category is ordered or not
# `I` internal storage type
# `T` = tuple of symbols representing Category level values
immutable Category{O,I,T}
    value::I
end

I dislike throwing out this option simply because of the current state of code generation in Julia (i.e. over-specializing every method that operates on Category). IMO, this is the simplest, most straightforward implementation and I'd hate to see us get additional code despecialization capabilities in 0.6 and then have an entire package or two become overly complicated. It'd be nice to be able to do something like:

immutable Category{O,I,T::ANY}
    value::I
end

that would help the compiler know that it shouldn't ever specialize methods on the T parameter.

I'd much rather write up an implementation based on this approach and then provide enough whiskey to @JeffBezanson, @vtjnash, and @carnaval during JuliaCon to add the ability to despecialize a type parameter 😄

from dataarrays.jl.

nalimilan avatar nalimilan commented on June 12, 2024

The downside with that approach is that any code modifying the levels would become type unstable if levels are not know statically. I'm also not sure what's the practical advantage: why would statically knowing the possible levels make anything faster?

from dataarrays.jl.

quinnj avatar quinnj commented on June 12, 2024

Is it really a common use case to need to modify an existing Category's levels? I can see a case where you "build up" the levels first, outside the type, but in my mind, once you have them, you then build your Category type and you're done.

The practical advantage of a definition like Category is that we don't have to define yet-another-type-of-array like NominalArray/OrdinalArray. I could just have a NullableArray{Category, 1} and everything works fine. That seems like a huge win based on what I've seen from NullableArrays/DataArrays and the amount of machinery that needs to be defined to get your own Array type working.

from dataarrays.jl.

nalimilan avatar nalimilan commented on June 12, 2024

It's quite frequent in my experience to change levels e.g. because you take a subset of the dataset and you want to drop unused levels, or even more frequently to recode data. If the levels are part of the type definition, you need to pass them in advance before assigning them based on various conditions (which violates the DRY principle), and setindex! cannot even create a new level (which is one of the annoying things with R's factors).

The machinery required to implement a custom type is significant, but that's not the end of the world either. CategoricalArrays.jl is mostly functional now, and it's only ~600 lines of code under src/ (+ 2000 for tests...). It's simpler than PDAs since it doesn't need to implement arbitrary functions like arithmetic operators (since the data is categorical). And several constructors for the internal pool aren't actually used and could be removed.

from dataarrays.jl.

nalimilan avatar nalimilan commented on June 12, 2024

See JuliaData/DataFrames.jl#994 regarding port of DataFrames to NullableArrays and CategoricaArrays.

from dataarrays.jl.

nalimilan avatar nalimilan commented on June 12, 2024

Closing now that DataFrames has been ported (JuliaData/DataFrames.jl#1008).

from dataarrays.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.