My thinking on PooledDataArray's is slowly changing, especially with regard to the abs

<a target="_blank" rel="noopener noreferrer nofollow" href="https://camo.githubusercon

Proposed PooledDataArray changes,about juliastats/dataarrays.jl

Comments (28)

johnmyleswhite commented on June 12, 2024

What this proposal doesn't mention, because I want to debate it in a separate issue, is the idea that we should define factor(x) in a model formula in terms of the output of unique(x). For PDA's, this is cached. For DA's, this is calculated at model matrix construction time. For streaming data, the unique values must be specified upfront.

from dataarrays.jl.

nalimilan commented on June 12, 2024

Sounds like a plan, though here are a few points I'd like to discuss:

PDA's shouldn't be used as a workhorse data structure for data that is still likely to be mutated. One should, under most circumstances, only construct a PDA when you know that the data is effectively static.

Sounds like too restrictive to me. When you have large data consisting of strings, storing it as a vector of strings takes much more memory than pooling them, so it should be possible to load it as a PooledDataVector even if you still need to recode things. Moreover, I don't think the distinction between "in flux" and "stable" is very practical: this would mean that once your data is pooled, you have to reload it in another form to make a few changes when you detect problems/find new ideas?

Anyway, I don't understand why you say pooled data should be stable. What's the technical constrain which imposes that?

The value of PDA's is that they cache information about categorical data. For now, PDA's cache only information about the unique values found in the full array. But I'd like to propose adding more information.
In addition to caching the unique values, I would argue for maintaing a count of the number of times that each value occurs. This would make it easier to repeatedly use PDA's in cross tabulations. In addition, we can use a number of counts of each unique value to ensure that the PDA's pool never contains entries that are no longer present in the PDA's refs.

I'm not sure this adds much value. The one-way frequency table does not help you in any way to make a multidimensional table. And it imposes a non-negligible cost everytime you take a subset of the vector (and this would be worse with DataFrames with many variables), which may make Julia feel slow. This behavior would also be a problem if DataFrames were backed by mmapped files or by a database: subsetting would impose reloading all of the data into memory.

Being able to automatically drop empty levels is an important feature to me, but it could be done only when taking a subset, and as an option.

from dataarrays.jl.

johnmyleswhite commented on June 12, 2024

You're right: the framing of "don't mutate PDA's" was too strong. What I want is to get everyone to agree that PDA's offer a performance trade-off similar to that offered by indexed columns in an RDBMS. You can do lots of cool things faster when you have an index for a database, but you pay a price in the speed of insertions and deletions. PDA's would offer such a trade-off: they perform insertions and deletions a little slower, but give you lots of useful cached information in exchange.

I'm not sure if the subsetting concern is going to be a problem for long, since the long-term plan has always been to make subsetting operations on DataArray's and DataFrame's return SubDataArray and SubDataFrame structures once Base switches to producing array views by default. Since the Sub* structure is just a mapping from indices in the subset to indices in the original, I don't see any need to copy anything about the counts. The costly copy only happens when you want an independent copy of the subset, which is something you generally don't need.

from dataarrays.jl.

nalimilan commented on June 12, 2024

Yeah, but if people mostly work with SubDataArrays, and counts are not updated in that case, they are of a very limited use. ;-)

from dataarrays.jl.

johnmyleswhite commented on June 12, 2024

You mean when they call countmap? That's the only case I'm seeing right now in which we'd use the counts to do anything other than prune the pool. Seems like it would be no slower than using countmap on DataArray.

from dataarrays.jl.

nalimilan commented on June 12, 2024

Sorry, I don't get what you mean. ;-)

from dataarrays.jl.

johnmyleswhite commented on June 12, 2024

from dataarrays.jl.

lindahua commented on June 12, 2024

Wow, I like the picture 👍

from dataarrays.jl.

nalimilan commented on June 12, 2024

Now I'm totally lost!

from dataarrays.jl.

johnmyleswhite commented on June 12, 2024

What specifically are you lost about? Because I'm lost about what you're lost about.

from dataarrays.jl.

nalimilan commented on June 12, 2024

Just joking, but I simply did not understand your question in #50 (comment).

from dataarrays.jl.

johnmyleswhite commented on June 12, 2024

I just wanted to ask if there were cases in which we would use the proposed counts field other than maintaining referential integrity of the pool field -- except for the case of someone asking exactly for the counts in a tabulation.

from dataarrays.jl.

nalimilan commented on June 12, 2024

I see. Well, precisely, I don't think it could be used in any other case -- that's why I think it's not worth keeping around.

from dataarrays.jl.

johnmyleswhite commented on June 12, 2024

But it is worth keeping around: it lets you maintain referential integrity.

from dataarrays.jl.

nalimilan commented on June 12, 2024

Well, once you've removed empty levels, you no longer need it, do you?

from dataarrays.jl.

johnmyleswhite commented on June 12, 2024

Yes, but this is the proposed mechanism by which you know when a level is empty. This is basically the reference counting approach used in many garbage collectors.

from dataarrays.jl.

garborg commented on June 12, 2024

There's a little uncertainty about how PooledDataArrays with numeric eltypes should be interpreted, here and #46 -- should they be treated like a DataArray or like categorical data?

Will people still use PooledDataArrays to store numbers more compactly if they're Uint32 under the hood (#57)?

If so, it would be nice if neither use required remembering to convert (/ functions taking extra agruments to know) each time the array is used. As nominal and ordinal pooled data are living under the same type, perhaps adding categorical::Bool or replacing ordered with scale.

If not, maybe treat them as categorical at least until more compact storage comes back after core functionality and benchmarking are in place. (Fits with my preference for the default -- that using the same clean syntax to pool a string and to pool an integer works, and that it's seems more useful to be explicit when pooling for performance.)

from dataarrays.jl.

johnmyleswhite commented on June 12, 2024

I've just had an epiphany about PDA's. And now I'm resolutely opposed to their existence and think we need something totally different from our current implementation. My proposed changes are also not going to save us from the problem I've discovered.

The problem is that PDA's, as we've thought about them up till now, are subtly corrupted by our familiarity with R's type system. But R's type system is fundamentally different from Julia's type system, because R has no scalar types at all.

And the implications of that statement are actually a disaster waiting to happen for us if we ever implement ordered categorical data.

The problem is that we can never hope to define an ordering on PDA's that works in a reliable way in Julia.

The reason for that is that R's factors are always vectors and, importantly, a single element of R factor is itself a length-1 R factor vector.

As such, simple things like factor[index1] > factor[index2] always have access to all of the properties of the factor in R. In contrast, single-element indexing in Julia will always produce a single scalar element that loses all access to the vector from which it was extracted.

As such, every single time you select a scalar element from a PDA, you lose access to all information about the ordering defined by the PDA. This will be a nightmare for us at some point.

To see why, imagine that you have a PooledDataArray{UTF8String} and that you then define an ordering over its levels. As one would hope, the ordering we define can affect computations on the vector as a whole, but will never apply to the single elements of the PDA when they're considered in isolation.

For example, if you defined the ordering over the two levels, ["A", "B"] so that B < A held in the vector, you would get totally broken behavior for comparison operations every time you pulled elements out of this PDA, because "A" < "B" is a property of strings that we can't just override at will.

So what should we do? I now think that we should abandon PDA's completely. The way to provide both minimum memory usage, representation of categorical data with unobserved levels and custom orderings of scalars is to use vectors whose elements are instances of a new kind of Enum type. The Enum type could have all the properties we'd like to define and would not be subtly broken the way ordered PDA's would be.

That said, I think we could still make PDA's useful. But I think they'd be most useful if we used them as a tool for providing O(1) calculations of unique and levels. That might make me them of such rare and obscure use that we shouldn't spend our time maintaining them, but it's at least one possible use case for that kind of data structure.

But, if we ever want to get ordered factors right, we need to start baking the levels of ordered categorical data into the Julia type system, which means using something like Enum's and not using something like PDA's.

Just like I've come to realize that emulation of delayed evaluation is hard to get right in Julia, I think working with ordered categorical data in Julia requires new abstractions that are further removed from R's precedent than our current implementation.

from dataarrays.jl.

lindahua commented on June 12, 2024

I did not actively contribute to the DataArray/DataFrames package, but I often keep an eye on the discussions here.

John, I agree with your point that an efficient & versatile Enum type is the correct way forward.

from dataarrays.jl.

johnmyleswhite commented on June 12, 2024

Thanks, Dahua!

from dataarrays.jl.

nalimilan commented on June 12, 2024

Very interesting considerations. ;-) You're right that the concept of an ordered/unordered set of levels is an enum in the language, so it makes sense to consider it as such. Using strings to represent these values is a habit we took from R (but also from most stats languages I know of).

So how would this work? I guess an enum type would be created on the fly?. And then how could you access the individual values of enum, e.g. how would you do pda .== "A"? with pda .== :A? or pda .== MyEnum.A? I reckon the enum shouldn't be named nor accessible independently of the PDA.

from dataarrays.jl.

johnmyleswhite commented on June 12, 2024

I'm not sure what the status of Enum types is in Base, but there's been some recent discussion about making them first-class citizens. I think we need to wait to see how they get handled there before we could make a clear plan for moving forward.

Conditional on how the Base implementation works out, we would have to make new Enum types on the fly to represent categorical data. We'd need to pick an interface, but I think something like makefactor(DataArray, levels = [...], order = [...]) would let us convert a DataArray{T} into a DataArray{Enum{T}} pretty easily. That function would then have to do a whole bunch of stuff: define the Enum type (and, arguably, do so in the proper scope), define an ordering on the Enum or define the absence of an ordering, define conversions between types, etc.

Re pda .== "A", I think this shouldn't be too hard if the Enum is derived from UTF8String. You'd just provide a mechanism for going back and forth between UTF8String and the Enum internal encoding, which is going to be something like Uint8.

from dataarrays.jl.

nalimilan commented on June 12, 2024

OK. So the only controversial point is actually whether the term factor should be retained or not. ;-)

from dataarrays.jl.

johnmyleswhite commented on June 12, 2024

I don't have strong feelings about the best name. But factor already exists in Base Julia and we shouldn't pun against it, so I'd say that's a no-go.

from dataarrays.jl.

nalimilan commented on June 12, 2024

Yeah, let's find a better name. I don't know where the S authors found this term...

Going back to enums: do you still want to support storing integers in PDAs? Julia enums may not support integer labels. This may also be a problem for any type which is not a string, e.g. Datetime objects. We would need either very flexible enums, or a way to map enum values to a pool of levels.

from dataarrays.jl.

johnmyleswhite commented on June 12, 2024

I'm pretty sure enums are going to let you use any type, but let me know if that's not right.

from dataarrays.jl.

nalimilan commented on June 12, 2024

Well, apparently the idea with enums is to associate symbols to integers. Then it's up to you to make anything with them. While you can easily convert symbols to strings, you cannot automatically convert them to any type: you'd need to store the objects in a pool and retrieve them using their integer index.

from dataarrays.jl.

johnmyleswhite commented on June 12, 2024

That's a bummer. We could always say that you can only use categorical data with string labels, then do a string -> symbol conversion in the background. Arguably there's no real need to use integers as categorical variables since the categoricalness means that arithmetic on the categories doesn't make any sense. So we can just replace integers with the symbols that represent them.

from dataarrays.jl.

Proposed PooledDataArray changes about dataarrays.jl HOT 28 OPEN

Comments (28)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent