Giter Club home page Giter Club logo

Comments (15)

johnmyleswhite avatar johnmyleswhite commented on June 12, 2024

I'm on board with using custom implementations of every function for both DA's and PDA's since I don't see any new AbstractDataArray types coming up soon.

from dataarrays.jl.

simonster avatar simonster commented on June 12, 2024

Should e.g. +(PooledDataArray, AbstractArray) and +(PooledDataArray, DataArray) return PooledDataArrays or DataArrays?

from dataarrays.jl.

johnmyleswhite avatar johnmyleswhite commented on June 12, 2024

That's a really good question. I'd say those things should raise errors: once you assert your data is categorical, we turn arithmetic off.

from dataarrays.jl.

simonster avatar simonster commented on June 12, 2024

I can see a legitimate use case for +(PooledDataArray, PooledDataArray) if you have multiple categorical variables but you want to do something with their sum. Not sure about mixed operations like those above, although I suppose they could arise if you have data in different forms.

from dataarrays.jl.

johnmyleswhite avatar johnmyleswhite commented on June 12, 2024

What's the legitimate use case for + on categorical variables? I'm pretty uncomfortable with the idea. I'd really like our approach to be consistent with the classical theory of measurement: http://en.wikipedia.org/wiki/Level_of_measurement#Nominal_scale

from dataarrays.jl.

nalimilan avatar nalimilan commented on June 12, 2024

Well, for ordinal data in some cases this can make sense (think of a Likert scale). But I think it would be safer to require people to convert to integers explicitly. If you start implementing +, you'll have to support all operations that apply to reals.

from dataarrays.jl.

simonster avatar simonster commented on June 12, 2024

Yes, I think this boils down to a question of what PooledDataArrays represent. Here are four possible answers:

  • PooledDataArrays are behaviorally identical to DataArrays, but with different storage characteristics. The subtype of AbstractDataArray has no relationship to the interpretation of the data it contains. In this case they should implement the same operators. (Currently they do, although the existing implementations are slow for PooledDataArrays.)
  • PooledDataArrays are behaviorally identical to DataArrays, but not intended for numeric values since those can be more efficiently in DataArrays. (Generally true, but there are cases where PooledDataArrays could be more efficient, if you can use Uint8s in your reference array but your values span a larger range, or if you plan to apply scalar operators to your data that only need to operate on the pool in the PooledDataArray case.)
  • PooledDataArrays represent variables on a nominal scale. In this case, no arithmetic or comparison operators should be defined.
  • PooledDataArrays represent variables on nominal or ordinal scales. In this case, comparison operators should be defined. The Wikipedia article @johnmyleswhite links to above suggests that no arithmetic operators should be defined. I have only superficial knowledge of the issues involved in classical theory of measurement, but an apparent counterpoint is Spearman's rho, which involves mathematical operations on ranks.

We should not worry about the cost of supporting operators for PooledDataArrays, since we have to support all operators for DataArrays anyway, and metaprogramming makes supporting all operators for PooledDataArrays almost as easy as supporting one.

from dataarrays.jl.

nalimilan avatar nalimilan commented on June 12, 2024

Yeah, we really need to decide what are PDAs.

Regarding ordinal scales, computing Spearman's rank correlation coefficient does not imply you are able to attribute a precise numeric value to a level: just that you know their order. So + does not apply in that case.

Anyway, I was not suggesting the problem was the cost of implementing operators, rather that you need to draw all the logical implications of adding +.

from dataarrays.jl.

johnmyleswhite avatar johnmyleswhite commented on June 12, 2024

As I said in another issue, I now quite strongly think that we should use an Enum-like type for processing categorical and ordinal data, then store those values in DataArray's. PDA's are an interesting idea, but not very valuable in the long run because of the existence of true scalars in Julia. The analogies with R that inspired PDA's were helpful, but inexact and shouldn't be part of our long-term strategy.

from dataarrays.jl.

nalimilan avatar nalimilan commented on June 12, 2024

So that means PDAs would become an enum for ordinal/nominal variables, and so + wouldn't make sense? And then maybe another type will have to be added one day if it's deemed to useful to store numeric values from a small set?

from dataarrays.jl.

simonster avatar simonster commented on June 12, 2024

Would this mean that what are now PooledDataArrays could become DataArrays of Enums, and we could kill AbstractDataArrays entirely? I'd be pretty happy with that, and I think it could simplify many things. Another question this discussion brings up is whether we want a separate wrapper for ordinal types.

My point about Spearman's rho was that regardless of whether the original data was ordinal or interval scaled, the ranks are ordinal, but we would need to be able to perform + and - on them to calculate rho.

from dataarrays.jl.

johnmyleswhite avatar johnmyleswhite commented on June 12, 2024

PDA's will just go away in my ideal world. For any specific categorical variable, there would be a custom type, which could be stored in Array's or DataArray's.

I would oppose implementing + for these variables.

If you want to use integers, but only a small number of them, use Int8. No need for us to reinvent the wheel.

from dataarrays.jl.

simonster avatar simonster commented on June 12, 2024

Sounds good to me.

from dataarrays.jl.

johnmyleswhite avatar johnmyleswhite commented on June 12, 2024

Yes, we'd only have DataArrays of Enums.

I think the Enum's we'd want would be either descendants of NominalVariable or OrdinalVariable.

For calculating Spearman's rho, you map the ordered elements to the integers and then do arithmetic on the integers. So we don't need to implement anything more than the map from elements to integers.

from dataarrays.jl.

simonster avatar simonster commented on June 12, 2024

If we're getting rid of PooledDataArrays, we obviously don't need operators for them, so I'm going to close this and open a new issue.

from dataarrays.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.