Comments (15)
I'm on board with using custom implementations of every function for both DA's and PDA's since I don't see any new AbstractDataArray types coming up soon.
from dataarrays.jl.
Should e.g. +(PooledDataArray, AbstractArray)
and +(PooledDataArray, DataArray)
return PooledDataArrays or DataArrays?
from dataarrays.jl.
That's a really good question. I'd say those things should raise errors: once you assert your data is categorical, we turn arithmetic off.
from dataarrays.jl.
I can see a legitimate use case for +(PooledDataArray, PooledDataArray)
if you have multiple categorical variables but you want to do something with their sum. Not sure about mixed operations like those above, although I suppose they could arise if you have data in different forms.
from dataarrays.jl.
What's the legitimate use case for +
on categorical variables? I'm pretty uncomfortable with the idea. I'd really like our approach to be consistent with the classical theory of measurement: http://en.wikipedia.org/wiki/Level_of_measurement#Nominal_scale
from dataarrays.jl.
Well, for ordinal data in some cases this can make sense (think of a Likert scale). But I think it would be safer to require people to convert to integers explicitly. If you start implementing +
, you'll have to support all operations that apply to reals.
from dataarrays.jl.
Yes, I think this boils down to a question of what PooledDataArrays represent. Here are four possible answers:
- PooledDataArrays are behaviorally identical to DataArrays, but with different storage characteristics. The subtype of AbstractDataArray has no relationship to the interpretation of the data it contains. In this case they should implement the same operators. (Currently they do, although the existing implementations are slow for PooledDataArrays.)
- PooledDataArrays are behaviorally identical to DataArrays, but not intended for numeric values since those can be more efficiently in DataArrays. (Generally true, but there are cases where PooledDataArrays could be more efficient, if you can use Uint8s in your reference array but your values span a larger range, or if you plan to apply scalar operators to your data that only need to operate on the pool in the PooledDataArray case.)
- PooledDataArrays represent variables on a nominal scale. In this case, no arithmetic or comparison operators should be defined.
- PooledDataArrays represent variables on nominal or ordinal scales. In this case, comparison operators should be defined. The Wikipedia article @johnmyleswhite links to above suggests that no arithmetic operators should be defined. I have only superficial knowledge of the issues involved in classical theory of measurement, but an apparent counterpoint is Spearman's rho, which involves mathematical operations on ranks.
We should not worry about the cost of supporting operators for PooledDataArrays, since we have to support all operators for DataArrays anyway, and metaprogramming makes supporting all operators for PooledDataArrays almost as easy as supporting one.
from dataarrays.jl.
Yeah, we really need to decide what are PDAs.
Regarding ordinal scales, computing Spearman's rank correlation coefficient does not imply you are able to attribute a precise numeric value to a level: just that you know their order. So +
does not apply in that case.
Anyway, I was not suggesting the problem was the cost of implementing operators, rather that you need to draw all the logical implications of adding +
.
from dataarrays.jl.
As I said in another issue, I now quite strongly think that we should use an Enum-like type for processing categorical and ordinal data, then store those values in DataArray's. PDA's are an interesting idea, but not very valuable in the long run because of the existence of true scalars in Julia. The analogies with R that inspired PDA's were helpful, but inexact and shouldn't be part of our long-term strategy.
from dataarrays.jl.
So that means PDAs would become an enum for ordinal/nominal variables, and so +
wouldn't make sense? And then maybe another type will have to be added one day if it's deemed to useful to store numeric values from a small set?
from dataarrays.jl.
Would this mean that what are now PooledDataArrays could become DataArrays of Enums, and we could kill AbstractDataArrays entirely? I'd be pretty happy with that, and I think it could simplify many things. Another question this discussion brings up is whether we want a separate wrapper for ordinal types.
My point about Spearman's rho was that regardless of whether the original data was ordinal or interval scaled, the ranks are ordinal, but we would need to be able to perform +
and -
on them to calculate rho.
from dataarrays.jl.
PDA's will just go away in my ideal world. For any specific categorical variable, there would be a custom type, which could be stored in Array's or DataArray's.
I would oppose implementing +
for these variables.
If you want to use integers, but only a small number of them, use Int8
. No need for us to reinvent the wheel.
from dataarrays.jl.
Sounds good to me.
from dataarrays.jl.
Yes, we'd only have DataArrays of Enums.
I think the Enum's we'd want would be either descendants of NominalVariable or OrdinalVariable.
For calculating Spearman's rho, you map the ordered elements to the integers and then do arithmetic on the integers. So we don't need to implement anything more than the map from elements to integers.
from dataarrays.jl.
If we're getting rid of PooledDataArrays, we obviously don't need operators for them, so I'm going to close this and open a new issue.
from dataarrays.jl.
Related Issues (20)
- #undef/uninitialised values from unique on PooledDataArray
- Get rid of @ngenerate
- Missing DataArray{T}(dims) constructors HOT 1
- MethodError in pooleddataarray.jl in Julia 0.6 HOT 7
- Inference failure/indexing issues HOT 14
- View on DataArray makes copy HOT 1
- broadcast() inference issue with mean() and > HOT 5
- In v 0.7 precompile fails with typeassert HOT 1
- `safe_mapslices` at test/reducedim.jl is broken on 0.7 HOT 2
- Can't load DataArrays HOT 9
- Weighted mean broken for matrices HOT 1
- Avoid copying input Array{Union{T, Null}} in DataArray{T} constructor
- getindex broken when using multidimensional index of DataArray{Bool}
- Division DataArray{Int} / Int results in Array{Any} HOT 2
- doesn't load on julia v0.7 because of `printf` HOT 1
- PooledDataArray depreciated? HOT 1
- Julia v0.7 compatibility HOT 9
- "UndefVarError: centralizedabs2fun not defined" while precompiling HOT 3
- Julia-1.0.0 Compile Error HOT 3
- Package compatibility caps
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dataarrays.jl.