Giter Club home page Giter Club logo

Comments (6)

johnmyleswhite avatar johnmyleswhite commented on June 12, 2024

Looking at this, I think we have too many different constructors for PDA's.

If we're going to start enforcing the referential integrity of PDA's, then I don't see any reason to allow users to specify the pool. The constructors for PDA's should be exactly like the constructors for DA's: you can specify data + missingness or just data.

from dataarrays.jl.

nalimilan avatar nalimilan commented on June 12, 2024

Specifying the pool can increase performance (both memory and speed), since you know the number of bits you need to store the references. It may also be used as a way to check that no unexpected values appear in the data (though this could of course be checked later).

from dataarrays.jl.

johnmyleswhite avatar johnmyleswhite commented on June 12, 2024

If you need performance you could always use a RefArray. The argument that it lets you check values seems to be the mirror image of the "it lets you create an invalid PDA" argument.

from dataarrays.jl.

nalimilan avatar nalimilan commented on June 12, 2024

I thought RefArrays were not supposed to be used out of the package itself (#13). I don't think providing a list of values should allow creating an invalid PDA. It would simply ensure that your expectations about the levels that are in the data are correct -- the resulting PDA should be correct in all cases.

Anyway, I don't really care, I was just exposing the possible counter arguments. It may be simpler to accept an optional argument giving the expected maximum number of levels -- or to do nothing for now, as it can always be added later easily.

from dataarrays.jl.

johnmyleswhite avatar johnmyleswhite commented on June 12, 2024

Yes, we shouldn't export RefArray's. But people who are sufficiently interested in performance will often try digging into internals even when we tell them not to.

I think I might just be missing something about your point, but it seems like we can only ensure that the passed pool is correct if we do a bunch of computations to check correctness, which might be just as costly as creating the pool to begin with. It seems like we'd need some substantial benchmarks to really know.

Offering a constructor that says what the ref size should be seems totally reasonable. I mostly just want us to stop exposing the pool at all, because I think it should be an implementation detail, not a property of PDA's. The property of PDA's is that they make it efficient to work with functions like unique and `levels.

My main argument for removing a lot of these constructors is that they only make sense if you intend to allow PDA's to not guarantee referential integrity. Since R allows that, we ended up implementing a lot of functionality that I think we would be better off without. We have too many features and our code is too buggy. I'd like to offer fewer features and more guarantees that everything really works as claimed.

from dataarrays.jl.

nalimilan avatar nalimilan commented on June 12, 2024

In my mind you would create the pool, iterate over the input array and assign elements a reference. If you encounter a value that was not passed in the pool, throw an error. If you find a level in the pool with no value in the data, throw an error at the end of the process too (or drop it, not sure what's best).

But let's go with only an argument specifying the number of unique values (i.e. size of the pool).

from dataarrays.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.