Giter Club home page Giter Club logo

Comments (7)

michaelschiff avatar michaelschiff commented on May 20, 2024

I've reproduced this in test.

The issue only occurs if the duplicate dimension has never been seen before.

Each time an InputRow is added we initialize a 2D array (1 cell in top level for each dimension, with an array containing the unique values for this dimension) with size = the # of dimensions seen so far.
We then loop through the dimensions, updating the 2d array (or an overflow buffer) with the values for each dimension.

We will see the duplicate dimension twice in this update loop. On the second time, we will find an index for this dimension (set the first time), and attempt to set the new set of values for this dimension to this index in the 2D array. However, this array was initialized based on the number of dimensions seen as of the last call to add (which is 1 too small, since we've never seen this dimension before), and we hit index out of bounds.

In the case that the duplicate dimension had been seen before (in a previous call to add), the 2D array will already be properly sized. We will find an index for the duplicate dimension both times we see it in the update loop, and set it to the same set of values in the 2D array.

It seems to me like duplicate dimensions should be an error, but I would like to clarify what the expected behavior is in this case.

80d8eedcf7422479020fa9388cd66f55dc74230d
this commit illustrates the second condition (the duplicate dimension has been seen before, and the add has no issue). By commenting out the first call to add, you can see it fail when it attempts to access the 2D array.

from druid.

nishantmonu51 avatar nishantmonu51 commented on May 20, 2024

+1 on having an error for duplicate dimension,
however the error needs to be informative and meaningful instead of ArrayIndexOutOfBoundEx.

from druid.

michaelschiff avatar michaelschiff commented on May 20, 2024

So assuming it is aggreed that this case is an error I see two approaches.

  1. eliminate the possibility of duplicates by having the InputRow return a Set instead of a List (implications on dimension ordering)

  2. on each call to add, maintain a set of the dimensions seen in this input row, throwing a descriptive error on duplicate dimensions.

The existing lookup (does this dimension have an index) can be used to detect duplicates on the first occurance of the dimension (i.e. Found an index but array is too small, turning what is now an index out of bounds into something more descriptive), however it is not sufficient to detect duplicate dimensions if this dimension was seen on a previous row (the index will be found and the array sufficiently large).

from druid.

michaelschiff avatar michaelschiff commented on May 20, 2024

#2017

Essentially, this is alternative 2 without the need for the additional set

from druid.

drcrallen avatar drcrallen commented on May 20, 2024

Is this a hypothetical, or is this occurring somewhere in the wild?

from druid.

michaelschiff avatar michaelschiff commented on May 20, 2024

We can probably close right? #2017 is merged

from druid.

fjy avatar fjy commented on May 20, 2024

Fixed by #2017

from druid.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.