If InputRow.getDimensions() has duplicates, IncrementalIndex.add() fails with <p d

<a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="11

We can probably close right? <a class="issue-link js-issue-link" data-error-text="Fail

Fixed by <a class="issue-link js-issue-link" data-error-text="Failed to load title" da

IncrementalIndex.add() barfs when InputRow.getDimensions() has duplicates about druid HOT 7 CLOSED

apache commented on May 20, 2024

IncrementalIndex.add() barfs when InputRow.getDimensions() has duplicates

from druid.

Comments (7)

michaelschiff commented on May 20, 2024

I've reproduced this in test.

The issue only occurs if the duplicate dimension has never been seen before.

Each time an InputRow is added we initialize a 2D array (1 cell in top level for each dimension, with an array containing the unique values for this dimension) with size = the # of dimensions seen so far.
We then loop through the dimensions, updating the 2d array (or an overflow buffer) with the values for each dimension.

We will see the duplicate dimension twice in this update loop. On the second time, we will find an index for this dimension (set the first time), and attempt to set the new set of values for this dimension to this index in the 2D array. However, this array was initialized based on the number of dimensions seen as of the last call to add (which is 1 too small, since we've never seen this dimension before), and we hit index out of bounds.

In the case that the duplicate dimension had been seen before (in a previous call to add), the 2D array will already be properly sized. We will find an index for the duplicate dimension both times we see it in the update loop, and set it to the same set of values in the 2D array.

It seems to me like duplicate dimensions should be an error, but I would like to clarify what the expected behavior is in this case.

80d8eedcf7422479020fa9388cd66f55dc74230d
this commit illustrates the second condition (the duplicate dimension has been seen before, and the add has no issue). By commenting out the first call to add, you can see it fail when it attempts to access the 2D array.

from druid.

nishantmonu51 commented on May 20, 2024

+1 on having an error for duplicate dimension,
however the error needs to be informative and meaningful instead of ArrayIndexOutOfBoundEx.

from druid.

michaelschiff commented on May 20, 2024

So assuming it is aggreed that this case is an error I see two approaches.

eliminate the possibility of duplicates by having the InputRow return a Set instead of a List (implications on dimension ordering)
on each call to add, maintain a set of the dimensions seen in this input row, throwing a descriptive error on duplicate dimensions.

The existing lookup (does this dimension have an index) can be used to detect duplicates on the first occurance of the dimension (i.e. Found an index but array is too small, turning what is now an index out of bounds into something more descriptive), however it is not sufficient to detect duplicate dimensions if this dimension was seen on a previous row (the index will be found and the array sufficiently large).

from druid.

michaelschiff commented on May 20, 2024

#2017

Essentially, this is alternative 2 without the need for the additional set

from druid.

drcrallen commented on May 20, 2024

Is this a hypothetical, or is this occurring somewhere in the wild?

from druid.

michaelschiff commented on May 20, 2024

We can probably close right? #2017 is merged

from druid.

fjy commented on May 20, 2024

Fixed by #2017

from druid.

IncrementalIndex.add() barfs when InputRow.getDimensions() has duplicates about druid HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent