Comments (7)
I've reproduced this in test.
The issue only occurs if the duplicate dimension has never been seen before.
Each time an InputRow is added we initialize a 2D array (1 cell in top level for each dimension, with an array containing the unique values for this dimension) with size = the # of dimensions seen so far.
We then loop through the dimensions, updating the 2d array (or an overflow buffer) with the values for each dimension.
We will see the duplicate dimension twice in this update loop. On the second time, we will find an index for this dimension (set the first time), and attempt to set the new set of values for this dimension to this index in the 2D array. However, this array was initialized based on the number of dimensions seen as of the last call to add (which is 1 too small, since we've never seen this dimension before), and we hit index out of bounds.
In the case that the duplicate dimension had been seen before (in a previous call to add), the 2D array will already be properly sized. We will find an index for the duplicate dimension both times we see it in the update loop, and set it to the same set of values in the 2D array.
It seems to me like duplicate dimensions should be an error, but I would like to clarify what the expected behavior is in this case.
80d8eedcf7422479020fa9388cd66f55dc74230d
this commit illustrates the second condition (the duplicate dimension has been seen before, and the add has no issue). By commenting out the first call to add, you can see it fail when it attempts to access the 2D array.
from druid.
+1 on having an error for duplicate dimension,
however the error needs to be informative and meaningful instead of ArrayIndexOutOfBoundEx.
from druid.
So assuming it is aggreed that this case is an error I see two approaches.
-
eliminate the possibility of duplicates by having the InputRow return a Set instead of a List (implications on dimension ordering)
-
on each call to add, maintain a set of the dimensions seen in this input row, throwing a descriptive error on duplicate dimensions.
The existing lookup (does this dimension have an index) can be used to detect duplicates on the first occurance of the dimension (i.e. Found an index but array is too small, turning what is now an index out of bounds into something more descriptive), however it is not sufficient to detect duplicate dimensions if this dimension was seen on a previous row (the index will be found and the array sufficiently large).
from druid.
Essentially, this is alternative 2 without the need for the additional set
from druid.
Is this a hypothetical, or is this occurring somewhere in the wild?
from druid.
We can probably close right? #2017 is merged
from druid.
Fixed by #2017
from druid.
Related Issues (20)
- [DRAFT] 29.0.0 release notes
- Kill period is not honoured
- KillUnusedSegments ignores segments that are marked unused after it last ran HOT 6
- Remove ScanQuery legacy support HOT 1
- adjustable side panel in query
- pin Testng dependencies to 7.3.0
- Query failed when using DS_QUANTILES_SKETCH and CASE Statement HOT 1
- Unkown lookup type loadingLookup when using druid-lookups-cached-single
- OOM error with data upload from MySQL table into Apache Druid using JSON task
- Coordinator crashes after upgrading to 29.0.0 HOT 2
- All replica Tasks fail for the task group due to previous task timed out before completion HOT 3
- Auto-kill doesn't delete segments outside the range [0000-01-01/10000-01-01) HOT 1
- Why is the data saved to HDFS unavailable?
- Druid in k8s not able to reload services when p12 TLS is renewed
- Log files are stored as ${sys:druid.node.type}.log HOT 3
- Jdbc connector to trino HOT 1
- Refactoring the Duplicated/Redundant Mock Objects in `Test Case` of org.apache.druid.query.groupby.epinephelinae.column
- Remove useless helm releated pathes in workflows
- Make CPU config of Container configurable under MoK mode
- Add column flattening via UI creating wrong spec for Kafka ingestions HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from druid.