Hi everyone, I'm currently trying to use ELKI for clustering some ra

In <a class="commit-link" data-hovercard-type="commit" data-hovercard-url="https://git

No data type found satisfying: NumberVector,field AND NumberVector,variable about elki HOT 7 CLOSED

elki-project commented on May 19, 2024

No data type found satisfying: NumberVector,field AND NumberVector,variable

from elki.

Comments (7)

kno10 commented on May 19, 2024

It fails to parse the following two numbers:
0.11753855483588400432 0.28323801662637815291
The reason probably is that these are too precise for a double, and our parser then (unfortunately, without better error handling - it is supposed to report a precision overflow error) gives up and they will be handled as strings.
The closest doubles apparently are the following:
0.117538554835884 0.28323801662637815
How come that your data has 20 decimal digits of precision, when double only provides about 16? Do you need that extra precision? Or is that just used for alignment purposes of the csv file?

Our parser reads the decimal digits into a long. If that overflows, it fails. But 11753855483588400432 - the decimals of above numbers - exceed 2^63
Our parser handles all 18 digit, and most 19 digit numbers; and a double only provides about 16 digits; so usually we have some safety margin there...

I would accept a patch to ParseUtil#parseDouble (check for PRECISION_OVERFLOW) if it does not degrade performance. Otherwise, I would prefer a patch to improve error handling in the number vector parser that catches the precision overflow and (at least) outputs a warning. Clearly, treating these numbers as strings can be very confusing. The main motivation is that for integers, such very long integers usually indicate this is some kind of identifier column, and then automatically treating them as strings is actually helpful...

from elki.

bastian-wur commented on May 19, 2024

aaaaah, okay, thanks, that makes sense.

There's no real reason why I have 20 decimal points. I just had multiple tools which failed to parse the scientific notation for the numbers, so I put my script to a random number which I thought should be high enough to catch most of the numbers without resorting to 0.
I actually did not consider the actual floating point precision lol (now I feel stupid).
I haven't tested yet if changing the precision will fix the issue, but I absolutely believe you -> issue closed.
Thanks for the quick help :).

from elki.

kno10 commented on May 19, 2024

In 748252b I added a warning Too many digits in what looked like a double number - treating as string when a too-long float is interpreted as a string instead.

from elki.

StatguyUser commented on May 19, 2024

I am trying to cluster using KMEANS algorithm for a sparse data from a doc2vec model. It has 60000*300 dimension and data points have average length of 22, for example 0.00000804921828675072
When i cluster this dataset, i am getting below error

Task failed
de.lmu.ifi.dbs.elki.data.type.NoSupportedDataTypeException: No data type found satisfying: NumberVector,field AND NumberVector,variable
Available types: DBID DoubleVector,variable,mindim=266,maxdim=300 LabelList
    at de.lmu.ifi.dbs.elki.database.AbstractDatabase.getRelation(AbstractDatabase.java:126)
    at de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm.run(AbstractAlgorithm.java:81)
    at de.lmu.ifi.dbs.elki.workflow.AlgorithmStep.runAlgorithms(AlgorithmStep.java:105)
    at de.lmu.ifi.dbs.elki.KDDTask.run(KDDTask.java:112)
    at de.lmu.ifi.dbs.elki.application.KDDCLIApplication.run(KDDCLIApplication.java:61)
    at [...]

Is there any data type in i should select in parser.vector-type which can handle this data or should i do anything else to fix this and be able to run this successfully?

from elki.

kno10 commented on May 19, 2024

For such data, exponential formatting is more appropriate, and should work.

I.e., 8.04921828675072e-6is the common way of storing such data.

from elki.

StatguyUser commented on May 19, 2024

exponential formatting in the input CSV file or is there any setting in ELKI for that?

from elki.

kno10 commented on May 19, 2024

As regular notation does not contain an "e", it will autodetect exponential notation. It literally just reads the number, and when encountering e-6 at the end, multiplies it by 10^-6.
It is very common, everbody uses it, so it cannot be turned off, but is always on.

from elki.

No data type found satisfying: NumberVector,field AND NumberVector,variable about elki HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent