Giter Club home page Giter Club logo

Comments (7)

kno10 avatar kno10 commented on May 19, 2024

It fails to parse the following two numbers:
0.11753855483588400432 0.28323801662637815291
The reason probably is that these are too precise for a double, and our parser then (unfortunately, without better error handling - it is supposed to report a precision overflow error) gives up and they will be handled as strings.
The closest doubles apparently are the following:
0.117538554835884 0.28323801662637815
How come that your data has 20 decimal digits of precision, when double only provides about 16? Do you need that extra precision? Or is that just used for alignment purposes of the csv file?

Our parser reads the decimal digits into a long. If that overflows, it fails. But 11753855483588400432 - the decimals of above numbers - exceed 2^63
Our parser handles all 18 digit, and most 19 digit numbers; and a double only provides about 16 digits; so usually we have some safety margin there...

I would accept a patch to ParseUtil#parseDouble (check for PRECISION_OVERFLOW) if it does not degrade performance. Otherwise, I would prefer a patch to improve error handling in the number vector parser that catches the precision overflow and (at least) outputs a warning. Clearly, treating these numbers as strings can be very confusing. The main motivation is that for integers, such very long integers usually indicate this is some kind of identifier column, and then automatically treating them as strings is actually helpful...

from elki.

bastian-wur avatar bastian-wur commented on May 19, 2024

aaaaah, okay, thanks, that makes sense.

There's no real reason why I have 20 decimal points. I just had multiple tools which failed to parse the scientific notation for the numbers, so I put my script to a random number which I thought should be high enough to catch most of the numbers without resorting to 0.
I actually did not consider the actual floating point precision lol (now I feel stupid).
I haven't tested yet if changing the precision will fix the issue, but I absolutely believe you -> issue closed.
Thanks for the quick help :).

from elki.

kno10 avatar kno10 commented on May 19, 2024

In 748252b I added a warning Too many digits in what looked like a double number - treating as string when a too-long float is interpreted as a string instead.

from elki.

StatguyUser avatar StatguyUser commented on May 19, 2024

I am trying to cluster using KMEANS algorithm for a sparse data from a doc2vec model. It has 60000*300 dimension and data points have average length of 22, for example 0.00000804921828675072
When i cluster this dataset, i am getting below error

Task failed
de.lmu.ifi.dbs.elki.data.type.NoSupportedDataTypeException: No data type found satisfying: NumberVector,field AND NumberVector,variable
Available types: DBID DoubleVector,variable,mindim=266,maxdim=300 LabelList
    at de.lmu.ifi.dbs.elki.database.AbstractDatabase.getRelation(AbstractDatabase.java:126)
    at de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm.run(AbstractAlgorithm.java:81)
    at de.lmu.ifi.dbs.elki.workflow.AlgorithmStep.runAlgorithms(AlgorithmStep.java:105)
    at de.lmu.ifi.dbs.elki.KDDTask.run(KDDTask.java:112)
    at de.lmu.ifi.dbs.elki.application.KDDCLIApplication.run(KDDCLIApplication.java:61)
    at [...]

Is there any data type in i should select in parser.vector-type which can handle this data or should i do anything else to fix this and be able to run this successfully?

from elki.

kno10 avatar kno10 commented on May 19, 2024

For such data, exponential formatting is more appropriate, and should work.

I.e., 8.04921828675072e-6is the common way of storing such data.

from elki.

StatguyUser avatar StatguyUser commented on May 19, 2024

exponential formatting in the input CSV file or is there any setting in ELKI for that?

from elki.

kno10 avatar kno10 commented on May 19, 2024

As regular notation does not contain an "e", it will autodetect exponential notation. It literally just reads the number, and when encountering e-6 at the end, multiplies it by 10^-6.
It is very common, everbody uses it, so it cannot be turned off, but is always on.

from elki.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.