Comments (7)
It fails to parse the following two numbers:
0.11753855483588400432 0.28323801662637815291
The reason probably is that these are too precise for a double, and our parser then (unfortunately, without better error handling - it is supposed to report a precision overflow error) gives up and they will be handled as strings.
The closest doubles apparently are the following:
0.117538554835884 0.28323801662637815
How come that your data has 20 decimal digits of precision, when double only provides about 16? Do you need that extra precision? Or is that just used for alignment purposes of the csv file?
Our parser reads the decimal digits into a long
. If that overflows, it fails. But 11753855483588400432
- the decimals of above numbers - exceed 2^63
Our parser handles all 18 digit, and most 19 digit numbers; and a double only provides about 16 digits; so usually we have some safety margin there...
I would accept a patch to ParseUtil#parseDouble
(check for PRECISION_OVERFLOW
) if it does not degrade performance. Otherwise, I would prefer a patch to improve error handling in the number vector parser that catches the precision overflow and (at least) outputs a warning. Clearly, treating these numbers as strings can be very confusing. The main motivation is that for integers, such very long integers usually indicate this is some kind of identifier column, and then automatically treating them as strings is actually helpful...
from elki.
aaaaah, okay, thanks, that makes sense.
There's no real reason why I have 20 decimal points. I just had multiple tools which failed to parse the scientific notation for the numbers, so I put my script to a random number which I thought should be high enough to catch most of the numbers without resorting to 0.
I actually did not consider the actual floating point precision lol (now I feel stupid).
I haven't tested yet if changing the precision will fix the issue, but I absolutely believe you -> issue closed.
Thanks for the quick help :).
from elki.
In 748252b I added a warning Too many digits in what looked like a double number - treating as string
when a too-long float is interpreted as a string instead.
from elki.
I am trying to cluster using KMEANS algorithm for a sparse data from a doc2vec model. It has 60000*300 dimension and data points have average length of 22, for example 0.00000804921828675072
When i cluster this dataset, i am getting below error
Task failed
de.lmu.ifi.dbs.elki.data.type.NoSupportedDataTypeException: No data type found satisfying: NumberVector,field AND NumberVector,variable
Available types: DBID DoubleVector,variable,mindim=266,maxdim=300 LabelList
at de.lmu.ifi.dbs.elki.database.AbstractDatabase.getRelation(AbstractDatabase.java:126)
at de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm.run(AbstractAlgorithm.java:81)
at de.lmu.ifi.dbs.elki.workflow.AlgorithmStep.runAlgorithms(AlgorithmStep.java:105)
at de.lmu.ifi.dbs.elki.KDDTask.run(KDDTask.java:112)
at de.lmu.ifi.dbs.elki.application.KDDCLIApplication.run(KDDCLIApplication.java:61)
at [...]
Is there any data type in i should select in parser.vector-type
which can handle this data or should i do anything else to fix this and be able to run this successfully?
from elki.
For such data, exponential formatting is more appropriate, and should work.
I.e., 8.04921828675072e-6
is the common way of storing such data.
from elki.
exponential formatting in the input CSV
file or is there any setting in ELKI
for that?
from elki.
As regular notation does not contain an "e", it will autodetect exponential notation. It literally just reads the number, and when encountering e-6 at the end, multiplies it by 10^-6.
It is very common, everbody uses it, so it cannot be turned off, but is always on.
from elki.
Related Issues (20)
- Hierarchical Clustering Questions HOT 17
- Cannot find a usable implementation of interface elki.database.ids.DBIDFactory HOT 5
- `ClusterOrder` vs `Clustering<OPTICSModel>` in OPTICS HOT 2
- Suspicious code fragments found by PVS-Studio HOT 1
- Yin-Yang sometimes takes more iterations than Lloyd
- LOF algo suggestion HOT 2
- Links in https://elki-project.github.io/algorithms/ broken HOT 1
- Distance-based cluster evaluation algorithms will fail, if input numbers are too big HOT 1
- Any docs about DeLiClu clustering algorithm? HOT 2
- Incorrect processing of column names in NumberVectorLabelParser#getTypeInformation() HOT 7
- MaximumMatchingAccuracy Index out of Bounds Exception HOT 1
- Imprecise variance calculation in MeanVariance.java HOT 7
- signed long overflow in Xoroshiro128NonThreadsafeRanom HOT 2
- Fastutil >8.5.3 not supported HOT 2
- Eclipse Mars launch of MiniGUI fails HOT 4
- Hamerly k-means fails for k=1
- issue parsing polygons using SimplePolygonParser HOT 1
- Build Issue: :elki-logging:compileJava - NullPointerException HOT 4
- SUBCLU: Why call DBSCAN d-times with the same DB? HOT 1
- Unsupervised models and prediction HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from elki.