Comments (11)
Thanks! As you may figure it out yourself, DBSCan use RNNSearch inside. When a Distance or Metric is supplied, we wrap it with LinearSearch or CoverTree. setIdenticalExcluded(true) is important for some uses including DBScan. But for your use case, it is different due to the meaning of JAR_WINKLER_DISTANCE. I will check if I should update the semantical meaning of setIdenticalExcluded, for example it means exactly same object (not just distance 0). I have to double check before making this change. Thanks!
from smile.
BTW, can you please contribute your JARO_WINKLER_DISTANCE implementation to SMILE? Thank you very much in advance! If you agree, please put it in package smile.math.distance. Thanks again!
from smile.
Nearest neighbor search is used in many density estimation and clustering algorithms. In most situations, distance 0 means that two objects are the same in the feature space. Without excluding identical data points, we may get into troubles in some cases. It is why we do setIdenticalExcluded(true) by default. I think that it is reasonable and also more efficient. But in your use case, many objects have 0 distance even if they are different. Your approach with customized LinearSearch is correct and I am glad it works. The problem comes from the distance definition not DBSCan. I won't change the default behavior otherwise many other algorithms will not work correctly. Thanks!
from smile.
Thank a lot you for fast and detailed reply.
I think I was not clear last time with my explanation, please let me rephrase it.
My task is to cluster the set of strings into groups.
In my case, distance 0 is possible only with equal or identical (equal by reference in Java terms) strings.
The problem occurs when I have two identical Java strings in dataset.
They have distance 0 and they are reference-equal. I expect them to be in one cluster, but because of identicalExcluded = false
, objects are labeled as noise.
You can check that with dataset {"123", "123"} strings are labeled as noise because they are the same Java objects (thanks, Java string pool), but with dataset {new String("123"), new String("123")} objects are clustered together because they have different references and distance 0.
In my opinion this example exposes inconsistent behavior because the algorithm depends on reference equality in Java language.
Anyway, I am not a specialist at machine learning, I just want to be sure you understand me right because of my terrible English. Thanks again for reviewing this issue!
About Jaro-Winkler distance implementation
In my case I used implementation provided by Apache commons library, please see code below:
public static final Distance<String> JARO_WINKLER_DISTANCE = new Distance<String>() {
@Override
public double d(String from, String to) {
return 1.0 - StringUtils.getJaroWinklerDistance(from, to);
}
};
Do you think I should add dependency below in SMILE and add distance to package smile.math.distance
?
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.3.2</version>
</dependency>
Many thanks for your time.
from smile.
Thanks for explanation. Value equality and reference equality are treated differently in almost all programming languages. I feel that our behavior is correct in this sense. In DBScan, we need to query every object's neighborhood. If we return the query object in the neighborhood, we will get into a self-reference loop. Although our code avoids getting into dead loop, it is still not a desired behavior to including the query object in the neighborhood. It is the purpose of setIdenticalExcluded.
So it is clear that Java pool string literal, which is the root cause. As you said, new String() is probably the right solution. With other data, I observe that the result is worse if I do setIdenticalExclude(false). Even though it may work for you, be careful with the clustering results.
Since you use apache commons for Jar-Winkler distance, it is not necessary to include it here. Thanks!
from smile.
Thanks for clear and detailed explanation.
With your permission I'd like to add couple of lines to documentation of RNNSearch
class, explaining default behavior.
If you are OK with this, I'll open pull request.
Also, I will follow your advice and rewrite my code to use new String()
approach
from smile.
Sure. Go ahead adding comments and make a pull request. Thanks!
from smile.
Btw, not sure your use case. But usually BKTree and CoverTree are much faster than linear search. We do have a very efficient implementation of edit distance. If your data is large, maybe worth of a try.
from smile.
Thanks for advice, but according to this article, Jaro-Winkler is not a metric because it does not satisfy triangle equality. It seems I can use BKTree
and CoverTree
only with metric.
Do we have better approach than LinearSearch
when distance does not satisfy triangle inequality?
BTW, my dataset should not be more than 10K objects.
from smile.
Since your data is not big, it is okay with linear search. I don't know your application, which may need that string distance. Just suggest to try edit distance if it makes sense in your case. It really depends on the use case. Speed is only a secondary consideration for small data.
from smile.
Many thanks for help.
I am closing the issue.
from smile.
Related Issues (20)
- Can't merge 2 KMeans clustering trained models calculated in different locations in codes HOT 1
- On version 3.0.0 Scala, how to serialise a model to a file or over a network ? HOT 2
- Need ReadMe guide on model training, model merging, model validation and model serialisation HOT 4
- 404 error when accessing the kotlin api documentation in the project website HOT 1
- Add XMeans with float array type HOT 5
- Ability to stop TSNE, possibly other time-heavy computations HOT 1
- rbf kernel? HOT 3
- [FEATURE PROPOSAL] ARPACK wrapper functions HOT 2
- Exception in thread "main" java.lang.UnsatisfiedLinkError: no jniopenblas_nolapack in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib HOT 13
- Tree Representation for Regression Models in Google Earth Engine HOT 1
- Takes more memory for LSH model in NearestNeighborSearch HOT 6
- Gamma random number generator support only integer shape parameter. HOT 1
- java.lang.StackOverflowError HOT 1
- svm HOT 1
- Jitpack builds are failing since 3.x HOT 2
- FR: Compact "how to load dirty data" example HOT 1
- Arff.java writeField can fail when type isn't in the list of handled types HOT 1
- BarPlot.getUpperBound() computes wrong bound. HOT 1
- FR: Warn before trying to train where the label column has any nulls HOT 1
- Dot product Question HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from smile.