Hi, thank you for the really nice and helpful library, I have a

The pointer hierarchy comes from the (fast) SLINK algorithm, but nevertheless it

Hi, thank you for your very fast answer. I w

I check the implementation of <a href="https://fastutil.di.unimi.it/docs/it/unimi/dsi/

Hierarchical Clustering Questions about elki HOT 17 CLOSED

sven-h commented on June 10, 2024

Hierarchical Clustering Questions

from elki.

Comments (17)

kno10 commented on June 10, 2024

The pointer hierarchy comes from the (fast) SLINK algorithm, but nevertheless it is on my wishlist to switch to the scipy-style list structure, which makes some postprocessing easier.
Such a transformation can be found in some of the clustering extraction classes, e.g., ClustersWithNoiseExtraction
Except for single-linkage with SLINK, almost all of these methods need to maintain a pairwise distance matrix of the clusters - initially of all points. Rewriting this to a "ragged" matrix is not too hard - which would make this limited largely by your memory and run time. A rewrite using a java Buffer instead of an array (which is indexed with a long) could be both easier and more efficient. But these methods are simply not very scalable, and I don't think it is the best way to proceed. For special cases - in particular single-link, but also Ward linkage - things can be optimized better. But this also is not included in the release yet. When working with data of this size, using some form of data aggregation (as done, e.g., in BIRCH/BETULA) most likely is desirable to not get quadratic-to-cubic runtime.

What data size are you interested in, what is the current runtime at 2^16 instances - and how much runtime are you willing to invest?

from elki.

sven-h commented on June 10, 2024

Hi,

thank you for your very fast answer.

I will have a look at ClustersWithNoiseExtraction and see how far I can get.
I quickly ran a test with 60,000 instances and it takes 151 seconds. My target is around 150,000 instances but I also have enough time (up to several days) and main memory (up to 1TB). Only the java array size (implementation specific) is currently the limiting factor.
Are you referring to a java DoubleBuffer(because at least the interface looks like it can only handle the same amount as arrays)? I thought the ragged array is also a good data structure especially for such algorithms. On the other hand, a custom Buffer indexed by a long such as this one on stackoverflow would work as well and require less rewriting (as you already mentioned). Do you think a ragged array or a custom buffer is the better way to go?

from elki.

kno10 commented on June 10, 2024

I thought DoubleBuffer would be long indexed, but apparently I am wrong. Maybe I had the Unsafe class in mind, which has getDouble(long address). Either way, if you are using BLAS to compute the distance matrix, it probably already originates off-heap. Then directly using this memory may save you making a copy. But working with ragged arrays isn't too bad here, either.

ELKI already uses fastutil in a number of places, which includes a DoubleBigArrayBigList (when building a bundle jar, remove the exclude lines from addons/bundle/build.gradle, we currently exclude these classes to reduce the jar size). These classes are usually very efficient and can be recommended. This is likely the easiest one to use instead.

151 seconds isn't too bad - which algorithm and which linkage did you use?

from elki.

sven-h commented on June 10, 2024

I check the implementation of DoubleBigArrays as well as DoubleBigArrayBigList. The former just provides static methods to get/set values of a double[][]. The latter one only has the advantage to add elements on the fly which is not needed here (at least from my perspective).

Most of the algorithms directly modify the double[] in MatrixParadigm (so this is more like a data storage). Maybe it is possible to have an interface such that the real implementation (using ragged arrays etc) is hided and only necessary methods like getter/setter etc are provided (but I think this would mean a lot of rewrite in many HAC classes).

For the simple experiments I used AnderbergHierarchicalClustering and single linkage (the linkage is just an example and can be also any other linkage strategy).

The question is how to proceed here. Maybe I'm able to provide an implementation of AGNES (in the beginning) with this new kind of implementation but I'm not sure if I can also change the other implementations at all (I would be interested in the AnderbergHierarchicalClustering because is a lot faster due to the efficient search for the next merge).

from elki.

kno10 commented on June 10, 2024

Anderberg is certainly the better starting point than AGNES, it is surprisingly simple (that is part of why its fast). It will not be sufficient to substitute matrixparadigm, because it writes directly into the underlying matrix (to improve performance, it exploits the triangular layout, and does not go through a get(x,y) getter that repeats certain calculations unnecessarily; so we eventually removed this abstraction).
But it may be sufficient to copy&paste "fork" these two classes. You may want to change the output format to your liking while at it, and produce such a linkage table instead of the pointer hierarchy.

from elki.

sven-h commented on June 10, 2024

I would stick to the pointer hierarchy to fulfill the requirements of your library (and just provide a transformation to the table style).
Would you accept a pull request and have a look at the proposed code such that others also profit from it?
Or do you think the requirement for higher number of examples are out of scope for ELKI?

Maybe the two versions can exist in ELKI side by side?

Still, Anderberg looks more complicated (maybe also due to the triangular layout exploits) than simple AGNES.

from elki.

kno10 commented on June 10, 2024

Our AGNES does the same optimizations wrt. to iterating the linearized diagonal form (which is also used in scipy, btw), so there is no difference there.
I'd certainly appreciate a pull request to add a "AnderbergBig" class.
It is definitely worth benchmarking at what cost it is possible to add back the abstraction that would allow replacing the MatrixParadigm class only (at which point it would be possible to dynamically switch to a BigMatrixParadigm depending on the data set size). It may well be that a "getLinear(long i)" and using long computations for the index in regular Anderberg comes at little enough cost due to Java inlining the getter and then optimizing bounds checks anyway. I'd avoid switching to a "get(int x, int y)" getter.
Although if I'd need to scale this to larger data myself, I'd likely implement Anderberg from scratch in Rust now; this will likely optimize even better and might be 2x-3x faster.

P.S. single-link is a special case, in which certain effects cannot occur, hence the scalability of other linkages could be worse (but probably not by much). Single-link can be implemented with O(n) memory, too, without such a matrix in the first place.

from elki.

kno10 commented on June 10, 2024

Branch https://github.com/elki-project/elki/tree/feature/newhac has a rewrite to a merge history representation that is likely a bit easier to use. It also uses more integer indexes rather than DBIDs (representing cluster numbers 0..2n-2) as we no longer identify clusters with the last object as in the SLINK pointer hierarchy.
This is a preparation for (hopefully) an even faster HAC to come in a year or two (when researched, implemented, and published). That one hopefully will also get rid of any 65k size limit.
I will also rewrite and merge an NNChain variant that does not need a pairwise distance matrix (and hence use only linear memory), which already exists in a separate private branch.
If above link does not work anymore, the branch was merged into main.

from elki.

sven-h commented on June 10, 2024

That sounds cool. Thank you for the information.
I will have a look at it.

from elki.

kno10 commented on June 10, 2024

I have merged the branch into main. It was 20% faster (for AGNES, Anderberg) in a brief test.
I did not remove the 65k limit yet, but it should be easier than before.

from elki.

sven-h commented on June 10, 2024

Great, thanks a lot.
One more question: Do you also plan to release a new version to maven central in the near future?

from elki.

kno10 commented on June 10, 2024

I do want to make a new release because of the many new features in ELKI, but usually I want these releases to come along with a supporting publication to make them easier to cite; this will need some time and preparation.
There are some todos for the next release I did not yet have the time to tackle: use some of the newer Java language features such as "var" types for readability, rewrite the logging layer to allow using, e.g., slf4j, log4j (but we have these progress bars implemented via logging, that will be more difficult to support efficiently when using some logging abstraction). etc.

from elki.

kno10 commented on June 10, 2024

A new release (without the logging change) has been submitted as a demo paper, and I hope to release 0.8.0 end of the summer.

from elki.

sven-h commented on June 10, 2024

ok, great. Thank you for the information.

from elki.

kno10 commented on June 10, 2024

The demo paper has been accepted, it will appear at SISAP 2022 in Bologna, October 5-7.
The ELKI 0.8.0 release will be prior to the conference.
c.f., https://sisap.org/2022/papers.html

from elki.

sven-h commented on June 10, 2024

Congrats.
Good to hear that a new release will be prepared.

from elki.

kno10 commented on June 10, 2024

ELKI 0.8.0 is on maven, but with a new artifact group id, "io.github.elki-project".

from elki.

Hierarchical Clustering Questions about elki HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent