martinfleis / clustergram Goto Github PK

View Code? Open in Web Editor NEW

116.0 116.0 6.0 11.57 MB

Clustergram - Visualization and diagnostics for cluster analysis in Python

Home Page: https://clustergram.readthedocs.io

License: MIT License

Python 9.60% Jupyter Notebook 86.55% TeX 3.85%

clustergram's People

Contributors

Stargazers

Watchers

Forkers

clustergram's Issues

skip k=1 for K-Means

k=1 does not need to be modelled, cluster centre is a pure mean of an input array. All the other options require k=1 e.g to fit gaussian.

Skip k=1 in all k-means implementations to get avoid unnecessary computation.

Plotting breaks on GPU

Clustergram.plot() doesn't seem to work with GPU-based backend in recent RAPIDS container. Some changes in cupy and an issue with computing PCA.

Can this work with cluster made by top2vec ?

Thanks for your interesting package.

Do you think Clustergram could work with top2vec ?
https://github.com/ddangelov/Top2Vec

I saw that there is the option to create a clustergram from a DataFrame.

In top2vec, each "document" to cluster is represented as a embedding of a certain dimension, 256 , for example.

So I could indeed generate a data frame, like this:

x0	x1	...	x255	topic
0.5	0.2	....	-0.2	2
0.7	0.2	....	-0.1	2
0.5	0.2	....	-0.2	3

Does Clustergram assume anything on the rows of this data frame ?
I saw that the from_data method either takes "mean" or "medium" as method to calculate the cluster centers.

In word vector, we use typically the cosine distance to calculate distances between the vectors. Does this have any influence ?

top2vec calculates as well the "topic vectors" as a mean of the "document vectors", I believe.

ENH: add bokeh plotting backend

With some larger clustergrams it may be quite useful to have the ability to zoom to certain places interactively. I think that bokeh plotting backend would be good for that.

Support hierarchical clustering

Supporting hierarchical clustering as in the original Schonlau's paper would be nice. Not sure if using scipy or sklearn, will have to explore.

Support multiple PCAs

The current way of weighting by PCA is hard-coded to use the first one. But it could be useful to see clustergrams weighted by other PCAs as well.

And it would be super cool to get a 3d version with the first component on one axis and a second one on the other (not sure how useful though :D).

Allow MiniBatchKMeans

For clustergram MiniBatchKMeans should be good enough whilst providing significant speedup.

Optionally measure silhouette and other metrics

Add an option to measure additional metrics to assess the results of clustering, like a silhouette score or Calinski-Harabasz.

As of 0.7.0, we have the test coverage at 99% but on GHA, we are only able to get to ~80% since the rest requires the rapids.ai stack running on GPU. I can test locally now, thanks to the machine at the university but I'll likely lose access to that one at some point.

It would be ideal to be able to run CI on every PR automatically on GPU but I am not aware of any option that is not charging for that. Opening the post to keep an eye on something.

As a bit of a minor point, however since plotting is the core function of this package I think it warrants mentioning, I would recommend that the ticks on the x-axis on all clustergram diagrams are ensured to be natural numbers since there are no fractions of numbers of clusters.

Follow the sklearn API guide

From openjournals/joss-reviews#5240 (comment)

The author claims to embrace scikit-learn’s API style, however that is unfortunately not entirely achieved in my view. While the main class does provide a estimator-like interface, accepting hyper parameters as constructor arguments and offering a fit() function, other parts of the scikit-learn API guide are ignored, such as avoiding the modification of hyper parameters within the constructor, storing fitted parameters with an underscore suffix, and allowing for easy composability. Other plotting-classes within the scikit-learn package typically implement “from_estimator” and “from_predictions” class methods (e.g. ConfusionMatrixDisplay, PredictionErrorDisplay, etc.). In this way it is easy to use the class with any class that would provide clustering data. The current implementation of the class provides from_data() and from_centers() methods, which achieve similar, but not directly compatible behavior.

Allow manual input

Allow manual input of cluster centers, data and labels to generate clustergram based on unsupported clusterings, like from spopt.

API: make Clustergram class

Change API to have a single class with .fit() which can then give you plot and means df.