martinfleis / clustergram Goto Github PK
View Code? Open in Web Editor NEWClustergram - Visualization and diagnostics for cluster analysis in Python
Home Page: https://clustergram.readthedocs.io
License: MIT License
Clustergram - Visualization and diagnostics for cluster analysis in Python
Home Page: https://clustergram.readthedocs.io
License: MIT License
k=1 does not need to be modelled, cluster centre is a pure mean of an input array. All the other options require k=1 e.g to fit gaussian.
Skip k=1 in all k-means implementations to get avoid unnecessary computation.
Clustergram.plot()
doesn't seem to work with GPU-based backend in recent RAPIDS container. Some changes in cupy
and an issue with computing PCA.
Thanks for your interesting package.
Do you think Clustergram could work with top2vec ?
https://github.com/ddangelov/Top2Vec
I saw that there is the option to create a clustergram from a DataFrame.
In top2vec, each "document" to cluster is represented as a embedding of a certain dimension, 256 , for example.
So I could indeed generate a data frame, like this:
x0 | x1 | ... | x255 | topic |
---|---|---|---|---|
0.5 | 0.2 | .... | -0.2 | 2 |
0.7 | 0.2 | .... | -0.1 | 2 |
0.5 | 0.2 | .... | -0.2 | 3 |
Does Clustergram assume anything on the rows of this data frame ?
I saw that the from_data method either takes "mean" or "medium" as method to calculate the cluster centers.
In word vector, we use typically the cosine distance to calculate distances between the vectors. Does this have any influence ?
top2vec calculates as well the "topic vectors" as a mean of the "document vectors", I believe.
With some larger clustergrams it may be quite useful to have the ability to zoom to certain places interactively. I think that bokeh
plotting backend would be good for that.
Supporting hierarchical clustering as in the original Schonlau's paper would be nice. Not sure if using scipy or sklearn, will have to explore.
The current way of weighting by PCA is hard-coded to use the first one. But it could be useful to see clustergrams weighted by other PCAs as well.
And it would be super cool to get a 3d version with the first component on one axis and a second one on the other (not sure how useful though :D).
For clustergram MiniBatchKMeans
should be good enough whilst providing significant speedup.
Add an option to measure additional metrics to assess the results of clustering, like a silhouette score or Calinski-Harabasz.
As of 0.7.0, we have the test coverage at 99% but on GHA, we are only able to get to ~80% since the rest requires the rapids.ai stack running on GPU. I can test locally now, thanks to the machine at the university but I'll likely lose access to that one at some point.
It would be ideal to be able to run CI on every PR automatically on GPU but I am not aware of any option that is not charging for that. Opening the post to keep an eye on something.
Split docs into multiple jupyter notebooks properly illustrating the options. Add introductory notebook to docs as well.
Detect the size of the data and scale lines and points accordingly.
From openjournals/joss-reviews#5240 (comment)
As a bit of a minor point, however since plotting is the core function of this package I think it warrants mentioning, I would recommend that the ticks on the x-axis on all clustergram diagrams are ensured to be natural numbers since there are no fractions of numbers of clusters.
From openjournals/joss-reviews#5240 (comment)
The author claims to embrace scikit-learn’s API style, however that is unfortunately not entirely achieved in my view. While the main class does provide a estimator-like interface, accepting hyper parameters as constructor arguments and offering a fit() function, other parts of the scikit-learn API guide are ignored, such as avoiding the modification of hyper parameters within the constructor, storing fitted parameters with an underscore suffix, and allowing for easy composability. Other plotting-classes within the scikit-learn package typically implement “from_estimator” and “from_predictions” class methods (e.g. ConfusionMatrixDisplay, PredictionErrorDisplay, etc.). In this way it is easy to use the class with any class that would provide clustering data. The current implementation of the class provides from_data() and from_centers() methods, which achieve similar, but not directly compatible behavior.
Allow manual input of cluster centers, data and labels to generate clustergram based on unsupported clusterings, like from spopt.
Change API to have a single class with .fit()
which can then give you plot and means df.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.