Giter Club home page Giter Club logo

clustergram's People

Contributors

martinfleis avatar matthew-law avatar pre-commit-ci[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

clustergram's Issues

skip k=1 for K-Means

k=1 does not need to be modelled, cluster centre is a pure mean of an input array. All the other options require k=1 e.g to fit gaussian.

Skip k=1 in all k-means implementations to get avoid unnecessary computation.

Plotting breaks on GPU

Clustergram.plot() doesn't seem to work with GPU-based backend in recent RAPIDS container. Some changes in cupy and an issue with computing PCA.

Can this work with cluster made by top2vec ?

Thanks for your interesting package.

Do you think Clustergram could work with top2vec ?
https://github.com/ddangelov/Top2Vec

I saw that there is the option to create a clustergram from a DataFrame.

In top2vec, each "document" to cluster is represented as a embedding of a certain dimension, 256 , for example.

So I could indeed generate a data frame, like this:

x0 x1 ... x255 topic
0.5 0.2 .... -0.2 2
0.7 0.2 .... -0.1 2
0.5 0.2 .... -0.2 3

Does Clustergram assume anything on the rows of this data frame ?
I saw that the from_data method either takes "mean" or "medium" as method to calculate the cluster centers.

In word vector, we use typically the cosine distance to calculate distances between the vectors. Does this have any influence ?

top2vec calculates as well the "topic vectors" as a mean of the "document vectors", I believe.

ENH: add bokeh plotting backend

With some larger clustergrams it may be quite useful to have the ability to zoom to certain places interactively. I think that bokeh plotting backend would be good for that.

Support hierarchical clustering

Supporting hierarchical clustering as in the original Schonlau's paper would be nice. Not sure if using scipy or sklearn, will have to explore.

Support multiple PCAs

The current way of weighting by PCA is hard-coded to use the first one. But it could be useful to see clustergrams weighted by other PCAs as well.

And it would be super cool to get a 3d version with the first component on one axis and a second one on the other (not sure how useful though :D).

Allow MiniBatchKMeans

For clustergram MiniBatchKMeans should be good enough whilst providing significant speedup.

GPU CI

As of 0.7.0, we have the test coverage at 99% but on GHA, we are only able to get to ~80% since the rest requires the rapids.ai stack running on GPU. I can test locally now, thanks to the machine at the university but I'll likely lose access to that one at some point.

It would be ideal to be able to run CI on every PR automatically on GPU but I am not aware of any option that is not charging for that. Opening the post to keep an eye on something.

DOC: refactor documentation

Split docs into multiple jupyter notebooks properly illustrating the options. Add introductory notebook to docs as well.

Follow the sklearn API guide

From openjournals/joss-reviews#5240 (comment)

The author claims to embrace scikit-learn’s API style, however that is unfortunately not entirely achieved in my view. While the main class does provide a estimator-like interface, accepting hyper parameters as constructor arguments and offering a fit() function, other parts of the scikit-learn API guide are ignored, such as avoiding the modification of hyper parameters within the constructor, storing fitted parameters with an underscore suffix, and allowing for easy composability. Other plotting-classes within the scikit-learn package typically implement “from_estimator” and “from_predictions” class methods (e.g. ConfusionMatrixDisplay, PredictionErrorDisplay, etc.). In this way it is easy to use the class with any class that would provide clustering data. The current implementation of the class provides from_data() and from_centers() methods, which achieve similar, but not directly compatible behavior.

Allow manual input

Allow manual input of cluster centers, data and labels to generate clustergram based on unsupported clusterings, like from spopt.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.