Comments (7)
Hi Graham,
Sorry for the very long delay on ever getting back to you on this. I got rather invested in building the new version of UMAP (which I was hoping would fix some of these issues) and then this fell off my radar for while. The new UMAP, using numba, is now in place, and I think it does fix some of your issues, though not all. I believe some of the rest of the apparent issues can be corrected by more careful plotting. The end result is that I don't believe we get what you want, but it looks less bad in doing so. In particular the default UMAP on your data gives this:
This is, admittedly, somewhat underwhelming. If we turn down n_neighbors
to 5 and set min_dist
to 0.0 we get the following (which shows more structure, but certainly doesn't separate your classes):
On the other hand, if we plot the PCA result in the same way we get this:
I think in your original iteration the apparent separation was a little bit due to plotting artifacts combined with the fact that the light blue class looks to have slightly larger variance (but ultimately they look like two overlayed gaussian blobs).
Finally, the new version of UMAP does support cosine distance so we can, at least, compute with cosine distance which makes more sense for doc2vec vectors. That results in the following:
Still not much notable separation of classes, but then given the PCA result, and these results, I am not sure there is actually good separation in 2D. I know that's not an ideal answer, or even what you were looking for, but hopefully it helps somewhat.
from umap.
That definitely looks underwhelming. How does t-SNE compare, or PCA? There may be less structure in the data than one might like. It looks more likely, however, that those two outliers are somehow messing everything up. I'll see if I can get some time and look into exactly what is going on internally. I am fairly busy at the moment with other projects, so I can't promise anything immediate. Sorry.
from umap.
If you have some time, the relevant thing to do is run the internals yourself step by step and look to see where things are getting swamped. In particular if you can build fuzzy simplicial set and look at the result (a sparse matrix) I suspect the distribution of non-zero entries will be suspicious (or, at least, the logs of them, since they are probably power law distributed). In particular you should look at the rows (and columns) associated to those two points that seem to end up at extremes.
Another alternative thing to look at is what happens if you don't use spectral initialisation.
from umap.
I took a look at the things you suggested. Using a random initialisation still looks underwhelming but there are no huge outliers. There is slightly better separation using PCA but it is still not great (though I haven't messed around with parameters).
I constructed the fuzzy simplicial set and as suspected the distribution of logs of non-zero entries is suspicious. To compare the outlying rows to "normal" rows I calculated the log distributions (and sorted them) for the outlying rows and 10 rows selected at random. What I found was the largest values in the outlying distributions were much bigger than the largest values of the other rows. I'm not sure what this means but it's something. The updated notebook is located here. Let me know if you have any ideas for further tests.
from umap.
I also played around a bit with language models and UMAP obtaining however some more (marginally) satisfying results, here and here.
from umap.
Those are some nice results @vb690 ; would you mind if I referenced them in the example uses section of the documentation?
from umap.
Hi @lmcinnes , sure thing!
from umap.
Related Issues (20)
- Setting a random state still leads to stochastic results
- Implementation of sciki-learn's get_feature_names_out() API is not correct
- Is 'n_training_epochs' working for parameteric UMAP?
- visualize video data
- How to combine UMAP models in new data?
- Edit instructions to make them compatible with zsh
- Empty API page on UMAP API Guide? HOT 1
- PCA diagnostic error HOT 2
- Speed inquries HOT 2
- UMAP crashes when torch also imported before first run HOT 2
- Unable to pickle trained UMAP instance
- Reducing Model Size for UMAP on Large Datasets HOT 2
- umap.UMAP accepts strings as n_neighbors and min_dist, causing later failures
- Optimal dimensions
- RunUMAP Failing HOT 1
- Semi-deterministic output even though randon_state is set
- TypeError: Dispatcher._rebuild() got an unexpected keyword argument 'impl_kind' HOT 1
- illegal hardware instruction python HOT 2
- Transform new input with composite model HOT 1
- Inquiry on Utilizing UMAP for Text Similarity and Clustering HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from umap.