Comments (25)
Can you share the data? This looks like a case worth debugging. I have perhaps overfit to digits ...
I would suggest you try with a significantly smaller n_neighbors
; something more like 10 or 20. I do kind of suspect something has gone astray somewhere internally however.
Edit: Sorry, I see you already have the data shared. I'll try and take a look at this later this evening when I get a chance.
from umap.
The good news is that when running it through on some of my earlier development notebooks I do get decent structure resulting, so the problem is not a fundamental algorithmic one, but rather a coding error somewhere. The bad news is I haven't tracked that down yet, but I'll see if I can figure out what is going astray.
from umap.
The problem is that we have different data points that have zero distance from each other -- that's a corner that case that my code does not handle well. It can certainly be fixed, but it will take a little bit of thought on my part to find the "right" way to handle that. In the meantime as a hack you can perturb the data ever so slightly to eliminate duplicate points and then it should work better. Let me know, and in the meantime I'll work to make sure duplicated points are properly handled.
from umap.
Yep, I dropped duplicate rows and it works.
Interesting differences in comparsion to tSNE:
Alot happening when changing n_neighbours
though.
Would be cool to make comparsions where there's some cluster ground trouth and calc Fowlkes-Mallows score.
from umap.
I just check in some code tat should fix the issue. I agree that it is an interesting comparison. Given the labelling you have I would have to say that t-SNE definitely looks better -- it has isolated more of your label groups. I'll have to see if I can figure out what is causing that. Thanks for providing such an interesting example to work with!
from umap.
Playing with it a little it looks like a small n_neighbors is going to help, but that crashes due to some other issues (too many points at distance 0; clearly I need yet more robust handling of that). Still a work in progress clearly.
from umap.
No problem, very happy to help/cause hedache. To be clear, the labels (color) are party affiliation. So dots (individuals) does not need to vote unanimously but there should be some clustering.
I also noticed small n_neighbors
helping but creating wierd outliers also.
from umap.
The outliers would be issue #3. I fixed that in the latest round of commits earlier today, but that results in the whole thing breaking for this particular dataset. Still working on figuring out the best way to handle things so it will be suitably robust to such cases.
from umap.
Keep up the good work :)
from umap.
Getting close. The latest version should support lower n_neighbors (which is similar to perplexity in t-SNE). We need a lower value for this dataset since there are small distinct groups. The problem previously was some of those groups were all identical points (presumably groups of party members voting identically down the party line?). I believe I have made the algorithm robust to such cases now, and I get better looking results (more akin to the t-SNE results you have). If you want to pull from master and rebuild hopefully it will work better as long as you use a suitably low n_neighbors value. Since you obviously understand the dataset better you can let me know if that provides a result that meets with your intuition of the sort of thing you would expect to get.
from umap.
I've done a similar comparison, with French electoral results in the first round of the 2017 presidential (11 candidates), and in my case umap looks a bit broken
https://gist.github.com/Fil/e0ecb37c181b7c960a33316375f0a002
Maybe it's the way I prepare the data, or maybe the fact that I kept "small" candidates which add a lot of noise.
from umap.
This looks suspiciously like related robustness issues that occurred here and issue #3 . Let me know if you are using he latest master, and if so I clearly have some more work to do to clean this. Thanks for the feedback -- it is very much appreciated. This is very much still in the experimental stages, so finding all the corner case (such as these) is very important
from umap.
Looking much better now. Here with n_neighbors
set to 8.
Or at least close to tSNE. But to be honest, I don't have that a good intuition about the data. So it's not clear that tSNE in any way is a gold standard. But I'm super glad I could help in pinning down corner cases.
from umap.
Thank you so much for all the help working through the issues. It has been greatly appreciated. If you would be interested in actually digging into the algorithm and/or contributing to the code I would be more than happy to help ease you into it, and explain the goings on as best I can. Let me know.
from umap.
I probably had yesterday's version. But now I can't seem to manage to reinstall it, even by starting a new clean virtualenv (sorry!).
The error is
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-1-a1231009814f> in <module>()
3 df = pd.read_csv('~/Sites/legislatives/cartogramme/circo.csv')
4 table = df[df.columns[3:20]].div(df.Inscrits, axis=0)
----> 5 embedding = umap.UMAP().fit_transform(table.as_matrix())
/Users/fil/Source/umap/build/lib.macosx-10.7-x86_64-3.6/umap/umap_.py in fit_transform(self, X, y)
170 Embedding of the training data in low-dimensional space.
171 """
--> 172 self.fit(X)
173 return self.embedding_
/Users/fil/Source/umap/build/lib.macosx-10.7-x86_64-3.6/umap/umap_.py in fit(self, X, y)
137 X = X.astype(np.float64)
138
--> 139 graph = fuzzy_simplicial_set(X, self.n_neighbors, self.oversampling)
140
141 if self.n_edge_samples is None:
umap/umap_utils.pyx in umap.umap_utils.fuzzy_simplicial_set (umap/umap_utils.c:5932)()
umap/umap_utils.pyx in umap.umap_utils.fuzzy_simplicial_set (umap/umap_utils.c:5054)()
umap/umap_utils.pyx in umap.umap_utils.smooth_knn_dist (umap/umap_utils.c:3101)()
/Users/fil/miniconda3/lib/python3.6/site-packages/numpy/core/fromnumeric.py in amin(a, axis, out, keepdims)
2350
2351 return _methods._amin(a, axis=axis,
-> 2352 out=out, **kwargs)
2353
2354
/Users/fil/miniconda3/lib/python3.6/site-packages/numpy/core/_methods.py in _amin(a, axis, out, keepdims)
27
28 def _amin(a, axis=None, out=None, keepdims=False):
---> 29 return umr_minimum(a, axis, None, out, keepdims)
30
31 def _sum(a, axis=None, dtype=None, out=None, keepdims=False):
ValueError: zero-size array to reduction operation minimum which has no identity
from umap.
That looks like an error that I thought I fixed. I'll see if I can track that down.
from umap.
@lmcinnes Would love to dig in, but as a newly father-of-two that one hour of free time a day (!!!) is not really setting me up for any productive work right now. But I'll be here lurking for sure :)
from umap.
@maxberggren That's okay, I certainly understand. If and when you eventually have time you're more than welcome.
from umap.
@Fil I think I found and fixed the problem. Sorry for the delay, I'm at a conference right now and have been a little short of time.
from umap.
No worries :) . The bug is fixed now, thank you. However I must report that the result still compares unfavorably with t-SNE: apart from a few outliers at x = -25 we have a blob of points grouped together and holding about 95% of the sample.
But if I take only columns [9:20] from my dataset (the votes for each candidate), then I get a very nicely dispersed graph:
If I include column 8, which is the participation in the election, the outliers are back (because the "Français de l'étranger" tend to vote much less than the others), and the graph is crushed.
from umap.
I suspect that poor results are due to outstanding bugs ... as evidenced in fixing the original problem in this issue. I have grabbed the data, and I'll see if I can track down what is going astray here.
from umap.
Having looked at it a little, this may actually be "fine". It appears that there are some very strong outliers (particularly with respect to column 8). Since UMAP preserves more global structure it wants to insert significant distance between those outliers and the main group to properly represent the data. t-SNE doesn't care about "inter-cluster" distances much at all, so it simply packs the outliers in to a small clump near the main clump. In some sense I would claim that t-SNE is misrepresenting the data in this way. I understand that visually this may not be what you want, but integrity of representation is one of my goals.
On the other hand I think that, as with the other example in this thread, a much lower n_neighbors
value will help. Finally, for visual presentation purposes the relevant value are min_dist
and spread
. The min_dist
value is the minimum distance that points can be packed together, increasing this will help to ensure clumps aren't drawn quite so tightly together. The default is 0.25, but you can probably go as high as 1.0 (or possibly higher). spread
is related to how forcefully distant points are spread apart -- think of it as a multiplicative factor. Reducing spread from the default 1.0 will draw the disparate groups closer to one another. You can potentially play with these to get a better aesthetic presentation.
from umap.
Thanks a lot for the explanation. This is the result including column 8, and with (n_neighbors=12, min_dist = 0.4, spread=0.5)
.
It might still benefit from more tuning, but it definitely looks more interesting.
from umap.
Glad you got something that is more appealing. I also realised that you can also use the init='random'
option to avoid the spectral initialisation which enforces certain global properties. Anyway, thanks for the interest in the library, and you patience working through some of these issues. Please feel free to continue to provide feedback and issues you find!
from umap.
See dev-0.4 branch for duplicate handling.
from umap.
Related Issues (20)
- Setting a random state still leads to stochastic results
- Implementation of sciki-learn's get_feature_names_out() API is not correct
- Is 'n_training_epochs' working for parameteric UMAP?
- visualize video data
- How to combine UMAP models in new data?
- Edit instructions to make them compatible with zsh
- Empty API page on UMAP API Guide? HOT 1
- PCA diagnostic error HOT 2
- Speed inquries HOT 2
- UMAP crashes when torch also imported before first run HOT 2
- Unable to pickle trained UMAP instance
- Reducing Model Size for UMAP on Large Datasets HOT 2
- umap.UMAP accepts strings as n_neighbors and min_dist, causing later failures
- Optimal dimensions
- RunUMAP Failing HOT 1
- Semi-deterministic output even though randon_state is set
- TypeError: Dispatcher._rebuild() got an unexpected keyword argument 'impl_kind' HOT 1
- illegal hardware instruction python HOT 2
- Transform new input with composite model HOT 1
- Inquiry on Utilizing UMAP for Text Similarity and Clustering HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from umap.