Tried it on some Swedish parlament voting data. Did a notebook comparing it to t-SNE t

The outliers would be issue <a class="issue-link js-issue-link" data-error-text="Faile

Wierd results on dataset about umap HOT 25 CLOSED

lmcinnes commented on April 28, 2024

Wierd results on dataset

from umap.

Comments (25)

lmcinnes commented on April 28, 2024

Can you share the data? This looks like a case worth debugging. I have perhaps overfit to digits ...
I would suggest you try with a significantly smaller n_neighbors; something more like 10 or 20. I do kind of suspect something has gone astray somewhere internally however.

Edit: Sorry, I see you already have the data shared. I'll try and take a look at this later this evening when I get a chance.

from umap.

lmcinnes commented on April 28, 2024

The good news is that when running it through on some of my earlier development notebooks I do get decent structure resulting, so the problem is not a fundamental algorithmic one, but rather a coding error somewhere. The bad news is I haven't tracked that down yet, but I'll see if I can figure out what is going astray.

from umap.

lmcinnes commented on April 28, 2024

The problem is that we have different data points that have zero distance from each other -- that's a corner that case that my code does not handle well. It can certainly be fixed, but it will take a little bit of thought on my part to find the "right" way to handle that. In the meantime as a hack you can perturb the data ever so slightly to eliminate duplicate points and then it should work better. Let me know, and in the meantime I'll work to make sure duplicated points are properly handled.

from umap.

maxberggren commented on April 28, 2024

Yep, I dropped duplicate rows and it works.

Interesting differences in comparsion to tSNE:

tSNE

umap

Alot happening when changing n_neighbours though.

Would be cool to make comparsions where there's some cluster ground trouth and calc Fowlkes-Mallows score.

from umap.

lmcinnes commented on April 28, 2024

I just check in some code tat should fix the issue. I agree that it is an interesting comparison. Given the labelling you have I would have to say that t-SNE definitely looks better -- it has isolated more of your label groups. I'll have to see if I can figure out what is causing that. Thanks for providing such an interesting example to work with!

from umap.

lmcinnes commented on April 28, 2024

Playing with it a little it looks like a small n_neighbors is going to help, but that crashes due to some other issues (too many points at distance 0; clearly I need yet more robust handling of that). Still a work in progress clearly.

from umap.

maxberggren commented on April 28, 2024

No problem, very happy to help/cause hedache. To be clear, the labels (color) are party affiliation. So dots (individuals) does not need to vote unanimously but there should be some clustering.

I also noticed small n_neighbors helping but creating wierd outliers also.

from umap.

lmcinnes commented on April 28, 2024

The outliers would be issue #3. I fixed that in the latest round of commits earlier today, but that results in the whole thing breaking for this particular dataset. Still working on figuring out the best way to handle things so it will be suitably robust to such cases.

from umap.

maxberggren commented on April 28, 2024

Keep up the good work :)

from umap.

lmcinnes commented on April 28, 2024

Getting close. The latest version should support lower n_neighbors (which is similar to perplexity in t-SNE). We need a lower value for this dataset since there are small distinct groups. The problem previously was some of those groups were all identical points (presumably groups of party members voting identically down the party line?). I believe I have made the algorithm robust to such cases now, and I get better looking results (more akin to the t-SNE results you have). If you want to pull from master and rebuild hopefully it will work better as long as you use a suitably low n_neighbors value. Since you obviously understand the dataset better you can let me know if that provides a result that meets with your intuition of the sort of thing you would expect to get.

from umap.

Fil commented on April 28, 2024

I've done a similar comparison, with French electoral results in the first round of the 2017 presidential (11 candidates), and in my case umap looks a bit broken
https://gist.github.com/Fil/e0ecb37c181b7c960a33316375f0a002

Maybe it's the way I prepare the data, or maybe the fact that I kept "small" candidates which add a lot of noise.

from umap.

lmcinnes commented on April 28, 2024

This looks suspiciously like related robustness issues that occurred here and issue #3 . Let me know if you are using he latest master, and if so I clearly have some more work to do to clean this. Thanks for the feedback -- it is very much appreciated. This is very much still in the experimental stages, so finding all the corner case (such as these) is very important

from umap.

maxberggren commented on April 28, 2024

Looking much better now. Here with n_neighbors set to 8.

Or at least close to tSNE. But to be honest, I don't have that a good intuition about the data. So it's not clear that tSNE in any way is a gold standard. But I'm super glad I could help in pinning down corner cases.

from umap.

lmcinnes commented on April 28, 2024

Thank you so much for all the help working through the issues. It has been greatly appreciated. If you would be interested in actually digging into the algorithm and/or contributing to the code I would be more than happy to help ease you into it, and explain the goings on as best I can. Let me know.

from umap.

Fil commented on April 28, 2024

I probably had yesterday's version. But now I can't seem to manage to reinstall it, even by starting a new clean virtualenv (sorry!).

The error is

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-a1231009814f> in <module>()
      3 df = pd.read_csv('~/Sites/legislatives/cartogramme/circo.csv')
      4 table = df[df.columns[3:20]].div(df.Inscrits, axis=0)
----> 5 embedding = umap.UMAP().fit_transform(table.as_matrix())

/Users/fil/Source/umap/build/lib.macosx-10.7-x86_64-3.6/umap/umap_.py in fit_transform(self, X, y)
    170             Embedding of the training data in low-dimensional space.
    171         """
--> 172         self.fit(X)
    173         return self.embedding_

/Users/fil/Source/umap/build/lib.macosx-10.7-x86_64-3.6/umap/umap_.py in fit(self, X, y)
    137         X = X.astype(np.float64)
    138 
--> 139         graph = fuzzy_simplicial_set(X, self.n_neighbors, self.oversampling)
    140 
    141         if self.n_edge_samples is None:

umap/umap_utils.pyx in umap.umap_utils.fuzzy_simplicial_set (umap/umap_utils.c:5932)()

umap/umap_utils.pyx in umap.umap_utils.fuzzy_simplicial_set (umap/umap_utils.c:5054)()

umap/umap_utils.pyx in umap.umap_utils.smooth_knn_dist (umap/umap_utils.c:3101)()

/Users/fil/miniconda3/lib/python3.6/site-packages/numpy/core/fromnumeric.py in amin(a, axis, out, keepdims)
   2350 
   2351     return _methods._amin(a, axis=axis,
-> 2352                           out=out, **kwargs)
   2353 
   2354 

/Users/fil/miniconda3/lib/python3.6/site-packages/numpy/core/_methods.py in _amin(a, axis, out, keepdims)
     27 
     28 def _amin(a, axis=None, out=None, keepdims=False):
---> 29     return umr_minimum(a, axis, None, out, keepdims)
     30 
     31 def _sum(a, axis=None, dtype=None, out=None, keepdims=False):

ValueError: zero-size array to reduction operation minimum which has no identity

from umap.

lmcinnes commented on April 28, 2024

That looks like an error that I thought I fixed. I'll see if I can track that down.

from umap.

maxberggren commented on April 28, 2024

@lmcinnes Would love to dig in, but as a newly father-of-two that one hour of free time a day (!!!) is not really setting me up for any productive work right now. But I'll be here lurking for sure :)

from umap.

lmcinnes commented on April 28, 2024

@maxberggren That's okay, I certainly understand. If and when you eventually have time you're more than welcome.

from umap.

lmcinnes commented on April 28, 2024

@Fil I think I found and fixed the problem. Sorry for the delay, I'm at a conference right now and have been a little short of time.

from umap.

Fil commented on April 28, 2024

No worries :) . The bug is fixed now, thank you. However I must report that the result still compares unfavorably with t-SNE: apart from a few outliers at x = -25 we have a blob of points grouped together and holding about 95% of the sample.

But if I take only columns [9:20] from my dataset (the votes for each candidate), then I get a very nicely dispersed graph:

If I include column 8, which is the participation in the election, the outliers are back (because the "Français de l'étranger" tend to vote much less than the others), and the graph is crushed.

from umap.

lmcinnes commented on April 28, 2024

I suspect that poor results are due to outstanding bugs ... as evidenced in fixing the original problem in this issue. I have grabbed the data, and I'll see if I can track down what is going astray here.

from umap.

lmcinnes commented on April 28, 2024

Having looked at it a little, this may actually be "fine". It appears that there are some very strong outliers (particularly with respect to column 8). Since UMAP preserves more global structure it wants to insert significant distance between those outliers and the main group to properly represent the data. t-SNE doesn't care about "inter-cluster" distances much at all, so it simply packs the outliers in to a small clump near the main clump. In some sense I would claim that t-SNE is misrepresenting the data in this way. I understand that visually this may not be what you want, but integrity of representation is one of my goals.

On the other hand I think that, as with the other example in this thread, a much lower n_neighbors value will help. Finally, for visual presentation purposes the relevant value are min_dist and spread. The min_dist value is the minimum distance that points can be packed together, increasing this will help to ensure clumps aren't drawn quite so tightly together. The default is 0.25, but you can probably go as high as 1.0 (or possibly higher). spread is related to how forcefully distant points are spread apart -- think of it as a multiplicative factor. Reducing spread from the default 1.0 will draw the disparate groups closer to one another. You can potentially play with these to get a better aesthetic presentation.

from umap.

Fil commented on April 28, 2024

Thanks a lot for the explanation. This is the result including column 8, and with (n_neighbors=12, min_dist = 0.4, spread=0.5).

It might still benefit from more tuning, but it definitely looks more interesting.

from umap.

lmcinnes commented on April 28, 2024

Glad you got something that is more appealing. I also realised that you can also use the init='random' option to avoid the spectral initialisation which enforces certain global properties. Anyway, thanks for the interest in the library, and you patience working through some of these issues. Please feel free to continue to provide feedback and issues you find!

from umap.

sleighsoft commented on April 28, 2024

See dev-0.4 branch for duplicate handling.

from umap.

Wierd results on dataset about umap HOT 25 CLOSED

Comments (25)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent