Giter Club home page Giter Club logo

umap's People

Contributors

adalmia96 avatar ajtritt avatar bkmgit avatar chrismbryant avatar cjweir avatar gclen avatar gclendenning avatar gregdemand avatar hamelin avatar hhcho avatar hndgzkn avatar jc-healy avatar jlmelville avatar josephcourtney avatar leriomaggio avatar lmcinnes avatar markfraney avatar matthieuheitz avatar mithaler avatar parashardhapola avatar paxtonfitzpatrick avatar pujaltes avatar rocketknight1 avatar sg-s avatar sleighsoft avatar thomasnickerson avatar timsainb avatar tomwhite avatar usul83 avatar vicramr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

umap's Issues

ZeroDivisionError when dataset is 4096 elements or more

Hi, I've narrowed in on a ZeroDivisionError that happens 100% of the time when my dataset is above >= 4096 elements, or 2^12, and none of the time when below 4096.
Varying min_dist, bandwidth, or n_neighbours parameters doesn't avoid it.
I tried with 3 datasets; the two that were 300 dims failed this way, but one with 600 dims had no error.

The trace:

File "C:/W/py_tests/dim_red/umap_reduct.py", line 9, in <module>
  embedding = umap.UMAP(n_neighbors=6, min_dist=0.002, bandwidth=0.6, metric='cosine').fit_transform(df)
File "C:\W\py_tests\venv\lib\site-packages\umap\umap_.py", line 1476, in fit_transform
  self.fit(X)
File "C:\W\py_tests\venv\lib\site-packages\umap\umap_.py", line 1434, in fit
  self.verbose
File "C:\W\py_tests\venv\lib\site-packages\umap\umap_.py", line 761, in fuzzy_simplicial_set
  verbose=verbose)
ZeroDivisionError: division by zero

This is version 0.2.1 on Windows 10.

Thanks

Attribute Error

Hi there, these are my system specs:
macOS Sierra 10.12.3 (16D32)

I have installed umap through pip. When I try to run it this the error message that comes up. I'm unsure what the problem is, any ideas?

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-10-68ef34dfa695> in <module>()
     16         umap_mfccs = get_scaled_umap_embeddings(mfcc_features,
     17                                                 neighbours,
---> 18                                                 distances)
     19         umap_embeddings_mfccs.append(umap_mfccs)
     20 

<ipython-input-10-68ef34dfa695> in get_scaled_umap_embeddings(features, neighbour, distance)
      1 def get_scaled_umap_embeddings(features, neighbour, distance):
      2 
----> 3     embedding = umap.UMAP(n_neighbors=neighbour,
      4                           min_dist = distance,
      5                           metric = 'correlation').fit_transform(features)

AttributeError: module 'umap' has no attribute 'UMAP'

Sparse matrix support

In principle a distance function could take sparse vectors and thus allow UMAP to take sparse matrices as input. This would allow for much higher dimensional data (NLP related data for example) to be handled by UMAP.

Intermittent ZeroDivisionError: division by zero

This is a fantastic library, thanks very much for your great work. Periodically though, I'm getting a ZeroDivisionError: division by zero while building a UMAP projection. My data doesn't change, nor does the way I call the UMAP constructor:

model = umap.UMAP(n_neighbors=25, min_dist=0.00001, metric='correlation')
fit_model = model.fit_transform( np.array(image_vectors) )

Once in a while (maybe 5% of runs) this throws the following trace (umap version 0.1.5):

File "imageplot.py", line 278, in <module>
    Imageplot(image_dir=sys.argv[1], output_dir='output')
  File "imageplot.py", line 30, in __init__
    self.create_2d_projection()
  File "imageplot.py", line 148, in create_2d_projection
    model = self.build_model(image_vectors)
  File "imageplot.py", line 175, in build_model
    return model.fit_transform( np.array(image_vectors) )
  File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 1402, in fit_transform
    self.fit(X)
  File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 1361, in fit
    self.verbose
  File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 385, in rptree_leaf_array
    angular=angular)
  File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 310, in make_tree
    angular)
  File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 315, in make_tree
    angular)
  File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 315, in make_tree
    angular)
  File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 310, in make_tree
    angular)
  File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 315, in make_tree
    angular)
  File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 315, in make_tree
    angular)
  File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 310, in make_tree
    angular)
  File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 310, in make_tree
    angular)
  File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 310, in make_tree
    angular)
  File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 301, in make_tree
    rng_state)
ZeroDivisionError: division by zero

I took a quick look at the make_tree function but that didn't show much--the real problem seems to be swallowed in the stacktrace by the recursion. Do you have an idea what might cause this? I'll upgrade to the latest master and see if the problem continues.

Bad argument for scipy.sparse.coo_matrix

I want to use umap.UMAP().fit_transform(X) but i got an error :
ValueError: negative column index found
from the scipy function scipy.sparse.coo_matrix

When i have investigated, i've found in umap_.py, in the function fuzzy_simplicial_set(), the variable tmp_indices contains values < 0 (-1), but scipy.sparse.coo_matrix need values >= 0 !

Thx for answering me.

UMAP Roadmap

A rough roadmap of things to be done for UMAP. Some of these tasks are easy, some are hard, and some require deeper knowledge of UMAP. Short and medium term tasks should be approachable for many people. Reply to this issue if you are interested in taking up any of them.

Short term items

  • Support for sparse matrix input
  • Add random seed as an user option
  • Support for cosine distance RP-trees
  • Allow non-RP-tree initialisation of NN-descent
  • Better document (via docstrings) all the support functions
  • "Custom" initialisation with a predefined positioning.

Medium term items

  • Generate notebook for basic usage demonstration
  • Generate notebook explaining parameter options and their effects
  • Set up CI and build a basic test suite
  • Start building basic documentation and integrate with readthedocs

Longer term items

  • Generate notebook for "How UMAP works"
  • Add code (and devise API(?)) for UMAP on general pandas dataframes
  • Add support for semi-supervised dimension reduction via UMAP
  • UMAP as a generative model (code + demo)
  • UMAP for text data (similar to word2vec)
  • A transform function for new previously unseen data (see issue #40)
  • Model persistence for UMAP models

No priority

  • GPU support for UMAP
  • Conda-forge UMAP package
  • Improve numba usage (better numba expertise required)
  • Concurrency via Dask for multicore and distributed support

Mixed-type datasets

Thanks for sharing the great algorithm and library!

Wondering what would be the recommended way of feeding mixed-type data with some categorical features to UMAP? Binary encoding (possibly with appropriate distance metrics)?

Install problems

I was had some trouble getting umap installed on my system. I think the problem was with getting numba to work properly without using Anaconda.

Traceback (most recent call last):
  File "visualize.py", line 2, in <module>
    import umap
  File "/home/sauln/research/graphs/venv/lib/python3.6/site-packages/umap/__init__.py", line 1, in <module>
    from .umap_ import UMAP
  File "/home/sauln/research/graphs/venv/lib/python3.6/site-packages/umap/umap_.py", line 7, in <module>
    import numba
  File "/home/sauln/research/graphs/venv/lib/python3.6/site-packages/numba/__init__.py", line 12, in <module>
    from .special import typeof, prange
  File "/home/sauln/research/graphs/venv/lib/python3.6/site-packages/numba/special.py", line 3, in <module>
    from .typing.typeof import typeof
  File "/home/sauln/research/graphs/venv/lib/python3.6/site-packages/numba/typing/__init__.py", line 2, in <module>
    from .context import BaseContext, Context
  File "/home/sauln/research/graphs/venv/lib/python3.6/site-packages/numba/typing/context.py", line 10, in <module>
    from numba.typeconv import Conversion, rules
  File "/home/sauln/research/graphs/venv/lib/python3.6/site-packages/numba/typeconv/rules.py", line 3, in <module>
    from .typeconv import TypeManager, TypeCastingRules
  File "/home/sauln/research/graphs/venv/lib/python3.6/site-packages/numba/typeconv/typeconv.py", line 3, in <module>
    from . import _typeconv, castgraph, Conversion
ImportError: libpython3.6m.so.1.0: cannot open shared object file: No such file or directory

I was able to get everything to work without using Conda by installing the python dev:

sudo apt-get install libpython3.6-dev

Though my issue is already solved, other people might run into the same problems. It might be helpful to incorporate this information into the docs somewhere, or just close the issue and direct people here if it comes up again.

Multi CPU / GPU capabilities?

@lmcinnes
As you may have guessed I have several CPUs and GPUs at hand and I work with high-dimensional data.
Now I am benching a 500k * 5k => 500k * 2 vector vs. PCA (I need a high level clustering to filter my data to feed it further in the pipeline).

So a couple of questions:

  1. Any plans on multi-CPU / GPU support?
  2. Does your implementation utilize vectorized operations (not really sure how embedding methods like T-SNE and UMAP work, I believe they minimize some kind of distance in high dimension space?) If so, can I help?
  3. Did you run benchmarks (like HDBSCAN) on large and huge datasets? If so, then is it feasible to expect 500k * 5k => 500k * 2 to finish in reasonable time, or should I do PCA => UMAP?

RuntimeWarning: divide by zero encountered in power

Hi there, first of all, thanks loads for this exciting algorithm. I'm writing a blog post on comparing this to a couple of other dim reduction techniques. I noticed when I'm using umap on a dataset with entries such as:

array([  0.        ,   0.        ,   0.        ,   0.        ,
         0.        ,   0.        ,   0.        ,   0.        ,
         0.        ,   0.        ,   0.        ,   0.        ,
         0.        ,   0.        ,   0.        ,   0.        ,
        -1.88847315,  11.17503262,  -0.69157058,   5.85993528,
         0.98581624,  -1.14453554,   0.61075902,  -3.21815372,
         4.9411006 ,   5.51712704,  -1.7895503 ,   2.04580665,
         0.22949766,  -6.60904551,   8.11811924,   1.88291252,
         0.        ,   0.        ,   0.        ,   0.        ,
         0.        ,   0.        ,   0.        ,   0.        ,
         0.        ,   0.        ,   0.        ,   0.        ,
         0.        ,   0.        ,   0.        ,   0.        ])

I get the following runtime warning

/usr/local/lib/python3.5/dist-packages/umap/umap_.py:592: RuntimeWarning: divide by zero encountered in power
  return 1.0 / (1.0 + a * x ** (2 * b))
/usr/local/lib/python3.5/dist-packages/scipy/optimize/minpack.py:779: OptimizeWarning: Covariance of the parameters could not be estimated
  category=OptimizeWarning)

And less than satisfactory results such as this (plots of many neighbour and distance settings:
wah
Any thoughts?

Preprocessing of the features

Hello Leland,

Congrats for the work and thanks for the code and examples. Looking forward for the paper!

If we'd like to use UMAP with the features that output a CNN, for example on MNIST dataset. Do the features need to be zero-centered? Or in some range?

I get the following error when running fit_transform(data): "ZeroDivisionError: division by zero".

Thanks again,
Amelia.

ValueError: negative column index found

error.zip

A strange issue happen if you try to compute these simple 500 rows (file is attached).

The code:

import umap
import pandas as pd

df = pd.read_csv('error.csv', header=None)
embedding = umap.UMAP(n_neighbors=15, min_dist=0.1,
                      metric='correlation').fit_transform(df.values)

In result we're getting this error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-a14777825bbd> in <module>()
      1 embedding = umap.UMAP(n_neighbors=15, min_dist=0.1,
----> 2                       metric='correlation').fit_transform(df.values)

~/venv3/lib/python3.6/site-packages/umap_learn-0.1.3-py3.6.egg/umap/umap_.py in fit_transform(self, X, y)
    790             Embedding of the training data in low-dimensional space.
    791         """
--> 792         self.fit(X)
    793         return self.embedding_

~/venv3/lib/python3.6/site-packages/umap_learn-0.1.3-py3.6.egg/umap/umap_.py in fit(self, X, y)
    757 
    758         graph = fuzzy_simplicial_set(X, self.n_neighbors,
--> 759                                      self._metric, self.metric_kwds)
    760 
    761         if self.n_edge_samples is None:

~/venv3/lib/python3.6/site-packages/scipy/sparse/coo.py in __init__(self, arg1, shape, dtype, copy)
    189             self.data = self.data.astype(dtype, copy=False)
    190 
--> 191         self._check()
    192 
    193     def getnnz(self, axis=None):

~/venv3/lib/python3.6/site-packages/scipy/sparse/coo.py in _check(self)
    241                 raise ValueError('negative row index found')
    242             if self.col.min() < 0:
--> 243                 raise ValueError('negative column index found')
    244 
    245     def transpose(self, axes=None, copy=False):

ValueError: negative column index found

Any help is appreciated.

umap non determinism - intended?

Was testing it out and noticed that setting the random seed doesn't stop the embedding from changing upon different runs.

is non-determinism part of the design (like tsne)? is there a way to replicate prior results?

Encountering numba current locale errors?

`raise NotImplementedError("cannot convert native %s to Python object" % (typ,))
LoweringError: cannot convert native const('\tnn descent iteration ') to Python object
File "../../../../home/.local/lib/python2.7/site-packages/umap/umap_.py", line 663
[1] During: lowering "print($515.5)" at /home/.local/lib/python2.7/site-packages/umap/umap_.py (663)

Failed at nopython (nopython mode backend)
cannot convert native const('\tnn descent iteration ') to Python object
File "../../../../home/.local/lib/python2.7/site-packages/umap/umap_.py", line 663
[1] During: lowering "print($515.5)" at /home/.local/lib/python2.7/site-packages/umap/umap_.py (663)
`
Related to 1.12.3.3 here: http://numba.pydata.org/numba-doc/dev/user/faq.html

RuntimeWarning: overflow

I get this warning on a dataset:

umap_.py:154: RuntimeWarning: overflow encountered in int_scalars
  self.init

Is it something that you can check without the dataset or do you need it?

Wierd results on dataset

Tried it on some Swedish parlament voting data. Did a notebook comparing it to t-SNE that works fine, but umap just produces one big blob. Tried some different parameters without any luck, but honestly I have no clue what either of the parameters does :).

See notebook for more info (if you run it you will have nice interactive plots, but i also added a static plot since the interactive is stripped from the gist)

https://gist.github.com/maxberggren/56efa53776f42755b83261c54081496e

Doesn't Work!!

Thank you for providing umap module!
I installed it by pip3, and tried following example.

import umap
from sklearn.datasets import load_digits

digits = load_digits()

embedding = umap.UMAP().fit_transform(digits.data)

However, it doesn't work and error says that

 module 'scipy.sparse' has no attribute 'csgraph'

I reinstalled spicy but error remains to be left.
Could you deal with it?

Problem with `n_components > 2`

Hi Leland,

Thank you for all the hard work you've put in UMAP. I'm very fond of it.

I'm using UMAP for dimension reduction. I was actually wondering what happens when you stack multiple instances of UMAP with different component counts. In the process I encountered the following error, produced by the below code.

A detail to point is that I'm also having problems with the scipy.sparse.csgraph import and I believe this is related.

import umap
import scipy.sparse.csgraph
from sklearn.datasets import load_digits

digits = load_digits()

embedding = umap.UMAP(
    n_components=20,
    n_neighbors=5,
    min_dist=0.3,
    metric='correlation'
).fit_transform(digits.data)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-35aa497c09c5> in <module>()
     10     min_dist=0.3,
     11     metric='correlation'
---> 12 ).fit_transform(digits.data)

~/virtualenvs/work/lib/python3.6/site-packages/umap/umap_.py in fit_transform(self, X, y)
   1473             Embedding of the training data in low-dimensional space.
   1474         """
-> 1475         self.fit(X)
   1476         return self.embedding_

~/virtualenvs/work/lib/python3.6/site-packages/umap/umap_.py in fit(self, X, y)
   1453             self.init,
   1454             random_state,
-> 1455             self.verbose
   1456         )
   1457 

~/virtualenvs/work/lib/python3.6/site-packages/umap/umap_.py in simplicial_set_embedding(graph, n_components, initial_alpha, a, b, gamma, negative_sample_rate, n_epochs, init, random_state, verbose)
   1115     elif isinstance(init, str) and init == 'spectral':
   1116         # We add a little noise to avoid local minima for optimization to come
-> 1117         initialisation = spectral_layout(graph, n_components, random_state)
   1118         expansion = 10.0 / initialisation.max()
   1119         embedding = (initialisation * expansion) + \

~/virtualenvs/work/lib/python3.6/site-packages/umap/umap_.py in spectral_layout(graph, dim, random_state)
    881             init = random_state.uniform(low=-10.0, high=10.0,
    882                                         size=(n_samples, 2))
--> 883             init[labels == largest_component] = eigenvectors[:, order]
    884             return init
    885     except scipy.sparse.linalg.ArpackError:

ValueError: shape mismatch: value array of shape (1770,20) could not be broadcast to indexing result of shape (1770,2)

Relevant version numbers:

python==3.6.1
scipy==1.0.0
numpy==1.14.0
numba==0.36.2
scikit-learn==0.19.1

Will continue to look into it in the meantime.

ZeroDivisionError with sparse input and metric='jaccard'

Example to reproduce error:

import numpy as np
from sklearn import manifold
import umap

X = np.random.choice([0, 1], size=(1000, 50), p=[90./100, 10./100])

tsne = manifold.TSNE(metric='jaccard')
y_tsne = tsne.fit_transform(X)

um = umap.UMAP(metric='jaccard')
y_umap = um.fit_transform(X)

p=[85./100, 15./100] works.

Support for input neighbor sets

This would allow mutual nearest neighbors, or other approaches to nearest neighbors to be used, providing greater flexibility.

Access to high-dimensional fuzzy simplicial set

I find for some applications (e.g. clustering) it is good to have access to the high-dimensional fuzzy simplicial set through the class UMAP. This can be easily implemented by storing graph as self.graph in the method UMAP.fit(). If you think this is a useful feature but are too busy with more urgent things, I will be happy to implement it through a pull request. Please let me know.

[Question] Clustering on UMAP output

Hi,

when using tSNE, it is usually not recommended to perform clustering on the "reduced space" with algorithms such as k-means or DBSCAN (and HDBSCAN?) because the dimensionality reduction applied by tSNE doesn't keep properties like relative distance and density (see https://stats.stackexchange.com/questions/263539/k-means-clustering-on-the-output-of-t-sne).

Would it make sense to perform such clustering (with k-means, DBSCAN, HDBSCAN etc.) on the UMAP output?

Thank you very much.

Recursion Error (Different from Previous Post)

EDIT: Is this just a version issue? RecursionError was RuntimeError before python 3.5.

I'm working with a large data set (~1,700,000 x ~400), and I'm getting the following error:

Traceback (most recent call last): File "umapping.py", line 25, in <module> u = umap.UMAP(metric="correlation").fit_transform(data) File "/users/nicolerg/anaconda2/lib/python2.7/site-packages/umap/umap_.py", line 1573, in fit_transform self.fit(X) File "/users/nicolerg/anaconda2/lib/python2.7/site-packages/umap/umap_.py", line 1534, in fit self.verbose File "/users/nicolerg/anaconda2/lib/python2.7/site-packages/umap/umap_.py", line 559, in rptree_leaf_array except RecursionError: NameError: global name 'RecursionError' is not defined

This is the relevant bit of code:

data = ps.read_csv(args.input, compression="gzip", header=1, sep=',') nrows = len(data) colors = np.random.rand(nrows, 3) # RGB colors u = umap.UMAP(metric="correlation", n_neighbors=25).fit_transform(data)

I increased n_neighbors from the default of 15 to 25 to see if that would help, but I got the same error. I do not expect that I have equivalent rows in my data. I am trying to cluster the ~1,700,000 instances in ~400 dimensions. Any suggestions?

Does not work well on trained doc2vec model

I trained a doc2vec model on the large movie review dataset and then tried to use UMAP to reduce the dimensions of the resulting document vectors. I had hoped that it would be possible to separate the documents by sentiment (positive and negative), but unfortunately the embedding is one big blob. A notebook can be seen here and the rest of the files for training the doc2vec model are in that repository as well.

UMAP as a dimensionality reduction (umap.transform())

Hey hi @lmcinnes

First of all, thx for this method. It working so well !

So I have a general question about using UMAP as a dimensionality reduction step in a prediction pipeline. We have a classification model where using a UMAP as a first dimensionality reduction step seem to gives really good results. It fixes a lot of regularization issue we have with this specific model. Now I guess my question is more related to manifold training in general, but I usually fit the dim reduction model first on the train data and then use the same model for the inference/prediction in order to have a consistent lower-dimensional projection.

Now obviously, like t-SNE, the manifold itself is learned with the data so it’s hard to “transform” new incoming data so that’s why there is no umap.transform() method I guess. There was a closely related discussion on sklearn at some point on a possible parametric t-SNE that would make this projection easier (scikit-learn/scikit-learn#5361) but looks like it’s a non trivial task in t-SNE. Anyway, long story short, since it’s mentioned in the documentation that UMAP can be used as a “reduction technique as a preliminary step to other machine learning tasks”, I was wondering how a prediction pipeline using UMAP would like like ?

The method I found so far is to reduce the dimensionality of the training AND test data at the same time in a single umap.fit_transform(), then train the model on the reduced train data and predict with the reduced test data. It’s work well in the a test scenario, but obviously in a real world environnement it mean that we would have to perform the dim reduction of the incoming data alongside the entire training dataset every time.

Is there a more elegant way of doing this ?

Martin

Performance regression in 0.2

I was pip updating from 0.1.3 to 0.2. Two sample workloads of us took a significant hit in performance: Reducing 480x13500 to 80x13500 ran 2:24 instead of 1:14 and reducing 480x6700 to 80x6700 took 1:49 instead of 0:28.

Alongside updating umap-learn, other libraries got a bump (llvmlite 0.2 to 0.21, numba 0.35.0 to 0.36.2). Neither of those affected running times. After downgrading to 0.1.3, I got the former numbers.

I saw that this commit disabled jitting for fuzzy_simplical_set. Could this or anything else cause this regression?

Cosine distance RP-Trees

UMAP currently uses RP-trees to initialise the NN-descent algorithm. The current version of this uses euclidean distance RP-Trees. In principle cosine distance RP-trees are simple to implement and would be more useful for cosine and correlation distance metrics. Allowing the option would be beneficial.

This requires both implementing the trees (simply write a new splitting function), and threading the option through the code to be able to present it to the user an class instantiation time.

Converging to a single point

I'm using UMAP to embed a bunch of 128 dimensional face embeddings generated by a neural net.

As I increase the number of embeddings (I have 3M total) the output from UMAP converges to a single point in the center surrounded by a sparse cloud around it. How can I fix this? Here are some examples from fewer samples to more samples. n = 73728, 114688, 172032, 196608, 245760

73728
114688
172032
196608
245760

`umap_utils` missing?

Looks like umap_utils is missing?

ImportError                               Traceback (most recent call last)
<ipython-input-2-c9a53c6b2768> in <module>()
----> 1 import umap

/Users/max/Dropbox/Dokument/Python/voteringar/umap/__init__.py in <module>()
----> 1 from .umap_ import UMAP

/Users/max/Dropbox/Dokument/Python/voteringar/umap/umap_.py in <module>()
----> 1 from .umap_utils import fuzzy_simplicial_set, simplicial_set_embedding
      2 from scipy.optimize import curve_fit
      3 from sklearn.base import BaseEstimator
      4 
      5 import numpy as np

ImportError: No module named umap_utils

Adding performance notebook

I created a small notebook for replication of UMAP performance when increasing dataset size and dimensionality.

I can not create a pull request though. Is this something on my end, or do I need a permission for this?

How to proceed?

increasing dimensionality

increasing data

notebook

Evaluating dimensionality reduction?

Hello Leland,

Thank you for sharing this new algorithm.
I have a question regarding evaluation measures of dimensionality reduction methods. I'm aware of trustworthiness and continuity, but I'm looking for measures that can handle large datasets.

I found the paper "Scale-independent quality criteria for dimensionality reduction" which is an alternative quality measure, but it is still for small datasets.

How are you evaluating umap against other approaches at the moment?

n_neighbors of point i includes i

In fuzzy_simplicial_set, in the small data case where X is the full distance matrix, the following code gets run:

    if metric == 'precomputed':
        # Note that this does not support sparse distance matrices yet ...
        # Compute indices of n nearest neighbors
        knn_indices = np.argsort(X)[:, :n_neighbors]
        # Compute the nearest neighbor distances
        #   (equivalent to np.sort(X)[:,:n_neighbors])
        knn_dists = X[np.arange(X.shape[0])[:, None], knn_indices].copy()

I believe that knn_indices and knn_dists contain the point itself and not just the k neighbors, i.e. it is always true that knn_indices[i][0] == i and knn_dists[i][0] == 0.

This leads to an error in simplicial_set_embedding if umap.UMAP is run with n_neighbors = 1 because graph just consists of zeros, with the following error (in PyCharm on Windows at any rate):

  File "C:\dev\python\umap\umap\umap_.py", line 1152, in simplicial_set_embedding
    graph.data[graph.data < (graph.data.max() / float(n_epochs))] = 0.0
  File "C:\dev\python\py36\lib\site-packages\numpy\core\_methods.py", line 26, in _amax
    return umr_maximum(a, axis, None, out, keepdims)
ValueError: zero-size array to reduction operation maximum which has no identity

An additional effect is that you can never use all neighbor distances in fuzzy_simplicial_set, because n_neighbors must be smaller than the dataset size. Admittedly this isn't a very useful thing to do practically, but it seems like you ought to be able to do it.

So I think this is a bug. I haven't tried with large data and the metric_nn_descent code block.

JavaScript implementation?

Are you aware of any JavaScript implementations?

Most probably, there are none, so please ping if you'd be interested as well. There's already e.g. a powerful ML JavaScript toolkit https://github.com/mljs/ml so I'd love to have UMAP there included.

Custom embedding initialization

Looking at the embedding initialization options, I see 'random' and 'spectral'. Would it be possible to initialize with a custom embedding? And if so, would this embedding be at all preserved?

In trying to compare the effect of different parameter changes, it could be helpful to use the embedding of a previous run as the initialization to a new UMAP instance with slightly different parameters. For example, these two min_dist values result in embeddings with different global orientations but similar local relationships.

download

download-5

Sudden outlier

I've quickly tested Multicore-TSNE and umap on the telecom churn dataset.
Here is the notebook. The dataset is available in the very same repository.

Looks like umap suffered from a sudden outlier which hadn't affected t-SNE.

I haven't played with umap hyperparameters but maybe the example will be useful.

does not support option: 'parallel'"

Hi, thanks for the exciting work! I am playing with your algorithm and I got the following error message, when I was running your demo with digits.data. Do you have a sense of what is going on here?


KeyError Traceback (most recent call last)
/Users/Qihong/anaconda/envs/brainiak/lib/python3.6/site-packages/numba/targets/options.py in from_dict(self, dic)
17 try:
---> 18 ctor = self.OPTIONS[k]
19 except KeyError:

KeyError: 'parallel'

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last)
in ()
6 embedding = umap.UMAP(n_neighbors=5,
7 min_dist=0.3,
----> 8 metric='correlation').fit_transform(digits.data)

/Users/Qihong/Dropbox/github/umap/umap/umap_.py in fit_transform(self, X, y)
790 Embedding of the training data in low-dimensional space.
791 """
--> 792 self.fit(X)
793 return self.embedding_

/Users/Qihong/Dropbox/github/umap/umap/umap_.py in fit(self, X, y)
757
758 graph = fuzzy_simplicial_set(X, self.n_neighbors,
--> 759 self._metric, self.metric_kwds)
760
761 if self.n_edge_samples is None:

/Users/Qihong/anaconda/envs/brainiak/lib/python3.6/site-packages/numba/dispatcher.py in _compile_for_args(self, *args, **kws)
305 argtypes.append(self.typeof_pyval(a))
306 try:
--> 307 return self.compile(tuple(argtypes))
308 except errors.TypingError as e:
309 # Intercept typing error that may be due to an argument

/Users/Qihong/anaconda/envs/brainiak/lib/python3.6/site-packages/numba/dispatcher.py in compile(self, sig)
577
578 self._cache_misses[sig] += 1
--> 579 cres = self._compiler.compile(args, return_type)
580 self.add_overload(cres)
581 self._cache.save_overload(sig, cres)

/Users/Qihong/anaconda/envs/brainiak/lib/python3.6/site-packages/numba/dispatcher.py in compile(self, args, return_type)
70 def compile(self, args, return_type):
71 flags = compiler.Flags()
---> 72 self.targetdescr.options.parse_as_flags(flags, self.targetoptions)
73 flags = self._customize_flags(flags)
74

/Users/Qihong/anaconda/envs/brainiak/lib/python3.6/site-packages/numba/targets/options.py in parse_as_flags(cls, flags, options)
26 def parse_as_flags(cls, flags, options):
27 opt = cls()
---> 28 opt.from_dict(options)
29 opt.set_flags(flags)
30 return flags

/Users/Qihong/anaconda/envs/brainiak/lib/python3.6/site-packages/numba/targets/options.py in from_dict(self, dic)
19 except KeyError:
20 fmt = "%r does not support option: '%s'"
---> 21 raise KeyError(fmt % (self.class, k))
22 else:
23 self.values[k] = ctor(v)

KeyError: "<class 'numba.targets.cpu.CPUTargetOptions'> does not support option: 'parallel'"

AttributeError: module 'scipy.sparse' has no attribute 'csgraph'

Hello,

Thank you for the great contribution.

I can't seem to get it running. Any help is appreciated.

Here are my versions:

Requirement already satisfied: umap-learn in ./anaconda3/lib/python3.6/site-packages
Requirement already satisfied: numba>=0.34 in ./anaconda3/lib/python3.6/site-packages (from umap-learn)
Requirement already satisfied: scipy>=0.19 in ./anaconda3/lib/python3.6/site-packages (from umap-learn)
Requirement already satisfied: scikit-learn>=0.16 in ./anaconda3/lib/python3.6/site-packages (from umap-learn)
Requirement already satisfied: llvmlite in ./anaconda3/lib/python3.6/site-packages (from numba>=0.34->umap-learn)
Requirement already satisfied: numpy in ./anaconda3/lib/python3.6/site-packages (from numba>=0.34->umap-learn)

Running the example,

import umap
from sklearn.datasets import load_digits

digits = load_digits()

embedding = umap.UMAP().fit_transform(digits.data)

outputs:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-32-e5c7a5ee7150> in <module>()
      4 digits = load_digits()
      5 
----> 6 embedding = umap.UMAP().fit_transform(digits.data)

~/anaconda3/lib/python3.6/site-packages/umap/umap_.py in fit_transform(self, X, y)

~/anaconda3/lib/python3.6/site-packages/umap/umap_.py in fit(self, X, y)

~/anaconda3/lib/python3.6/site-packages/umap/umap_.py in simplicial_set_embedding(graph, n_components, initial_alpha, a, b, gamma, negative_sample_rate, n_epochs, init, random_state, verbose)

~/anaconda3/lib/python3.6/site-packages/umap/umap_.py in spectral_layout(graph, dim, random_state)

AttributeError: module 'scipy.sparse' has no attribute 'csgraph'

But I can import csgraph witout problems from scipy,
from scipy.sparse import csgraph

Custom losses, coherent embeddings

Nice property of TSNE, that is not exploited in most of implementations, is that it can be treated as a combination of two orthogonal components: loss function and optimization algorithm. For example one may visualize set on temporally varying vectors with a sequence on coherent embeddings by adding a loss term that penalizes unnecessary movement of each vector between those embeddings. Is it possible to have provide such flexibility to use additional constrains with UMAP?

Digits example

569 self._raise_no_convergence()
570 else:
--> 571 raise ArpackError(self.info, infodict=self.iterate_infodict)
572
573 def extract(self, return_eigenvectors):

ArpackError: ARPACK error 3: No shifts could be applied during a cycle of the Implicitly restarted Arnoldi iteration. One possibility is to increase the size of NCV relative to NEV.

Precomputed distances

Would you be interested in / see any obstacles for implementing UMAP on a distance matrix? After a first glance this seems to be quite straight forward to include.
I'd be inclined to contribute.

hdbscan on UMAP subspace

As the doc said "With a little care (documentation on how to be careful is coming) it partners well with the hdbscan clustering library" I wonder is there any updates about the little care or quick answers how to use hdbscan to perform clustering on the UMAP subspace ?
Thanks in advance !

[Question] What's the scaling complexity?

Looks like a great alternative to t-SNE! The readme mentions how fast it is, but I wonder what is the complexity in big-O depending on the number of samples and dimensions? Waiting for the paper to read impatiently!

get a RecursionError

working_uamp = umap.UMAP(n_neighbors=5,
n_components=2,
min_dist=0.3,
metric='euclidean')

input_feat = df.as_matrix([df.columns[1:1001]])
embeddings = working_uamp.fit_transform(input_feat)

---------------------------error message ------------------------------

make_tree(data, indices, leaf_size)
105 rng_state)
106 left_node = make_tree(data, left_indices, leaf_size)
--> 107 right_node = make_tree(data, right_indices, leaf_size)
108 node = RandomProjectionTreeNode(indices, False, left_node, right_node)
109 else:

RecursionError: maximum recursion depth exceeded

Transform and Unsupervised Data

Hello,

maybe I'm missing it, but is there the 'transform' function, i.e. after you trained the UMAP instance with data you can apply the same instance on an unseen point?
If not, why? And is it foreseen?
Thank you!

Add progress log

It would be nice to report some intermediate calculation progress as log info messages. For large data the process can be relatively long and progress info will help.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.