A rough roadmap of things to be done for UMAP. Some of these tasks are easy, some are

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

UMAP Roadmap,about lmcinnes/umap

Comments (36)

bccho commented on April 28, 2024 4

Unfortunately I have no control over moving to python 3 (as much as I would like to), but for a workaround, I can try saving individual subobjects to files and re-loading them.
Can you indicate what subobjects and parameters are required for transform to work correctly?

EDIT: After iterating through individual attributes from dir(trans), it looks like _random_init, _search, and _tree_init are the culprits. They are all instances of @numba.njit called on nested functions, but using dill didn't resolve the problem, and it seems they are necessary for transform.

EDIT: Here is a functioning workaround for Python 2:

import pickle

def save_umap(umap):
    for attr in ["_tree_init", "_search", "_random_init"]:
        if hasattr(umap, attr):
            delattr(umap, attr)
    return pickle.dumps(umap, pickle.HIGHEST_PROTOCOL)

def load_umap(s):
    umap = pickle.loads(s)
    from umap.nndescent import make_initialisations, make_initialized_nnd_search
    umap._random_init, umap._tree_init = make_initialisations(
        umap._distance_func, umap._dist_args
    )
    umap._search = make_initialized_nnd_search(
        umap._distance_func, umap._dist_args
    )
    return umap

import numpy as np
X = np.random.randn(5000, 16)
X_new = np.random.randn(100, 16)

from umap import UMAP
um = UMAP()
um.fit(X)
emb = um.transform(X_new)

pkl = save_umap(um)
um_new = load_umap(pkl) # no error!

emb_new = um_new.transform(X_new)

from umap.

KeithTheEE commented on April 28, 2024 3

So I had started on a different notebook approach, and decided to see it through to an alpha version. It uses scikit-learn's digits data, so it at least offers a different perspective. Like I said, it's an alpha/early draft version. There's plenty of points that I just got bored of writing instead of coding, but I'm going to go back to them soon. I also blinked and the documentation/code changed so I'll have to update that too. Here it is,
https://nbviewer.jupyter.org/github/CrakeNotSnowman/umapNotebooks/blob/master/UMAP%20Usage.ipynb

That said, I really like the notebook @Fil came up with, and @lmcinnes improved on, I think it offers a better intro to UMAP.

from umap.

josephcourtney commented on April 28, 2024 3

Unless I am not understanding something, pickling seems to work fine, at least on the current main branch. Here is a simple example that shows pickling and unpickling of a trained model, even with a custom metric. Note: if you unpickle a model with a custom metric, that metric must already be defined in that same file; the pickle only contains a reference to the metric function.

import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import umap
import pickle


digits = load_digits()

X_train, X_test, y_train, y_test = train_test_split(
    digits.data,
    digits.target,
    stratify=digits.target,
    random_state=42
)


def mydist(x, y):
    return np.max(np.abs(x - y))


trans = umap.UMAP(
    n_neighbors=5,
    random_state=42,
    metric=mydist
).fit(X_train)
plt.scatter(trans.embedding_[:, 0], trans.embedding_[:, 1], s=5, c=y_train, cmap='Spectral')
plt.title('Embedding of the training set by UMAP', fontsize=24)
plt.show()
plt.close()


with open('trans.pkl', 'wb') as f:
    pickle.dump(trans, f, pickle.HIGHEST_PROTOCOL)

with open('trans.pkl', 'rb') as f:
    trans = pickle.load(f)


test_embedding = trans.transform(X_test)
plt.scatter(test_embedding[:, 0], test_embedding[:, 1], s=5, c=y_test, cmap='Spectral')
plt.title('Embedding of the test set by UMAP', fontsize=24)
plt.show()
plt.close()

from umap.

lmcinnes commented on April 28, 2024 2

Go ahead and pull whatever you need. It's helpful if you can explore the parameter effects in a little detail. Kyle McDonald had some nice min_dist comparisons here https://twitter.com/kcimc/status/930180473262919685 . Exploring some of the other effects similarly as well (metric, n_components, spread, n_neighbors) would be beneficial. But certainly any contributions are welcome.

from umap.

nawafmo commented on April 28, 2024 2

am thinking about using UMAP for IDS project as feature extraction methods

is it a good Idea? have any body did this before ??

from umap.

Fil commented on April 28, 2024 1

Wow! I wanted to do the hue and HSL metrics, but didn't think they would turn out that splendid. Thank you! For the credit you should remove "excellent", and can add "for visionscarto.net" after my name for affiliation.

I'm preparing another example, will follow up when it's ready :)

from umap.

lmcinnes commented on April 28, 2024 1

I have a colleague who is working on that -- there's some underlying theory to be worked through, but I believe the core ideas are now all in place. The essence of the idea is this: word2vec can be viewed as (in the limit) a matrix factorization problem, which is to say similar to PCA. It should be possible to use manifold learning like UMAP to do the embedding rather than something linear like PCA. Ideally this should capture word similarity better, at the cost that word algebra will no longer work.

The details are in what data to embed (something based on a word-word-co-occurence matrix), and how to measure distance (negative log likelihoods under a suitable model), and how to interpet the theory around all of that. Progress is being made, but it may be a little while before anything releasable happens.

from umap.

lmcinnes commented on April 28, 2024 1

I'm forgoing exact knn-graph methods as most are too slow on high dimensional data. I agree that nmslib is impressive but for this project I was hoping to keep the dependencies relatively self-contained. Right now I'm using my own python based implementation of NN-descent (for which kgraph is the reference implementation). The advantages of NN-descent are that it is non-metric space based (just like nmslib), and can be used for direct approximate knn-graph construction rather than building an index and then querying.

If someone else wanted to build an optimized UMAP on top of nmslib I would certainly be interested to see it -- it would likely outperform this version due to the parallelism (presuming a suitably parallelised version of the SGD for layout was paired with it).

from umap.

ghannum commented on April 28, 2024 1

It would be useful to have a way to save the UMAP model to a file for transforming future data into the same space. What would it take to get save/load functions?

from umap.

lmcinnes commented on April 28, 2024 1

@ghannum Okay, thanks, I'll try to look into this at some point. At the very least I'll add it to the roadmap.

from umap.

david4096 commented on April 28, 2024 1

@lmcinnes very interested in that feature!

from umap.

profwacko commented on April 28, 2024 1

@bccho thanks for this fix, been running into the problem with larger training sets with joblib and pickle for the past week. Needed to use python 2.7 specifically. Hopefully this functionality gets added into UMAP soon.

Same error as above:

[path omitted]/lib/python2.7/site-packages/funcsigs/__init__.pyc in __new__(self, *args, **kwargs)
    199     def __new__(self, *args, **kwargs):
    200         obj = int.__new__(self, *args)
--> 201         obj._name = kwargs['name']
    202         return obj
    203 

KeyError: 'name'

Unfortunately I have no control over moving to python 3 (as much as I would like to), but for a workaround, I can try saving individual subobjects to files and re-loading them.
Can you indicate what subobjects and parameters are required for transform to work correctly?

EDIT: After iterating through individual attributes from dir(trans), it looks like _random_init, _search, and _tree_init are the culprits. They are all instances of @numba.njit called on nested functions, but using dill didn't resolve the problem, and it seems they are necessary for transform.

EDIT: Here is a functioning workaround for Python 2:
import pickle

def save_umap(umap):
    for attr in ["_tree_init", "_search", "_random_init"]:
        if hasattr(umap, attr):
            delattr(umap, attr)
    return pickle.dumps(umap, pickle.HIGHEST_PROTOCOL)

def load_umap(s):
    umap = pickle.loads(s)
    from umap.nndescent import make_initialisations, make_initialized_nnd_search
    umap._random_init, umap._tree_init = make_initialisations(
        umap._distance_func, umap._dist_args
    )
    umap._search = make_initialized_nnd_search(
        umap._distance_func, umap._dist_args
    )
    return umap

import numpy as np
X = np.random.randn(5000, 16)
X_new = np.random.randn(100, 16)

from umap import UMAP
um = UMAP()
um.fit(X)
emb = um.transform(X_new)

pkl = save_umap(um)
um_new = load_umap(pkl) # no error!

emb_new = um_new.transform(X_new)

from umap.

lmcinnes commented on April 28, 2024 1

Unfortunatlely I can't share details. Sorry.

from umap.

lefnire commented on April 28, 2024 1

Couldn't get the pickle.dumps/loads workaround to work (python3.8).

    man = self._unserialize_umap(man)                                                                                                                                                                                                                                                                               [20/1811]  File "/app/jwtauthtest/autoencoder.py", line 223, in _unserialize_umap
    umap = pickle.loads(s)
  File "/usr/local/lib/python3.8/site-packages/pynndescent/pynndescent_.py", line 1028, in __setstate__
    self._rp_forest = tuple([renumbaify_tree(tree) for tree in d["_rp_forest"]])
  File "/usr/local/lib/python3.8/site-packages/pynndescent/pynndescent_.py", line 1028, in <listcomp>
    self._rp_forest = tuple([renumbaify_tree(tree) for tree in d["_rp_forest"]])
  File "/usr/local/lib/python3.8/site-packages/pynndescent/rp_trees.py", line 1178, in renumbaify_tree
    hyperplanes.extend(tree.hyperplanes)
  File "/usr/local/lib/python3.8/site-packages/numba/typed/typedlist.py", line 366, in extend
    return _extend(self, iterable)
  File "/usr/local/lib/python3.8/site-packages/numba/core/dispatcher.py", line 415, in _compile_for_args
    error_rewrite(e, 'typing')
  File "/usr/local/lib/python3.8/site-packages/numba/core/dispatcher.py", line 358, in error_rewrite
    reraise(type(e), e, None)
  File "/usr/local/lib/python3.8/site-packages/numba/core/utils.py", line 80, in reraise
    raise value.with_traceback(tb)
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
- Resolution failure for literal arguments:
No implementation of function Function(<function impl_extend at 0x7feb5058daf0>) found for signature:

 >>> impl_extend(ListType[array(float64, 2d, C)], reflected list(array(float32, 1d, C)))

There are 2 candidate implementations:
  - Of which 2 did not match due to:
  Overload in function 'impl_extend': File: numba/typed/listobject.py: Line 1027.
    With argument(s): '(ListType[array(float64, 2d, C)], reflected list(array(float32, 1d, C)))':
   Rejected as the implementation raised a specific error:
     TypingError: Failed in nopython mode pipeline (step: nopython frontend)
   - Resolution failure for literal arguments:
   No implementation of function Function(<function impl_append at 0x7feb5058d280>) found for signature:

    >>> impl_append(ListType[array(float64, 2d, C)], array(float32, 1d, C))

   There are 2 candidate implementations:
     - Of which 2 did not match due to:
     Overload in function 'impl_append': File: numba/typed/listobject.py: Line 589.
       With argument(s): '(ListType[array(float64, 2d, C)], array(float32, 1d, C))':
      Rejected as the implementation raised a specific error:
        LoweringError: Failed in nopython mode pipeline (step: nopython mode backend)


      File "../usr/local/lib/python3.8/site-packages/numba/typed/listobject.py", line 597:
          def impl(l, item):
              casteditem = _cast(item, itemty)
              ^

      During: lowering "$8call_function.3 = call $2load_global.0(item, $6load_deref.2, func=$2load_global.0, args=[Var(item, listobject.py:597), Var($6load_deref.2, listobject.py:597)], kws=(), vararg=None)" at /usr/local/lib/python3.8/site-packages/numba/typed/listobject.py (597)
     raised from /usr/local/lib/python3.8/site-packages/numba/core/utils.py:81

   - Resolution failure for non-literal arguments:
   None

   During: resolving callee type: BoundFunction((<class 'numba.core.types.containers.ListType'>, 'append') for ListType[array(float64, 2d, C)])
   During: typing of call at /usr/local/lib/python3.8/site-packages/numba/typed/listobject.py (1051)


   File "../usr/local/lib/python3.8/site-packages/numba/typed/listobject.py", line 1051:
               def impl(l, iterable):
                   <source elided>
                   for i in iterable:
                       l.append(i)
                       ^

  raised from /usr/local/lib/python3.8/site-packages/numba/core/typeinfer.py:994

- Resolution failure for non-literal arguments:
None

During: resolving callee type: BoundFunction((<class 'numba.core.types.containers.ListType'>, 'extend') for ListType[array(float64, 2d, C)])
During: typing of call at /usr/local/lib/python3.8/site-packages/numba/typed/typedlist.py (101)


File "../usr/local/lib/python3.8/site-packages/numba/typed/typedlist.py", line 101:
def _extend(l, iterable):
    return l.extend(iterable)

Also tried dill.dump/load, same error (maybe I need to dump/load_session? not sure how that might interfere with the rest of the environment, as this is shared with server code). I'll shelf umap for my project & subscribe here in case roadmap sees some love.

from umap.

Fil commented on April 28, 2024

Here's the smallest notebook I could think of for basic usage demonstration
https://nbviewer.jupyter.org/gist/Fil/cce232583907035b65686cdec7d4cc92

from umap.

lmcinnes commented on April 28, 2024

Thanks! I was hoping to have some further description in Markdown in the notebook, but this is an excellent beginning.

from umap.

KeithTheEE commented on April 28, 2024

Would you mind if we pulled direct quotes from your README for the notebook ('basic usage demonstration' and 'explaining parameter options and their effects')?

I'm also currently wrapping the two ideas as one notebook, with a basic usage section at the top, and more in-depth information after that. Thoughts?

from umap.

Fil commented on April 28, 2024

Here's another version, exploring some of parameters
https://nbviewer.jupyter.org/gist/Fil/5c48475e88a0e1a8f56eaadaebff0544

from umap.

lmcinnes commented on April 28, 2024

I love the metric exploration! The custom metrics nicely show off what can be done, and the effects (e.g. the pure red metric has a clear linear embedding etc.)

from umap.

lmcinnes commented on April 28, 2024

@Fil I have added in a version of your parameter exploration notebook (with some minor changes and added text commentary and explanation) in the notebooks directory. Have a look and let me know if it looks okay to you. I really appreciate your work on this, so let me know how you would like to be acknowledged within the notebook.

from umap.

lmcinnes commented on April 28, 2024

@CrakeNotSnowman That looks great! To be honest more intros are good, especially if they come from different perspectives, as this one does. There are some really interesting results in there.

Sorry about the code and documentation changes; I'm a tinkerer and I can't help it.

I definitely look forward to seeing this with any further expository writing.

from umap.

loretoparisi commented on April 28, 2024

What about UMAP for text data (similar to word2vec)?

from umap.

gokceneraslan commented on April 28, 2024

Are you also planning to explore other exact and approximate k-nn graph methods? nmslib is a super fast parallelized implementation with a plethora of knn methods.

from umap.

lmcinnes commented on April 28, 2024

@ghannum I admit that I had been hoping that the standard methods for model persistence in sklearn (pickling etc.) would handle this -- is that not working with UMAP, or are you looking for something a little different than what it would provide? This isn't really my area of expertise, so you'll have to excuse my lack of knowledge here.

from umap.

ghannum commented on April 28, 2024

@lmcinnes I tried to pickle the model file, but pickling only works for data objects - not classes. I believe the correct approach would be to write one function which puts all of the relevant model data into a list and pickles the list. Then write a load function which loads the pickled data and constructs the model object.

from umap.

lmcinnes commented on April 28, 2024

Thanks for testing that out. I've been learning a little about potential issues in pickling in pynndescent and it seems that ``pickle.HIGHEST_PROTOCOL`` is the key point here if you are using python2 -- by default pickle in python2 uses a different protocol that may not support pickling UMAP well.

…

On Fri, Jul 20, 2018 at 1:02 PM Joseph Courtney ***@***.***> wrote: Unless I am not understanding something, pickling seems to work fine, at least on the current main branch. Here is a simple example that shows pickling and unpickling of a trained model, even with a custom metric. Note: if you unpickle a model with a custom metric, that metric must already be defined in that same file; the pickle only contains a reference to the metric function. import numpy as np from sklearn.datasets import load_digits from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt import umap import pickle digits = load_digits() X_train, X_test, y_train, y_test = train_test_split( digits.data, digits.target, stratify=digits.target, random_state=42 ) def mydist(x, y): return np.max(np.abs(x - y)) trans = umap.UMAP( n_neighbors=5, random_state=42, metric=mydist ).fit(X_train) plt.scatter(trans.embedding_[:, 0], trans.embedding_[:, 1], s=5, c=y_train, cmap='Spectral') plt.title('Embedding of the training set by UMAP', fontsize=24) plt.show() plt.close() with open('trans.pkl', 'wb') as f: pickle.dump(trans, f, pickle.HIGHEST_PROTOCOL) with open('trans.pkl', 'rb') as f: trans = pickle.load(f) test_embedding = trans.transform(X_test) plt.scatter(test_embedding[:, 0], test_embedding[:, 1], s=5, c=y_test, cmap='Spectral') plt.title('Embedding of the test set by UMAP', fontsize=24) plt.show() plt.close() — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALaKBfoiIeHnwnUhqRXnlfa5RB3pSwaHks5uIg0IgaJpZM4QcUO9> .

from umap.

bccho commented on April 28, 2024

@josephcourtney 's example fails when the training data is larger.

No error:

X = np.random.randn(4000, 48)
trans = umap.UMAP(
    n_neighbors=5,
    random_state=42,
    metric="euclidean
).fit(X)
with open('trans.pkl', 'wb') as f:
    pickle.dump(trans, f, pickle.HIGHEST_PROTOCOL)
with open('trans.pkl', 'rb') as f:
    trans = pickle.load(f)

Error:

X = np.random.randn(5000, 48)
trans = umap.UMAP(
    n_neighbors=5,
    random_state=42,
    metric="euclidean
).fit(X)
with open('trans.pkl', 'wb') as f:
    pickle.dump(trans, f, pickle.HIGHEST_PROTOCOL)
with open('trans.pkl', 'rb') as f:
    trans = pickle.load(f)

Traceback:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-140-1981a7cd4080> in <module>()
      2     pickle.dump(trans, f, pickle.HIGHEST_PROTOCOL)
      3 with open('trans.pkl', 'rb') as f:
----> 4     trans = pickle.load(f)
      5

/usr/lib/python2.7/pickle.pyc in load(file)
   1382
   1383 def load(file):
-> 1384     return Unpickler(file).load()
   1385
   1386 def loads(str):

/usr/lib/python2.7/pickle.pyc in load(self)
    862             while 1:
    863                 key = read(1)
--> 864                 dispatch[key](self)
    865         except _Stop, stopinst:
    866             return stopinst.value

/usr/lib/python2.7/pickle.pyc in load_newobj(self)
   1087         args = self.stack.pop()
   1088         cls = self.stack[-1]
-> 1089         obj = cls.__new__(cls, *args)
   1090         self.stack[-1] = obj
   1091     dispatch[NEWOBJ] = load_newobj

[path omitted]/lib/python2.7/site-packages/funcsigs/__init__.py in __new__(self, *args, **kwargs)
    199     def __new__(self, *args, **kwargs):
    200         obj = int.__new__(self, *args)
--> 201         obj._name = kwargs['name']
    202         return obj
    203

KeyError: 'name'

from umap.

lmcinnes commented on April 28, 2024

@bccho : That's a little disconcerting. It seems to be some sort of issue with pickle storing certain objects. At 4096 there is a switch in how knn computation is handled, so that may be responsible, but it is entirely unclear to me where in the whole process this is going astray. It must be in some subobjects of the basic UMAP class, so is likely an issue for those objects in general (scipy sparse matrices perhaps?). I'm away for a few days but I'll try to look into it when I get back. If there is any chance you can switch to python3 that will resolve the issue, but I understand that that is not always an option.

from umap.

lmcinnes commented on April 28, 2024

Sorry I'm currently on vacation. I stole a little time, but I won't be able to look into this properly until Monday.

…

On Wed, Aug 29, 2018 at 3:08 PM Byung-Cheol Cho ***@***.***> wrote: Unfortunately I have no control over moving to python 3 (as much as I would like to), but for a workaround, I can try saving individual subobjects to files and re-loading them. Can you indicate what subobjects and parameters are required for transform to work correctly? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALaKBScXwi59qVCGS16hsvDAEDf1cv-Uks5uVua0gaJpZM4QcUO9> .

from umap.

lmcinnes commented on April 28, 2024

Thanks for finding a workaround! It looks like it was the numba-jitted functions that were not pickling properly, at least under 2.7. I'll have to see if I can figure out a more permanent solution.

from umap.

bccho commented on April 28, 2024

I think that was the problem too. You could probably put the save_umap and get_umap code in as part of __getstate__ and __setstate__

from umap.

lmcinnes commented on April 28, 2024

That makes sense. I'll add it to my todo list. Thanks.

…

On Sat, Sep 8, 2018 at 2:48 PM Byung-Cheol Cho ***@***.***> wrote: I think that was the problem too. You could probably put the save_umap and get_umap code in as part of __getstate__ and __setstate__ — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALaKBaDmwZLOto1FEWnydsA3P9PwdobQks5uZBD7gaJpZM4QcUO9> .

from umap.

stefan-jansen commented on April 28, 2024

I was able to persist umap objects using the pickle extension dill under Python 3.6.

from umap.

lmcinnes commented on April 28, 2024

It is worth trying, but a lot will depend on the nature of your data. I have seen UMAP used for IDS projects, though usually more as part of an exploratory tool rather than a production pipeline.

from umap.

nawafmo commented on April 28, 2024

It is worth trying, but a lot will depend on the nature of your data. I have seen UMAP used for IDS projects, though usually more as part of an exploratory tool rather than a production pipeline.

can you share with me some of these project ?

from umap.

loretoparisi commented on April 28, 2024

@lmcinnes my two cents is that the issue with umap is the use case. I see a lot of people do not how which is the advantage to use umap instead of t-sne / pca...

from umap.

UMAP Roadmap about umap HOT 36 OPEN

Comments (36)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent