Comments (36)
Unfortunately I have no control over moving to python 3 (as much as I would like to), but for a workaround, I can try saving individual subobjects to files and re-loading them.
Can you indicate what subobjects and parameters are required for transform
to work correctly?
EDIT: After iterating through individual attributes from dir(trans)
, it looks like _random_init
, _search
, and _tree_init
are the culprits. They are all instances of @numba.njit
called on nested functions, but using dill
didn't resolve the problem, and it seems they are necessary for transform
.
EDIT: Here is a functioning workaround for Python 2:
import pickle
def save_umap(umap):
for attr in ["_tree_init", "_search", "_random_init"]:
if hasattr(umap, attr):
delattr(umap, attr)
return pickle.dumps(umap, pickle.HIGHEST_PROTOCOL)
def load_umap(s):
umap = pickle.loads(s)
from umap.nndescent import make_initialisations, make_initialized_nnd_search
umap._random_init, umap._tree_init = make_initialisations(
umap._distance_func, umap._dist_args
)
umap._search = make_initialized_nnd_search(
umap._distance_func, umap._dist_args
)
return umap
import numpy as np
X = np.random.randn(5000, 16)
X_new = np.random.randn(100, 16)
from umap import UMAP
um = UMAP()
um.fit(X)
emb = um.transform(X_new)
pkl = save_umap(um)
um_new = load_umap(pkl) # no error!
emb_new = um_new.transform(X_new)
from umap.
So I had started on a different notebook approach, and decided to see it through to an alpha version. It uses scikit-learn's digits data, so it at least offers a different perspective. Like I said, it's an alpha/early draft version. There's plenty of points that I just got bored of writing instead of coding, but I'm going to go back to them soon. I also blinked and the documentation/code changed so I'll have to update that too. Here it is,
https://nbviewer.jupyter.org/github/CrakeNotSnowman/umapNotebooks/blob/master/UMAP%20Usage.ipynb
That said, I really like the notebook @Fil came up with, and @lmcinnes improved on, I think it offers a better intro to UMAP.
from umap.
Unless I am not understanding something, pickling seems to work fine, at least on the current main branch. Here is a simple example that shows pickling and unpickling of a trained model, even with a custom metric. Note: if you unpickle a model with a custom metric, that metric must already be defined in that same file; the pickle only contains a reference to the metric function.
import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import umap
import pickle
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
digits.data,
digits.target,
stratify=digits.target,
random_state=42
)
def mydist(x, y):
return np.max(np.abs(x - y))
trans = umap.UMAP(
n_neighbors=5,
random_state=42,
metric=mydist
).fit(X_train)
plt.scatter(trans.embedding_[:, 0], trans.embedding_[:, 1], s=5, c=y_train, cmap='Spectral')
plt.title('Embedding of the training set by UMAP', fontsize=24)
plt.show()
plt.close()
with open('trans.pkl', 'wb') as f:
pickle.dump(trans, f, pickle.HIGHEST_PROTOCOL)
with open('trans.pkl', 'rb') as f:
trans = pickle.load(f)
test_embedding = trans.transform(X_test)
plt.scatter(test_embedding[:, 0], test_embedding[:, 1], s=5, c=y_test, cmap='Spectral')
plt.title('Embedding of the test set by UMAP', fontsize=24)
plt.show()
plt.close()
from umap.
Go ahead and pull whatever you need. It's helpful if you can explore the parameter effects in a little detail. Kyle McDonald had some nice min_dist comparisons here https://twitter.com/kcimc/status/930180473262919685 . Exploring some of the other effects similarly as well (metric
, n_components
, spread
, n_neighbors
) would be beneficial. But certainly any contributions are welcome.
from umap.
am thinking about using UMAP for IDS project as feature extraction methods
is it a good Idea? have any body did this before ??
from umap.
Wow! I wanted to do the hue and HSL metrics, but didn't think they would turn out that splendid. Thank you! For the credit you should remove "excellent", and can add "for visionscarto.net" after my name for affiliation.
I'm preparing another example, will follow up when it's ready :)
from umap.
I have a colleague who is working on that -- there's some underlying theory to be worked through, but I believe the core ideas are now all in place. The essence of the idea is this: word2vec can be viewed as (in the limit) a matrix factorization problem, which is to say similar to PCA. It should be possible to use manifold learning like UMAP to do the embedding rather than something linear like PCA. Ideally this should capture word similarity better, at the cost that word algebra will no longer work.
The details are in what data to embed (something based on a word-word-co-occurence matrix), and how to measure distance (negative log likelihoods under a suitable model), and how to interpet the theory around all of that. Progress is being made, but it may be a little while before anything releasable happens.
from umap.
I'm forgoing exact knn-graph methods as most are too slow on high dimensional data. I agree that nmslib is impressive but for this project I was hoping to keep the dependencies relatively self-contained. Right now I'm using my own python based implementation of NN-descent (for which kgraph is the reference implementation). The advantages of NN-descent are that it is non-metric space based (just like nmslib), and can be used for direct approximate knn-graph construction rather than building an index and then querying.
If someone else wanted to build an optimized UMAP on top of nmslib I would certainly be interested to see it -- it would likely outperform this version due to the parallelism (presuming a suitably parallelised version of the SGD for layout was paired with it).
from umap.
It would be useful to have a way to save the UMAP model to a file for transforming future data into the same space. What would it take to get save/load functions?
from umap.
@ghannum Okay, thanks, I'll try to look into this at some point. At the very least I'll add it to the roadmap.
from umap.
@lmcinnes very interested in that feature!
from umap.
@bccho thanks for this fix, been running into the problem with larger training sets with joblib and pickle for the past week. Needed to use python 2.7
specifically. Hopefully this functionality gets added into UMAP soon.
Same error as above:
[path omitted]/lib/python2.7/site-packages/funcsigs/__init__.pyc in __new__(self, *args, **kwargs)
199 def __new__(self, *args, **kwargs):
200 obj = int.__new__(self, *args)
--> 201 obj._name = kwargs['name']
202 return obj
203
KeyError: 'name'
Unfortunately I have no control over moving to python 3 (as much as I would like to), but for a workaround, I can try saving individual subobjects to files and re-loading them.
Can you indicate what subobjects and parameters are required fortransform
to work correctly?EDIT: After iterating through individual attributes from
dir(trans)
, it looks like_random_init
,_search
, and_tree_init
are the culprits. They are all instances of@numba.njit
called on nested functions, but usingdill
didn't resolve the problem, and it seems they are necessary fortransform
.EDIT: Here is a functioning workaround for Python 2:
import pickle def save_umap(umap): for attr in ["_tree_init", "_search", "_random_init"]: if hasattr(umap, attr): delattr(umap, attr) return pickle.dumps(umap, pickle.HIGHEST_PROTOCOL) def load_umap(s): umap = pickle.loads(s) from umap.nndescent import make_initialisations, make_initialized_nnd_search umap._random_init, umap._tree_init = make_initialisations( umap._distance_func, umap._dist_args ) umap._search = make_initialized_nnd_search( umap._distance_func, umap._dist_args ) return umap import numpy as np X = np.random.randn(5000, 16) X_new = np.random.randn(100, 16) from umap import UMAP um = UMAP() um.fit(X) emb = um.transform(X_new) pkl = save_umap(um) um_new = load_umap(pkl) # no error! emb_new = um_new.transform(X_new)
from umap.
Unfortunatlely I can't share details. Sorry.
from umap.
Couldn't get the pickle.dumps/loads workaround to work (python3.8).
man = self._unserialize_umap(man) [20/1811] File "/app/jwtauthtest/autoencoder.py", line 223, in _unserialize_umap
umap = pickle.loads(s)
File "/usr/local/lib/python3.8/site-packages/pynndescent/pynndescent_.py", line 1028, in __setstate__
self._rp_forest = tuple([renumbaify_tree(tree) for tree in d["_rp_forest"]])
File "/usr/local/lib/python3.8/site-packages/pynndescent/pynndescent_.py", line 1028, in <listcomp>
self._rp_forest = tuple([renumbaify_tree(tree) for tree in d["_rp_forest"]])
File "/usr/local/lib/python3.8/site-packages/pynndescent/rp_trees.py", line 1178, in renumbaify_tree
hyperplanes.extend(tree.hyperplanes)
File "/usr/local/lib/python3.8/site-packages/numba/typed/typedlist.py", line 366, in extend
return _extend(self, iterable)
File "/usr/local/lib/python3.8/site-packages/numba/core/dispatcher.py", line 415, in _compile_for_args
error_rewrite(e, 'typing')
File "/usr/local/lib/python3.8/site-packages/numba/core/dispatcher.py", line 358, in error_rewrite
reraise(type(e), e, None)
File "/usr/local/lib/python3.8/site-packages/numba/core/utils.py", line 80, in reraise
raise value.with_traceback(tb)
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
- Resolution failure for literal arguments:
No implementation of function Function(<function impl_extend at 0x7feb5058daf0>) found for signature:
>>> impl_extend(ListType[array(float64, 2d, C)], reflected list(array(float32, 1d, C)))
There are 2 candidate implementations:
- Of which 2 did not match due to:
Overload in function 'impl_extend': File: numba/typed/listobject.py: Line 1027.
With argument(s): '(ListType[array(float64, 2d, C)], reflected list(array(float32, 1d, C)))':
Rejected as the implementation raised a specific error:
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
- Resolution failure for literal arguments:
No implementation of function Function(<function impl_append at 0x7feb5058d280>) found for signature:
>>> impl_append(ListType[array(float64, 2d, C)], array(float32, 1d, C))
There are 2 candidate implementations:
- Of which 2 did not match due to:
Overload in function 'impl_append': File: numba/typed/listobject.py: Line 589.
With argument(s): '(ListType[array(float64, 2d, C)], array(float32, 1d, C))':
Rejected as the implementation raised a specific error:
LoweringError: Failed in nopython mode pipeline (step: nopython mode backend)
File "../usr/local/lib/python3.8/site-packages/numba/typed/listobject.py", line 597:
def impl(l, item):
casteditem = _cast(item, itemty)
^
During: lowering "$8call_function.3 = call $2load_global.0(item, $6load_deref.2, func=$2load_global.0, args=[Var(item, listobject.py:597), Var($6load_deref.2, listobject.py:597)], kws=(), vararg=None)" at /usr/local/lib/python3.8/site-packages/numba/typed/listobject.py (597)
raised from /usr/local/lib/python3.8/site-packages/numba/core/utils.py:81
- Resolution failure for non-literal arguments:
None
During: resolving callee type: BoundFunction((<class 'numba.core.types.containers.ListType'>, 'append') for ListType[array(float64, 2d, C)])
During: typing of call at /usr/local/lib/python3.8/site-packages/numba/typed/listobject.py (1051)
File "../usr/local/lib/python3.8/site-packages/numba/typed/listobject.py", line 1051:
def impl(l, iterable):
<source elided>
for i in iterable:
l.append(i)
^
raised from /usr/local/lib/python3.8/site-packages/numba/core/typeinfer.py:994
- Resolution failure for non-literal arguments:
None
During: resolving callee type: BoundFunction((<class 'numba.core.types.containers.ListType'>, 'extend') for ListType[array(float64, 2d, C)])
During: typing of call at /usr/local/lib/python3.8/site-packages/numba/typed/typedlist.py (101)
File "../usr/local/lib/python3.8/site-packages/numba/typed/typedlist.py", line 101:
def _extend(l, iterable):
return l.extend(iterable)
Also tried dill.dump/load
, same error (maybe I need to dump/load_session? not sure how that might interfere with the rest of the environment, as this is shared with server code). I'll shelf umap for my project & subscribe here in case roadmap sees some love.
from umap.
Here's the smallest notebook I could think of for basic usage demonstration
https://nbviewer.jupyter.org/gist/Fil/cce232583907035b65686cdec7d4cc92
from umap.
Thanks! I was hoping to have some further description in Markdown in the notebook, but this is an excellent beginning.
from umap.
Would you mind if we pulled direct quotes from your README for the notebook ('basic usage demonstration' and 'explaining parameter options and their effects')?
I'm also currently wrapping the two ideas as one notebook, with a basic usage section at the top, and more in-depth information after that. Thoughts?
from umap.
Here's another version, exploring some of parameters
https://nbviewer.jupyter.org/gist/Fil/5c48475e88a0e1a8f56eaadaebff0544
from umap.
I love the metric exploration! The custom metrics nicely show off what can be done, and the effects (e.g. the pure red metric has a clear linear embedding etc.)
from umap.
@Fil I have added in a version of your parameter exploration notebook (with some minor changes and added text commentary and explanation) in the notebooks directory. Have a look and let me know if it looks okay to you. I really appreciate your work on this, so let me know how you would like to be acknowledged within the notebook.
from umap.
@CrakeNotSnowman That looks great! To be honest more intros are good, especially if they come from different perspectives, as this one does. There are some really interesting results in there.
Sorry about the code and documentation changes; I'm a tinkerer and I can't help it.
I definitely look forward to seeing this with any further expository writing.
from umap.
What about UMAP for text data (similar to word2vec)?
from umap.
Are you also planning to explore other exact and approximate k-nn graph methods? nmslib is a super fast parallelized implementation with a plethora of knn methods.
from umap.
@ghannum I admit that I had been hoping that the standard methods for model persistence in sklearn (pickling etc.) would handle this -- is that not working with UMAP, or are you looking for something a little different than what it would provide? This isn't really my area of expertise, so you'll have to excuse my lack of knowledge here.
from umap.
@lmcinnes I tried to pickle the model file, but pickling only works for data objects - not classes. I believe the correct approach would be to write one function which puts all of the relevant model data into a list and pickles the list. Then write a load function which loads the pickled data and constructs the model object.
from umap.
from umap.
@josephcourtney 's example fails when the training data is larger.
No error:
X = np.random.randn(4000, 48)
trans = umap.UMAP(
n_neighbors=5,
random_state=42,
metric="euclidean
).fit(X)
with open('trans.pkl', 'wb') as f:
pickle.dump(trans, f, pickle.HIGHEST_PROTOCOL)
with open('trans.pkl', 'rb') as f:
trans = pickle.load(f)
Error:
X = np.random.randn(5000, 48)
trans = umap.UMAP(
n_neighbors=5,
random_state=42,
metric="euclidean
).fit(X)
with open('trans.pkl', 'wb') as f:
pickle.dump(trans, f, pickle.HIGHEST_PROTOCOL)
with open('trans.pkl', 'rb') as f:
trans = pickle.load(f)
Traceback:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-140-1981a7cd4080> in <module>()
2 pickle.dump(trans, f, pickle.HIGHEST_PROTOCOL)
3 with open('trans.pkl', 'rb') as f:
----> 4 trans = pickle.load(f)
5
/usr/lib/python2.7/pickle.pyc in load(file)
1382
1383 def load(file):
-> 1384 return Unpickler(file).load()
1385
1386 def loads(str):
/usr/lib/python2.7/pickle.pyc in load(self)
862 while 1:
863 key = read(1)
--> 864 dispatch[key](self)
865 except _Stop, stopinst:
866 return stopinst.value
/usr/lib/python2.7/pickle.pyc in load_newobj(self)
1087 args = self.stack.pop()
1088 cls = self.stack[-1]
-> 1089 obj = cls.__new__(cls, *args)
1090 self.stack[-1] = obj
1091 dispatch[NEWOBJ] = load_newobj
[path omitted]/lib/python2.7/site-packages/funcsigs/__init__.py in __new__(self, *args, **kwargs)
199 def __new__(self, *args, **kwargs):
200 obj = int.__new__(self, *args)
--> 201 obj._name = kwargs['name']
202 return obj
203
KeyError: 'name'
from umap.
@bccho : That's a little disconcerting. It seems to be some sort of issue with pickle storing certain objects. At 4096 there is a switch in how knn computation is handled, so that may be responsible, but it is entirely unclear to me where in the whole process this is going astray. It must be in some subobjects of the basic UMAP class, so is likely an issue for those objects in general (scipy sparse matrices perhaps?). I'm away for a few days but I'll try to look into it when I get back. If there is any chance you can switch to python3 that will resolve the issue, but I understand that that is not always an option.
from umap.
from umap.
Thanks for finding a workaround! It looks like it was the numba-jitted functions that were not pickling properly, at least under 2.7. I'll have to see if I can figure out a more permanent solution.
from umap.
I think that was the problem too. You could probably put the save_umap
and get_umap
code in as part of __getstate__
and __setstate__
from umap.
from umap.
I was able to persist umap
objects using the pickle extension dill under Python 3.6.
from umap.
It is worth trying, but a lot will depend on the nature of your data. I have seen UMAP used for IDS projects, though usually more as part of an exploratory tool rather than a production pipeline.
from umap.
It is worth trying, but a lot will depend on the nature of your data. I have seen UMAP used for IDS projects, though usually more as part of an exploratory tool rather than a production pipeline.
can you share with me some of these project ?
from umap.
@lmcinnes my two cents is that the issue with umap is the use case. I see a lot of people do not how which is the advantage to use umap instead of t-sne / pca...
from umap.
Related Issues (20)
- No module named importlib HOT 2
- When using umap fit, an error occurred suddenly: Intel MKL ERROR: Parameter 6 was incorrect on entry to DLASWP.
- scipy.sparse._csparsetools.lil_get_lengths Error Running UMAP
- Not able to work with old embedder object created using python 3.8 HOT 1
- Setting a random state still leads to stochastic results
- Implementation of sciki-learn's get_feature_names_out() API is not correct
- Is 'n_training_epochs' working for parameteric UMAP?
- visualize video data
- How to combine UMAP models in new data?
- Edit instructions to make them compatible with zsh
- Empty API page on UMAP API Guide? HOT 1
- PCA diagnostic error HOT 2
- Speed inquries HOT 2
- UMAP crashes when torch also imported before first run HOT 2
- Unable to pickle trained UMAP instance
- Reducing Model Size for UMAP on Large Datasets HOT 2
- umap.UMAP accepts strings as n_neighbors and min_dist, causing later failures
- Optimal dimensions
- RunUMAP Failing HOT 1
- Semi-deterministic output even though randon_state is set
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from umap.