jlmelville / uwot Goto Github PK
View Code? Open in Web Editor NEWAn R package implementing the UMAP dimensionality reduction method.
Home Page: https://jlmelville.github.io/uwot/
License: GNU General Public License v3.0
An R package implementing the UMAP dimensionality reduction method.
Home Page: https://jlmelville.github.io/uwot/
License: GNU General Public License v3.0
Hi,
If i run umap_trasform after tumap I get the following error
16:25:52 Writing NN index file to temp file /tmp/RtmpAAapHI/file3b610733a05
Error in .External(list(name = "CppMethod__invoke_void", address = <pointer: (nil)>, :
NULL value passed as symbol address
while instead everything works if I use umap and then umap_trasform. Of course using the same data for both methods.
When trying init='agspectral' it will only return 2 components no matter what n_components is specified when calling umap
Hello,
I am trying to install uwot on our RStudio Server and I am getting an error I cannot decipher (google cannot do that either...).
In file included from gradient.cpp:20:
gradient.h:30: error: ISO C++ forbids declaration of ‘constexpr’ with no type
gradient.h:30: error: expected ‘;’ before ‘double’
gradient.h:56: error: ISO C++ forbids declaration of ‘constexpr’ with no type
gradient.h:56: error: expected ‘;’ before ‘double’
gradient.h:64: error: ISO C++ forbids declaration of ‘constexpr’ with no type
gradient.h:64: error: expected ‘;’ before ‘double’
make: *** [gradient.o] Error 1
ERROR: compilation failed for package ‘uwot’
* removing ‘/home/myuser/R/x86_64-pc-linux-gnu-library/3.5/uwot’
Error in i.p(...) :
(converted from warning) installation of package ‘/tmp/Rtmp9wtvCO/file2accb4660569f/uwot_0.0.0.9008.tar.gz’ had non-zero exit status
Any idea of what can cause this?
RStudio is running R 3.5.1, I installed RCpp and RCppParallel without any issue. We might have an old-ish version of GCC (4.4.7) , so that might be the issue, but I want to make sure before I go to war against our sysadmin.
Thanks for this implementation, really looking forward to having a native R/Rcpp implementation to use on my big datasets!
The package seems to install fine but then there is a problem loading the shared object. I am running this on R3.5. Do you know what this could be?
Error: package or namespace load failed for ‘uwot’ in dyn.load(file, DLLpath = DLLpath, ...):
unable to load shared object '/Library/Frameworks/R.framework/Versions/3.5/Resources/library/uwot/libs/uwot.so':
dlopen(/Library/Frameworks/R.framework/Versions/3.5/Resources/library/uwot/libs/uwot.so, 6): Symbol not found: __ZN13umap_gradient8clip_maxE
Referenced from: /Library/Frameworks/R.framework/Versions/3.5/Resources/library/uwot/libs/uwot.so
Expected in: flat namespace
in /Library/Frameworks/R.framework/Versions/3.5/Resources/library/uwot/libs/uwot.so
Error: loading failed
Execution halted
ERROR: loading failed
Dear all,
I can't get umap to run twice in an R-session without crashing. Initially observed using RunUMAP from Seurat 3.1.1, but also the very basic code (see below) would not work..
I have tried to resolve this using the hints in cole-trapnell-lab/monocle3#186 and satijalab/seurat#2256, but without any succes..
Any suggestions?
Session and info:
library(uwot)
iris_umap <- umap(iris, pca = 50)
# And a second time
iris_umap2 <- umap(iris, pca = 50)
# Crash ....
#
# Bioconductor version [1] ‘3.10’
#
# R Under development (unstable) (2019-11-05 r77375)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows >= 8 x64 (build 9200)
#
# Matrix products: default
#
# locale:
# [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
# [4] LC_NUMERIC=C LC_TIME=English_United States.1252
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# other attached packages:
# [1] uwot_0.1.5 Matrix_1.2-17
#
# loaded via a namespace (and not attached):
# [1] compiler_4.0.0 tools_4.0.0 yaml_2.2.0 Rcpp_1.0.3 grid_4.0.0 FNN_1.1.3 RcppParallel_4.4.4
# [8] lattice_0.20-38
thanks in advance and with kind regards,
Aldo
see imminent pull request for details
Any plan to add a way to perform an inverse transform (from embedding to data space)?
Thanks for the great work!
Hi,
Thanks for your invaluable work proposing umap in a native R package. Thanks also for the smallvis package.
As you already integrated largeVis and proposed tumap, I am wondering if you planned to integrate HSNE someday. HDI is already integrated in interactive exploration tools such as cytosplore, but no R package is available. If you plan it, let me know.
Best.
I tried to run umap with init parameter as a matrix that had pca components generated from another software.But the initialisation failed and ran with random instead.
Hi,
I have a matrix with a size of 174, 76 and the last column contains 4 NAs. I though uwot have tolerance for NAs in x but I get this error message, "FNN::get.knn(X, k) : Data include NAs". I am using the following command;
umap1<-umap(DT[,28:102], n_neighbors = 10, learning_rate = 0.5, init = "lvrandom", scale = "Z", a=1, b=0.5, min_dist = 1, spread = 4) %>% as.data.table()
Is there any way to get around this NA issue ?
Thanks.
Hello!
I'm having some trouble running UMAP with one dimension due to an error that pops up saying
Error in optimize_layout_umap(head_embedding = embedding, tail_embedding = embedding, :
Not a matrix.
I'm assuming that this is because the data being passed into the function is an atomic vector rather than a matrix due to it being one dimension? Perhaps this would be solved by using something like drop = FALSE or something similar?
Thank you.
Hi @jlmelville, I'm hitting a compilation issue with uwot 0.1.4 on my RStudio Server Pro installation, running on Azure / RHEL 7.
Specifically with RStudio Server Pro, I'm seeing this failure:
install.packages("uwot")
g++ -std=gnu++11 -shared -L/usr/local/lib64/R/lib -L/usr/local/lib64 -o uwot.so RcppExports.o connected_components.o gradient.o nn_parallel.o optimize.o perplexity.o sampler.o smooth_knn.o supervised.o transform.o WARNING: ignoring environment value of R_HOME -L/usr/local/lib64/R/lib -lR
g++: error: WARNING:: No such file or directory
g++: error: ignoring: No such file or directory
g++: error: environment: No such file or directory
g++: error: value: No such file or directory
g++: error: of: No such file or directory
g++: error: R_HOME: No such file or directory
make: *** [uwot.so] Error 1
ERROR: compilation failed for package ‘uwot’
* removing ‘/data00/R/site-library/3.6/uwot’
* restoring previous ‘/data00/R/site-library/3.6/uwot’
Warning in install.packages(pkgs = doing, lib = lib, repos = repos, ...) :
installation of package ‘uwot’ had non-zero exit status
Calls: install ... <Anonymous> -> .install -> .install_repos -> install.packages
However, on another virtual machine running regular RStudio Server (RHEL 7), the package installs fine. The failure is due to the R_HOME
variable not being detected as expected. Any tips on how to fix this?
I have training and testing datasets and I reduced their dimension separately (for training and testing data) and accuracy dropped in SVM. What to do? Is it necessary to keep all training data in a single matrix and reduce their dimension for reduced feature extraction? Thanks
In the space {0, 1}^n, Hamming and Manhattan metrics are equivalent.
If I however calculate embeddings for such a binary dataset using the 'hamming'
and 'manhattan'
metric=
parameter provided by uwot, I get distinct results.
For example:
library("uwot")
set.seed(42)
frequencies <- c(0.1, 0.2)
size <- c(1000, 1000)
samples <- lapply(frequencies, function(f) matrix(rbinom(prod(size), 1, f), nrow=size[1], ncol=size[2]))
str(samples)
mat <- do.call(rbind, samples)
mat.umap_hamming <- umap(mat, metric='hamming')
mat.umap_manhattan <- umap(mat, metric='manhattan')
par(mfrow=c(2,1))
plot(mat.umap_hamming, main="Hamming metric", xlab="UMAP1", ylab="UMAP2")
plot(mat.umap_manhattan, main="Manhattan metric", xlab="UMAP1", ylab="UMAP2")
And here is an example from real data I was working with:
Hi,
Installation of uwot, by install_github()
or install()
failed with the same error. Even R --vanilla
failed to install it with the same error;
clang++ -std=gnu++11 -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/usr/local/opt/gettext/lib -L/usr/local/opt/readline/lib -L/usr/local/lib -L/usr/local/Cellar/r/3.6.1_1/lib/R/lib -L/usr/local/opt/gettext/lib -L/usr/local/opt/readline/lib -L/usr/local/lib -o uwot.so RcppExports.o connected_components.o gradient.o nn_parallel.o optimize.o perplexity.o sampler.o smooth_knn.o supervised.o transform.o All my packages loaded Tue Aug 27 23:05:12 2019 -L/usr/local/Cellar/r/3.6.1_1/lib/R/lib -lR -lintl -Wl,-framework -Wl,CoreFoundation
clang-8: error: no such file or directory: 'All'
clang-8: error: no such file or directory: 'my'
clang-8: error: no such file or directory: 'packages'
clang-8: error: no such file or directory: 'loaded'
clang-8: error: no such file or directory: 'Tue'
clang-8: error: no such file or directory: 'Aug'
clang-8: error: no such file or directory: '27'
clang-8: error: no such file or directory: '23:05:14'
clang-8: error: no such file or directory: '2019'
make: *** [uwot.so] Error 1
I am using OSX Mojave,
sw_vers
ProductName: Mac OS X
ProductVersion: 10.14.6
BuildVersion: 18G87
gcc version (not apples);
gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/local/Cellar/gcc/9.2.0/libexec/gcc/x86_64-apple-darwin18/9.2.0/lto-wrapper
Target: x86_64-apple-darwin18
Configured with: ../configure --build=x86_64-apple-darwin18 --prefix=/usr/local/Cellar/gcc/9.2.0 --libdir=/usr/local/Cellar/gcc/9.2.0/lib/gcc/9 --disable-nls --enable-checking=release --enable-languages=c,c++,objc,obj-c++,fortran --program-suffix=-9 --with-gmp=/usr/local/opt/gmp --with-mpfr=/usr/local/opt/mpfr --with-mpc=/usr/local/opt/libmpc --with-isl=/usr/local/opt/isl --with-system-zlib --with-pkgversion='Homebrew GCC 9.2.0' --with-bugurl=https://github.com/Homebrew/homebrew-core/issues --disable-multilib --with-native-system-header-dir=/usr/include --with-sysroot=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk
Thread model: posix
gcc version 9.2.0 (Homebrew GCC 9.2.0)
clang -v
clang version 8.0.1 (tags/RELEASE_801/final)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /usr/local/Cellar/llvm/8.0.1/bin
Any help or pointer to fix this is really appreciated.
Hi James,
Great package, appreciate the efforts! It looks like the latest version is broken though?
umap(iris[,1:4])
results in the error:
Error in .Call(
_uwot_smooth_knn_distances_parallel, nn_dist, nn_idx, : Incorrect number of arguments (10), expecting 9 for '_uwot_smooth_knn_distances_parallel'
I didn't have issues until today. Any idea about what might be causing this?
Cheers
This is a followup to issue #46.
The reproducibility issues described there have been fixed for me in 0.1.8 by using approx_pow = TRUE
with an euclidean
or manhattan
metric, but I still face problems when using cosine
.
Here's a result on my laptop (Ubuntu 18.04, R 3.6.3, uwot 0.1.8) :
> set.seed(13); head(uwot::umap(iris, metric = "cosine", init="spca", a=1, b=1, approx_pow=TRUE), 5)
[,1] [,2]
[1,] 2.190465 -14.45460
[2,] 2.153269 -11.64510
[3,] 2.337686 -14.14382
[4,] 1.191009 -12.59075
[5,] 1.472325 -15.06042
And here's the same thing on a server (CentOS 7, R 3.6.1, uwot 0.1.8) :
> set.seed(13); head(uwot::umap(iris, metric = "cosine", init="spca", a=1, b=1, approx_pow=TRUE), 5)
[,1] [,2]
[1,] -15.45597 -4.156313
[2,] -17.59474 -4.357967
[3,] -15.25843 -4.456960
[4,] -17.01195 -2.813276
[5,] -14.92331 -3.548293
The results are the same when run with metric = "euclidean"
.
I'm trying to install uwot in R 3.6.1 and getting an error message I can't debug. I was wondering if anyone has seen something like this before and can give me a pointer:
install.packages("uwot")
Installing package into ‘/usr/local/lib/R/host-site-library’
(as ‘lib’ is unspecified)
trying URL 'https://mran.microsoft.com/snapshot/2019-11-05/src/contrib/uwot_0.1.4.tar.gz'
Content type 'application/octet-stream' length 81262 bytes (79 KB)
==================================================
downloaded 79 KB
...
/bin/bash: -c: line 1: syntax error near unexpected token `('
/bin/bash: -c: line 1: ` echo g++ -std=gnu++11 -shared -L"/usr/local/lib/R/lib" -L/usr/local/lib -o uwot.so RcppExports.o connected_components.o gradient.o nn_parallel.o optimize.o perplexity.o sampler.o smooth_knn.o supervised.o transform.o > RcppParallel::RcppParallelLibs() > > -L"/usr/local/lib/R/lib" -lR; \'
/usr/local/lib/R/share/make/shlib.mk:6: recipe for target 'uwot.so' failed
make: *** [uwot.so] Error 1
ERROR: compilation failed for package ‘uwot’
* removing ‘/usr/local/lib/R/host-site-library/uwot’
Here's my sessionInfo:
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)
Matrix products: default
BLAS/LAPACK: /usr/lib/libopenblasp-r0.2.19.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] BiocManager_1.30.10
loaded via a namespace (and not attached):
[1] compiler_3.6.1 tools_3.6.1
UMAP supports partial labeling (i.e. NA values) of a target array when performing supervised reduction.
And this is what uwot::umap()
does when the the 'y' argument is a character:
x <- iris
x$Species[sample.int(nrow(iris), 50)] <- NA
iris_umap <- umap(x[,-5], n_neighbors = 50, alpha = 0.5, init = "random", y = x$Species)
However, this fails in the case of a numeric 'y' argument that contains NA values:
x <- mtcars
x$mpg[sample.int(nrow(mtcars), 10)] <- NA
mtcars_umap <- umap(x[,-1], n_neighbors = 10, alpha = 0.5, init = "random", y = x$mpg)
Error in result[n_samples > 0] <- n_epochs/n_samples[n_samples > 0] :
NAs are not allowed in subscripted assignments
Perhaps this is the expected behavior (I admit I am not familiar with the details of the algorithm), but I wanted to confirm if this is the case or not.
Why does writing NN index file to temp take so long? Is it possible to speed it up?
merged
is a large numeric matrix.
Input:
markers <- c(19:33,36:51,53,62)
sub <- merged[,]
library(uwot)
threads <- 32
umap_data <- umap(
sub[,markers],
n_neighbors = 15,
n_components = 2,
metric = "euclidean",
n_epochs = 1000,
learning_rate = 1,
scale = "z",
init = "spca",
init_sdev = NULL,
# spread = 5,
min_dist = 0.01,
set_op_mix_ratio = 1,
local_connectivity = 1,
bandwidth = 1,
repulsion_strength = 1,
negative_sample_rate = 5,
nn_method = "annoy",
# n_trees = 50,
approx_pow = FALSE,
pca = NULL,
pca_center = TRUE,
pcg_rand = TRUE,
fast_sgd = FALSE,
ret_model = FALSE,
ret_nn = FALSE,
n_threads = threads,
n_sgd_threads = threads,
grain_size = 1,
verbose = TRUE
)
Output:
19:37:09 Read 13869323 rows and found 33 numeric columns
19:37:09 Scaling to zero mean and unit variance
19:37:16 Kept 33 non-zero-variance columns
19:37:38 Building Annoy index with metric = euclidean, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
20:29:09 Writing NN index file to temp file C:\Users*\AppData\Local\Temp\Rtmp2D6YGc\file47d421456b98
It has been stuck on the last step for more than 15 hours. File size is about 1.8GB.
With 1e5 rows:
markers <- c(19:33,36:51,53,62)
sub <- merged[1:1e5,]
library(uwot)
threads = 64
umap_data <- umap(
sub[,markers],
n_neighbors = 15,
n_components = 2,
metric = "euclidean",
n_epochs = 500,
learning_rate = 1,
scale = "z",
init = "spca",
init_sdev = NULL,
# spread = 5,
min_dist = 0.01,
set_op_mix_ratio = 1,
local_connectivity = 1,
bandwidth = 1,
repulsion_strength = 1,
negative_sample_rate = 5,
nn_method = "annoy",
# n_trees = 50,
approx_pow = FALSE,
pca = NULL,
pca_center = TRUE,
pcg_rand = TRUE,
fast_sgd = FALSE,
ret_model = FALSE,
ret_nn = FALSE,
n_threads = threads,
n_sgd_threads = threads,
grain_size = 1,
verbose = TRUE
)
11:57:16 Read 100000 rows and found 33 numeric columns
11:57:16 Scaling to zero mean and unit variance
11:57:16 Kept 33 non-zero-variance columns
11:57:17 Building Annoy index with metric = euclidean, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
11:57:30 Writing NN index file to temp file C:\Users*\AppData\Local\Temp\Rtmpcf6MHg\file1c05eb14ee7
11:57:30 Searching Annoy index using 64 threads, search_k = 1500
12:00:22 Annoy recall = 100%
12:00:23 Commencing smooth kNN distance calibration using 64 threads
12:00:25 Initializing from PCA
12:00:25 PCA: 2 components explained 27.42% variance
12:00:25 Commencing optimization for 500 epochs, with 2325534 positive edges using 64 threads
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
12:00:53 Optimization finished
So with 1e5 rows the writing the NN index takes not even a second and file size is 73MB.
With 1e6 rows:
12:02:21 Read 1000000 rows and found 33 numeric columns
12:02:21 Scaling to zero mean and unit variance
12:02:22 Kept 33 non-zero-variance columns
12:02:23 Building Annoy index with metric = euclidean, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
12:05:16 Writing NN index file to temp file C:\Users*\AppData\Local\Temp\Rtmpcf6MHg\file1c01733238f
12:05:17 Searching Annoy index using 64 threads, search_k = 1500
With 1e6 rows, writing NN index takes 1 second and file size is about 738MB.
Specs:
Samsung EVO SSD 1 TB
128 GB ECC RAM
AMD 2990WX
SessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252attached base packages:
[1] stats graphics grDevices utils datasets methods baseother attached packages:
[1] uwot_0.1.3 Matrix_1.2-17 foreach_1.4.4 flowCore_1.50.0loaded via a namespace (and not attached):
[1] graph_1.62.0 Rcpp_1.0.1 cluster_2.0.8 BiocGenerics_0.30.0
[5] MASS_7.3-51.4 lattice_0.20-38 rrcov_1.4-7 pcaPP_1.9-73
[9] vizier_0.3 tools_3.6.0 parallel_3.6.0 grid_3.6.0
[13] Biobase_2.44.0 snow_0.4-3 corpcor_1.6.9 iterators_1.0.10
[17] matrixStats_0.54.0 RcppParallel_4.4.3 doSNOW_1.0.16 codetools_0.2-16
[21] robustbase_0.93-5 compiler_3.6.0 DEoptimR_1.0-8 stats4_3.6.0
[25] mvtnorm_1.0-10
Things I should fix, but which may need a major version change. To be edited and updated as I discover more hidden horrors.
min_dist
default is 0.01
, but should be 0.1
for consistency with Python UMAP. Fortunately, this has no discernible effect on the output.pca
be set by default? If users attempt to throw very high dimensional data at uwot
at the moment, they are in for a miserable time, because at best Annoy will take hours to complete. At worst, if they are using multi-threading (also a default), Annoy will fail on large datasets due to not being able to read back in an index larger in size than 2GB. I must get back to rnndescent and add rp tree support to provide a replacement/alternative.Thanks for this great implementation.
To be fair, the species column should be removed from the example using iris, as it is the ground truth.
I've noticed that I'm not getting similar results from the uwot
as the python implementation.
Goal: Translate a pipeline from Python to R.
Problem: uwot
behaves different than umap
(python)
For reproducibility, I'm following this workflow in python.
Prior to the supervised clustering with umap
there are two steps 1) simulating data with sklearn.datasets.make_classification()
and 2) scaling with StandardScaler.fit_transform()
.
Rather than simulating data and scaling with R
functions lets use python so the input to uwot
and Python's umap
are identical.
First we simulate the data and scale it with Python. Let's enter the python interpreter with `repl_python() then enter the following:
# importing relevant libraries
import numpy as np
import pandas as pd
import scipy as sp
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import cross_val_predict, StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.datasets import make_classification
from tqdm import tqdm
from umap import UMAP
from pynndescent import NNDescent
from fastcluster import single
from scipy.cluster.hierarchy import cut_tree, fcluster, dendrogram
from scipy.spatial.distance import squareform
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier
# let us generate some data with 10 clusters per class
X, y = make_classification(n_samples=500000, n_features=200, n_informative=5,
n_redundant=0, n_clusters_per_class=10, weights=[0.80],
flip_y=0.05, class_sep=3.5, random_state=42)
# normalizing to eliminate scaling differences
X = pd.DataFrame(StandardScaler().fit_transform(X))
Were going to want to do Python's umap
first but we will do the plotting in ggplot2
just to show that it's not an issue with visualization.
# building supervised embedding with UMAP
sup_embed_umap = UMAP().fit_transform(X, y=y)
exit # exit the python interpreter
Now let's plot this in R:
library(ggplot2)
unsup_embed_python <- py$unsup_embed
unsup_embed_python <- as.data.frame(unsup_embed_python)
unsup_embed_python$labels <- py$y
ggplot(sup_embed_python, aes(V1, V2, color = labels)) + geom_point() + scale_color_manual(values = c("#0000FF", "#ff0000"))
Now let's do the same thing using the default for supervised dimension reduction with uwot
:
sup_embed_R <- umap(py$X, y = py$y)
sup_embed_R <- as.data.frame(sup_embed_R)
sup_embed_R$labels <- py$y
ggplot(sup_embed_R, aes(V1, V2, color = as.character(labels))) + geom_point() + scale_color_manual(values = c("#0000FF", "#ff0000"))
The resulting image looks identical to calling uwot
without supervision (y = py$y
)?
I read the "Python Comparison" document which suggests using pca = 100
and min_dist = 0.1
within umap()
. So I also tried this but don't see a similar result.
sup_embed_R <- umap(py$X, y = py$y, pca = 100, min_dist = 0.1)
sup_embed_R <- as.data.frame(sup_embed_R)
sup_embed_R$labels <- py$y
ggplot(sup_embed_R, aes(V1, V2, color = as.character(labels))) + geom_point() + scale_color_manual(values = c("#0000FF", "#ff0000"))
Maybe the issue is that Python is calling the fit_transform()
function from umap
. Therefore, I tried using the ret_model = TRUE
with uwot::umap_transform()
but I don't get the same result as python either.
sup_embed_model <- uwot::umap(py$X, y = py$y, ret_model = TRUE)
sup_embed_R <- uwot::umap_transform(py$X, sup_embed_model, verbose = TRUE)
sup_embed_R <- as.data.frame(sup_embed_R)
sup_embed_R$labels <- py$y
ggplot(sup_embed_R, aes(V1, V2, color = as.character(labels))) + geom_point() + scale_color_manual(values = c("#0000FF", "#ff0000"))
This looks more like a donut than two separate cluster
Is there something I'm doing wrong?
HI,
For supervised dimension reduction, are multiple y variables allowed ? I did y=c(y1, y2)
and got an error.
Thanks.
Hi James,
I can install it without any problem on my workstation (win10). However, I can't on our linux server with the R-3.6.0. It seems like the LIB path didn't defined in "Makevars" file and the default path "/usr/lib/R/lib" can't be found.
Appreciate any feedbacks!
Yu
sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
Random number generation:
RNG: Mersenne-Twister
Normal: Inversion
Sample: Rounding
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RcppAnnoy_0.0.12 RcppParallel_4.4.3 devtools_2.1.0 usethis_1.5.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.2 rstudioapi_0.10 magrittr_1.5 pkgload_1.0.2 R6_2.4.0 rlang_0.4.0 tools_3.6.0 pkgbuild_1.0.5 sessioninfo_1.1.1 cli_1.1.0 withr_2.1.2 remotes_2.1.0
[13] assertthat_0.2.1 digest_0.6.20 rprojroot_1.3-2 crayon_1.3.4 processx_3.4.1 callr_3.3.1 codetools_0.2-16 fs_1.3.1 ps_1.3.0 curl_4.0 testthat_2.2.1 memoise_1.1.0
[25] glue_1.3.1 compiler_3.6.0 desc_1.2.0 backports_1.1.4 prettyunits_1.0.2
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
devtools::install_github("jlmelville/uwot")
Downloading GitHub repo jlmelville/uwot@master
✔ checking for file ‘/tmp/Rtmpr1KisN/remotes81ee6859d9ce/jlmelville-uwot-7418141/DESCRIPTION’ ...
─ preparing ‘uwot’:
✔ checking DESCRIPTION meta-information ...
─ cleaning src
─ checking for LF line-endings in source and make files and shell scripts
─ checking for empty or unneeded directories
─ building ‘uwot_0.1.3.tar.gz’
Warning: invalid uid value replaced by that for user 'nobody'
Installing package into ‘/scratch/TBI/Softwares/R-3.6/Packages’
(as ‘lib’ is unspecified)
[1] "Working in R Studio, setting library path for R 3.6.0"
Thanks for your work on uwot: really love it (including the very clear vignettes explaining umap and tsne).
I was wondering if you came across this paper: https://arxiv.org/abs/1710.00992, which tries to visualize for each of the original variables its dependency in the resulting tnse/lle mapping by plotting contourlines. May be an idea for uwot? (or a seperate package that does that?)
Comment Used:
I have seperated my samples accordinng to column combined and would like to run uwot.
Distancemetric='euclidean'
uwot=umap(data,n_neighbors=n_neighbors,metric=list(Distancemetric=1:n_genes,"categorical"="Combined"),min_dist=mindist,init=init,n_epochs=epochs)
Error:
uwot=umap(data,n_neighbors=n_neighbors,metric=list(Distancemetric=1:n_genes,"categorical"="Combined"),min_dist=mindist,init=init,n_epochs=epochs)
Running rhub::check_with_sanitizers()
has confirmed that the UBSAN issues reported for RcppAnnoy in #50 are fixed with RcppAnnoy 0.0.15. Unfortunately, there are lots of UBSAN complaints originating with RcppParallel. I don't think this is due to me using the package incorrectly, because the RcppParallel CRAN checks give the same messages (see https://www.stats.ox.ac.uk/pub/bdr/memtests/gcc-UBSAN/RcppParallel/RcppParallel-Ex.Rout).
They seem to originate with the Intel tbb library and are well known by the RcppParallel maintainers (see e.g. RcppCore/RcppParallel#36), but they can't do anything about it. The risk here is that the strategy of saying that the UBSAN issues are harmless and originate from a package uwot is using is exactly the strategy that stopped working with RcppAnnoy.
A possible alternative is to look at RcppThread which has a parallel for construct and is not currently showing any check problems.
@sirusb, @ttriche: as contributors of PRs to this package, would you like to be acknowledged as such in the Authors@R
field of the DESCRIPTION
? You don't need to provide an email address, just a suitable identifier, e.g. first name and last name. For reference, the field currently looks like:
c(person("James", "Melville", email = "[email protected]", role = c("aut", "cre")),
person("Aaron", "Lun", role="ctb"))
Metric = "precomputed" is not implemented
I would like to run uwot::umap() with metric = 'pearson'. However, 'pearson' is not an option with within this package and I got the following error:
Error in match.arg(metric, c("euclidean", "cosine", "manhattan", "hamming", : 'arg' should be one of “euclidean”, “cosine”, “manhattan”, “hamming”, “precomputed”
This error suggests that I can use a "precomputed" distance matrix. So I tried to run uwot::umap() with metric = 'precomputed' and got the following error:
Error in create_ann(metric, nc) : BUG: unknown Annoy metric 'precomputed'
This error suggests precomputed is not implemented within this package.
PS. The original umap package allows for metrix = 'pearson.' It would be nice to see this added to this package!
Hello,
i'm trying to install in this environment
CentOS 7.6.1810 - gcc 4.8.5
Microsoft R 3.5.1
but I receive this error
devtools::install_github("jlmelville/uwot")
Downloading GitHub repo jlmelville/uwot@master
from URL https://api.github.com/repos/jlmelville/uwot/zipball/master
Installing uwot
'/opt/microsoft/ropen/3.5.1/lib64/R/bin/R' --no-site-file --no-environ
--no-save --no-restore --quiet CMD INSTALL
'/tmp/RtmpWTJkQy/devtools1bf444ad5fb01/jlmelville-uwot-05e3d4e'
--library='/opt/microsoft/ropen/3.5.1/lib64/R/library' --install-tests
i've googled but i couldn't find any solutions, i hope you may kindly help with my issue.
regards,
Fabio
I have 0.1.5 installed and it works find. I tried to upgrade it to the most recent version and I get an error:
> install.packages("uwot")
There is a binary version available but the source version is later:
binary source needs_compilation
uwot 0.1.5 0.1.8 TRUE
Do you want to install from sources the package which needs compilation? (Yes/no/cancel) y
installing the source package ‘uwot’
trying URL 'https://cran.rstudio.com/src/contrib/uwot_0.1.8.tar.gz'
Content type 'application/x-gzip' length 90032 bytes (87 KB)
==================================================
downloaded 87 KB
* installing *source* package ‘uwot’ ...
** package ‘uwot’ successfully unpacked and MD5 sums checked
** using staged installation
** libs
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I../inst/include/ -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppProgress/include" -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppAnnoy/include" -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dqrng/include" -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I/usr/local/include -DSTRICT_R_HEADERS -DRCPP_NO_RTTI -fPIC -Wall -g -O2 -c RcppExports.cpp -o RcppExports.o
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I../inst/include/ -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppProgress/include" -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppAnnoy/include" -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dqrng/include" -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I/usr/local/include -DSTRICT_R_HEADERS -DRCPP_NO_RTTI -fPIC -Wall -g -O2 -c connected_components.cpp -o connected_components.o
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I../inst/include/ -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppProgress/include" -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppAnnoy/include" -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dqrng/include" -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I/usr/local/include -DSTRICT_R_HEADERS -DRCPP_NO_RTTI -fPIC -Wall -g -O2 -c nn_parallel.cpp -o nn_parallel.o
In file included from nn_parallel.cpp:6:
In file included from ./nn_parallel.h:29:
In file included from /Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppAnnoy/include/annoylib.h:22:
In file included from /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/unistd.h:658:
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/gethostuuid.h:39:17: error: C++ requires a type specifier for all declarations
int gethostuuid(uuid_t, const struct timespec *) __OSX_AVAILABLE_STARTING(__MAC_10_5, __IPHONE_NA);
^
In file included from nn_parallel.cpp:6:
In file included from ./nn_parallel.h:29:
In file included from /Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppAnnoy/include/annoylib.h:22:
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/unistd.h:665:27: error: unknown type name 'uuid_t'; did you mean 'uid_t'?
int getsgroups_np(int *, uuid_t);
^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/sys/_types/_uid_t.h:31:31: note: 'uid_t' declared here
typedef __darwin_uid_t uid_t;
^
In file included from nn_parallel.cpp:6:
In file included from ./nn_parallel.h:29:
In file included from /Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppAnnoy/include/annoylib.h:22:
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/unistd.h:667:27: error: unknown type name 'uuid_t'; did you mean 'uid_t'?
int getwgroups_np(int *, uuid_t);
^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/sys/_types/_uid_t.h:31:31: note: 'uid_t' declared here
typedef __darwin_uid_t uid_t;
^
In file included from nn_parallel.cpp:6:
In file included from ./nn_parallel.h:29:
In file included from /Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppAnnoy/include/annoylib.h:22:
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/unistd.h:730:31: error: unknown type name 'uuid_t'; did you mean 'uid_t'?
int setsgroups_np(int, const uuid_t);
^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/sys/_types/_uid_t.h:31:31: note: 'uid_t' declared here
typedef __darwin_uid_t uid_t;
^
In file included from nn_parallel.cpp:6:
In file included from ./nn_parallel.h:29:
In file included from /Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppAnnoy/include/annoylib.h:22:
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/unistd.h:732:31: error: unknown type name 'uuid_t'; did you mean 'uid_t'?
int setwgroups_np(int, const uuid_t);
^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/sys/_types/_uid_t.h:31:31: note: 'uid_t' declared here
typedef __darwin_uid_t uid_t;
^
5 errors generated.
make: *** [nn_parallel.o] Error 1
ERROR: compilation failed for package ‘uwot’
* removing ‘/Library/Frameworks/R.framework/Versions/3.6/Resources/library/uwot’
* restoring previous ‘/Library/Frameworks/R.framework/Versions/3.6/Resources/library/uwot’
Warning in install.packages :
installation of package ‘uwot’ had non-zero exit status
The downloaded source packages are in
‘/private/var/folders/l5/b5l0kyfd46780f7qdl5hm9d4cncpc2/T/RtmpLXr669/downloaded_packages’
This is my SessionInfo:
> sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.6.2 tools_3.6.2
Hi there,
I'm getting a weird error when I try to run umap() with my data:
Error: index size 79464 is not a multiple of vector size 16
After playing around trying to figure out why (iris data was working), I discovered that it has something to do with the number of columns that your input data for umap() is.
As an example, when you eliminate one of the numeric columns of the iris dataset (from 4 to 3), I also get a similar error.
iris_train <- data.table(iris[1:100, 1:3])
iris_test <- data.table(iris[101:150, 1:3])
iris_train_umap <- umap(iris_train, n_components = 3, ret_model = T)
save_uwot(iris_train_umap, "~/iris.uwot")
Warning messages:
1: invalid uid value replaced by that for user 'nobody'
2: invalid gid value replaced by that for user 'nobody'
iris_umap = load_uwot("~/iris.uwot")
Error: index size 77420 is not a multiple of vector size 16
iris_test_umap <- umap_transform(iris_test, iris_umap)
Rstudio crashes.
Any help would be appreciated!
First of all, thanks for this package. It is very convenient to have a pure R implementation of UMAP which is fast and reliable !
I am meeting a small and a bit strange problem of reproducibility of results. To be able to get the same UMAP results twice, I use set.seed
before calling umap
, and locally on my machine it works well :
set.seed(82223)
umap <- uwot::umap(USArrests)
The problem is, sometimes if I run the same code on another machine, for example during tests for a package CRAN check, the test fails because the results are different.
I've tried several things : testing that the uwot
version are the same, and even testing that set.seed
give the same suite of random numbers on every machine. This is true, but when I compare umap
results they are different.
I'm not sure I'm completely clear here... But if you have any idea on why this could happen, I'd be glad to hear it :-)
Thanks for developing this!
The umap function appears to have a bug when the 'metric = "cosine"' option is invoked. I get the following error:
Error in search_nn_func(index_file, X, k, search_k, grain_size = grain_size, :
vector::_M_range_insert
However, if I use 'metric = "manhattan" or leave it to the default it works just fine.
Best.
List of metrics not allowed if X is a matrix.
I tried the option metric=list("cosine"=1:27, "categorical"=28) on my data (matrix with dimnames) and got this error:
Error in match.arg(metric, c("euclidean", "cosine", "manhattan", "hamming", : 'arg' should be one of “euclidean”, “cosine”, “manhattan”, “hamming”, “precomputed”
If I set X=as.data.frame(mydata) the error is gone.
Thanks.
Hello,
I'm getting *** caught segfault *** address 0xfffffffffffffff7, cause 'memory not mapped'
from each threads when I use uwot() with option n_sgd_threads
as soon as the process is "Commencing the optimization epoch"
Here is the command
sentence_umap <- umap(X = corp_sentence_nda, pca=150, n_neighbors = 15, n_components = 3, ret_model = TRUE, verbose = TRUE, n_threads = 40, approx_pow = TRUE, n_sgd_threads=2)
Here is the log
09:58:17 UMAP embedding parameters a = 1.896 b = 0.8006
09:58:27 Read 215213 rows and found 768 numeric columns
09:58:27 Reducing X column dimension to 150 via PCA
10:01:53 PCA: 150 components explained 85.29% variance
10:01:53 Using Annoy for neighbor search, n_neighbors = 15
10:01:54 Building Annoy index with metric = euclidean, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
10:03:06 Writing NN index file to temp file /tmp/RtmpJ0QbYw/file81c79b6196a
10:03:07 Searching Annoy index using 40 threads, search_k = 1500
10:04:42 Annoy recall = 67.79%
10:04:42 Commencing smooth kNN distance calibration using 40 threads
10:04:42 103918 smooth knn distance failures
10:04:50 Found 498 connected components, falling back to 'spca' initialization with init_sdev = 1
10:04:50 Initializing from scaled PCA
10:04:51 Commencing optimization for 200 epochs, with 4880866 positive edges using 2 threads
*** caught segfault ***
address 0xfffffffffffffff7, cause 'memory not mapped'
Traceback:
1: RcppParallel::setThreadOptions(numThreads = n_sgd_threads)
2: uwot(X = X, n_neighbors = n_neighbors, n_components = n_components, metric = metric, n_epochs = n_epochs, alpha = learning_rate, scale = scale, init = init, init_sdev = init_sdev, spread = spread, min_dist = min_dist, set_op_mix_ratio = set_op_mix_ratio, local_connectivity = local_connectivity, bandwidth = bandwidth, gamma = repulsion_strength, negative_sample_rate = negative_sample_rate, a = a, b = b, nn_method = nn_method, n_trees = n_trees, search_k = search_k, method = "umap", approx_pow =approx_pow, n_threads = n_threads, n_sgd_threads = n_sgd_threads, grain_size = grain_size, y = y, target_n_neighbors = target_n_neighbors, target_weight = target_weight, target_metric = target_metric, pca = pca, pca_center = pca_center, pcg_rand = pcg_rand, fast_sgd = fast_sgd, ret_model = ret_model, ret_nn = ret_nn, tmpdir = tempdir(), verbose = verbose)
3: umap(X = corp_sentence_nda, pca = 150, n_neighbors = 15, n_components = 3, ret_model = TRUE, verbose = TRUE, n_threads = 40, approx_pow = TRUE, n_sgd_threads = 2)
4: system.time(sentence_umap <- umap(X = corp_sentence_nda, pca = 150, n_neighbors = 15, n_components = 3, ret_model = TRUE, verbose = TRUE, n_threads = 40, approx_pow= TRUE, n_sgd_threads = 2))
The machine I'm running on provides the following cores to Rcpp :
> RcppParallel::defaultNumThreads()
[1] 48
(that's the reason why I would love to benefit from n_sgd_threads > 1)
note that changing from approx_pow = TRUE
to approx_pow = FALSE
has no effect and produce the same segfault.
Here is the sessionInfo() i'm using ( note that uwot is the github master version, not the cran one, with no difference)
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /opt/conda/lib/R/lib/libRblas.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] reticulate_1.14 uwot_0.1.6 Matrix_1.2-18
loaded via a namespace (and not attached):
[1] compiler_3.6.1 rappdirs_0.3.1 Rcpp_1.0.3 grid_3.6.1
[5] jsonlite_1.6.1 RcppParallel_4.4.4 lattice_0.20-38
Thanks for your help, and for this fantastic package.
Dirk Eddelbuettel, RcppAnnoy author, has reached out to inform me of that a new version of RcppAnnoy is coming, backed with an updated version of Annoy.
Unfortunately, uwot's ability to load previously saved models is broken by these changes. It used to be possible to specify an arbitrary dimensionality when creating an index, and then loading a serialized Annoy model would overwrite that dimensionality to whatever the serialized index was supposed to have, e.g.:
ann <- methods::new(RcppAnnoy::AnnoyEuclidean, 1)
ann$load(index_path)
Annoy author Erik Bernhardsson further pointed out that this should never have worked, which adds a little extra urgency to making a fix.
Fortunately, the information needed is readily to hand at load time, so this isn't a difficult (or backwards-compatibility breaking) fix.
There is a discussion over at the python UMAP repo where it is mentioned that the python equivalent of uwot::umap_transform
includes an option to set a random number seed internally:
The transform function should now be consistent in the transformation (via a fixed transform seed which you can pick on instantiation if you wish).
The current behavior of umap_transform
is, for example:
iris_umap <- umap(iris, ret_model = TRUE)
iris_umap2 <- umap_transform(iris, model = iris_umap)
Which will yield something like:
head(iris_umap$embedding)
[,1] [,2]
[1,] 4.200771 8.554187
[2,] 3.545428 6.722146
[3,] 3.172449 7.146079
[4,] 3.298905 7.069255
[5,] 4.084401 8.499889
[6,] 5.120863 9.330461
head(iris_umap2)
[,1] [,2]
[1,] 4.069650 8.825538
[2,] 3.723976 6.704192
[3,] 3.110911 7.511822
[4,] 3.286992 6.673110
[5,] 3.893319 8.772491
[6,] 5.126285 9.693930
The results are close but not exactly the same, due to the stochastic nature of the UMAP algorithm.
However, a relevant (and, I think, potentially common) use case is when umap()
is used to reduce the dimensionality of a set of X
predictor variables and a model is then trained on the embedding. If new observations (X2
) become available for which we wish to make predictions, those values need to be deterministically translated into the embedding space (using the same random number generation as the original UMAP calculations). Even relatively small differences in how the "natural" X
values are translated to the embedded space could cause identical X
and X2
observations to generate different predictions from subsequent models.
How difficult would it be to enable a random seed to be set (and returned) for umap
and then later passed to umap_transform
?
It asks for rtools whille installing and could not install.
I reduced the dimension using supervised method. and I used metric learning to reduce the dimension of the test data. But, the accuracy of training feature is 99%, whereas the testing data has very low accuracy, why? Any idea in this case. I don't want to use PCA or other tools for dimension reduction. Thanks
Below is the code!
library(uwot)
library(tidyverse)
library(ggplot2)
#perform dimension reduction for object detection features
data_train <- read.table('d://augmented_features/raw_training_features.csv',sep = ',',header = F)
data_test <- read.table('d://augmented_features/raw_testing_features.csv',sep = ',',header = F)
umap_test <- data_test[,-ncol(data_test)]
umap_train <- data_train[,-ncol(data_train)]
train_label <- data_train$V513
#set.seed(1337)
reduced_umap_train <- umap(umap_train,ret_model = TRUE,y=train_label,n_components = 2)
inria_train <- as.data.frame(reduced_umap_train$embedding)
inria_train %>%
mutate(Categories = data_train$V513) %>%
ggplot(aes(V1, V2, color = Categories)) + geom_point(cex=1.5)
#write this file for final feature extraction
#class(inria_umap)
#write.tainble(inria_umap,'G://testing.csv',sep = ',',row.names = F,col.names = F)
set.seed(1337)
reduced_umap_test <- umap_transform(umap_test,reduced_umap_train)
inria_test <- as.data.frame(reduced_umap_test)
inria_test %>%
mutate(Categories = data_test$V513) %>%
ggplot(aes(V1, V2, color = Categories)) + geom_point(cex=1.5)
#make final reduced datasets that is used for SVM training
category <- data_train$V513
train_data <- cbind(inria_train,category)
category <- data_test$V513
test_data <- cbind(inria_test,category)
#write data into the file
write.table(train_data,'d://augmented_features/reduced_training.csv',sep = ',',row.names = F,col.names = F)
write.table(test_data,'d:/augmented_features/reduced_testing.csv',sep = ',',row.names = F,col.names = F)
Hi,
This R implementation is very useful for me since I only know R. Thank you for making this package.
So I was trying to run a series of UMAP analysis with different parameters. I saved them with saveRDS()
for later use, especially for umap_transform()
function for my testing data set. However, when I retrieve it with readRDS()
I couldn't use the object as the model for umap_transform()
. The error message reads:
Error in .External(list(name = "CppMethod__invoke_void", address = <pointer: (nil)>, :
NULL value passed as symbol address
I work on RStudio Server. Not sure if the information helps to solve the problem.
Thanks a lot for making this package again.
I thought I'd go one by one over each of the C++ files, starting with nn_parallel.h
. It seems you're doing a nearest-neighbor search on X
in the R annoy_nn
function. Here's some observations:
k+1
neighbors and check if the observation is not its own neighbor, see here.get_nns_by_item
method for get the NNs for a particular item in the index. This avoids the need to repass mat
into the NNworker
, and eliminates the row-by-row accesses that are not cache optimal in NNworker::operator()
.idx
and dists
, which would be more cache optimal. That is, create transposed matrices for NNWorker
to dump results in, and untranspose them just before or after you return to R.Could you point me to something that generally explains how the umap_transform does the embedding of the new data, e.g. what information from the embedding of the initial set is used, what is the nature of the objective function that is being minimized? I have gone thru the R code for the function but not getting it.
I have read (and generally understand) the "How UMAP works" description at https://umap-learn.readthedocs.io/en/latest/how_umap_works.html, so I have the basic idea of how the embedding of the initial set of data is done.
Any help would be appreciated!
The latest submission of uwot
to CRAN has been rejected due to the UBSAN issues inherited from RcppAnnoy (the UBSAN check is currently accessible via a link on https://cran.r-project.org/web/checks/check_results_uwot.html):
Thanks, it is your choice to use RcppAnnoy, so you have to work around the issues. The use of undefined behaviour is not compatible with the CRAN policy.
Please fix and resubmit.
If this decision isn't reconsidered, I imagine that this is likely to see uwot
being removed from CRAN shortly.
The UBSAN issue is also present in RcppAnnoy
itself, not uwot
's specific use of the package (as far as I can tell anyway): https://cran.r-project.org/web/checks/check_results_RcppAnnoy.html and is due to how the underlying Annoy library is written. It's not going to get fixed because it's Annoy working as designed. It's not clear to me at the moment if this means RcppAnnoy
will also be removed from CRAN or what has changed in policy since the last submission of uwot
(or indeed of RcppAnnoy).
At any rate, I grow weary of the ban-hammer lottery uwot
enters every time I want to update the package on CRAN. The obvious solution is to stop using Annoy. The upside would be:
uwot
.The obvious downsides are:
RcppHNSW is a possible alternative, but it supports fewer metrics than Annoy and is a lot slower.
I do want to get on with rnndescent, the upsides of which are:
A big downside is:
Other downsides that emerge from the fact that I am writing the package, so inevitably:
I am curious to see whether there is a way to give individual observations different weights in the UMAP objective function. For instance, I have data from 2 conditions, one with 100 observations and one with 1000. I would like to have both conditions contribute equally to the embedding. Perhaps naively, I would expect observations from each conditions to take up the same amount of real estate in this balanced analysis. I appreciate any thoughts on how feasible this would be. Thanks in advance!
Hi,
Maybe this is a "no issue".
I´m trying FNN as method for kNN search and need ret_model = TRUE to do metric learning, but I get a error:
Error in x2nn(X, n_neighbors, metric, nn_method, n_trees, search_k, n_refine_iters, :
nn_method = 'FNN' is incompatible with ret_model = TRUE
Do you think there could be a workaround?
Thanks.
Looking at:
Lines 71 to 75 in 3e359a3
This seems like it could be replaced by a binary search, assuming that the inputs represent slots from a dgCMatrix
; entries of i
should always be sorted within each column specified by p
.
auto left_end=indices1.begin() + indptr1[i + 1];
auto left_it=std::lower_bound(indices1.begin() + indptr1[i], left_end, j);
double left_val = (left_it!=left_end && *left_it==j ? data1[left_it - indices1.begin()] : left_min);
This should be faster for any decently sized input matrix where you're getting >100 non-zero entries in each column (I don't know if this is particularly common?), and saves two lines as well.
library(uwot)
library(Matrix)
X <- rsparsematrix(10000, 10000, 0.1)
Y <- rsparsematrix(10000, 10000, 0.1)
Z <- as(X + Y, 'dgTMatrix')
system.time({
uwot:::general_sset_intersection_cpp(
X@p, X@i, X@x,
Y@p, Y@i, Y@x,
Z@i, Z@j, Z@x)
})
## user system elapsed
## 19.838 0.000 19.843
system.time({
uwot:::general_sset_intersection_cpp2( # modified as above
X@p, X@i, X@x,
Y@p, Y@i, Y@x,
Z@i, Z@j, Z@x)
})
## user system elapsed
## 1.43 0.00 1.43
(Not that I have any concerns about speed; I was trawling through the code for other reasons and just happened to notice this. Just something to consider.)
Sometimes the spectral initialization takes a long time; on some occasions, it's got so stuck I've had to terminate the calculation (which can sometimes require terminating the R session).
Probably the input matrix is very poorly conditioned, so that finding the smallest eigenvalues is an exercise in numerical futility.
Recent versions of UMAP detect connected components and initialize them separately, see e.g.
https://github.com/lmcinnes/umap/blob/43cf1a820cea8d5b3218627d047dd78e4a152dd4/umap/spectral.py
This might solve the problem.
Hello! Thank you for writing such a useful package. It is great not to have to switch between python and R to use umap :).
I was wondering: is it possible to output the graph (i.e. the fuzzy simplicial set) that is an intermediate step in the UMAP projection?
In the original python implementation, I obtained this using the function:
umap.umap_.fuzzy_simplicial_set
I have found that this graph has several nice properties, and can be used to cluster data directly using graphical clustering methods.
Tom
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.