lvdmaaten / bhtsne Goto Github PK
View Code? Open in Web Editor NEWBarnes-Hut t-SNE
License: Other
Barnes-Hut t-SNE
License: Other
HI, I think I'm seeing the same thing as this issue.
I compiled as in the instructions on Kali Linux using g++. It seemed to compile fine.
When I run it on some test data, I get this error:
Read the 4329 x 10 data matrix successfully!
Using current time as random seed...
Using no_dims = 2, perplexity = 60.000000, and theta = 0.500000
Computing input similarities...
Building tree...
- point 0 of 4329
Segmentation fault
I have attached the binary that caused the problem, together with the data.dat
file. I have also verified that another binary compiled on macOS works fine with this particular data.dat
This seems to be an issue on many different OSes, including on Windows.
(paging @he-zhe)
When I try and follow the windows instructions in the readme I get the following error:
cl.exe /nologo /O2 /EHsc /D "_CRT_SECURE_NO_DEPRECATE" /D "USEOMP" /openmp tsne.obj sptree.obj -Fewindows\bh_tsne.exe
libcpmt.lib(xthrow.obj) : error LNK2038: mismatch detected for '_MSC_VER': value '1900' doesn't match value '1800' in tsne.obj
libucrt.lib(hypot.obj) : error LNK2005: hypot already defined in tsne.obj
windows\bh_tsne.exe : fatal error LNK1169: one or more multiply defined symbols found
NMAKE : fatal error U1077: '"C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe"' : return code '0x2'
Stop.
I'm not sure how I can edit the tsne.obj file to adjust this, any advice?
Thanks!
I attempt to run fast_tsne
from matlab using the wrapper. After some time of processing I get an app crash.
I'm on a Win7 SP1 with MATLAB 2015b.
fast_tsne code from the master branch
Other:
The data.dat is being written, the crash is during the process of the binary itself. In the first seconds of operation the bh_tsne process allocates a lot of memory (up to 4.4 GB in my case). This is within limits and even just before the moment of crash there is still 12% of free physical memory.
After compiling, I get the file bhtsne but it should be bh_tsne, or at least this is what the python wrapper expect in my case. I am on OSX 10.12.3 and I am compiling using g++ from xcode.
I'm working to integrate it to have an homogenous framework. @lvdmaaten Do you think it could be relevant? Thanks!
Hey!
Does anyone have idea why the lates version (master) is slower than the older version available in
https://github.com/danielfrg/tsne/tree/master/tsne/bh_sne_src
I profiled the issue with the mnist2500 data set and the problem lies in the recursive function computeNonEdgeForces.
Is there something fundamentally wrong / bugs in the old version used by that other repository?
First, I compile bhtsne sucessfully. Then,I run the example code, using the data file 'mnist2500_X.txt'.
I run :
python bhtsne.py -i mnist2500_x.txt
I get this error:
bhtsne.py:135: ComplexWarning: Casting complex values to real discards the imaginary part
.
This error is occured when writting 'data.dat'. Complex values is found after PCA.
I dont know how to fix it. Any suggestions will be appreciated.
Hello,
I'm working with the python wrapper bhtsne.py, the iris dataset, and on a ubuntu 14.04.
I'm trying to get the costs for each samples (as specified iat the end of bh_tsne()).
I de-commented the last line (of the function)
#read_unpack('{}d'.format(sample_count), output_file)
and adapted it
_read_unpack('{}d'.format(len(results)), output_file)
and put it before the yield
, inside a simple print()
However, all the cost are equal to zero. Even when i set a very low number of iterations, and the verbose tell me that the error is still high:
$ ./bhtsne.py -d 2 -p 30 -v -i iris_data.txt -o tsne_test.output --no_pca -m 100
Read the 150 x 4 data matrix successfully!
Using current time as random seed...
Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
Computing input similarities...
Building tree...
- point 0 of 150
Input similarities computed in 0.01 seconds (sparsity = 0.706622)!
Learning embedding...
Iteration 50: error is 45.556438 (50 iterations in 0.02 seconds)
Iteration 99: error is 44.807590 (50 iterations in 0.02 seconds)
Fitting performed in 0.04 seconds.
Wrote the 150 x 2 data matrix successfully!
('costs: ', (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, [... all 0.0 ..], 0.0, 0.0))
What did i missed' or how can i get the costs ?
Thanks
Hello @lvdmaaten ,
I've read on your tSNE homepage that you can handle datasets with up to 30 million examples https://lvdmaaten.github.io/tsne/. I'm currently working in google colab
I currently have a dataset with 2 million examples and each example is a 100-d vector.
Using verbose= False, I get the following:
Using verbose=True as suggested I get:
I'm not sure what this means or how i should proceed. The example with the Mnist dataset works perfect using verbose=False
Does this implementation supports Ubuntu 14.04 LTS ? Would that be possible to add a clean documentation for this ?
I ran python bh_tsne on a 95 * 745544 matrix, and here is my command ./bhtsne.py -i ~/Dropbox/github/data/lan_uid_matrix.txt -o ~/Dropbox/github/data/lan_uid_coordinate.txt -p 5 -d 2 -t 1 -v
but it shows the error as follows:
Error: could not open data file.
Traceback (most recent call last):
File "./bhtsne.py", line 233, in
exit(main(argv))
File "./bhtsne.py", line 224, in main
verbose=argp.verbose, initial_dims=argp.initial_dims, use_pca=argp.use_pca, max_iter=argp.max_iter):
File "./bhtsne.py", line 211, in run_bh_tsne
for result in bh_tsne(tmp_dir_path, verbose):
File "./bhtsne.py", line 164, in bh_tsne
with open(path_join(workdir, 'result.dat'), 'rb') as output_file:
IOError: [Errno 2] No such file or directory: '/var/folders/92/8ty0c6392m773r5tbp4s9gy80000gp/T/tmpakdFT0/result.dat'
I don't know why it can not find the result.dat. Could you help me solve it?
Thanks in advance
I was trying to use python wrapper in windows and I used your example code. But this error was raised:
AttributeError: module 'os' has no attribute 'fork' which seems reasonable in Windows. Do you have any suggestion to solve this problem? Thanks!
I found if If I define the perplexity small than 0, then the K always be 0, because you define
int K = (float)perplexity * 3, so K always be 0.
If I define the perplexity > 0, then I will get segmentation fault.....(because the sizeof (distances) != K)
I'd like to know, this source code still could be use? or you already don't use it any more ....
Hi,
Since t-SNE is increasingly used to visualize neural networks outputs (and their layers), it would be extremely helpful to have an implementation of t-SNE in pytorch, in particular the barnes-hut version that runs in N log N.
Is this something you would be interested in doing?
Thanks!
hello,
is there a similar way to tsne_p.m for performing fast_tsne (i.e provide the pairwise similarity matrix)?
If not,
is it appropriate if i add an option to computeSquaredEuclideanDistance for computing a gaussian kernel with custom DD (as in exp(-beta * DD[nN + m])) and an equivelent vptree distance,
or euclidean DDistances are the only way for the fast version to work correctly?
Hello Laurens:
I am trying to have full control of results in the sense of getting exactly the same output for a given random seed. Could you please confirm that I got it right: there are only two places where a random generator is used:
Thanks,
Nik Tuzov, PhD
Is there a way to input sparse data? I suspect this is not a straight-forward thing to do, because of the lack of a standard way to store sparse matrices in a text file, i.e. python probably does it different than matlab (did not check though).
OT: I just watched a video of you presenting t-SNE at Google and I want to compliment you on your explanation skills. Very clear and understandable.
0x0000000000403662 in TSNE::computeGaussianPerplexity (this=0x612010, X=0x7ffff6fab010, N=10000, D=30, _row_P=0x7fffffffe3d0, _col_P=0x7fffffffe3d8,
_val_P=0x7fffffffe3e0, perplexity=50, K=150) at tsne.cpp:470
470 cur_P[m] = exp(-beta * distances[m + 1] * distances[m + 1]);
The reason is sizeof(distance) = 24, but K = 150, how it works??
Hi, Smile has an implantation of t-SNE for Java/Scala/Kotlin/Clojure. Would you please add a link to it in your t-SNE page? Thanks.
Hello Laurens:
There is a sizeable difference in the output quantities when the difference between the input data sets is virtually zero. The R code attached below provides an illustration, and a similar issue exists with your code as well. I know it's all about relative distances among the points, so as long as the visualizations look similar (which is the case) the user shouldn't care. Still, it would be nice to see more consistent numbers in the output when the input data are virtually the same.
Based on my tests, the divergence occurs in computeNonEdgeForces() which causes computeGradient() to diverge as well.
Regards,
Nik Tuzov, PhD
===============================================
library(Rtsne)
library(rgl)
set.seed(115)
iris_unique <- unique(iris)
Y_zero <- as.matrix(iris_unique[, 1:3])
tsne_out3d <- Rtsne(as.matrix(iris_unique[, 1:4]), dims = 3, Y_init = Y_zero)
plot3d(tsne_out3d$Y[, 1], tsne_out3d$Y[, 2], tsne_out3d$Y[, 3], col = as.numeric(iris_unique$Species))
head(tsne_out3d$Y)
set.seed(115)
iris_unique_butt = iris_unique;
iris_unique_butt[1, 1] = iris_unique_butt[1, 1] + 1e-6;
tsne_out3d_butt <- Rtsne(as.matrix(iris_unique_butt[, 1:4]), dims = 3, Y_init = Y_zero)
plot3d(tsne_out3d_butt$Y[, 1], tsne_out3d_butt$Y[, 2], tsne_out3d_butt$Y[, 3], col = as.numeric(iris_unique$Species))
head(tsne_out3d_butt$Y)
Hey,
I'm trying to generate bhtsne.exe by following your instructions, but I keep getting this message on the cmd:
sptree.cpp
sptree.cpp(111): error C3861: 'fmax': identifier not found
sptree.cpp(335): error C3861: 'fmax': identifier not found
NMAKE: fatal error U1077: '"C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\BIN\amd64\cl.exe"': return code '0x2'
Any idea how to fix it?
Hi all,
I've experienced weird performance issues when compiling the binary on my Windows 10 home desktop vs. an Ubuntu 18.04 virtual machine. I compiled the binary using the given instructions in this repository, that is
g++ sptree.cpp tsne.cpp tsne_main.cpp -o bh_tsne -O2
on Ubuntu and
nmake -f Makefile.win all
on windows (using Visual Studio 2019)
Still, on windows, using all 70000 MNIST digits, the .exe runs only half the time the binary requires on ubuntu, see the following logs:
Windows:
Computing input similarities...
Building tree...
Ubuntu:
Computing input similarities...
Building tree...
TL;DR: while constructing the nearest-neighbor tree takes almost the same time on both machines, the iterations take twice as long on ubuntu.
Any ideas on what could be going wrong would be greatly appreciated! Thanks
Dear Dr. van der Maaten:
Could you help me enhance my understanding of how the perplexity parameter works. There are two questions.
Looking at the implementation, do I get it right that a reasonable upper bound on perplexity is equal to 1/3 of the minimal expected cluster size (for simplicity, assume we know what cluster sizes to expect).
On your home page, there is a question (“I get a strange ‘ball’ with uniformly distributed points”) and your suggestion is to reduce perplexity. Do you think the same “ball” effect can be see when perplexity is too low? If yes, how do you suggest we define a lower bound for perplexity?
Regarding 2), I have this digit images data set with 40,000 points that is supposed to contain 10 clusters of about the same size. When I subsample 2000 points and run default Rtsne (its implementation is very similar to yours) the embedding looks nice. However, it is far worse on the full data set. I figured it was because the default perplexity of 30 was too low compared to the typical cluster size, 4000, so I reset it to 30*20 = 600 and obtained a very nice embedding.
When the expected result is unknown, I guess one could try to use a similar subsampling approach to figure out how to increase perplexity. I was wondering if you know of a more analytical method or a rule of thumb.
Regards,
Nik Tuzov, PhD
Hello,
I have found some indexing bugs in the methods
TSNE::computeSquaredEuclideanDistance
and
TSNE::run
for the Exact computation of tSNE. (in the section where you symmetrize the input probability matrix)
If this has already been corrected, thank you, and please let me know and I will make sure I obtain the updated version.
In both cases, 2 nested loops are being used with a similar indexing format. I will use the euclidean distance matrix calculation to describe what I see:
nN = 0; nD = 0;
for(int n = 0; n < N; n++) {
int mD = 0; //always indexes to the first data point in X
DD[nN + n] = 0.0;
for(int m = n + 1; m < N; m++) {
DD[nN + m] = 0.0;
for(int d = 0; d < D; d++) {
DD[nN + m] += (X[nD + d] - X[mD + d]) * (X[nD + d] - X[mD + d]);
}
DD[m * N + n] = DD[nN + m];
mD+=D;
}
nN+=N;
nD += D;
}
The problem is that mD always starts at 0, but m always starts at n+1. It seems that what is intended is that mD should contain the index of the m'th data point in X. If we describe DD as a matrix instead of an array, when you process row 0, you end up calculating the following:
DD[0,1] = ||x0 - x0||^2
DD[0,2] = ||x0 - x1||^2
DD[0,3] = ||x0 - x2||^2
where x0 is the first point, x1 is the second point, and x2 is the third point. In processing the second row, you get:
DD[1,2] = ||x1 - x0||^2
DD[1,3] = ||x1 - x1||^2
DD[1,4] = ||x1 - x2|| ^2
This can be fixed by changing the value of mD:
int mD;
nN = 0; nD = 0;
for(int n = 0; n < N; n++) {
// int mD = 0; //remove this line
DD[nN + n] = 0.0;
for(int m = n + 1; m < N; m++) {
mD = m * D; //this is the added code
DD[nN + m] = 0.0;
for(int d = 0; d < D; d++) {
DD[nN + m] += (X[nD + d] - X[mD + d]) * (X[nD + d] - X[mD + d]);
}
DD[m * N + n] = DD[nN + m];
// mD+=D; //remove this line
}
nN+=N;
nD += D;
}
The bug fix in TSNE::run is similar:
// Symmetrize input similarities
printf("Symmetrizing...\n");
int nN = 0;
for(int n = 0; n < N; n++) {
//int mN = 0; //remove this line
for(int m = n + 1; m < N; m++) {
int mN = m*N; //this is the added code
P[nN + m] += P[mN + n];
P[mN + n] = P[nN + m];
//mN += N; //remove this line
}
nN += N;
}
As a last note, thank you for providing this technique and implementation. It has proved to be a great embedding technique for the type of data I work with.
Sincerely,
Allison
Hello!
I compiled the program and tried the matlab example and it works.
But when I try the python wrapper, no bh_tsne.exe process is starting and I get nothing in the output, so I don't know what is the problem. My system is windows 7 x64.
Could you please provide a similar usage example for python like the one for matlab?
Thanks in advance
The computations involving the "gains" in tsne.cpp, line 72 carry the awe-inspiring comment
// Allocate some memory
This is not just "some memory". These are parts of a computation that is critical for the implementation to work properly. Neither the paper nor any of the copycat implementations have any information about what these "gains" are. Maybe it's obvious for those who are more deeply involved. But nevertheless, at some point, it should be explained what these "gains" actually are.
I can install bhtsne using my vcvars32.bat.
But...
I can't run the example in the front page:
data = np.loadtxt("mnist2500_X.txt", skiprows=1)
embedding_array = bhtsne.run_bh_tsne(data, initial_dims=data.shape[1])
However, i can run using bhtsne.tsne(data).
The question is...
Is bhtsne.tsne the same as bhtsne.run_bh_tsne above? Also, setting verbose=True under bhtsne.py doesn't produce verbose text as usual in my SPYDER (python 2.7 anaconda) console.
Note that this works in Python 2.7, but not in Anaconda Python 3.5 on OS X and Linux. Something is wrong with the file handling. I can't figure out what, but this file does not exist.
It looks like it opens it in read mode 'rb' and then writes to it? I'm not familiar with doing this.
Traceback (most recent call last):
File "test/test_tsne.py", line 24, in reduce_dimensions
result = bhtsne.run_bh_tsne(pca_result)
File "/Users/rjurney/Software/pinpointcloud_worker/bhtsne/bhtsne.py", line 214, in run_bh_tsne
for result in bh_tsne(tmp_dir_path, verbose):
File "/Users/rjurney/Software/pinpointcloud_worker/bhtsne/bhtsne.py", line 159, in bh_tsne
with open(path_join(workdir, 'result.dat'), 'rb') as output_file:
FileNotFoundError: [Errno 2] No such file or directory: '/var/folders/0b/74l_65015_5fcbmbdz1w2xl40000gn/T/tmph9x08ku8/result.dat'
The MATLAB wrapper fails for me after installation with:
'.../bh_tsne/bh_tsne' is not recognized as an internal or external command,
operable program or batch file.
This error is displayed in the stdout of the system call in fast_tsne.m on line 79:
tic, system(fullfile(tsne_path,'./bh_tsne')); toc
The reason seems to be that the python script this call refers to does not contain the underscore. So the very simple workaround is to change the above line to:
tic, system(fullfile(tsne_path,'./bhtsne')); toc
@lvdmaaten Have you tried to use dynamic lying factor instead of static 12.0?
Got little lower error (dropped from 1.428 to 1.394 ) by replacing static lying factor with dynamic value:
double lying_factor = 12.0;
double lying_decrease = 0;
if(stop_lying_iter > 0)
lying_decrease = (lying_factor - 1.0) / (double) stop_lying_iter;
if(iter < stop_lying_iter) {
if(exact) { for(int i = 0; i < N * N; i++) P[i] /= lying_factor; }
else { for(int i = 0; i < row_P[N]; i++) val_P[i] /= lying_factor; }
lying_factor -= lying_decrease;
if(exact) { for(int i = 0; i < N * N; i++) P[i] *= lying_factor; }
else { for(int i = 0; i < row_P[N]; i++) val_P[i] *= lying_factor; }
}
Output for bhtsne with static lying factor for input of 10000 x 200 samples:
Learning embedding...
Iteration 50: error is 83.834001 (50 iterations in 6.29 seconds)
Iteration 100: error is 82.772901 (50 iterations in 5.75 seconds)
Iteration 150: error is 82.683113 (50 iterations in 5.59 seconds)
Iteration 200: error is 82.657414 (50 iterations in 6.91 seconds)
Iteration 250: error is 4.401062 (50 iterations in 5.98 seconds)
Iteration 300: error is 2.621082 (50 iterations in 6.51 seconds)
Iteration 350: error is 2.200671 (50 iterations in 5.28 seconds)
Iteration 400: error is 1.975511 (50 iterations in 5.58 seconds)
Iteration 450: error is 1.834608 (50 iterations in 6.70 seconds)
Iteration 500: error is 1.740066 (50 iterations in 5.66 seconds)
Iteration 550: error is 1.671593 (50 iterations in 7.15 seconds)
Iteration 600: error is 1.619357 (50 iterations in 7.49 seconds)
Iteration 650: error is 1.579096 (50 iterations in 8.11 seconds)
Iteration 700: error is 1.550052 (50 iterations in 6.58 seconds)
Iteration 750: error is 1.528960 (50 iterations in 6.04 seconds)
Iteration 800: error is 1.516330 (50 iterations in 7.10 seconds)
Iteration 850: error is 1.507626 (50 iterations in 5.70 seconds)
Iteration 900: error is 1.500774 (50 iterations in 7.26 seconds)
Iteration 950: error is 1.495621 (50 iterations in 6.24 seconds)
Iteration 1000: error is 1.490954 (50 iterations in 6.61 seconds)
Iteration 1050: error is 1.486083 (50 iterations in 5.91 seconds)
Iteration 1100: error is 1.480977 (50 iterations in 9.09 seconds)
Iteration 1150: error is 1.475732 (50 iterations in 12.08 seconds)
Iteration 1200: error is 1.471433 (50 iterations in 7.36 seconds)
Iteration 1250: error is 1.467418 (50 iterations in 5.78 seconds)
Iteration 1300: error is 1.463404 (50 iterations in 7.68 seconds)
Iteration 1350: error is 1.459785 (50 iterations in 5.97 seconds)
Iteration 1400: error is 1.456251 (50 iterations in 5.53 seconds)
Iteration 1450: error is 1.453166 (50 iterations in 6.60 seconds)
Iteration 1500: error is 1.450100 (50 iterations in 5.54 seconds)
Iteration 1550: error is 1.447593 (50 iterations in 7.94 seconds)
Iteration 1600: error is 1.445209 (50 iterations in 5.48 seconds)
Iteration 1650: error is 1.442866 (50 iterations in 5.72 seconds)
Iteration 1700: error is 1.440184 (50 iterations in 6.16 seconds)
Iteration 1750: error is 1.437438 (50 iterations in 5.08 seconds)
Iteration 1800: error is 1.435075 (50 iterations in 6.89 seconds)
Iteration 1850: error is 1.432948 (50 iterations in 5.19 seconds)
Iteration 1900: error is 1.431293 (50 iterations in 5.38 seconds)
Iteration 1950: error is 1.429766 (50 iterations in 6.69 seconds)
Iteration 1999: error is 1.428257 (50 iterations in 5.49 seconds)
Output for bhtsne with dynamic lying factor for input of 10000 x 200 samples:
Learning embedding...
Iteration 50: error is 65.368017 (50 iterations in 7.20 seconds)
Iteration 100: error is 46.293893 (50 iterations in 6.24 seconds)
Iteration 150: error is 29.246011 (50 iterations in 7.21 seconds)
Iteration 200: error is 13.983286 (50 iterations in 6.51 seconds)
Iteration 250: error is 2.634763 (50 iterations in 7.59 seconds)
Iteration 300: error is 2.010282 (50 iterations in 5.58 seconds)
Iteration 350: error is 1.809022 (50 iterations in 5.98 seconds)
Iteration 400: error is 1.698381 (50 iterations in 6.79 seconds)
Iteration 450: error is 1.626216 (50 iterations in 6.61 seconds)
Iteration 500: error is 1.575453 (50 iterations in 7.00 seconds)
Iteration 550: error is 1.539009 (50 iterations in 5.83 seconds)
Iteration 600: error is 1.511758 (50 iterations in 7.45 seconds)
Iteration 650: error is 1.493206 (50 iterations in 5.83 seconds)
Iteration 700: error is 1.480972 (50 iterations in 6.17 seconds)
Iteration 750: error is 1.473207 (50 iterations in 6.50 seconds)
Iteration 800: error is 1.467411 (50 iterations in 5.75 seconds)
Iteration 850: error is 1.462050 (50 iterations in 6.94 seconds)
Iteration 900: error is 1.457054 (50 iterations in 5.78 seconds)
Iteration 950: error is 1.452213 (50 iterations in 6.93 seconds)
Iteration 1000: error is 1.447649 (50 iterations in 5.43 seconds)
Iteration 1050: error is 1.443852 (50 iterations in 6.00 seconds)
Iteration 1100: error is 1.440698 (50 iterations in 6.88 seconds)
Iteration 1150: error is 1.437592 (50 iterations in 6.08 seconds)
Iteration 1200: error is 1.434110 (50 iterations in 7.13 seconds)
Iteration 1250: error is 1.430698 (50 iterations in 5.43 seconds)
Iteration 1300: error is 1.427438 (50 iterations in 6.43 seconds)
Iteration 1350: error is 1.424260 (50 iterations in 4.72 seconds)
Iteration 1400: error is 1.421095 (50 iterations in 10.57 seconds)
Iteration 1450: error is 1.418043 (50 iterations in 11.95 seconds)
Iteration 1500: error is 1.415134 (50 iterations in 5.66 seconds)
Iteration 1550: error is 1.412521 (50 iterations in 6.57 seconds)
Iteration 1600: error is 1.409757 (50 iterations in 6.42 seconds)
Iteration 1650: error is 1.407288 (50 iterations in 5.95 seconds)
Iteration 1700: error is 1.404948 (50 iterations in 6.25 seconds)
Iteration 1750: error is 1.403020 (50 iterations in 5.08 seconds)
Iteration 1800: error is 1.401276 (50 iterations in 7.00 seconds)
Iteration 1850: error is 1.399376 (50 iterations in 5.49 seconds)
Iteration 1900: error is 1.397579 (50 iterations in 5.88 seconds)
Iteration 1950: error is 1.395958 (50 iterations in 6.58 seconds)
Iteration 1999: error is 1.394586 (50 iterations in 6.41 seconds)
Any opinions on this experiment?
Hello,
Im trying to use the code for a image dataset that is in a h5 file, i changed the load data function in the python wrapper for this
def load_data(input_file):
with h5py.File('data4.h5', 'r') as hf:
data = hf['data1'][:]
return data
Where data1 is the name of the dataset inside the h5 file
But i get the following error
Traceback (most recent call last)
File "bhtsne.py", line 246, in <module>
exit(main(argv))
File "bhtsne.py", line 237, in main
verbose=argp.verbose, initial_dims=argp.initial_dims, use_pca=argp.use_pca, max_iter=argp.max_iter):
File "bhtsne.py", line 208, in run_bh_tsne
init_bh_tsne(data, tmp_dir_path, no_dims=no_dims, perplexity=perplexity, theta=theta, randseed=randseed,verbose=verbose, initial_dims=initial_dims, use_pca=use_pca, max_iter=max_iter)
File "bhtsne.py", line 112, in init_bh_tsne
cov_x = np.dot(np.transpose(samples), samples)
ValueError: shapes (32,32,3,2462) and (2462,3,32,32) not aligned: 2462 (dim 3) != 32 (dim 2)
Traceback (most recent call last):
File "bhtsne.py", line 246, in <module>
exit(main(argv))
File "bhtsne.py", line 237, in main
verbose=argp.verbose, initial_dims=argp.initial_dims, use_pca=argp.use_pca, max_iter=argp.max_iter):
File "bhtsne.py", line 218, in run_bh_tsne
for result in bh_tsne(tmp_dir_path, verbose):
File "bhtsne.py", line 163, in bh_tsne
with open(path_join(workdir, 'result.dat'), 'rb') as output_file:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpx2rn_uow/result.dat'
How can I read my dataset correctly?
If its not possible I still have the data in a folder structure where each folder represents a category, sorry for my lack of knowledge but can I pass it and should pass it to a csv file?
Thank you very much
Hello~
Thank you very much to share.
I want to use your code on Matlab R2015b, Do I need to pre-install any software or toolbox?
It would be awesome if we could choose between several distance measures (e.g. jaccard).
[[email protected]@login-node03 bhtsne]$ echo -e '1.0\t0.0\n0.0\t1.0'
1.0 0.0
0.0 1.0
[[email protected]@login-node03 bhtsne]$ echo -e '1.0\t0.0\n0.0\t1.0' | ./bhtsne.py -d 2 -p 0.1
-2227.32653069 6608.48958328
2227.32653069 -6608.48958328
[[email protected]@login-node03 bhtsne]$ echo -e '1.0\t0.0\n0.0\t1.0' > a_file.txt
[[email protected]@login-node03 bhtsne]$ cat a_file.txt
1.0 0.0
0.0 1.0
[[email protected]@login-node03 bhtsne]$ ./bhtsne.py -d 2 -p 0.1 -i a_file.txt
-6863.21277159 -1236.73732294
6863.21277159 1236.73732294
Changes in 5d347d1 breaks Windows support, as Windows doesn't support os.fork
. Wish I could propose a fix, but I'm not sure if/how to achieve the same sort of functionality in Windows. As a stopgap, I suppose the use of the forked process could just be conditional on platform...
Hello!
I download this package and installed it on OS X and Windows.
But When I run the Demonstration of usage in Matlab ( https://github.com/lvdmaaten/bhtsne/), I found the error message as blow.
Could you please provide the new version or a method to avoid this problem?
Thank you very much for your share.
After following the steps in the readme the command window tells me:
sptree.cpp(111): error: c3861: 'fmax': identifier not found
Hi
I've built the library for node.js, but it's not obvious how to call the library (parameters, callbacks, etc.). Is it possible to add a javascript call example to the readme ?
Thanks
Ian
After #29 my code doesn't run anymore. I think it is because the first argument of run_bh_tsne now expects an input_file and not a numpy array anymore. The function bh_tsne now makes a numpy array using this input_file.
Is this intended behaviour? In my opinion it would be better to provide both ways: start bh_tsne with input file and with a numpy array.
The error message:
File "/home/hans/wart-detection/bhtsne.py", line 206, in run_bh_tsne
init_bh_tsne(input_file, tmp_dir_path, no_dims=no_dims, perplexity=perplexity, theta=theta, randseed=randseed,verbose=verbose, initial_dims=initial_dims, use_pca=use_pca, max_iter=max_iter)
File "/home/hans/wart-detection/bhtsne.py", line 101, in init_bh_tsne
for l in input_file), start=1):
File "/home/hans/wart-detection/bhtsne.py", line 101, in <genexpr>
for l in input_file), start=1):
AttributeError: 'numpy.ndarray' object has no attribute 'rstrip'
Exception AttributeError: "'NoneType' object has no attribute 'path'" in <function _remove at 0x7fb39c57a6e0> ignored
does this implementation keep the original input order of samples in the output?
Apologies if this is not an "issue", but rather a question that I have about the implementation (or my lack of understanding thereof).
The paper says in section 3.1 (and in the pseudocode of Algorithm 1)
set pij = (pj | i+pi | j) / 2n
The actual implementation of the symmetrization in tsne.cpp
, line 112 seems to be
double sum_P = .0;
for(int i = 0; i < N * N; i++) sum_P += P[i];
for(int i = 0; i < N * N; i++) P[i] /= sum_P;
thus not dividing by 2n
, but by the sum of all elements.
Which one is right?
My gut feeling is: It does not matter. Both achieve the same goal. But then, I wonder why the effort of computing the sum is undertaken.
Am I overlooking something here?
Laurens (and anyone else)
I've got a stupid question: we're stuck with some very complex Fortran code for neural recording analysis. Basically we can't really write the fortran suite from scratch, but would love to try tSNE somehow.
What do you recommend? I'm willing run tSNE through Fortran system calls, but that's quite a hack. Do you know anyone implementing your work in Fortran?
Thanks for your time.
catubc
HELP!!! troubles in HLLE.m ,when i put minst dataset in hlle ,it come out a error in line 118 "[mappedX, eigenvals] = eigs(G, no_dims + 1, tol, options);" .
"""
Error in eigs (line 93)
[A,Amatrix,isrealprob,issymA,n,B,classAB,k,eigs_sigma,whch, ...
Error in hlle (line 118)
[mappedX, eigenvals] = eigs(G , no_dims + 1, tol, options);
Error in test_hlle_minst (line 8)
[mappedX] = hlle(train_X, 2, 12);
"""
i would be very appreciate if you could help me overcome these errors!
thank U!
Hi, I was wondering about the speed of exact tsne and Barnes-Hut tsne in C++ code.
In my case, the speed of exact tsne is almost 10 times faster than BH tsne, which does not make much sense theoretically. Have anyone encountered the similar results?
If exact tsne is faster in the C++ version, could anyone explain a bit about why this is the case? Really appreciate it!!
Hey,
it's more a question than an actual issue: I'm mapping a dataset with 32dims x 900000items with tsne on a multi-core machine but as tsne is single threaded i'm just using one core. Do you have any tipps or tricks how i can split the dataset to parallelize computation?
thanks in advance!
First, thanks a ton for this research and the implementation! Here's one thing and another I've been using it for.
So, it's totally my fault that I fed NaNs to bh_tsne but the crash was a little mysterious, so I thought I would post an issue in case someone else ran into the same thing. From the Python wrapper I see:
- point 0 of 30583
Traceback (most recent call last):
File "bh_tsne/bhtsne.py", line 176, in <module>
exit(main(argv))
File "bh_tsne/bhtsne.py", line 167, in main
verbose=argp.verbose):
File "bh_tsne/bhtsne.py", line 125, in bh_tsne
'refer to the bh_tsne output for further details')
AssertionError: ERROR: Call to bh_tsne exited with a non-zero return code exit status, please refer to the bh_tsne output for further details
Then, copying the data.dat
and feeding it to the binary with a debugger, I see it crashes here with EXC_BAD_ACCESS
(or Segmentation Fault: 11
when you run it without a debugger):
// Compute Gaussian kernel row
for(int m = 0; m < K; m++) cur_P[m] = exp(-beta * distances[m + 1]);
distances
is 0-length, so I look at my data closer, sorting it and looking for duplicate or weird rows. I noticed some NaNs and verified it with this at the end of TSNE::load_data
int k = 0;
for(int i = 0; i < *n; i++) {
for(int j = 0; j < *d; j++) {
if(isnan((*data)[k++])) {
printf("Found NaN at %i x %i!\n", i, j);
}
}
}
A quick hack to clean the data grep -v 'nan' vectors > vectors-clean
and it looks like it's working, but now I need to fix the original cause of the problem :)
Both tsne.h
and tsne_main.cpp
seem to be almost entirely C code. Would you be opposed to a pull request making them C compatible? This would make your code usable as both C and C++ library, allowing for easy implementation of language bindings without intermediate files.
I am interested in this for the C library bindings and for contributing Go bindings.
Is there an OOS extension for t-SNE?
Hello,
I have found this implementation as sklearn's TSNE doesn't scale well with my 50k x 50k similarity matrix. Is there a simple way to pass this matrix the same way it is passed in scikit-learn. Thanks.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.