mkusner / wmd Goto Github PK
View Code? Open in Web Editor NEWWord Mover's Distance from Matthew J Kusner's paper "From Word Embeddings to Document Distances"
Word Mover's Distance from Matthew J Kusner's paper "From Word Embeddings to Document Distances"
I have installed numpy and in the shell,when I type "import numpy.core.multiarray",it's ok.
I don't know why this problem appear?
luoyj@luoyj-Lenovo-M490:~/wmd-master$ python wmd.py twitter_vec.pk twitter_wmd_d.pk
Traceback (most recent call last):
File "wmd.py", line 11, in
[X, BOW_X, y, C, words] = pickle.load(f)
File "/usr/lib/python2.7/pickle.py", line 1378, in load
return Unpickler(file).load()
File "/usr/lib/python2.7/pickle.py", line 858, in load
dispatchkey
File "/usr/lib/python2.7/pickle.py", line 1090, in load_global
klass = self.find_class(module, name)
File "/usr/lib/python2.7/pickle.py", line 1124, in find_class
import(module)
ImportError: No module named multiarray
# git clone https://github.com/mkusner/wmd.git
Cloning into 'wmd'...
remote: Counting objects: 41, done.
remote: Total 41 (delta 0), reused 1 (delta 0), pack-reused 40
Unpacking objects: 100% (41/41), done.
Checking connectivity... done.
# cd wmd/
# pip install gensim numpy scipy
# cd python-emd-master/
# make
>>> Building object file 'emd.o'.
cc -o emd.o -c emd.c -fPIC -I/usr/include/python2.7 -I/usr/include/x86_64-linux-gnu/python2.7
In file included from emd.c:20:0:
emd.h:22:0: warning: "INFINITY" redefined
#define INFINITY 1e20
^
In file included from /usr/include/math.h:41:0,
from emd.c:18:
/usr/include/x86_64-linux-gnu/bits/inf.h:26:0: note: this is the location of the previous definition
# define INFINITY (__builtin_inff())
^
In file included from emd.c:20:0:
emd.h:32:20: warning: extra tokens at end of #include directive
#include "Python.h";
^
>>> Generating C interface
swig -python emd.i
make: swig: Command not found
Makefile:51: recipe for target 'emd_wrap.c' failed
make: *** [emd_wrap.c] Error 127
rm emd.o
It is not really an issue, but compatibility with GenSim library.
Using the first twitter corpus texts, i.e.
now all apple has to do is get swype on the iphone and it will be crack iphone that is
and
apple will be adding more carrier support to the iphone 4s just announced,
I get 0.99 distance using GenSim wmd implementation and 2.6625 using this implementation (original and from the paper's author).
At first sight, I thought that it was related to your stop words list. That said, debugging your code I see that the first and second texts become:
apple swype iphone iphone crack
apple adding carrier support iphone 4s announced
However, running with the words above, I still get a completely different result. Using GenSim and filtering your stop words (as above) I get 0.96 wmd.
Is there any place where this compatibility is discussed?
Could anybody please confirm if the same numbers are returned for different implementations?
This highly impacts the effectiveness of using GenSim implementation to find semantically close texts.
When I run the example script inside VMWare with Ubuntu installed as a guest OS, I get a matrix with around 100K NaN entries. Could it be a problem with the EMD solver?
how to change the signature size in emd module?
emd: Unexpected error in findBasicVariables!This typically happens when the EPSILON defined in emd.h is not right for the scale of the problem.
in emd.i
%typemap(freearg) signature_t * {
if ($1 != NULL) {
PyObject **features_array = (PyObject **) $1->Features;
int weights_count = (int)$1->n;
int i = 0;
for (i = 0; i < weights_count; ++i) {
Py_XDECREF(features_array[i]);
}
free((PyObject **) $1->Features);
free((float *) $1->Weights);
free((signature_t *) $1);
}
}
Dear sir,
I feel sorry to trouble you :
After I run wmd.py, it may get a distance matrix between all documents.
But I am puzzled about the row and column of this distance matrix:
Is every row of distance matrix representing each document? that is the document vector?
Is every column of distance matrix representing the same word of each document?
Thank you so much!
Wow, great paper! Thank you for making the code OSS.
The documentation says that the Python wrapper is not suitable for parallel execution:
The wrapper is not suited for concurrent execution. It uses a global variable for the distance callback function, so calling
emd
from concurrent threads will result in undefined behavior.
However, the function get_wmd
calls emd
concurrently. Can you please explain?
swig is required, but not mentioned.
in emd.h, the include of Python.h has a ; that should be removed
@mkusner I read your paper and want to use your WCD+RWMD method to calculate docs similarity in my doc recommendation project. I found the code for RWMD in matlab, but didn't find the code for WCD. Is it the file named distance.m?
mac os high sierra 10.13
install error
Building object file 'emd.o'.
-n
cc -o emd.o -c emd.c -fPIC -I/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/include/python2.7 -I/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/include/python2.7
In file included from emd.c:20:
./emd.h:22:9: warning: 'INFINITY' macro redefined [-Wmacro-redefined]
#define INFINITY 1e20
^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.13.sdk/usr/include/math.h:68:9: note: previous definition is
here
#define INFINITY HUGE_VALF
^
1 warning generated.
Generating C interface
swig -python emd.i
Building object file 'emd_wrap.o'.
-n
cc -o emd_wrap.o -c emd_wrap.c -fPIC -I/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/include/python2.7 -I/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/include/python2.7
In file included from emd_wrap.c:3020:
./emd.h:22:9: warning: 'INFINITY' macro redefined [-Wmacro-redefined]
#define INFINITY 1e20
^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.13.sdk/usr/include/math.h:68:9: note: previous definition is
here
#define INFINITY HUGE_VALF
^
1 warning generated.
Linking wrapper library '_emd.so'.
-n
ld -shared -o _emd.so emd.o emd_wrap.o
ld: unknown option: -shared
make: *** [_emd.so] Error 1
rm emd_wrap.o emd.o emd_wrap.c
Hello Matthew,
Please let me know how I can use the GloVe 840 billion corpus with your code for embedding purpose.
Kindly, guide how to use it.
Thanks in advance.
Thank you for your implementation of your paper.
First, I tried with your code and data. It worked well. (all_twitter_by_line.txt)
Second, I tried 20newsgroup data which was in your paperwork.
Then, I got
"emd: Maximum number of iterations has been reached 1013"
error because of limitation, MAX_SIG_SIZE 100.
So, I change it to over maximum size of unique keywords in 20newsgroup dataset( =5284).
Now, I have trouble with blocking after some steps.
I think it's because of multiprocessing.
I check CPU availability, it was 99% in multiCPU, multicore environment.
Is there any solution for this?
During the wmd distance matrix computation, it occurs "emd: Signature size is limited to 100" several times. What should be done?
root@user-virtual-machine:/home/user/WMD# python wmd.py asd.pk asdwmd.pk
[pool :] <multiprocessing.pool.Pool object at 0x7f327f1cc150>
0 out of 3
1 out of 3
emd: Signature size is limited to 100
2 out of 3
emd: Signature size is limited to 100
stop.txt and training data all use in chinese. how can i solve this problem???
Has anyone (at least partially) evaluated the quality of outputs compared to Sørensen–Dice coefficient?
Please share your findings (ideally from production 😉), thanks!
Does this code run in python 3.x? Thanks!
Hello,
Thank you for the great work and nice implementation. It really helps me!
I know that i can obtain distances through emd( (X[i], BOW_X[i]), (X[j], BOW_X[j]), distance). But how can I get the flow information (transportation matrix)? I have no idea of getting it through python interface.
Zhe Zhao
I was wondering if matlab is required to run/edit the code ?
Hi,
first of all thank you for the great work and nice implementation!
The tool works fine for me and I will use it for document comparison in the socal media context. Can you please give me some advise how to work with the resulting "...wmd_d.pk" file? First I thought the result would be a textfile with a readable matrix in it but now I think I need any additional software?
Thank you very much!
i had installation issues similar to before-mentioned ones.
running
sudo apt-get install python-dev # for python2.x installs
or
sudo apt-get install python3-dev # for python3.x installs
and
removing ";" from include Python.h; in emd.h
solved the problems
It would be very nice if the output distance matrix file were independent of python formats. So we use it in another languages as well.
Why is that I keep getting "ImportError: No module named _emd
" error from emd.py? I use python 2.7.
May I ask what is '_emd' ? I assume it's not the same as pyemd?
Thanks in advance for your time!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.