Giter Club home page Giter Club logo

wassersteinretrieval's Introduction

Wasserstein For Documents

Implementation for our ECIR 2018 paper "Fast Cross-lingual Document Retrieval using Regularized Wasserstein Distance". The code in the repository implements the Wass and Entro_Wass models of our paper. The implementations heavily rely on:

Running the code

To run the code, one first needs to download the Numberbatch embeddings we used in this paper. We provide a script to download the embeddings and extract those for a subset of languages, e.g., English and French. To do that, first clone the repository, move the the directory and execute the script:

git clone https://github.com/balikasg/WassersteinRetrieval
cd WassersteinRetrieval
bash get_embeddings.sh

This will take some time as it downloads the embeddings (1.1GB compressed), uncompresses it, filters English and French embeddings and removes the files that are not needed. It will output informative messages for its progress. It will create two files: concept_net_1706.300.en and concept_net_1706.300.fr, the containing the English and French word embeddings respectively. The scripts can be improved to read from the compressed file directly using pythons gzip module if need be.

To run the cross-lingual retrieval experiments, run:

python emd.py concept_net_1706.300.en concept_net_1706.300.fr wiki_data/wikicomp.enfr.2k.en wiki_data/wikicomp.enfr.2k.fr 500 french

This runs the emd.py program with several arguments. The first two stand for the embeddings, the second two for the datasets where retrieval is performed, the fifth for the upper limit of words to be kept for each document (we used 500 for efficiency) and the last one for the second language (by default the first is english).

Citing

In case you use the model or the provided code, please cite our paper:

@InProceedings{balikas2018ecir,
  author    = {Georgios Balikas and Charlotte Laclau and Ievgen Redko and Massih-Reza Amini},
  title     = {Cross-lingual Document Retrieval using Regularized Wasserstein Distance},
  booktitle = {Proceedings of the 40th European Conference {ECIR} conference on Information Retrieval, {ECIR} 2018, Grenoble, France, March 26-29, 2018},
  year      = {2018}}

Timings

The code can be parallelized easily as one needs to calculate the distance of each query document with every document in the set of available documents. We have used pathos to parellize the calculations in the level of queries. Having N queries will send them to the available cores. In the figure below we illustrate the performance benefits when parallelizing the example of the section "Running the code", using 1,2,6,10,14,18 and 22 cores in a Intel(R) Xeon(R) CPU E5-2643 v3 @ 3.40GHz machine. Notice that Entro_Wass needs more time, but the difference is small when having more than 10 cores available. Also,Entro_Wass can be implemented with GPUs, but we did not have access to one while writing the paper.

Timings with cores

wassersteinretrieval's People

Contributors

balikasg avatar

Stargazers

杨海宏 avatar Eduardo Fernandes Montesuma avatar Jiachang Liu avatar  avatar  avatar Akmal avatar 彭愈翔 avatar Shashank Gupta avatar Ramsey avatar

Watchers

Vera avatar  avatar

wassersteinretrieval's Issues

when I run "python emd.py concept_net_1706.300.en concept_net_1706.300.fr wiki_data/wikicomp.enfr.2k.en wiki_data/wikicomp.enfr.2k.fr 500 french" ,it occurs the following question:

/home/xzy/xzy_py/lib/python3.6/site-packages/sklearn/neighbors/_base.py:167: EfficiencyWarning: Precomputed sparse input was not sorted by data.
EfficiencyWarning)
Traceback (most recent call last):
File "emd.py", line 56, in
clf.fit(X_train_idf[:instances], np.ones(instances))
File "/home/xzy/docretrieve/wass_funcs.py", line 120, in fit
return super(WassersteinDistances, self).fit(X, y)
File "/home/xzy/xzy_py/lib/python3.6/site-packages/sklearn/neighbors/_base.py", line 1155, in fit
return self._fit(X)
File "/home/xzy/xzy_py/lib/python3.6/site-packages/sklearn/neighbors/_base.py", line 409, in _fit
.format(X.shape[0], X.shape[1]))
ValueError: Precomputed matrix must be a square matrix. Input is a 500x29243 matrix.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.