Giter Club home page Giter Club logo

meemi's Introduction

Meemi

The following repository includes the code and pre-trained cross-lingual word embeddings from the paper Improving cross-lingual word embeddings by meeting in the middle (EMNLP 2018).

Pre-trained embeddings

We release the 300-dimension word embeddings used in our experiments (English, Spanish, Italian, German and Finnish) as binary bin files:

  • Monolingual FastText embeddings: Available here
  • Baseline cross-lingual embeddings: Available here
  • Cross-lingual embeddings post-processed with Meemi: Available here

Note 1: All vocabulary words are lowercased.

Note 2: If you would like to convert the binary files to txt, you can use convertvec.

Requirements:

  • Python 3
  • NumPy
  • Gensim
  • If you use VecMap or MUSE, please also check their corresponding GitHub pages. Note that we use a previous version of these tools, of which there is a copy in this repository (WIP).

Usage

get_crossembs.sh SOURCE_EMBEDDINGS TARGET_EMBEDDINGS DICTIONARY_FILE [-vecmap | -muse TRAIN_DICT VALID_DICT]

Apply meemi to your cross-lingual embeddings

get_crossembs.sh SOURCE_EMBEDDINGS TARGET_EMBEDDINGS DICTIONARY_FILE

Example:

get_crossembs.sh EN-ES.english.vecmap.txt EN-ES.spanish.vecmap.txt en-es.dict.txt

Use VecMap to align monolingual embeddings and then meemi

get_crossembs.sh SOURCE_EMBEDDINGS TARGET_EMBEDDINGS DICTIONARY_FILE -vecmap

Use MUSE to align monolingual embeddings and then meemi

get_crossembs.sh SOURCE_EMBEDDINGS TARGET_EMBEDDINGS DICTIONARY_FILE -muse TRAIN_SIZE VALID_SIZE

Experiments

Bilingual Dictionary Induction

In order to test your embeddings on bilingual dictionary induction type the following:

python test.py SOURCE_EMBEDDINGS TARGET_EMBEDDINGS < DICTIONARY_FILE

Word similarity

In order to test your embeddings on monolingual word similarity type the following:

python test_similarity_monolingual.py EMBEDDINGS DATASET

You can also test various datasets at the same time:

python test_similarity_monolingual.py EMBEDDINGS DATASET1 [DATASET2] ... [DATASETN]

Likewise, to test your cross-lingual embeddings on cross-lingual word similarity type the following:

python test_similarity_crosslingual.py SOURCE_EMBEDDINGS TARGET_EMBEDDINGS DATASET

As with monolingual similarity, you can also test various datasets at the same time. Below is an example of how to test your English-Spanish cross-lingual embeddings on all the monolingual and cross-lingual word similarity datasets:

python test_similarity_monolingual.py EN-ES.english.vecmap.meemi.bin data/SimLex/simlex-999_english.txt data/SemEval2018-subtask1-monolingual/english.txt data/rg65-monolingual/rg65_english.txt data/WS353-monolingual/WS353-english-sim.txt
python test_similarity_monolingual.py EN-ES.english.vecmap.meemi.bin data/SemEval2018-subtask1-monolingual/spanish.txt data/rg65-monolingual/rg65_spanish.txt 
python test_similarity_crosslingual.py EN-ES.english.vecmap.meemi.bin EN-ES.spanish.vecmap.meemi.bin data/SemEval2018-subtask2-crosslingual/en-es.txt data/rg65-crosslingual/rg65_EN-ES.txt

Note: This code assumes that lowercased word embeddings are provided as input. If you would like to mantain the casing, simply remove the .lower() commands in the evaluation scripts.

Cross-lingual Hypernym Discovery

Hypernym Discovery is the task to retrieve, for a given term, a ranked list of valid hypernyms. In this experiment, a hypernym discovery system is trained in English data (and possibly in a weakly supervised setting with some target language data), and makes predictions in the target language.

To run the hypernym discovery experiments, launch the following command:

python3 experiments/hypernym_discovery/taxoembed.py -wvtrain SOURCE_EMBEDDINGS -wvtest TARGET_EMBEDDINGS -vtest TARGET_VOCABULARY -hypotrain SOURCE_HYPONYMS -hypertrain SOURCE_HYPERNYMS -test TARGET_HYPONYMS -newtrain TARGET_LANG_TRAINING_INSTANCES -npairs NUMB_TRAINING_INSTANCES -o OUTPUT_FOLDER 

The predictions of the model are saved in OUTPUT_FOLDER with the name [TARGET_EMBEDDINGS]_[NUMB_TRAINING_INSTANCES]_[TARGET_LANG_TRAINING_INSTANCES]_W.txt.

For example, evaluating a hypernym discovery model for Spanish trained on VecMap English vectors and 500 additional instances in Spanish:

 python3 experiments/hypernym_discovery/taxoembed.py -wvtrain EN-ES.english.vecmap.bin -wvtest EN-ES.spanish.vecmap.bin -vtest experiments/hypernym_discovery/SemEval2018-Task9/vocabulary/1C.spanish.vocabulary.txt -hypotrain experiments/hypernym_discovery/SemEval2018-Task9/training/data/1A.english.training.data.txt -hypertrain experiments/hypernym_discovery/SemEval2018-Task9/training/gold/1A.english.training.gold.txt -test experiments/hypernym_discovery/SemEval2018-Task9/test/data/1C.spanish.test.data.txt -o experiments/hypernym_discovery/ -newtrain experiments/hypernym_discovery/SemEval2018-Task9/utils/spanish_train.tsv -npairs 500

Then, call the official SemEval task scorer passing as arguments the gold file and the predictions file generated in the previous step.

python experiments/hypernym_discovery/SemEval2018-Task9/task9-scorer.py GOLD_FILE PREDICTIONS_FILE

For the previous example, the exact command would be:

python experiments/hypernym_discovery/SemEval2018-Task9/task9-scorer.py experiments/hypernym_discovery/SemEval2018-Task9/test/gold/1C.spanish.test.gold.txt experiments/hypernym_discovery/EN-ES.spanish.vecmap.bin_500_1C.spanish.output_W.txt 

Reference paper

If you use any of these resources, please cite the following paper:

@InProceedings{doval:meemiemnlp2018,
  author = 	"Doval, Yerai and Camacho-Collados, Jose and Espinosa-Anke, Luis and Schockaert, Steven",
  title = 	"Improving cross-lingual word embeddings by meeting in the middle",
  booktitle = 	"Proceedings of EMNLP",
  year = 	"2018",
  publisher = 	"Association for Computational Linguistics",
  location = 	"Brussels, Belgium"
}

If you use VecMap or MUSE, please also cite their corresponding papers.

meemi's People

Contributors

yeraidm avatar pedrada88 avatar luisespinosaanke avatar

Stargazers

Zoey Wu avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.