Giter Club home page Giter Club logo

muse's Introduction

MUSE: Multilingual Unsupervised and Supervised Embeddings

Model

MUSE is a Python library for multilingual word embeddings, whose goal is to provide the community with:

  • state-of-the-art multilingual word embeddings (fastText embeddings aligned in a common space)
  • large-scale high-quality bilingual dictionaries for training and evaluation

We include two methods, one supervised that uses a bilingual dictionary or identical character strings, and one unsupervised that does not use any parallel data (see Word Translation without Parallel Data for more details).

Dependencies

MUSE is available on CPU or GPU, in Python 2 or 3. Faiss is optional for GPU users - though Faiss-GPU will greatly speed up nearest neighbor search - and highly recommended for CPU users. Faiss can be installed using "conda install faiss-cpu -c pytorch" or "conda install faiss-gpu -c pytorch".

Get evaluation datasets

To download monolingual and cross-lingual word embeddings evaluation datasets:

  • Our 110 bilingual dictionaries
  • 28 monolingual word similarity tasks for 6 languages, and the English word analogy task
  • Cross-lingual word similarity tasks from SemEval2017
  • Sentence translation retrieval with Europarl corpora

You can simply run:

cd data/
wget https://dl.fbaipublicfiles.com/arrival/vectors.tar.gz
wget https://dl.fbaipublicfiles.com/arrival/wordsim.tar.gz
wget https://dl.fbaipublicfiles.com/arrival/dictionaries.tar.gz

Alternatively, you can also download the data with:

cd data/
./get_evaluation.sh

Note: Requires bash 4. The download of Europarl is disabled by default (slow), you can enable it here.

Get monolingual word embeddings

For pre-trained monolingual word embeddings, we highly recommend fastText Wikipedia embeddings, or using fastText to train your own word embeddings from your corpus.

You can download the English (en) and Spanish (es) embeddings this way:

# English fastText Wikipedia embeddings
curl -Lo data/wiki.en.vec https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.vec
# Spanish fastText Wikipedia embeddings
curl -Lo data/wiki.es.vec https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.es.vec

Align monolingual word embeddings

This project includes two ways to obtain cross-lingual word embeddings:

  • Supervised: using a train bilingual dictionary (or identical character strings as anchor points), learn a mapping from the source to the target space using (iterative) Procrustes alignment.
  • Unsupervised: without any parallel data or anchor point, learn a mapping from the source to the target space using adversarial training and (iterative) Procrustes refinement.

For more details on these approaches, please check here.

The supervised way: iterative Procrustes (CPU|GPU)

To learn a mapping between the source and the target space, simply run:

python supervised.py --src_lang en --tgt_lang es --src_emb data/wiki.en.vec --tgt_emb data/wiki.es.vec --n_refinement 5 --dico_train default

By default, dico_train will point to our ground-truth dictionaries (downloaded above); when set to "identical_char" it will use identical character strings between source and target languages to form a vocabulary. Logs and embeddings will be saved in the dumped/ directory.

The unsupervised way: adversarial training and refinement (CPU|GPU)

To learn a mapping using adversarial training and iterative Procrustes refinement, run:

python unsupervised.py --src_lang en --tgt_lang es --src_emb data/wiki.en.vec --tgt_emb data/wiki.es.vec --n_refinement 5

By default, the validation metric is the mean cosine of word pairs from a synthetic dictionary built with CSLS (Cross-domain similarity local scaling). For some language pairs (e.g. En-Zh), we recommend to center the embeddings using --normalize_embeddings center.

Evaluate monolingual or cross-lingual embeddings (CPU|GPU)

We also include a simple script to evaluate the quality of monolingual or cross-lingual word embeddings on several tasks:

Monolingual

python evaluate.py --src_lang en --src_emb data/wiki.en.vec --max_vocab 200000

Cross-lingual

python evaluate.py --src_lang en --tgt_lang es --src_emb data/wiki.en-es.en.vec --tgt_emb data/wiki.en-es.es.vec --max_vocab 200000

Word embedding format

By default, the aligned embeddings are exported to a text format at the end of experiments: --export txt. Exporting embeddings to a text file can take a while if you have a lot of embeddings. For a very fast export, you can set --export pth to export the embeddings in a PyTorch binary file, or simply disable the export (--export "").

When loading embeddings, the model can load:

  • PyTorch binary files previously generated by MUSE (.pth files)
  • fastText binary files previously generated by fastText (.bin files)
  • text files (text file with one word embedding per line)

The two first options are very fast and can load 1 million embeddings in a few seconds, while loading text files can take a while.

Download

We provide multilingual embeddings and ground-truth bilingual dictionaries. These embeddings are fastText embeddings that have been aligned in a common space.

Multilingual word Embeddings

We release fastText Wikipedia supervised word embeddings for 30 languages, aligned in a single vector space.

Arabic: text Bulgarian: text Catalan: text Croatian: text Czech: text Danish: text
Dutch: text English: text Estonian: text Finnish: text French: text German: text
Greek: text Hebrew: text Hungarian: text Indonesian: text Italian: text Macedonian: text
Norwegian: text Polish: text Portuguese: text Romanian: text Russian: text Slovak: text
Slovenian: text Spanish: text Swedish: text Turkish: text Ukrainian: text Vietnamese: text

You can visualize crosslingual nearest neighbors using demo.ipynb.

Ground-truth bilingual dictionaries

We created 110 large-scale ground-truth bilingual dictionaries using an internal translation tool. The dictionaries handle well the polysemy of words. We provide a train and test split of 5000 and 1500 unique source words, as well as a larger set of up to 100k pairs. Our goal is to ease the development and the evaluation of cross-lingual word embeddings and multilingual NLP.

European languages in every direction

src-tgt German English Spanish French Italian Portuguese
German - full train test full train test full train test full train test full train test
English full train test - full train test full train test full train test full train test
Spanish full train test full train test - full train test full train test full train test
French full train test full train test full train test - full train test full train test
Italian full train test full train test full train test full train test - full train test
Portuguese full train test full train test full train test full train test full train test -

Other languages to English (e.g. {fr,es}-en)

Afrikaans: full train test Albanian: full train test Arabic: full train test Bengali: full train test
Bosnian: full train test Bulgarian: full train test Catalan: full train test Chinese: full train test
Croatian: full train test Czech: full train test Danish: full train test Dutch: full train test
English: full train test Estonian: full train test Filipino: full train test Finnish: full train test
French: full train test German: full train test Greek: full train test Hebrew: full train test
Hindi: full train test Hungarian: full train test Indonesian: full train test Italian: full train test
Japanese: full train test Korean: full train test Latvian: full train test Littuanian: full train test
Macedonian: full train test Malay: full train test Norwegian: full train test Persian: full train test
Polish: full train test Portuguese: full train test Romanian: full train test Russian: full train test
Slovak: full train test Slovenian: full train test Spanish: full train test Swedish: full train test
Tamil: full train test Thai: full train test Turkish: full train test Ukrainian: full train test
Vietnamese: full train test

English to other languages (e.g. en-{fr,es})

Afrikaans: full train test Albanian: full train test Arabic: full train test Bengali: full train test
Bosnian: full train test Bulgarian: full train test Catalan: full train test Chinese: full train test
Croatian: full train test Czech: full train test Danish: full train test Dutch: full train test
English: full train test Estonian: full train test Filipino: full train test Finnish: full train test
French: full train test German: full train test Greek: full train test Hebrew: full train test
Hindi: full train test Hungarian: full train test Indonesian: full train test Italian: full train test
Japanese: full train test Korean: full train test Latvian: full train test Littuanian: full train test
Macedonian: full train test Malay: full train test Norwegian: full train test Persian: full train test
Polish: full train test Portuguese: full train test Romanian: full train test Russian: full train test
Slovak: full train test Slovenian: full train test Spanish: full train test Swedish: full train test
Tamil: full train test Thai: full train test Turkish: full train test Ukrainian: full train test
Vietnamese: full train test

References

Please cite [1] if you found the resources in this repository useful.

Word Translation Without Parallel Data

[1] A. Conneau*, G. Lample*, L. Denoyer, MA. Ranzato, H. Jégou, Word Translation Without Parallel Data

* Equal contribution. Order has been determined with a coin flip.

@article{conneau2017word,
  title={Word Translation Without Parallel Data},
  author={Conneau, Alexis and Lample, Guillaume and Ranzato, Marc'Aurelio and Denoyer, Ludovic and J{\'e}gou, Herv{\'e}},
  journal={arXiv preprint arXiv:1710.04087},
  year={2017}
}

MUSE is the project at the origin of the work on unsupervised machine translation with monolingual data only [2].

Unsupervised Machine Translation With Monolingual Data Only

[2] G. Lample, A. Conneau, L. Denoyer, MA. Ranzato Unsupervised Machine Translation With Monolingual Data Only

@article{lample2017unsupervised,
  title={Unsupervised Machine Translation Using Monolingual Corpora Only},
  author={Lample, Guillaume and Conneau, Alexis and Denoyer, Ludovic and Ranzato, Marc'Aurelio},
  journal={arXiv preprint arXiv:1711.00043},
  year={2017}
}

Related work

Contact: [email protected] [email protected]

muse's People

Contributors

aconneau avatar glample avatar justin1904 avatar louismartin avatar mimipaskova avatar prabhakar267 avatar sufuf3 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

muse's Issues

ValueError: result of slicing is an empty tensor

I was trying to run the unsupervised mapping task and got this error:

INFO - 04/15/18 03:50:23 - 0:00:06 - 9 source words - csls_knn_10 - Precision at k = 10: 0.000000
Traceback (most recent call last):
  File "unsupervised.py", line 136, in <module>
    evaluator.all_eval(to_log)
  File "/home/nghibui/codes/MUSE/src/evaluation/evaluator.py", line 192, in all_eval
    self.dist_mean_cosine(to_log)
  File "/home/nghibui/codes/MUSE/src/evaluation/evaluator.py", line 172, in dist_mean_cosine
    s2t_candidates = get_candidates(src_emb, tgt_emb, _params)
  File "/home/nghibui/codes/MUSE/src/dico_builder.py", line 38, in get_candidates
    scores = emb2.mm(emb1[i:min(n_src, i + bs)].transpose(0, 1)).transpose(0, 1)
ValueError: result of slicing is an empty tensor
 

No idea why this is happened, any explain?
I'm doing the mapping task on a pair of languages that does not belong to the available list, because of that, I don't have a dictionary for evaluation. Also, the number of vocabularies for each language is quite small, around 4000 each. I guess if I can get rid of the evaluation tasks in the evaluator.py, the code will work.

cannot eval on cross-lingual word similarity task

Thanks for this wonderful project!

I found I can not evaluate on cross-lingual word similarity task (i.e., SEMEVAL17 task).

  1. in get_evaluation.sh, the eval data are crosslingual/wordsim/$lg_pair-SEMEVAL17.txt

    paste $fdir/data/$lg_pair.test.data.txt $fdir/keys/$lg_pair.test.gold.txt > crosslingual/wordsim/$lg_pair-SEMEVAL17.txt

  2. in src/evaluation/wordsim.py, the eval file look like $lg_pair/SEMEVAL17.txt

dirpath = os.path.join(SEMEVAL17_EVAL_PATH, '%s-%s' % (lang1, lang2))
if not os.path.isdir(dirpath):
return None
scores = {}
separator = "=" * (30 + 1 + 10 + 1 + 13 + 1 + 12)
pattern = "%30s %10s %13s %12s"
logger.info(separator)
logger.info(pattern % ("Dataset", "Found", "Not found", "Rho"))
logger.info(separator)
for filename in list(os.listdir(dirpath)):
if 'SEMEVAL17' not in filename:
continue
filepath = os.path.join(dirpath, filename)

Out-of-Vocabulary words in released models

Hey,

is there any token for OOV words in released models? If I understand correctly, the projection is working with finite set of word embeddings, not with the fastText model itself. So there is no way to use the original fastText way to deal with OOV, right?

How to align word embeddings of 3 languages?

Thanks for releasing supervised word embeddings for 30 languages, aligned in a single vector space. My question is how would you do alignment of 3 languages? For example, how to align French, German and English in a single vector space? Your examples seem to only show how to align two languages. Thanks in advance!

AssertionError - in get_word_translation_accuracy dico = load_dictionary(path, word2id1, word2id2)

I was trying to build cross-lingual word embeddings for Malayalam and Hindi.

Environment : Ubuntu 16, 8CPUs/52GB RAM, Tesla K80, Google Cloud, CUDA 8, Python 3.6, Faiss not installed

This is what I did,

curl -Lo data/wiki.ml.vec https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.ml.vec
curl -Lo data/wiki.hi.vec https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.hi.vec

Then

python unsupervised.py --src_lang ml --tgt_lang hi --src_emb ../data/wiki.ml.vec --tgt_emb ../data/wiki.hi.vec

After running it around 10 mins, I got this error,

INFO - 12/27/17 13:10:56 - 0:10:37 - 988000 - Discriminator loss: 0.4106 - 3290 samples/s INFO - 12/27/17 13:10:58 - 0:10:39 - 992000 - Discriminator loss: 0.4109 - 3339 samples/s INFO - 12/27/17 13:11:00 - 0:10:42 - 996000 - Discriminator loss: 0.4110 - 3344 samples/s Traceback (most recent call last): File "unsupervised.py", line 135, in <module> evaluator.all_eval(to_log) File "/home/jamsheer/jamsheer/fasttext/MUSE/src/evaluation/evaluator.py", line 190, in all_eval self.word_translation(to_log) File "/home/jamsheer/jamsheer/fasttext/MUSE/src/evaluation/evaluator.py", line 94, in word_translation method=method File "/home/jamsheer/jamsheer/fasttext/MUSE/src/evaluation/word_translation.py", line 88, in get_word_translation_accuracy dico = load_dictionary(path, word2id1, word2id2) File "/home/jamsheer/jamsheer/fasttext/MUSE/src/evaluation/word_translation.py", line 48, in load_dictionary assert os.path.isfile(path)AssertionError

CSLS calculation understanding

Hello, in the paper CSLS is calculated by the following equation
image
since three terms are all pairwise cosine similarity (or mean k nearest neighbor similarity), each tensor should be size as [dictionary_size, 1]. However when I look at the code , I found that cos(Wx,y) size as [128dictionary_size], and average_dist1[i:min(n_src, i + bs)][:, None] + average_dist2[None, :] also sizes as [128dictionary_size]. I wonder what is this 128 means, or if I misunderstand any details? Thank you!

It's it possible to anchor the source embeddings

Hi,

after the alignment I can see that the embeddings of the source have also changed, I mean, both spaces has been "moved", is there any way to anchor the source space and just modify the target?

Running with unknown languages (how to disable eval)

Hi,

I'm running the unsupervised alignment network on two sets of embeddings, one of which is in an undeciphered language.

I don't really care about the part of the training code that does evaluation against built-in dictionaries, since that isn't really well-defined for my application. Thus, I've tried running unsupervised.py with the default values (es-en) even though my embeddings are in Latin and an unknown language.

This works for the first epoch, but then it gives me the following error message after the epoch finishes and it tries to enter the evaluation code:
Traceback (most recent call last):
File "unsupervised.py", line 137, in
evaluator.all_eval(to_log)
File "/gpfs/loomis/home.grace/fas/frank/wcm24/voynich2vec/MUSE/src/evaluation/evaluator.py", line 190, in all_eval
self.word_translation(to_log)
File "/gpfs/loomis/home.grace/fas/frank/wcm24/voynich2vec/MUSE/src/evaluation/evaluator.py", line 94, in word_translation
method=method
File "/gpfs/loomis/home.grace/fas/frank/wcm24/voynich2vec/MUSE/src/evaluation/word_translation.py", line 92, in get_word_translation_accuracy
assert dico[:, 0].max() < emb1.size(0)
IndexError: trying to index 2 dimensions of a 0 dimensional tensor

Any idea what could be going wrong? Or how I could just disable the part of the evaluation that is causing these issues? I have tried commenting out some lines in the code, but this always leaves me with other errors. I would prefer a cleaner solution.

Thanks!

DIC_EVAL_PATH in word_translation sticks in the default path

wow, just working with it and you change so much of the readme and actually provide embeddings of 30 languages in one space!
Fantastic, Thank you!

I was about to ask you how it would be possible to have more then 2 languages in 1 space on how accurate that still is.

One problem I encountered is that DIC_EVAL_PATH in src/evaluation/word_translation.py is not changeable with an argument, so it would always look in the default path for it...
Maybe another argument for that path or just one path parameter and it looks for train.txt and eval.txt or src_lang-tgt_lang-train.txt, ...

How to use the aligned vectors?

Is it possible to provide some examples on how to use the monolingual aligned vectors (unsupervised) to generate translations? I have got the aligned vectors for Malayalam-Hindi, it would be helpful, if I can get an example how I can proceed further.

Average time to align monolingual word embeddings: the supervised way?

I am aligning english and hindi fasttext monolingual embeddings using the the supervised way on a GPU. Are there are time estimates as to how long it takes? It's been 4 hours, and it is still in the first refinement step.

I ran the following command:

python supervised.py --src_lang en --tgt_lang hi --src_emb wiki.en.vec --tgt_emb wiki.hi.vec --n_iter 5 --dico_train default

Update: it was running for close to 20 hours on a GeForce GTX 1080, constantly hogging 1 CPU core, but no entries were added to the log. I am running it again.

Log:

INFO - 12/27/17 17:57:14 - 0:00:00 - ============ Initialized logger ============
INFO - 12/27/17 17:57:14 - 0:00:00 - cuda: True
                                     dico_build: S2T&T2S
                                     dico_max_rank: 10000
                                     dico_max_size: 0
                                     dico_method: csls_knn_10
                                     dico_min_size: 0
                                     dico_threshold: 0
                                     dico_train: default
                                     emb_dim: 300
                                     exp_path: /MUSE/dumped/hidden
                                     export: True
                                     max_vocab: 200000
                                     n_iters: 5
                                     normalize_embeddings: 
                                     seed: -1
                                     src_emb:wiki.en.vec
                                     src_lang: en
                                     tgt_emb: wiki.hi.vec
                                     tgt_lang: hi
                                     verbose: 2
INFO - 12/27/17 17:57:14 - 0:00:00 - The experiment will be stored in hidden/MUSE/dumped/hidden
INFO - 12/27/17 17:57:25 - 0:00:11 - Loaded 200000 pre-trained word embeddings
INFO - 12/27/17 17:57:45 - 0:00:31 - Loaded 158016 pre-trained word embeddings
INFO - 12/27/17 17:57:49 - 0:00:34 - Found 8704 pairs of words in the dictionary (4998 unique). 0 other pairs contained at least one unknown word (0 in lang1, 0 in lang2)
INFO - 12/27/17 17:57:49 - 0:00:34 - Starting refinement iteration 0...
INFO - 12/27/17 17:57:49 - 0:00:35 - ====================================================================
INFO - 12/27/17 17:57:49 - 0:00:35 -                        Dataset      Found     Not found          Rho
INFO - 12/27/17 17:57:49 - 0:00:35 - ====================================================================
INFO - 12/27/17 17:57:49 - 0:00:35 -                   EN_MTurk-771        771             0       0.6689
INFO - 12/27/17 17:57:49 - 0:00:35 -                   EN_MTurk-287        286             1       0.6773
INFO - 12/27/17 17:57:49 - 0:00:35 -                  EN_SIMLEX-999        998             1       0.3823
INFO - 12/27/17 17:57:49 - 0:00:35 -                  EN_WS-353-REL        252             0       0.6820
INFO - 12/27/17 17:57:49 - 0:00:35 -                 EN_RW-STANFORD       1323           711       0.5080
INFO - 12/27/17 17:57:49 - 0:00:35 -                       EN_MC-30         30             0       0.8123
INFO - 12/27/17 17:57:49 - 0:00:35 -                  EN_WS-353-ALL        353             0       0.7388
INFO - 12/27/17 17:57:49 - 0:00:35 -                    EN_VERB-143        144             0       0.3973
INFO - 12/27/17 17:57:49 - 0:00:35 -                   EN_MEN-TR-3k       3000             0       0.7637
INFO - 12/27/17 17:57:49 - 0:00:35 -                      EN_YP-130        130             0       0.5333
INFO - 12/27/17 17:57:49 - 0:00:35 -                       EN_RG-65         65             0       0.7974
INFO - 12/27/17 17:57:49 - 0:00:35 -                   EN_SEMEVAL17        379             9       0.7216
INFO - 12/27/17 17:57:49 - 0:00:35 -                  EN_WS-353-SIM        203             0       0.7811
INFO - 12/27/17 17:57:49 - 0:00:35 - ====================================================================
INFO - 12/27/17 17:57:49 - 0:00:35 - Monolingual source word similarity score average: 0.65108
INFO - 12/27/17 17:57:49 - 0:00:35 - Found 2032 pairs of words in the dictionary (1500 unique). 0 other pairs contained at least one unknown word (0 in lang1, 0 in lang2)
INFO - 12/27/17 17:57:50 - 0:00:36 - 1500 source words - nn - Precision at k = 1: 23.800000
INFO - 12/27/17 17:57:51 - 0:00:36 - 1500 source words - nn - Precision at k = 5: 41.133333
INFO - 12/27/17 17:57:51 - 0:00:37 - 1500 source words - nn - Precision at k = 10: 48.133333
INFO - 12/27/17 17:57:51 - 0:00:37 - Found 2032 pairs of words in the dictionary (1500 unique). 0 other pairs contained at least one unknown word (0 in lang1, 0 in lang2)

The usage of embeddings during evaluation

Hi, I have a quick question about embeddings used during the evaluation.
In README.md, the crosslingual evaluation python evaluate.py --src_lang en --tgt_lang es --src_emb data/wiki.en-es.en.vec --tgt_emb data/wiki.en-es.es.vec --max_vocab 200000 means use the pretrained embeddings or first normalized then mapped to the source target embeddings as processed in trainer.export()?

It seems that the monolingual evaluation uses pretrained embeddings in data/wiki.en.vec which is different from data/wiki.en-es.en.vec. If crosslingual evaluation uses embeddings of exported version, why does the code src_emb = self.mapping(self.src_emb.weight).data.cpu().numpy() in evaluator.py do mapping once again?
Am I getting anything wrong?

Using --dico_train identical_char still needs dictionaries

according to the docs

when set to "identical_char" it will use identical character strings between source and target languages to form a vocabulary.````

I understood that the dictionary was going to be created using the given corpus

Assertion Error in unsupervised training

INFO - 03/01/18 16:49:10 - 0:00:00 - The experiment will be stored in /usr2/home/nkvyas/Bayesian/MUSE/dumped/6uymzfdoug Traceback (most recent call last): File "unsupervised.py", line 91, in <module> src_emb, tgt_emb, mapping, discriminator = build_model(params, True) File "/usr2/home/nkvyas/Bayesian/MUSE/src/models.py", line 46, in build_model src_dico, _src_emb = load_external_embeddings(params, source=True) File "/usr2/home/nkvyas/Bayesian/MUSE/src/utils.py", line 282, in load_external_embeddings assert len(split) == 2 AssertionError

where I tried the supervised.py, I got "ValueError: The input must have at least 3 entries!"

Hello, I used docker to build an environment which contained conda, pytorch and faiss with python3.6. Finally it can run this amazing open source.

But when I tried the following command to test the supervised method:

python3 supervised.py --src_lang en --tgt_lang es --src_emb ../wiki.en.vec --tgt_emb ../Spanish_wiki.es.vec --n_iter 5 --dico_train identical_char --cuda False

It returned a ValueError said "The input must have at least 3 entries!".

Here is the logs:

Failed to load GPU Faiss: No module named 'swigfaiss_gpu'
Faiss falling back to CPU-only.
Impossible to import Faiss-GPU. Switching to FAISS-CPU, this will be slower.

INFO - 01/06/18 12:20:38 - 0:00:00 - ============ Initialized logger ============
INFO - 01/06/18 12:20:38 - 0:00:00 - cuda: False
dico_build: S2T&T2S
dico_max_rank: 10000
dico_max_size: 0
dico_method: csls_knn_10
dico_min_size: 0
dico_threshold: 0
dico_train: identical_char
emb_dim: 300
exp_path: /Documents/MUSE-master/dumped/nrthsd26ay
export: True
max_vocab: 200000
n_iters: 5
normalize_embeddings:
seed: -1
src_emb: ../wiki.en.vec
src_lang: en
tgt_emb: ../Spanish_wiki.es.vec
tgt_lang: es
verbose: 2
INFO - 01/06/18 12:20:38 - 0:00:00 - The experiment will be stored in /Documents/MUSE-master/dumped/nrthsd26ay
INFO - 01/06/18 12:20:48 - 0:00:10 - Loaded 200000 pre-trained word embeddings
INFO - 01/06/18 12:21:02 - 0:00:24 - Loaded 200000 pre-trained word embeddings
INFO - 01/06/18 12:21:04 - 0:00:26 - Found 85912 pairs of identical character strings.
INFO - 01/06/18 12:21:05 - 0:00:26 - Starting refinement iteration 0...
INFO - 01/06/18 12:22:14 - 0:01:36 - ====================================================================
INFO - 01/06/18 12:22:14 - 0:01:36 - Dataset Found Not found Rho
INFO - 01/06/18 12:22:14 - 0:01:36 - ====================================================================
here:
n--> 771
m--> 771
Traceback (most recent call last):
File "supervised.py", line 92, in
evaluator.all_eval(to_log)
File "/Documents/MUSE-master/src/evaluation/evaluator.py", line 188, in all_eval
self.monolingual_wordsim(to_log)
File "/Documents/MUSE-master/src/evaluation/evaluator.py", line 43, in monolingual_wordsim
self.mapping(self.src_emb.weight).data.cpu().numpy()
File "/Documents/MUSE-master/src/evaluation/wordsim.py", line 104, in get_wordsim_scores
coeff, found, not_found = get_spearman_rho(word2id, embeddings, filepath, lower)
File "/Documents/MUSE-master/src/evaluation/wordsim.py", line 83, in get_spearman_rho
return spearmanr(gold, pred).correlation, len(gold), not_found
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/scipy/stats/stats.py", line 3301, in spearmanr
rho, pval = mstats_basic.spearmanr(a, b, axis)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/scipy/stats/mstats_basic.py", line 461, in spearmanr
raise ValueError("The input must have at least 3 entries!")
ValueError: The input must have at least 3 entries!

Does anyone have ideas with these problem? Thanks ^^.

Is translation dictionary required in unsupervised alignment?

Hi, I was trying to train an alignment of embeddings from German to Japanese. Since we don't have a translation dictionary from German to Japanese, I was using unsupervised alignment, assuming no German2Japanese dictionary is required. However, during training, "unsupervised.py" calls the functions "evaluator.all_eval" and "self.word_translation" from "evaluator.py" and "word_translation.py" when it tries to do embeddings / discriminator evaluation, and the program stopped because no German2Japanese dictionary is found. I wonder whether in this case a translation dictionary is still required to perform unsupervised alignment, or is there any parameter I need to change to avoid this?

Questions about the way to take use of the word embeddings.

Hi, @glample
Here is the output during the training progress. And I find these files(best_mapping.t7 params.pkl train.log vectors-latin.txt vectors-zh.txt) in the dumped folder. But how could I take use of these word embeddings? Now, I am just taking a word's vector in source word embedding file and find top 10 words most similar to this vector in the target word embedding file. But is this way right? Below outputs show there are two methods, namely nn and csls_knn_10, is it necessary to take care about these two methods? Thank you!

==============================================
OrderedDict([('n_iter', 149)])
nn
path:data/crosslingual/dictionaries/zh-latin.5000-6500.txt
INFO - 01/24/18 09:18:02 - 1:28:43 - Found 8552 pairs of words in the dictionary (8552 unique). 9297 other pairs contained at least one unknown word (4953 in lang1, 6954 in lang2)
INFO - 01/24/18 09:18:02 - 1:28:43 - 8552 source words - nn - Precision at k = 1: 10.289991
INFO - 01/24/18 09:18:02 - 1:28:43 - 8552 source words - nn - Precision at k = 5: 17.516370
INFO - 01/24/18 09:18:02 - 1:28:44 - 8552 source words - nn - Precision at k = 10: 21.246492
csls_knn_10
path:data/crosslingual/dictionaries/zh-latin.5000-6500.txt
INFO - 01/24/18 09:18:02 - 1:28:44 - Found 8552 pairs of words in the dictionary (8552 unique). 9297 other pairs contained at least one unknown word (4953 in lang1, 6954 in lang2)
INFO - 01/24/18 09:18:05 - 1:28:46 - 8552 source words - csls_knn_10 - Precision at k = 1: 9.985968
INFO - 01/24/18 09:18:05 - 1:28:46 - 8552 source words - csls_knn_10 - Precision at k = 5: 16.300281
INFO - 01/24/18 09:18:05 - 1:28:47 - 8552 source words - csls_knn_10 - Precision at k = 10: 20.053789
INFO - 01/24/18 09:18:10 - 1:28:51 - Building the train dictionary ...
INFO - 01/24/18 09:18:10 - 1:28:51 - New train dictionary of 7721 pairs.
INFO - 01/24/18 09:18:10 - 1:28:51 - Mean cosine (nn method, S2T build, 10000 max size): 0.49470
INFO - 01/24/18 09:18:24 - 1:29:05 - Building the train dictionary ...
INFO - 01/24/18 09:18:24 - 1:29:05 - New train dictionary of 6263 pairs.
INFO - 01/24/18 09:18:24 - 1:29:05 - Mean cosine (csls_knn_10 method, S2T build, 10000 max size): 0.50656
INFO - 01/24/18 09:18:24 - 1:29:05 - log:{"n_iter": 149, "precision_at_1-nn": 10.28999064546305, "precision_at_5-nn": 17.516370439663238, "precision_at_10-nn": 21.246492048643592, "precision_at_1-csls_knn_10": 9.985968194574369, "precision_at_5-csls_knn_10": 16.300280636108514, "precision_at_10-csls_knn_10": 20.053788587464922, "mean_cosine-nn-S2T-10000": 0.49470236897468567, "mean_cosine-csls_knn_10-S2T-10000": 0.5065571665763855}
INFO - 01/24/18 09:18:24 - 1:29:05 - End of refinement iteration 149.

Reproducing the EN-ZH results in Table 1

Hi,

I tried training MUSE in the unsupervised way with the pretrained fasttext Wikipedia embeddings. On some European language pairs, such as EN-DE or EN-ES, I was able to get reasonable performance using the default parameters.
However, when for EN-ZH or ZH-EN, using the default parameters, the cross-lingual word similarity scores are always 0 (even for top 10).

As a comparison, to rule out problems with the data, I ran the supervised setting for EN-ZH, and it gave non-zero performance (though the number is a few points lower than that in the paper).

Any idea of what I might have done wrong?
Thank you.

About Multilingual word embeddings

Hi,

I downloaded from the Multilingual word Embeddings, embeddings for english and spanish.

If I understand correctly the embeddings for similar words (like good and bueno) , however, when I calculate the cosine similarity between their embeddings I get a smaller value, around 0.15 and for distance words (like bad and bueno) I get the same value.

Should the aligned embeddings have high cosing similarity? am I missing something?

Error in dico_buildr.py, ValueError: result of slicing is an empty tensor

Hi, I was trying your method on the unsupervised setting, with en-fr language pair. I trained my embedding model using fastText on newstest 2014. Dictionary sizes of en and fr are 1962 and 2018.
I downloaded your ground-truth dictionary of en-fr full set, and put it on /MUSE/data/crosslingual/dictionaries/, and renamed it as "en-fr.5000-6500.txt".

I used the following command:
python unsupervised.py --src_lang en --tgt_lang fr --src_emb ../fastText-0.1.0/myemb/en.tok.vec --tgt_emb ../fastText-0.1.0/myemb/fr.tok.vec --dis_most_frequent 1000

And then I got the following logs and error. Could you please help me with that? Thanks!

===========================================================

INFO - 03/22/18 03:15:36 - 0:07:24 - 996000 - Discriminator loss: 1.5441 - 4455 samples/s
INFO - 03/22/18 03:15:38 - 0:07:25 - Found 1335 pairs of words in the dictionary (1014 unique) 111951 other pairs contained at least one unknown word (109965 in lang1, 110720 in lang2)
INFO - 03/22/18 03:15:38 - 0:07:25 - 1014 source words - nn - Precision at k = 1: 0.098619
INFO - 03/22/18 03:15:38 - 0:07:26 - 1014 source words - nn - Precision at k = 5: 0.295858
INFO - 03/22/18 03:15:38 - 0:07:26 - 1014 source words - nn - Precision at k = 10: 0.493097
INFO - 03/22/18 03:15:39 - 0:07:26 - Found 1335 pairs of words in the dictionary (1014 unique) 111951 other pairs contained at least one unknown word (109965 in lang1, 110720 in lang2)
INFO - 03/22/18 03:15:39 - 0:07:26 - 1014 source words - csls_knn_10 - Precision at k = 1: 0.00000
INFO - 03/22/18 03:15:39 - 0:07:26 - 1014 source words - csls_knn_10 - Precision at k = 5: 0.43097
INFO - 03/22/18 03:15:39 - 0:07:26 - 1014 source words - csls_knn_10 - Precision at k = 10: 0.91716
Traceback (most recent call last):
File "unsupervised.py", line 135, in
evaluator.all_eval(to_log)
File "/var/storage/shared/pnrsy/sys/jobs/application_1513627406021_70260/MUSE/src/evaluationevaluator.py", line 192, in all_eval
self.dist_mean_cosine(to_log)
File "/var/storage/shared/pnrsy/sys/jobs/application_1513627406021_70260/MUSE/src/evaluationevaluator.py", line 172, in dist_mean_cosine
s2t_candidates = get_candidates(src_emb, tgt_emb, _params)
File "/var/storage/shared/pnrsy/sys/jobs/application_1513627406021_70260/MUSE/src/dico_buildr.py", line 38, in get_candidates
scores = emb2.mm(emb1[i:min(n_src, i + bs)].transpose(0, 1)).transpose(0, 1)
ValueError: result of slicing is an empty tensor

Unknow Error

INFO - 05/01/18 22:03:49 - 0:00:07 - Found 3896 pairs of words in the dictionary (1919 unique). 3836 other pairs contained at least one unknown word (1388 in lang1, 3089 in lang2)
INFO - 05/01/18 22:03:49 - 0:00:08 - 1919 source words - nn - Precision at k = 1: 9.692548
INFO - 05/01/18 22:03:49 - 0:00:08 - 1919 source words - nn - Precision at k = 5: 22.094841
INFO - 05/01/18 22:03:50 - 0:00:08 - 1919 source words - nn - Precision at k = 10: 29.233976
INFO - 05/01/18 22:03:50 - 0:00:08 - Found 3896 pairs of words in the dictionary (1919 unique). 3836 other pairs contained at least one unknown word (1388 in lang1, 3089 in lang2)
INFO - 05/01/18 22:03:50 - 0:00:09 - 1919 source words - csls_knn_10 - Precision at k = 1: 9.015112
INFO - 05/01/18 22:03:51 - 0:00:09 - 1919 source words - csls_knn_10 - Precision at k = 5: 19.124544
INFO - 05/01/18 22:03:51 - 0:00:09 - 1919 source words - csls_knn_10 - Precision at k = 10: 24.856696
Traceback (most recent call last):
File "supervised.py", line 98, in
evaluator.all_eval(to_log)
File "/home/jack/software/MUSE/src/evaluation/evaluator.py", line 217, in all_eval
self.dist_mean_cosine(to_log)
File "/home/jack/software/MUSE/src/evaluation/evaluator.py", line 197, in dist_mean_cosine
s2t_candidates = get_candidates(src_emb, tgt_emb, _params)
File "/home/jack/software/MUSE/src/dico_builder.py", line 38, in get_candidates
scores = emb2.mm(emb1[i:min(n_src, i + bs)].transpose(0, 1)).transpose(0, 1)
ValueError: result of slicing is an empty tensor

I have specified --dico_max_rank 4000 --dico_method csls_knn_10, but the message gotten from /home/jack/software/MUSE/src/dico_builder.py", line 38 is that 10000 nn.

conda install faiss-cpu -c pytorch - Not working

I am getting the below error any idea ?
`Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

faiss-cpu
Current channels:

https://conda.anaconda.org/pytorch/osx-64
https://conda.anaconda.org/pytorch/noarch
https://repo.continuum.io/pkgs/main/osx-64
https://repo.continuum.io/pkgs/main/noarch
https://repo.continuum.io/pkgs/free/osx-64
https://repo.continuum.io/pkgs/free/noarch
https://repo.continuum.io/pkgs/r/osx-64
https://repo.continuum.io/pkgs/r/noarch
https://repo.continuum.io/pkgs/pro/osx-64
https://repo.continuum.io/pkgs/pro/noarch
INFO: deactivate_clangxx_osx-64.sh made the following environmental changes:
-CLANGXX=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-clang++
-CXX=x86_64-apple-darwin13.4.0-clang++
-CXXFLAGS=-march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe -stdlib=libc++ -fvisibility-inlines-hidden -std=c++14 -fmessage-length=0
-DEBUG_CXXFLAGS=-Og -g -Wall -Wextra -fcheck=all -fbacktrace -fimplicit-none
INFO: deactivate_clang_osx-64.sh made the following environmental changes:
-AR=/Users/
/anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-ar
-AS=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-as
-CC=x86_64-apple-darwin13.4.0-clang
-CFLAGS=-march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe
-CHECKSYMS=/Users/
/anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-checksyms
-CLANG=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-clang
-CODESIGN_ALLOCATE=/Users/
/anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-codesign_allocate
-CPPFLAGS=-D_FORTIFY_SOURCE=2 -mmacosx-version-min=10.9
-DEBUG_CFLAGS=-Og -g -Wall -Wextra -fcheck=all -fbacktrace -fimplicit-none
-INDR=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-indr
-INSTALL_NAME_TOOL=/Users/
/anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-install_name_tool
-LD=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-ld
-LDFLAGS=-pie -headerpad_max_install_names
-LDFLAGS_CC=-Wl,-pie -Wl,-headerpad_max_install_names
-LIBTOOL=/Users/
/anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-libtool
-LIPO=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-lipo
-NM=/Users/
/anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-nm
-NMEDIT=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-nmedit
-OTOOL=/Users/
/anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-otool
-PAGESTUFF=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-pagestuff
-RANLIB=/Users/
/anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-ranlib
-REDO_PREBINDING=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-redo_prebinding
-SEGEDIT=/Users/
/anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-segedit
-SEG_ADDR_TABLE=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-seg_addr_table
-SEG_HACK=/Users/
/anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-seg_hack
-SIZE=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-size
-STRINGS=/Users/
/anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-strings
-STRIP=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-strip
INFO: activate_clang_osx-64.sh made the following environmental changes:
+AR=/Users/
/anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-ar
+AS=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-as
+CC=x86_64-apple-darwin13.4.0-clang
+CFLAGS=-march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe
+CHECKSYMS=/Users/
/anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-checksyms
+CLANG=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-clang
+CODESIGN_ALLOCATE=/Users/
/anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-codesign_allocate
+CPPFLAGS=-D_FORTIFY_SOURCE=2 -mmacosx-version-min=10.9
+DEBUG_CFLAGS=-Og -g -Wall -Wextra -fcheck=all -fbacktrace -fimplicit-none
+INDR=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-indr
+INSTALL_NAME_TOOL=/Users/
/anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-install_name_tool
+LD=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-ld
+LDFLAGS=-pie -headerpad_max_install_names
+LDFLAGS_CC=-Wl,-pie -Wl,-headerpad_max_install_names
+LIBTOOL=/Users/
/anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-libtool
+LIPO=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-lipo
+NM=/Users/
/anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-nm
+NMEDIT=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-nmedit
+OTOOL=/Users/
/anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-otool
+PAGESTUFF=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-pagestuff
+RANLIB=/Users/
/anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-ranlib
+REDO_PREBINDING=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-redo_prebinding
+SEGEDIT=/Users/
/anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-segedit
+SEG_ADDR_TABLE=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-seg_addr_table
+SEG_HACK=/Users/
/anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-seg_hack
+SIZE=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-size
+STRINGS=/Users/
/anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-strings
+STRIP=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-strip
INFO: activate_clangxx_osx-64.sh made the following environmental changes:
+CLANGXX=/Users/
/anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-clang++
+CXX=x86_64-apple-darwin13.4.0-clang++
+CXXFLAGS=-march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe -stdlib=libc++ -fvisibility-inlines-hidden -std=c++14 -fmessage-length=0
+DEBUG_CXXFLAGS=-Og -g -Wall -Wextra -fcheck=all -fbacktrace -fimplicit-none`

How to select the 5000/1500 words when building the dictionaries?

Hi, I was wondering how the 5000+ pairs and 1500+ pairs were selected to build the training/testing dictionary? As the full dictionary can contain 100K+ pairs, do we take just the top frequent words? I understand the pre-defined dictionary is only used in the first iteration of supervised training, but how much will the initial selection of translation pairs affect the alignment performance? Another question is that why selecting 5000? Will it help when including more translation pairs in the training dictionary? Thanks in advance!

Why do we get different embeddings for target language?

According to the paper, the only trainable parameter is W, which is used to convert the source embeddings to the target ones. But, why the target embeddings are generated in a separate file, which are apparently different from the initial target embeddings?

AssertionError with unsupervised.py

Hi all,

not sure if this is an error on my end or not, but as I couldn't debug it myself I thought I would post it here. Same happens with both py2 and py3.

Using:
numpy (1.13.3)
torch (0.3.0.post4)
torchvision (0.2.0)

No Faiss installed.

When I run:

$CUDA_VISIBLE_DEVICES=2,3 /home/sevajuri/anaconda3/envs/torch3/bin/python unsupervised.py --src_lang en --tgt_lang fr --src_emb wiki.en.vec --tgt_emb wiki.fr.vec --max_vocab=1000000

I get (after the first epoch):

Traceback (most recent call last):
File "unsupervised.py", line 135, in
evaluator.all_eval(to_log)
File "/home/sevajuri/projects/clef17/MUSE/src/evaluation/evaluator.py", line 190, in all_eval
self.word_translation(to_log)
File "/home/sevajuri/projects/clef17/MUSE/src/evaluation/evaluator.py", line 94, in word_translation
method=method
File "/home/sevajuri/projects/clef17/MUSE/src/evaluation/word_translation.py", line 88, in get_word_translation_accuracy
dico = load_dictionary(path, word2id1, word2id2)
File "/home/sevajuri/projects/clef17/MUSE/src/evaluation/word_translation.py", line 48, in load_dictionary
assert os.path.isfile(path)
AssertionError

Runtime error on trainng

While training embedding with unsupervised.py by the changing
default flag --dis_most_frequent 75000 to --dis_most_frequent 99999 I got the following error,

Traceback (most recent call last):
  File "unsupervised.py", line 135, in <module>
    evaluator.all_eval(to_log)
  File "/data/sls/qcri/asr/sjoty/sbmaruf/MUSE/MUSE/src/evaluation/evaluator.py", line 192, in all_eval
    self.dist_mean_cosine(to_log)
  File "/data/sls/qcri/asr/sjoty/sbmaruf/MUSE/MUSE/src/evaluation/evaluator.py", line 173, in dist_mean_cosine
    t2s_candidates = get_candidates(tgt_emb, src_emb, _params)
  File "/data/sls/qcri/asr/sjoty/sbmaruf/MUSE/MUSE/src/dico_builder.py", line 125, in get_candidates
    all_scores = all_scores[:params.dico_max_size]
RuntimeError: invalid argument 2: dimension 0 out of range of 0D tensor at /opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/TH/generic/THTensor.c:24

Error with binary fasttext embeddings

I trained fastText embeddings, all basic commands:
./fasttext cbow -input cleaned.et -thread 16 -dim 300 -output mdl.skip.et
./fasttext cbow -input cleaned.en -thread 16 -dim 300 -output mdl.skip.en

Now I wanted to train supervised MUSE, code:
python unsupervised.py --src_lang et --tgt_lang en --src_emb mdl.skip.et.bin --tgt_emb mdl.skip.en.bin --export pth --exp_name eten-skip-300 --emb_dim 300

I get the following stacktrace:
INFO - 04/13/18 09:24:37 - 0:00:00 - The experiment will be stored in /gpfs/hpchome/b02166/thesis/upd_muse/MUSE/dumped/eten-skip-300/c346kmebk8
Traceback (most recent call last):
File "supervised.py", line 69, in
src_emb, tgt_emb, mapping, _ = build_model(params, False)
File "/gpfs/hpchome/b02166/thesis/upd_muse/MUSE/src/models.py", line 46, in build_model
src_dico, _src_emb = load_embeddings(params, source=True)
File "/gpfs/hpchome/b02166/thesis/upd_muse/MUSE/src/utils.py", line 402, in load_embeddings
return load_bin_embeddings(params, source, full_vocab)
File "/gpfs/hpchome/b02166/thesis/upd_muse/MUSE/src/utils.py", line 369, in load_bin_embeddings
assert embeddings.size() == (len(words), params.emb_dim)
AssertionError

So there is assertionerror in utils.py file load_bin_embeddings, in assertion assert embeddings.size() == (len(words), params.emb_dim). What am I doing wrong here?

Why use adversarial training in the original paper?

Hi, this question is not related to the code implementation itself but more about the detail of the original paper, since I don't know that if there is a better place to ask then I put the question here.

Since from the Orthogonal Procrustes problem, we know that W = SVD(XY-transpose). Why not just compute the SVD of (XY-transpose) instead of introducing the adversarial training approach?

Documentation

I am trying to create my own bilingual word embeddings and I am finding it difficult to understand somethings. I will try and describe my problem in detail in hope to find some answers
I have a parallel corpus and I want to create bilingual word embeddings, I preprocessed both of them and created my word vectors with gensim(I wish I have used fasttext, but same again the docs were not clear) Now I have both word embeddings, the question I have is if I dont have same number of words in both the models, does this work for MUSE?

Why only First 200,000 vocab selected?

According to the code,
https://github.com/facebookresearch/MUSE/blob/master/src/utils.py#L293

unsupervised.py only takes 200,000(by default)/restricted number of vocabulary words from the embedding text file.

From paper 'Word Translation without parallel data' section 3.2,

The embedding quality of rare words is generally not as good as the one of frequent words and we observed that feeding discriminator with rare words had a small but not negligible negative impact.

What are the parameters you have added to handle this issue?

  • params.max_vocab
  • param.dis_most_frequent

from the given name it seems param.dis_most_frequent handles the issue but in that case, why did we restrict the total number of vocabulary by params.max_vocab.

Results obtained are different from that published on the paper

there are my setting below, and the rest of parameters are remained as default.
these words vectors and zh-en dictionary were downloaded from official site.

export SRC_EMB=/home/jack/dev1.8t/corpus/zh-en/wiki.zh.vec 
export TGT_EMB=/home/jack/dev1.8t/corpus/zh-en/wiki.en.vec
nohup python unsupervised.py --src_lang zh --tgt_lang en --src_emb $SRC_EMB --tgt_emb $TGT_EMB --cuda 1 --export 1 --exp_path ./dumped/unsuperv/zh-mn --emb_dim 300 --refinement true --adversarial true > zh-en-unsuper.log &

but the results are just 0s:

east one unknown word (0 in lang1, 0 in lang2)
INFO - 04/24/18 10:03:34 - 0:08:39 - 1500 source words - nn - Precision at k = 1: 0.000000
INFO - 04/24/18 10:03:34 - 0:08:39 - 1500 source words - nn - Precision at k = 5: 0.000000
INFO - 04/24/18 10:03:34 - 0:08:39 - 1500 source words - nn - Precision at k = 10: 0.000000
INFO - 04/24/18 10:03:34 - 0:08:39 - Found 2483 pairs of words in the dictionary (1500 unique). 0 other pairs contained at l
east one unknown word (0 in lang1, 0 in lang2)
INFO - 04/24/18 10:03:44 - 0:08:49 - 1500 source words - csls_knn_10 - Precision at k = 1: 0.000000
INFO - 04/24/18 10:03:44 - 0:08:49 - 1500 source words - csls_knn_10 - Precision at k = 5: 0.000000
INFO - 04/24/18 10:03:44 - 0:08:49 - 1500 source words - csls_knn_10 - Precision at k = 10: 0.000000

EN-EO dictionary

Hi,

It seems I could not find the EN-EO dictionary in the list of released dictionaries to reproduce Table 1 results. Are the Esperanto dictionaries released?

Thanks.

AssertionError

Hi, I am new to deep learning and I got these errors shown on below when I have already installed Faiss,pytorch,cuda and anaconda.

Failed to load GPU Faiss: No module named _swigfaiss_gpu
Faiss falling back to CPU-only.
Impossible to import Faiss library!! Switching to standard nearest neighbors search implementation, this will be significantly slower.

Traceback (most recent call last):
  File "unsupervised.py", line 80, in <module>
    assert not params.cuda or torch.cuda.is_available()
AssertionError

PyTorch 0.4.0 compatibility - Problem running supervised experiment with EN-ZH language pairs

Steps to reproduce

  1. obtained evaluation dataset with the bash script.
  2. obtained wiki.zh.vec and wiki.en.vec pretrained from fasttext.
python supervised.py --src_lang zh --tgt_lang en --src_emb /Tmp/xiamengx/cc18-st1-data/fasttext-embeddings/wiki.zh.vec --tgt_emb /Tmp/xiamengx/cc18-st1-data/fasttext-embeddings/wiki.en.vec --n_refinement 5

Expected

Training finishes successfully

Actual

TypeError is thrown

Logs

INFO - 04/26/18 08:42:05 - 0:00:00 - ============ Initialized logger ============
INFO - 04/26/18 08:42:05 - 0:00:00 - cuda: True
                                     dico_build: S2T&T2S
                                     dico_eval: default
                                     dico_max_rank: 10000
                                     dico_max_size: 0
                                     dico_method: csls_knn_10
                                     dico_min_size: 0
                                     dico_threshold: 0
                                     dico_train: default
                                     emb_dim: 300
                                     exp_id: 
                                     exp_name: debug
                                     exp_path: /u/xiamengx/src/nlpcc/MUSE/dumped/debug/ufwt0sef93
                                     export: txt
                                     max_vocab: 200000
                                     n_refinement: 5
                                     normalize_embeddings: 
                                     seed: -1
                                     src_emb: /Tmp/xiamengx/cc18-st1-data/fasttext-embeddings/wiki.zh.vec
                                     src_lang: zh
                                     tgt_emb: /Tmp/xiamengx/cc18-st1-data/fasttext-embeddings/wiki.en.vec
                                     tgt_lang: en
                                     verbose: 2
INFO - 04/26/18 08:42:05 - 0:00:00 - The experiment will be stored in /u/xiamengx/src/nlpcc/MUSE/dumped/debug/ufwt0sef93
INFO - 04/26/18 08:42:24 - 0:00:18 - Loaded 200000 pre-trained word embeddings.
INFO - 04/26/18 08:42:48 - 0:00:43 - Loaded 200000 pre-trained word embeddings.
INFO - 04/26/18 08:42:53 - 0:00:48 - Found 8891 pairs of words in the dictionary (5000 unique). 0 other pairs contained at least one unknown word (0 in lang1, 0 in lang2)
INFO - 04/26/18 08:42:54 - 0:00:48 - Starting iteration 0...
INFO - 04/26/18 08:42:54 - 0:00:48 - ====================================================================
INFO - 04/26/18 08:42:54 - 0:00:48 -                        Dataset      Found     Not found          Rho
INFO - 04/26/18 08:42:54 - 0:00:48 - ====================================================================
INFO - 04/26/18 08:42:54 - 0:00:48 -                  EN_WS-353-REL        252             0       0.6820
INFO - 04/26/18 08:42:54 - 0:00:48 -                    EN_VERB-143        144             0       0.3973
INFO - 04/26/18 08:42:54 - 0:00:49 -                   EN_MEN-TR-3k       3000             0       0.7637
INFO - 04/26/18 08:42:54 - 0:00:49 -                  EN_WS-353-SIM        203             0       0.7811
INFO - 04/26/18 08:42:54 - 0:00:49 -                   EN_SEMEVAL17        379             9       0.7216
INFO - 04/26/18 08:42:54 - 0:00:49 -                  EN_SIMLEX-999        998             1       0.3823
INFO - 04/26/18 08:42:54 - 0:00:49 -                      EN_YP-130        130             0       0.5333
INFO - 04/26/18 08:42:54 - 0:00:49 -                  EN_WS-353-ALL        353             0       0.7388
INFO - 04/26/18 08:42:54 - 0:00:49 -                       EN_MC-30         30             0       0.8123
INFO - 04/26/18 08:42:54 - 0:00:49 -                   EN_MTurk-287        286             1       0.6773
INFO - 04/26/18 08:42:54 - 0:00:49 -                   EN_MTurk-771        771             0       0.6689
INFO - 04/26/18 08:42:54 - 0:00:49 -                       EN_RG-65         65             0       0.7974
INFO - 04/26/18 08:42:54 - 0:00:49 -                 EN_RW-STANFORD       1323           711       0.5080
INFO - 04/26/18 08:42:54 - 0:00:49 - ====================================================================
INFO - 04/26/18 08:42:54 - 0:00:49 - Monolingual target word similarity score average: 0.65108
INFO - 04/26/18 08:42:54 - 0:00:49 - Found 2483 pairs of words in the dictionary (1500 unique). 0 other pairs contained at least one unknown word (0 in lang1, 0 in lang2)
INFO - 04/26/18 08:42:55 - 0:00:49 - 2483 source words - nn - Precision at k = 1: 5.598067
INFO - 04/26/18 08:42:55 - 0:00:50 - 2483 source words - nn - Precision at k = 5: 15.304068
INFO - 04/26/18 08:42:55 - 0:00:50 - 2483 source words - nn - Precision at k = 10: 21.546516
INFO - 04/26/18 08:42:55 - 0:00:50 - Found 2483 pairs of words in the dictionary (1500 unique). 0 other pairs contained at least one unknown word (0 in lang1, 0 in lang2)
INFO - 04/26/18 08:43:08 - 0:01:02 - 2483 source words - csls_knn_10 - Precision at k = 1: 15.183246
INFO - 04/26/18 08:43:08 - 0:01:02 - 2483 source words - csls_knn_10 - Precision at k = 5: 34.071687
INFO - 04/26/18 08:43:08 - 0:01:03 - 2483 source words - csls_knn_10 - Precision at k = 10: 42.690294
INFO - 04/26/18 08:43:15 - 0:01:09 - Building the train dictionary ...
INFO - 04/26/18 08:43:15 - 0:01:09 - New train dictionary of 6665 pairs.
INFO - 04/26/18 08:43:15 - 0:01:09 - Mean cosine (nn method, S2T build, 10000 max size): 0.60218
INFO - 04/26/18 08:43:48 - 0:01:42 - Building the train dictionary ...
INFO - 04/26/18 08:43:48 - 0:01:42 - New train dictionary of 5529 pairs.
INFO - 04/26/18 08:43:48 - 0:01:42 - Mean cosine (csls_knn_10 method, S2T build, 10000 max size): 0.59617
Traceback (most recent call last):
  File "supervised.py", line 101, in <module>
    logger.info("__log__:%s" % json.dumps(to_log))
  File "/u/xiamengx/anaconda3/envs/nlpcc/lib/python3.5/json/__init__.py", line 230, in dumps
    return _default_encoder.encode(obj)
  File "/u/xiamengx/anaconda3/envs/nlpcc/lib/python3.5/json/encoder.py", line 198, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/u/xiamengx/anaconda3/envs/nlpcc/lib/python3.5/json/encoder.py", line 256, in iterencode
    return _iterencode(o, 0)
  File "/u/xiamengx/anaconda3/envs/nlpcc/lib/python3.5/json/encoder.py", line 179, in default
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: tensor(0.6022, device='cuda:0') is not JSON serializable

Environment

Python 3.5.5 :: Anaconda, Inc
>>> torch.__version__
'0.4.0'
MUSE commit: a620cc8aa394d4eb345ecdeda5067b0b1ef30a6a

How to open/view the best_mapping.t7 file?

I trained the model using unsupervised.py,
English<---->Spanish pair .
But I am unable to open best_mapping.t7 file.
What I know is it's a torch 7 file format, but I am unable to open/view it?

No 5000,1500 unique words in Ground-truth bilingual dictionaries.

Except main language pairs like 'en-es', there are missing words in Ground-truth bilingual dictionaries.

For example,

I've tried with simple shell script for counting words.

cat no-en.0-5000.txt | awk -F' ' '{print $1}' | uniq | wc -l

Here is sample of counting results.

/en-ko.0-5000.txt 4870
/en-tr.0-5000.txt 4998
/en-vi.0-5000.txt 4993
/ko-en.0-5000.txt 4685
/ms-en.0-5000.txt 4998
/no-en.0-5000.txt 4999
/tr-en.0-5000.txt 4943
/vi-en.0-5000.txt 4998

/en-ko.5000-6500.txt 1465
/ko-en.5000-6500.txt 1461
/ms-en.5000-6500.txt 1499
/tr-en.5000-6500.txt 1499

embedding files not generated.

After running the supervised alignment script, the embedding vectors are not dumped in the dump directory. Instead i get this error at the end of last refinement iteration :

flags:
n_refinement = 1
export txt
cuda False

langs:
hi-en

INFO - 06/04/18 15:26:33 - 3:15:50 - End of iteration 1.
                                     
                                     
INFO - 06/04/18 15:26:33 - 3:15:50 - * Reloading the best model from /home/ravi/muse/MUSE/dumped/debug/n31qnn6tl6/best_mapping.pth ...
INFO - 06/04/18 15:26:33 - 3:15:50 - Reloading all embeddings for mapping ...
INFO - 06/04/18 15:26:50 - 3:16:08 - Loaded 158016 pre-trained word embeddings.
INFO - 06/04/18 15:30:55 - 3:20:12 - Loaded 2519370 pre-trained word embeddings.
Traceback (most recent call last):
  File "./MUSE/supervised.py", line 109, in <module>
    trainer.export()
  File "/home/myDir/muse/MUSE/src/trainer.py", line 252, in export
    params.tgt_dico, tgt_emb = load_embeddings(params, source=False, full_vocab=True)
  File "/home/myDir/muse/MUSE/src/utils.py", line 406, in load_embeddings
    return read_txt_embeddings(params, source, full_vocab)
  File "/home/myDir/muse/MUSE/src/utils.py", line 310, in read_txt_embeddings
    embeddings = torch.from_numpy(embeddings).float()
RuntimeError: $ Torch: not enough memory: you tried to allocate 2GB. Buy new RAM! at /pytorch/aten/src/TH/THGeneral.c:218

can anyone help me out with this?
Thanks in advance.

questions about the output during the training progress

INFO - 01/23/18 01:13:58 - 0:06:00 - 4317 source words - csls_knn_10 - Precision at k = 1: 0.486449
INFO - 01/23/18 01:13:58 - 0:06:00 - 4317 source words - csls_knn_10 - Precision at k = 5: 0.625434
INFO - 01/23/18 01:13:58 - 0:06:00 - 4317 source words - csls_knn_10 - Precision at k = 10: 0.741256
INFO - 01/23/18 01:14:04 - 0:06:06 - Building the train dictionary ...
INFO - 01/23/18 01:14:04 - 0:06:06 - New train dictionary of 5223 pairs.
INFO - 01/23/18 01:14:04 - 0:06:06 - Mean cosine (nn method, S2T build, 10000 max size): 0.44828
INFO - 01/23/18 01:14:21 - 0:06:23 - Building the train dictionary ...
INFO - 01/23/18 01:14:21 - 0:06:23 - New train dictionary of 4368 pairs.
INFO - 01/23/18 01:14:21 - 0:06:23 - Mean cosine (csls_knn_10 method, S2T build, 10000 max size): 0.45662

Above is part of the output during the model training progress with "unsupervied.py " or "supervised.py". What does "Precision" mean? Can it be a performance indicator of the trained model?
Thank you very much!

Discriminator inputs: frequency cutoff

I am having a hard time figuring out what embeddings to train the model on. Your paper says: "As a result, we only feed the discriminator with the 50,000 most frequent words."
Does that mean that input files data/wiki.en.vec and data/wiki.es.vec each contain 50,000 words only?
If so, how do you then include 200,000 words in the experiments if output files data/wiki.en-es.en.vec and data/wiki.en-es.es.vec only contain 50,000 words each?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.