facebookresearch / muse Goto Github PK

A library for Multilingual Unsupervised or Supervised word Embeddings

License: Other

Python 61.70% Shell 3.15% Jupyter Notebook 35.14%

muse's Issues

How to align word embeddings of 3 languages?

Thanks for releasing supervised word embeddings for 30 languages, aligned in a single vector space. My question is how would you do alignment of 3 languages? For example, how to align French, German and English in a single vector space? Your examples seem to only show how to align two languages. Thanks in advance!

Questions about the way to take use of the word embeddings.

Hi, @glample
Here is the output during the training progress. And I find these files(best_mapping.t7 params.pkl train.log vectors-latin.txt vectors-zh.txt) in the dumped folder. But how could I take use of these word embeddings? Now, I am just taking a word's vector in source word embedding file and find top 10 words most similar to this vector in the target word embedding file. But is this way right? Below outputs show there are two methods, namely nn and csls_knn_10, is it necessary to take care about these two methods? Thank you!

==============================================
OrderedDict([('n_iter', 149)])
nn
path:data/crosslingual/dictionaries/zh-latin.5000-6500.txt
INFO - 01/24/18 09:18:02 - 1:28:43 - Found 8552 pairs of words in the dictionary (8552 unique). 9297 other pairs contained at least one unknown word (4953 in lang1, 6954 in lang2)
INFO - 01/24/18 09:18:02 - 1:28:43 - 8552 source words - nn - Precision at k = 1: 10.289991
INFO - 01/24/18 09:18:02 - 1:28:43 - 8552 source words - nn - Precision at k = 5: 17.516370
INFO - 01/24/18 09:18:02 - 1:28:44 - 8552 source words - nn - Precision at k = 10: 21.246492
csls_knn_10
path:data/crosslingual/dictionaries/zh-latin.5000-6500.txt
INFO - 01/24/18 09:18:02 - 1:28:44 - Found 8552 pairs of words in the dictionary (8552 unique). 9297 other pairs contained at least one unknown word (4953 in lang1, 6954 in lang2)
INFO - 01/24/18 09:18:05 - 1:28:46 - 8552 source words - csls_knn_10 - Precision at k = 1: 9.985968
INFO - 01/24/18 09:18:05 - 1:28:46 - 8552 source words - csls_knn_10 - Precision at k = 5: 16.300281
INFO - 01/24/18 09:18:05 - 1:28:47 - 8552 source words - csls_knn_10 - Precision at k = 10: 20.053789
INFO - 01/24/18 09:18:10 - 1:28:51 - Building the train dictionary ...
INFO - 01/24/18 09:18:10 - 1:28:51 - New train dictionary of 7721 pairs.
INFO - 01/24/18 09:18:10 - 1:28:51 - Mean cosine (nn method, S2T build, 10000 max size): 0.49470
INFO - 01/24/18 09:18:24 - 1:29:05 - Building the train dictionary ...
INFO - 01/24/18 09:18:24 - 1:29:05 - New train dictionary of 6263 pairs.
INFO - 01/24/18 09:18:24 - 1:29:05 - Mean cosine (csls_knn_10 method, S2T build, 10000 max size): 0.50656
INFO - 01/24/18 09:18:24 - 1:29:05 - log:{"n_iter": 149, "precision_at_1-nn": 10.28999064546305, "precision_at_5-nn": 17.516370439663238, "precision_at_10-nn": 21.246492048643592, "precision_at_1-csls_knn_10": 9.985968194574369, "precision_at_5-csls_knn_10": 16.300280636108514, "precision_at_10-csls_knn_10": 20.053788587464922, "mean_cosine-nn-S2T-10000": 0.49470236897468567, "mean_cosine-csls_knn_10-S2T-10000": 0.5065571665763855}
INFO - 01/24/18 09:18:24 - 1:29:05 - End of refinement iteration 149.

AssertionError - in get_word_translation_accuracy dico = load_dictionary(path, word2id1, word2id2)

I was trying to build cross-lingual word embeddings for Malayalam and Hindi.

Environment : Ubuntu 16, 8CPUs/52GB RAM, Tesla K80, Google Cloud, CUDA 8, Python 3.6, Faiss not installed

This is what I did,

curl -Lo data/wiki.ml.vec https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.ml.vec
curl -Lo data/wiki.hi.vec https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.hi.vec

Then

python unsupervised.py --src_lang ml --tgt_lang hi --src_emb ../data/wiki.ml.vec --tgt_emb ../data/wiki.hi.vec

After running it around 10 mins, I got this error,

INFO - 12/27/17 13:10:56 - 0:10:37 - 988000 - Discriminator loss: 0.4106 - 3290 samples/s INFO - 12/27/17 13:10:58 - 0:10:39 - 992000 - Discriminator loss: 0.4109 - 3339 samples/s INFO - 12/27/17 13:11:00 - 0:10:42 - 996000 - Discriminator loss: 0.4110 - 3344 samples/s Traceback (most recent call last): File "unsupervised.py", line 135, in <module> evaluator.all_eval(to_log) File "/home/jamsheer/jamsheer/fasttext/MUSE/src/evaluation/evaluator.py", line 190, in all_eval self.word_translation(to_log) File "/home/jamsheer/jamsheer/fasttext/MUSE/src/evaluation/evaluator.py", line 94, in word_translation method=method File "/home/jamsheer/jamsheer/fasttext/MUSE/src/evaluation/word_translation.py", line 88, in get_word_translation_accuracy dico = load_dictionary(path, word2id1, word2id2) File "/home/jamsheer/jamsheer/fasttext/MUSE/src/evaluation/word_translation.py", line 48, in load_dictionary assert os.path.isfile(path)AssertionError

Why do we get different embeddings for target language?

According to the paper, the only trainable parameter is W, which is used to convert the source embeddings to the target ones. But, why the target embeddings are generated in a separate file, which are apparently different from the initial target embeddings?

AssertionError

Hi, I am new to deep learning and I got these errors shown on below when I have already installed Faiss,pytorch,cuda and anaconda.

Failed to load GPU Faiss: No module named _swigfaiss_gpu
Faiss falling back to CPU-only.
Impossible to import Faiss library!! Switching to standard nearest neighbors search implementation, this will be significantly slower.

Traceback (most recent call last):
  File "unsupervised.py", line 80, in <module>
    assert not params.cuda or torch.cuda.is_available()
AssertionError

Why only First 200,000 vocab selected?

According to the code,
https://github.com/facebookresearch/MUSE/blob/master/src/utils.py#L293

unsupervised.py only takes 200,000(by default)/restricted number of vocabulary words from the embedding text file.

From paper 'Word Translation without parallel data' section 3.2,

The embedding quality of rare words is generally not as good as the one of frequent words and we observed that feeding discriminator with rare words had a small but not negligible negative impact.

What are the parameters you have added to handle this issue?

params.max_vocab
param.dis_most_frequent

from the given name it seems param.dis_most_frequent handles the issue but in that case, why did we restrict the total number of vocabulary by params.max_vocab.

cannot eval on cross-lingual word similarity task

Thanks for this wonderful project!

I found I can not evaluate on cross-lingual word similarity task (i.e., SEMEVAL17 task).

in get_evaluation.sh, the eval data are crosslingual/wordsim/$lg_pair-SEMEVAL17.txt

MUSE/data/get_evaluation.sh

Line 93 in 26e3e40

paste $fdir/data/$lg_pair.test.data.txt $fdir/keys/$lg_pair.test.gold.txt > crosslingual/wordsim/$lg_pair-SEMEVAL17.txt
in src/evaluation/wordsim.py, the eval file look like $lg_pair/SEMEVAL17.txt

MUSE/src/evaluation/wordsim.py

Lines 204 to 218 in 26e3e40

 dirpath = os.path.join(SEMEVAL17_EVAL_PATH, '%s-%s' % (lang1, lang2)) 

 if not os.path.isdir(dirpath): 

 return None 

 scores = {} 

 separator = "=" * (30 + 1 + 10 + 1 + 13 + 1 + 12) 

 pattern = "%30s %10s %13s %12s" 

 logger.info(separator) 

 logger.info(pattern % ("Dataset", "Found", "Not found", "Rho")) 

 logger.info(separator) 

 for filename in list(os.listdir(dirpath)): 

 if 'SEMEVAL17' not in filename: 

 continue 

 filepath = os.path.join(dirpath, filename)

Are the EN embeddings the same as the fastText EN embeddings normalized?

I was wondering whether the English multilingual word embeddings are just a normalized subset (200K) of the fastText English embeddings?
I can see that the vectors are different between the two, would be happy to know if the reason is normalization or something else :)
Thanks!

Runtime error on trainng

While training embedding with unsupervised.py by the changing
default flag --dis_most_frequent 75000 to --dis_most_frequent 99999 I got the following error,

Traceback (most recent call last):
  File "unsupervised.py", line 135, in <module>
    evaluator.all_eval(to_log)
  File "/data/sls/qcri/asr/sjoty/sbmaruf/MUSE/MUSE/src/evaluation/evaluator.py", line 192, in all_eval
    self.dist_mean_cosine(to_log)
  File "/data/sls/qcri/asr/sjoty/sbmaruf/MUSE/MUSE/src/evaluation/evaluator.py", line 173, in dist_mean_cosine
    t2s_candidates = get_candidates(tgt_emb, src_emb, _params)
  File "/data/sls/qcri/asr/sjoty/sbmaruf/MUSE/MUSE/src/dico_builder.py", line 125, in get_candidates
    all_scores = all_scores[:params.dico_max_size]
RuntimeError: invalid argument 2: dimension 0 out of range of 0D tensor at /opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/TH/generic/THTensor.c:24

How to select the 5000/1500 words when building the dictionaries?

Hi, I was wondering how the 5000+ pairs and 1500+ pairs were selected to build the training/testing dictionary? As the full dictionary can contain 100K+ pairs, do we take just the top frequent words? I understand the pre-defined dictionary is only used in the first iteration of supervised training, but how much will the initial selection of translation pairs affect the alignment performance? Another question is that why selecting 5000? Will it help when including more translation pairs in the training dictionary? Thanks in advance!

readme should mention easy install of faiss

Faiss can now be installed using:

conda install faiss-cpu -c pytorch
# OR
conda install faiss-gpu -c pytorch

The usage of embeddings during evaluation

Hi, I have a quick question about embeddings used during the evaluation.
In README.md, the crosslingual evaluation python evaluate.py --src_lang en --tgt_lang es --src_emb data/wiki.en-es.en.vec --tgt_emb data/wiki.en-es.es.vec --max_vocab 200000 means use the pretrained embeddings or first normalized then mapped to the source target embeddings as processed in trainer.export()?

It seems that the monolingual evaluation uses pretrained embeddings in data/wiki.en.vec which is different from data/wiki.en-es.en.vec. If crosslingual evaluation uses embeddings of exported version, why does the code src_emb = self.mapping(self.src_emb.weight).data.cpu().numpy() in evaluator.py do mapping once again?
Am I getting anything wrong?

Error in dico_buildr.py, ValueError: result of slicing is an empty tensor

Hi, I was trying your method on the unsupervised setting, with en-fr language pair. I trained my embedding model using fastText on newstest 2014. Dictionary sizes of en and fr are 1962 and 2018.
I downloaded your ground-truth dictionary of en-fr full set, and put it on /MUSE/data/crosslingual/dictionaries/, and renamed it as "en-fr.5000-6500.txt".

I used the following command:
python unsupervised.py --src_lang en --tgt_lang fr --src_emb ../fastText-0.1.0/myemb/en.tok.vec --tgt_emb ../fastText-0.1.0/myemb/fr.tok.vec --dis_most_frequent 1000

And then I got the following logs and error. Could you please help me with that? Thanks!

===========================================================

INFO - 03/22/18 03:15:36 - 0:07:24 - 996000 - Discriminator loss: 1.5441 - 4455 samples/s
INFO - 03/22/18 03:15:38 - 0:07:25 - Found 1335 pairs of words in the dictionary (1014 unique) 111951 other pairs contained at least one unknown word (109965 in lang1, 110720 in lang2)
INFO - 03/22/18 03:15:38 - 0:07:25 - 1014 source words - nn - Precision at k = 1: 0.098619
INFO - 03/22/18 03:15:38 - 0:07:26 - 1014 source words - nn - Precision at k = 5: 0.295858
INFO - 03/22/18 03:15:38 - 0:07:26 - 1014 source words - nn - Precision at k = 10: 0.493097
INFO - 03/22/18 03:15:39 - 0:07:26 - Found 1335 pairs of words in the dictionary (1014 unique) 111951 other pairs contained at least one unknown word (109965 in lang1, 110720 in lang2)
INFO - 03/22/18 03:15:39 - 0:07:26 - 1014 source words - csls_knn_10 - Precision at k = 1: 0.00000
INFO - 03/22/18 03:15:39 - 0:07:26 - 1014 source words - csls_knn_10 - Precision at k = 5: 0.43097
INFO - 03/22/18 03:15:39 - 0:07:26 - 1014 source words - csls_knn_10 - Precision at k = 10: 0.91716
Traceback (most recent call last):
File "unsupervised.py", line 135, in
evaluator.all_eval(to_log)
File "/var/storage/shared/pnrsy/sys/jobs/application_1513627406021_70260/MUSE/src/evaluationevaluator.py", line 192, in all_eval
self.dist_mean_cosine(to_log)
File "/var/storage/shared/pnrsy/sys/jobs/application_1513627406021_70260/MUSE/src/evaluationevaluator.py", line 172, in dist_mean_cosine
s2t_candidates = get_candidates(src_emb, tgt_emb, _params)
File "/var/storage/shared/pnrsy/sys/jobs/application_1513627406021_70260/MUSE/src/dico_buildr.py", line 38, in get_candidates
scores = emb2.mm(emb1[i:min(n_src, i + bs)].transpose(0, 1)).transpose(0, 1)
ValueError: result of slicing is an empty tensor

ValueError: result of slicing is an empty tensor

I was trying to run the unsupervised mapping task and got this error:

INFO - 04/15/18 03:50:23 - 0:00:06 - 9 source words - csls_knn_10 - Precision at k = 10: 0.000000
Traceback (most recent call last):
  File "unsupervised.py", line 136, in <module>
    evaluator.all_eval(to_log)
  File "/home/nghibui/codes/MUSE/src/evaluation/evaluator.py", line 192, in all_eval
    self.dist_mean_cosine(to_log)
  File "/home/nghibui/codes/MUSE/src/evaluation/evaluator.py", line 172, in dist_mean_cosine
    s2t_candidates = get_candidates(src_emb, tgt_emb, _params)
  File "/home/nghibui/codes/MUSE/src/dico_builder.py", line 38, in get_candidates
    scores = emb2.mm(emb1[i:min(n_src, i + bs)].transpose(0, 1)).transpose(0, 1)
ValueError: result of slicing is an empty tensor

No idea why this is happened, any explain?
I'm doing the mapping task on a pair of languages that does not belong to the available list, because of that, I don't have a dictionary for evaluation. Also, the number of vocabularies for each language is quite small, around 4000 each. I guess if I can get rid of the evaluation tasks in the evaluator.py, the code will work.

questions about the output during the training progress

INFO - 01/23/18 01:13:58 - 0:06:00 - 4317 source words - csls_knn_10 - Precision at k = 1: 0.486449
INFO - 01/23/18 01:13:58 - 0:06:00 - 4317 source words - csls_knn_10 - Precision at k = 5: 0.625434
INFO - 01/23/18 01:13:58 - 0:06:00 - 4317 source words - csls_knn_10 - Precision at k = 10: 0.741256
INFO - 01/23/18 01:14:04 - 0:06:06 - Building the train dictionary ...
INFO - 01/23/18 01:14:04 - 0:06:06 - New train dictionary of 5223 pairs.
INFO - 01/23/18 01:14:04 - 0:06:06 - Mean cosine (nn method, S2T build, 10000 max size): 0.44828
INFO - 01/23/18 01:14:21 - 0:06:23 - Building the train dictionary ...
INFO - 01/23/18 01:14:21 - 0:06:23 - New train dictionary of 4368 pairs.
INFO - 01/23/18 01:14:21 - 0:06:23 - Mean cosine (csls_knn_10 method, S2T build, 10000 max size): 0.45662

Above is part of the output during the model training progress with "unsupervied.py " or "supervised.py". What does "Precision" mean? Can it be a performance indicator of the trained model?
Thank you very much!

About Multilingual word embeddings

Hi,

I downloaded from the Multilingual word Embeddings, embeddings for english and spanish.

If I understand correctly the embeddings for similar words (like good and bueno) , however, when I calculate the cosine similarity between their embeddings I get a smaller value, around 0.15 and for distance words (like bad and bueno) I get the same value.

Should the aligned embeddings have high cosing similarity? am I missing something?

embedding files not generated.

After running the supervised alignment script, the embedding vectors are not dumped in the dump directory. Instead i get this error at the end of last refinement iteration :

flags:
n_refinement = 1
export txt
cuda False

langs:
hi-en

INFO - 06/04/18 15:26:33 - 3:15:50 - End of iteration 1.
                                     
                                     
INFO - 06/04/18 15:26:33 - 3:15:50 - * Reloading the best model from /home/ravi/muse/MUSE/dumped/debug/n31qnn6tl6/best_mapping.pth ...
INFO - 06/04/18 15:26:33 - 3:15:50 - Reloading all embeddings for mapping ...
INFO - 06/04/18 15:26:50 - 3:16:08 - Loaded 158016 pre-trained word embeddings.
INFO - 06/04/18 15:30:55 - 3:20:12 - Loaded 2519370 pre-trained word embeddings.
Traceback (most recent call last):
  File "./MUSE/supervised.py", line 109, in <module>
    trainer.export()
  File "/home/myDir/muse/MUSE/src/trainer.py", line 252, in export
    params.tgt_dico, tgt_emb = load_embeddings(params, source=False, full_vocab=True)
  File "/home/myDir/muse/MUSE/src/utils.py", line 406, in load_embeddings
    return read_txt_embeddings(params, source, full_vocab)
  File "/home/myDir/muse/MUSE/src/utils.py", line 310, in read_txt_embeddings
    embeddings = torch.from_numpy(embeddings).float()
RuntimeError: $ Torch: not enough memory: you tried to allocate 2GB. Buy new RAM! at /pytorch/aten/src/TH/THGeneral.c:218

can anyone help me out with this?
Thanks in advance.

Assertion Error in unsupervised training

INFO - 03/01/18 16:49:10 - 0:00:00 - The experiment will be stored in /usr2/home/nkvyas/Bayesian/MUSE/dumped/6uymzfdoug Traceback (most recent call last): File "unsupervised.py", line 91, in <module> src_emb, tgt_emb, mapping, discriminator = build_model(params, True) File "/usr2/home/nkvyas/Bayesian/MUSE/src/models.py", line 46, in build_model src_dico, _src_emb = load_external_embeddings(params, source=True) File "/usr2/home/nkvyas/Bayesian/MUSE/src/utils.py", line 282, in load_external_embeddings assert len(split) == 2 AssertionError

Error with binary fasttext embeddings

I trained fastText embeddings, all basic commands:
./fasttext cbow -input cleaned.et -thread 16 -dim 300 -output mdl.skip.et
./fasttext cbow -input cleaned.en -thread 16 -dim 300 -output mdl.skip.en

Now I wanted to train supervised MUSE, code:
python unsupervised.py --src_lang et --tgt_lang en --src_emb mdl.skip.et.bin --tgt_emb mdl.skip.en.bin --export pth --exp_name eten-skip-300 --emb_dim 300

I get the following stacktrace:
INFO - 04/13/18 09:24:37 - 0:00:00 - The experiment will be stored in /gpfs/hpchome/b02166/thesis/upd_muse/MUSE/dumped/eten-skip-300/c346kmebk8
Traceback (most recent call last):
File "supervised.py", line 69, in
src_emb, tgt_emb, mapping, _ = build_model(params, False)
File "/gpfs/hpchome/b02166/thesis/upd_muse/MUSE/src/models.py", line 46, in build_model
src_dico, _src_emb = load_embeddings(params, source=True)
File "/gpfs/hpchome/b02166/thesis/upd_muse/MUSE/src/utils.py", line 402, in load_embeddings
return load_bin_embeddings(params, source, full_vocab)
File "/gpfs/hpchome/b02166/thesis/upd_muse/MUSE/src/utils.py", line 369, in load_bin_embeddings
assert embeddings.size() == (len(words), params.emb_dim)
AssertionError

So there is assertionerror in utils.py file load_bin_embeddings, in assertion assert embeddings.size() == (len(words), params.emb_dim). What am I doing wrong here?

Any document on controlling how many CPU threads MUSE spawns?

Thanks for the software. Is there any document on controlling how many CPU threads MUSE spawns? I'm running it with FAISS-CPU.

Running with unknown languages (how to disable eval)

Hi,

I'm running the unsupervised alignment network on two sets of embeddings, one of which is in an undeciphered language.

I don't really care about the part of the training code that does evaluation against built-in dictionaries, since that isn't really well-defined for my application. Thus, I've tried running unsupervised.py with the default values (es-en) even though my embeddings are in Latin and an unknown language.

This works for the first epoch, but then it gives me the following error message after the epoch finishes and it tries to enter the evaluation code:
Traceback (most recent call last):
File "unsupervised.py", line 137, in
evaluator.all_eval(to_log)
File "/gpfs/loomis/home.grace/fas/frank/wcm24/voynich2vec/MUSE/src/evaluation/evaluator.py", line 190, in all_eval
self.word_translation(to_log)
File "/gpfs/loomis/home.grace/fas/frank/wcm24/voynich2vec/MUSE/src/evaluation/evaluator.py", line 94, in word_translation
method=method
File "/gpfs/loomis/home.grace/fas/frank/wcm24/voynich2vec/MUSE/src/evaluation/word_translation.py", line 92, in get_word_translation_accuracy
assert dico[:, 0].max() < emb1.size(0)
IndexError: trying to index 2 dimensions of a 0 dimensional tensor

Any idea what could be going wrong? Or how I could just disable the part of the evaluation that is causing these issues? I have tried commenting out some lines in the code, but this always leaves me with other errors. I would prefer a cleaner solution.

Thanks!

Discriminator inputs: frequency cutoff

I am having a hard time figuring out what embeddings to train the model on. Your paper says: "As a result, we only feed the discriminator with the 50,000 most frequent words."
Does that mean that input files data/wiki.en.vec and data/wiki.es.vec each contain 50,000 words only?
If so, how do you then include 200,000 words in the experiments if output files data/wiki.en-es.en.vec and data/wiki.en-es.es.vec only contain 50,000 words each?

NLP multilingual

Results obtained are different from that published on the paper

there are my setting below, and the rest of parameters are remained as default.
these words vectors and zh-en dictionary were downloaded from official site.

export SRC_EMB=/home/jack/dev1.8t/corpus/zh-en/wiki.zh.vec 
export TGT_EMB=/home/jack/dev1.8t/corpus/zh-en/wiki.en.vec
nohup python unsupervised.py --src_lang zh --tgt_lang en --src_emb $SRC_EMB --tgt_emb $TGT_EMB --cuda 1 --export 1 --exp_path ./dumped/unsuperv/zh-mn --emb_dim 300 --refinement true --adversarial true > zh-en-unsuper.log &

but the results are just 0s:

east one unknown word (0 in lang1, 0 in lang2)
INFO - 04/24/18 10:03:34 - 0:08:39 - 1500 source words - nn - Precision at k = 1: 0.000000
INFO - 04/24/18 10:03:34 - 0:08:39 - 1500 source words - nn - Precision at k = 5: 0.000000
INFO - 04/24/18 10:03:34 - 0:08:39 - 1500 source words - nn - Precision at k = 10: 0.000000
INFO - 04/24/18 10:03:34 - 0:08:39 - Found 2483 pairs of words in the dictionary (1500 unique). 0 other pairs contained at l
east one unknown word (0 in lang1, 0 in lang2)
INFO - 04/24/18 10:03:44 - 0:08:49 - 1500 source words - csls_knn_10 - Precision at k = 1: 0.000000
INFO - 04/24/18 10:03:44 - 0:08:49 - 1500 source words - csls_knn_10 - Precision at k = 5: 0.000000
INFO - 04/24/18 10:03:44 - 0:08:49 - 1500 source words - csls_knn_10 - Precision at k = 10: 0.000000

Documentation

I am trying to create my own bilingual word embeddings and I am finding it difficult to understand somethings. I will try and describe my problem in detail in hope to find some answers
I have a parallel corpus and I want to create bilingual word embeddings, I preprocessed both of them and created my word vectors with gensim(I wish I have used fasttext, but same again the docs were not clear) Now I have both word embeddings, the question I have is if I dont have same number of words in both the models, does this work for MUSE?

Do you have chinese word embeddings?

Is translation dictionary required in unsupervised alignment?

Hi, I was trying to train an alignment of embeddings from German to Japanese. Since we don't have a translation dictionary from German to Japanese, I was using unsupervised alignment, assuming no German2Japanese dictionary is required. However, during training, "unsupervised.py" calls the functions "evaluator.all_eval" and "self.word_translation" from "evaluator.py" and "word_translation.py" when it tries to do embeddings / discriminator evaluation, and the program stopped because no German2Japanese dictionary is found. I wonder whether in this case a translation dictionary is still required to perform unsupervised alignment, or is there any parameter I need to change to avoid this?

Evaluation dataset for table-2 in "word translation without parallel data" paper

What was the evaluation set used for table-2, both for results on Wacky (top 7 rows) and Wiki (bottom 2 rows)?

How to use the aligned vectors?

Is it possible to provide some examples on how to use the monolingual aligned vectors (unsupervised) to generate translations? I have got the aligned vectors for Malayalam-Hindi, it would be helpful, if I can get an example how I can proceed further.

EN-EO dictionary

Hi,

It seems I could not find the EN-EO dictionary in the list of released dictionaries to reproduce Table 1 results. Are the Esperanto dictionaries released?

Thanks.

DIC_EVAL_PATH in word_translation sticks in the default path

wow, just working with it and you change so much of the readme and actually provide embeddings of 30 languages in one space!
Fantastic, Thank you!

I was about to ask you how it would be possible to have more then 2 languages in 1 space on how accurate that still is.

One problem I encountered is that DIC_EVAL_PATH in src/evaluation/word_translation.py is not changeable with an argument, so it would always look in the default path for it...
Maybe another argument for that path or just one path parameter and it looks for train.txt and eval.txt or src_lang-tgt_lang-train.txt, ...

Does it make sense to run the adversarial training on the already-aligned vector spaces?

Supposed that I have already achieved the aligned vector spaces from another approach rather than MUSE, does it make sense to run those already-aligned vector spaces again on the adversarial training proposed in MUSE? Will it make the vector spaces to be more aligned? which can produce better results.

RuntimeError: cuda runtime error (2) : out of memory at /pytorch/torch/lib/THC/generic/THCStorage.cu:58

I got this error no matter how the options were small.
Here are the options specified.

The dimensionality of word embddings trained by fastText is 300
GPU has 8G memory
--dis_hid_dim 32
--batch_size 5
--epoch_size 10
--n_epochs 1

Is there any suggestion？Thank you very much!

It's it possible to anchor the source embeddings

Hi,

after the alignment I can see that the embeddings of the source have also changed, I mean, both spaces has been "moved", is there any way to anchor the source space and just modify the target?

No 5000,1500 unique words in Ground-truth bilingual dictionaries.

Except main language pairs like 'en-es', there are missing words in Ground-truth bilingual dictionaries.

For example,

I've tried with simple shell script for counting words.

cat no-en.0-5000.txt | awk -F' ' '{print $1}' | uniq | wc -l

Here is sample of counting results.

/en-ko.0-5000.txt 4870
/en-tr.0-5000.txt 4998
/en-vi.0-5000.txt 4993
/ko-en.0-5000.txt 4685
/ms-en.0-5000.txt 4998
/no-en.0-5000.txt 4999
/tr-en.0-5000.txt 4943
/vi-en.0-5000.txt 4998

/en-ko.5000-6500.txt 1465
/ko-en.5000-6500.txt 1461
/ms-en.5000-6500.txt 1499
/tr-en.5000-6500.txt 1499

conda install faiss-cpu -c pytorch - Not working

I am getting the below error any idea ?
`Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

faiss-cpu
Current channels:

https://conda.anaconda.org/pytorch/osx-64
https://conda.anaconda.org/pytorch/noarch
https://repo.continuum.io/pkgs/main/osx-64
https://repo.continuum.io/pkgs/main/noarch
https://repo.continuum.io/pkgs/free/osx-64
https://repo.continuum.io/pkgs/free/noarch
https://repo.continuum.io/pkgs/r/osx-64
https://repo.continuum.io/pkgs/r/noarch
https://repo.continuum.io/pkgs/pro/osx-64
https://repo.continuum.io/pkgs/pro/noarch
INFO: deactivate_clangxx_osx-64.sh made the following environmental changes:
-CLANGXX=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-clang++
-CXX=x86_64-apple-darwin13.4.0-clang++
-CXXFLAGS=-march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe -stdlib=libc++ -fvisibility-inlines-hidden -std=c++14 -fmessage-length=0
-DEBUG_CXXFLAGS=-Og -g -Wall -Wextra -fcheck=all -fbacktrace -fimplicit-none
INFO: deactivate_clang_osx-64.sh made the following environmental changes:
-AR=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-ar
-AS=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-as
-CC=x86_64-apple-darwin13.4.0-clang
-CFLAGS=-march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe
-CHECKSYMS=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-checksyms
-CLANG=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-clang
-CODESIGN_ALLOCATE=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-codesign_allocate
-CPPFLAGS=-D_FORTIFY_SOURCE=2 -mmacosx-version-min=10.9
-DEBUG_CFLAGS=-Og -g -Wall -Wextra -fcheck=all -fbacktrace -fimplicit-none
-INDR=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-indr
-INSTALL_NAME_TOOL=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-install_name_tool
-LD=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-ld
-LDFLAGS=-pie -headerpad_max_install_names
-LDFLAGS_CC=-Wl,-pie -Wl,-headerpad_max_install_names
-LIBTOOL=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-libtool
-LIPO=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-lipo
-NM=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-nm
-NMEDIT=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-nmedit
-OTOOL=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-otool
-PAGESTUFF=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-pagestuff
-RANLIB=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-ranlib
-REDO_PREBINDING=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-redo_prebinding
-SEGEDIT=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-segedit
-SEG_ADDR_TABLE=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-seg_addr_table
-SEG_HACK=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-seg_hack
-SIZE=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-size
-STRINGS=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-strings
-STRIP=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-strip
INFO: activate_clang_osx-64.sh made the following environmental changes:
+AR=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-ar
+AS=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-as
+CC=x86_64-apple-darwin13.4.0-clang
+CFLAGS=-march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe
+CHECKSYMS=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-checksyms
+CLANG=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-clang
+CODESIGN_ALLOCATE=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-codesign_allocate
+CPPFLAGS=-D_FORTIFY_SOURCE=2 -mmacosx-version-min=10.9
+DEBUG_CFLAGS=-Og -g -Wall -Wextra -fcheck=all -fbacktrace -fimplicit-none
+INDR=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-indr
+INSTALL_NAME_TOOL=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-install_name_tool
+LD=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-ld
+LDFLAGS=-pie -headerpad_max_install_names
+LDFLAGS_CC=-Wl,-pie -Wl,-headerpad_max_install_names
+LIBTOOL=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-libtool
+LIPO=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-lipo
+NM=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-nm
+NMEDIT=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-nmedit
+OTOOL=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-otool
+PAGESTUFF=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-pagestuff
+RANLIB=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-ranlib
+REDO_PREBINDING=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-redo_prebinding
+SEGEDIT=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-segedit
+SEG_ADDR_TABLE=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-seg_addr_table
+SEG_HACK=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-seg_hack
+SIZE=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-size
+STRINGS=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-strings
+STRIP=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-strip
INFO: activate_clangxx_osx-64.sh made the following environmental changes:
+CLANGXX=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-clang++
+CXX=x86_64-apple-darwin13.4.0-clang++
+CXXFLAGS=-march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe -stdlib=libc++ -fvisibility-inlines-hidden -std=c++14 -fmessage-length=0
+DEBUG_CXXFLAGS=-Og -g -Wall -Wextra -fcheck=all -fbacktrace -fimplicit-none`

AssertionError with unsupervised.py

Hi all,

not sure if this is an error on my end or not, but as I couldn't debug it myself I thought I would post it here. Same happens with both py2 and py3.

Using:
numpy (1.13.3)
torch (0.3.0.post4)
torchvision (0.2.0)

No Faiss installed.

When I run:

$CUDA_VISIBLE_DEVICES=2,3 /home/sevajuri/anaconda3/envs/torch3/bin/python unsupervised.py --src_lang en --tgt_lang fr --src_emb wiki.en.vec --tgt_emb wiki.fr.vec --max_vocab=1000000

I get (after the first epoch):

Traceback (most recent call last):
File "unsupervised.py", line 135, in
evaluator.all_eval(to_log)
File "/home/sevajuri/projects/clef17/MUSE/src/evaluation/evaluator.py", line 190, in all_eval
self.word_translation(to_log)
File "/home/sevajuri/projects/clef17/MUSE/src/evaluation/evaluator.py", line 94, in word_translation
method=method
File "/home/sevajuri/projects/clef17/MUSE/src/evaluation/word_translation.py", line 88, in get_word_translation_accuracy
dico = load_dictionary(path, word2id1, word2id2)
File "/home/sevajuri/projects/clef17/MUSE/src/evaluation/word_translation.py", line 48, in load_dictionary
assert os.path.isfile(path)
AssertionError

Using --dico_train identical_char still needs dictionaries

according to the docs

when set to "identical_char" it will use identical character strings between source and target languages to form a vocabulary.````

I understood that the dictionary was going to be created using the given corpus

Reproducing the EN-ZH results in Table 1

Hi,

I tried training MUSE in the unsupervised way with the pretrained fasttext Wikipedia embeddings. On some European language pairs, such as EN-DE or EN-ES, I was able to get reasonable performance using the default parameters.
However, when for EN-ZH or ZH-EN, using the default parameters, the cross-lingual word similarity scores are always 0 (even for top 10).

As a comparison, to rule out problems with the data, I ran the supervised setting for EN-ZH, and it gave non-zero performance (though the number is a few points lower than that in the paper).

Any idea of what I might have done wrong?
Thank you.

Why use adversarial training in the original paper?

Hi, this question is not related to the code implementation itself but more about the detail of the original paper, since I don't know that if there is a better place to ask then I put the question here.

Since from the Orthogonal Procrustes problem, we know that W = SVD(XY-transpose). Why not just compute the SVD of (XY-transpose) instead of introducing the adversarial training approach?

"AssertionError" at end of processing word embedding [Unsupervised.py]

I tried to run the default data and program instructions for "Word translation without parallel data"
for En<--->Es language pairs
@glample
Why am I getting these errors in the end of the processing running unsupervised.py ?
(Errors are mentioned in the attachments.)

TypeError when using cuda

When trying with cuda set to true, i am getting this error on supervised.py#L103

TypeError: tensor(0.5628, device='cuda:0') is not JSON serializable

Out-of-Vocabulary words in released models

Hey,

is there any token for OOV words in released models? If I understand correctly, the projection is working with finite set of word embeddings, not with the fastText model itself. So there is no way to use the original fastText way to deal with OOV, right?

How to open/view the best_mapping.t7 file?

I trained the model using unsupervised.py,
English<---->Spanish pair .
But I am unable to open best_mapping.t7 file.
What I know is it's a torch 7 file format, but I am unable to open/view it?

Please try explaining this project to a 5-year old

where I tried the supervised.py, I got "ValueError: The input must have at least 3 entries!"

Hello, I used docker to build an environment which contained conda, pytorch and faiss with python3.6. Finally it can run this amazing open source.

But when I tried the following command to test the supervised method:

python3 supervised.py --src_lang en --tgt_lang es --src_emb ../wiki.en.vec --tgt_emb ../Spanish_wiki.es.vec --n_iter 5 --dico_train identical_char --cuda False

It returned a ValueError said "The input must have at least 3 entries!".

Here is the logs:

Failed to load GPU Faiss: No module named 'swigfaiss_gpu'
Faiss falling back to CPU-only.
Impossible to import Faiss-GPU. Switching to FAISS-CPU, this will be slower.

INFO - 01/06/18 12:20:38 - 0:00:00 - ============ Initialized logger ============
INFO - 01/06/18 12:20:38 - 0:00:00 - cuda: False
dico_build: S2T&T2S
dico_max_rank: 10000
dico_max_size: 0
dico_method: csls_knn_10
dico_min_size: 0
dico_threshold: 0
dico_train: identical_char
emb_dim: 300
exp_path: /Documents/MUSE-master/dumped/nrthsd26ay
export: True
max_vocab: 200000
n_iters: 5
normalize_embeddings:
seed: -1
src_emb: ../wiki.en.vec
src_lang: en
tgt_emb: ../Spanish_wiki.es.vec
tgt_lang: es
verbose: 2
INFO - 01/06/18 12:20:38 - 0:00:00 - The experiment will be stored in /Documents/MUSE-master/dumped/nrthsd26ay
INFO - 01/06/18 12:20:48 - 0:00:10 - Loaded 200000 pre-trained word embeddings
INFO - 01/06/18 12:21:02 - 0:00:24 - Loaded 200000 pre-trained word embeddings
INFO - 01/06/18 12:21:04 - 0:00:26 - Found 85912 pairs of identical character strings.
INFO - 01/06/18 12:21:05 - 0:00:26 - Starting refinement iteration 0...
INFO - 01/06/18 12:22:14 - 0:01:36 - ====================================================================
INFO - 01/06/18 12:22:14 - 0:01:36 - Dataset Found Not found Rho
INFO - 01/06/18 12:22:14 - 0:01:36 - ====================================================================
here:
n--> 771
m--> 771
Traceback (most recent call last):
File "supervised.py", line 92, in
evaluator.all_eval(to_log)
File "/Documents/MUSE-master/src/evaluation/evaluator.py", line 188, in all_eval
self.monolingual_wordsim(to_log)
File "/Documents/MUSE-master/src/evaluation/evaluator.py", line 43, in monolingual_wordsim
self.mapping(self.src_emb.weight).data.cpu().numpy()
File "/Documents/MUSE-master/src/evaluation/wordsim.py", line 104, in get_wordsim_scores
coeff, found, not_found = get_spearman_rho(word2id, embeddings, filepath, lower)
File "/Documents/MUSE-master/src/evaluation/wordsim.py", line 83, in get_spearman_rho
return spearmanr(gold, pred).correlation, len(gold), not_found
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/scipy/stats/stats.py", line 3301, in spearmanr
rho, pval = mstats_basic.spearmanr(a, b, axis)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/scipy/stats/mstats_basic.py", line 461, in spearmanr
raise ValueError("The input must have at least 3 entries!")
ValueError: The input must have at least 3 entries!

Does anyone have ideas with these problem? Thanks ^^.

Average time to align monolingual word embeddings: the supervised way?

I am aligning english and hindi fasttext monolingual embeddings using the the supervised way on a GPU. Are there are time estimates as to how long it takes? It's been 4 hours, and it is still in the first refinement step.

I ran the following command:

python supervised.py --src_lang en --tgt_lang hi --src_emb wiki.en.vec --tgt_emb wiki.hi.vec --n_iter 5 --dico_train default

Update: it was running for close to 20 hours on a GeForce GTX 1080, constantly hogging 1 CPU core, but no entries were added to the log. I am running it again.

Log:

INFO - 12/27/17 17:57:14 - 0:00:00 - ============ Initialized logger ============
INFO - 12/27/17 17:57:14 - 0:00:00 - cuda: True
                                     dico_build: S2T&T2S
                                     dico_max_rank: 10000
                                     dico_max_size: 0
                                     dico_method: csls_knn_10
                                     dico_min_size: 0
                                     dico_threshold: 0
                                     dico_train: default
                                     emb_dim: 300
                                     exp_path: /MUSE/dumped/hidden
                                     export: True
                                     max_vocab: 200000
                                     n_iters: 5
                                     normalize_embeddings: 
                                     seed: -1
                                     src_emb:wiki.en.vec
                                     src_lang: en
                                     tgt_emb: wiki.hi.vec
                                     tgt_lang: hi
                                     verbose: 2
INFO - 12/27/17 17:57:14 - 0:00:00 - The experiment will be stored in hidden/MUSE/dumped/hidden
INFO - 12/27/17 17:57:25 - 0:00:11 - Loaded 200000 pre-trained word embeddings
INFO - 12/27/17 17:57:45 - 0:00:31 - Loaded 158016 pre-trained word embeddings
INFO - 12/27/17 17:57:49 - 0:00:34 - Found 8704 pairs of words in the dictionary (4998 unique). 0 other pairs contained at least one unknown word (0 in lang1, 0 in lang2)
INFO - 12/27/17 17:57:49 - 0:00:34 - Starting refinement iteration 0...
INFO - 12/27/17 17:57:49 - 0:00:35 - ====================================================================
INFO - 12/27/17 17:57:49 - 0:00:35 -                        Dataset      Found     Not found          Rho
INFO - 12/27/17 17:57:49 - 0:00:35 - ====================================================================
INFO - 12/27/17 17:57:49 - 0:00:35 -                   EN_MTurk-771        771             0       0.6689
INFO - 12/27/17 17:57:49 - 0:00:35 -                   EN_MTurk-287        286             1       0.6773
INFO - 12/27/17 17:57:49 - 0:00:35 -                  EN_SIMLEX-999        998             1       0.3823
INFO - 12/27/17 17:57:49 - 0:00:35 -                  EN_WS-353-REL        252             0       0.6820
INFO - 12/27/17 17:57:49 - 0:00:35 -                 EN_RW-STANFORD       1323           711       0.5080
INFO - 12/27/17 17:57:49 - 0:00:35 -                       EN_MC-30         30             0       0.8123
INFO - 12/27/17 17:57:49 - 0:00:35 -                  EN_WS-353-ALL        353             0       0.7388
INFO - 12/27/17 17:57:49 - 0:00:35 -                    EN_VERB-143        144             0       0.3973
INFO - 12/27/17 17:57:49 - 0:00:35 -                   EN_MEN-TR-3k       3000             0       0.7637
INFO - 12/27/17 17:57:49 - 0:00:35 -                      EN_YP-130        130             0       0.5333
INFO - 12/27/17 17:57:49 - 0:00:35 -                       EN_RG-65         65             0       0.7974
INFO - 12/27/17 17:57:49 - 0:00:35 -                   EN_SEMEVAL17        379             9       0.7216
INFO - 12/27/17 17:57:49 - 0:00:35 -                  EN_WS-353-SIM        203             0       0.7811
INFO - 12/27/17 17:57:49 - 0:00:35 - ====================================================================
INFO - 12/27/17 17:57:49 - 0:00:35 - Monolingual source word similarity score average: 0.65108
INFO - 12/27/17 17:57:49 - 0:00:35 - Found 2032 pairs of words in the dictionary (1500 unique). 0 other pairs contained at least one unknown word (0 in lang1, 0 in lang2)
INFO - 12/27/17 17:57:50 - 0:00:36 - 1500 source words - nn - Precision at k = 1: 23.800000
INFO - 12/27/17 17:57:51 - 0:00:36 - 1500 source words - nn - Precision at k = 5: 41.133333
INFO - 12/27/17 17:57:51 - 0:00:37 - 1500 source words - nn - Precision at k = 10: 48.133333
INFO - 12/27/17 17:57:51 - 0:00:37 - Found 2032 pairs of words in the dictionary (1500 unique). 0 other pairs contained at least one unknown word (0 in lang1, 0 in lang2)

PyTorch 0.4.0 compatibility - Problem running supervised experiment with EN-ZH language pairs

Steps to reproduce

obtained evaluation dataset with the bash script.
obtained wiki.zh.vec and wiki.en.vec pretrained from fasttext.

python supervised.py --src_lang zh --tgt_lang en --src_emb /Tmp/xiamengx/cc18-st1-data/fasttext-embeddings/wiki.zh.vec --tgt_emb /Tmp/xiamengx/cc18-st1-data/fasttext-embeddings/wiki.en.vec --n_refinement 5

Expected

Training finishes successfully

Actual

TypeError is thrown

Logs

INFO - 04/26/18 08:42:05 - 0:00:00 - ============ Initialized logger ============
INFO - 04/26/18 08:42:05 - 0:00:00 - cuda: True
                                     dico_build: S2T&T2S
                                     dico_eval: default
                                     dico_max_rank: 10000
                                     dico_max_size: 0
                                     dico_method: csls_knn_10
                                     dico_min_size: 0
                                     dico_threshold: 0
                                     dico_train: default
                                     emb_dim: 300
                                     exp_id: 
                                     exp_name: debug
                                     exp_path: /u/xiamengx/src/nlpcc/MUSE/dumped/debug/ufwt0sef93
                                     export: txt
                                     max_vocab: 200000
                                     n_refinement: 5
                                     normalize_embeddings: 
                                     seed: -1
                                     src_emb: /Tmp/xiamengx/cc18-st1-data/fasttext-embeddings/wiki.zh.vec
                                     src_lang: zh
                                     tgt_emb: /Tmp/xiamengx/cc18-st1-data/fasttext-embeddings/wiki.en.vec
                                     tgt_lang: en
                                     verbose: 2
INFO - 04/26/18 08:42:05 - 0:00:00 - The experiment will be stored in /u/xiamengx/src/nlpcc/MUSE/dumped/debug/ufwt0sef93
INFO - 04/26/18 08:42:24 - 0:00:18 - Loaded 200000 pre-trained word embeddings.
INFO - 04/26/18 08:42:48 - 0:00:43 - Loaded 200000 pre-trained word embeddings.
INFO - 04/26/18 08:42:53 - 0:00:48 - Found 8891 pairs of words in the dictionary (5000 unique). 0 other pairs contained at least one unknown word (0 in lang1, 0 in lang2)
INFO - 04/26/18 08:42:54 - 0:00:48 - Starting iteration 0...
INFO - 04/26/18 08:42:54 - 0:00:48 - ====================================================================
INFO - 04/26/18 08:42:54 - 0:00:48 -                        Dataset      Found     Not found          Rho
INFO - 04/26/18 08:42:54 - 0:00:48 - ====================================================================
INFO - 04/26/18 08:42:54 - 0:00:48 -                  EN_WS-353-REL        252             0       0.6820
INFO - 04/26/18 08:42:54 - 0:00:48 -                    EN_VERB-143        144             0       0.3973
INFO - 04/26/18 08:42:54 - 0:00:49 -                   EN_MEN-TR-3k       3000             0       0.7637
INFO - 04/26/18 08:42:54 - 0:00:49 -                  EN_WS-353-SIM        203             0       0.7811
INFO - 04/26/18 08:42:54 - 0:00:49 -                   EN_SEMEVAL17        379             9       0.7216
INFO - 04/26/18 08:42:54 - 0:00:49 -                  EN_SIMLEX-999        998             1       0.3823
INFO - 04/26/18 08:42:54 - 0:00:49 -                      EN_YP-130        130             0       0.5333
INFO - 04/26/18 08:42:54 - 0:00:49 -                  EN_WS-353-ALL        353             0       0.7388
INFO - 04/26/18 08:42:54 - 0:00:49 -                       EN_MC-30         30             0       0.8123
INFO - 04/26/18 08:42:54 - 0:00:49 -                   EN_MTurk-287        286             1       0.6773
INFO - 04/26/18 08:42:54 - 0:00:49 -                   EN_MTurk-771        771             0       0.6689
INFO - 04/26/18 08:42:54 - 0:00:49 -                       EN_RG-65         65             0       0.7974
INFO - 04/26/18 08:42:54 - 0:00:49 -                 EN_RW-STANFORD       1323           711       0.5080
INFO - 04/26/18 08:42:54 - 0:00:49 - ====================================================================
INFO - 04/26/18 08:42:54 - 0:00:49 - Monolingual target word similarity score average: 0.65108
INFO - 04/26/18 08:42:54 - 0:00:49 - Found 2483 pairs of words in the dictionary (1500 unique). 0 other pairs contained at least one unknown word (0 in lang1, 0 in lang2)
INFO - 04/26/18 08:42:55 - 0:00:49 - 2483 source words - nn - Precision at k = 1: 5.598067
INFO - 04/26/18 08:42:55 - 0:00:50 - 2483 source words - nn - Precision at k = 5: 15.304068
INFO - 04/26/18 08:42:55 - 0:00:50 - 2483 source words - nn - Precision at k = 10: 21.546516
INFO - 04/26/18 08:42:55 - 0:00:50 - Found 2483 pairs of words in the dictionary (1500 unique). 0 other pairs contained at least one unknown word (0 in lang1, 0 in lang2)
INFO - 04/26/18 08:43:08 - 0:01:02 - 2483 source words - csls_knn_10 - Precision at k = 1: 15.183246
INFO - 04/26/18 08:43:08 - 0:01:02 - 2483 source words - csls_knn_10 - Precision at k = 5: 34.071687
INFO - 04/26/18 08:43:08 - 0:01:03 - 2483 source words - csls_knn_10 - Precision at k = 10: 42.690294
INFO - 04/26/18 08:43:15 - 0:01:09 - Building the train dictionary ...
INFO - 04/26/18 08:43:15 - 0:01:09 - New train dictionary of 6665 pairs.
INFO - 04/26/18 08:43:15 - 0:01:09 - Mean cosine (nn method, S2T build, 10000 max size): 0.60218
INFO - 04/26/18 08:43:48 - 0:01:42 - Building the train dictionary ...
INFO - 04/26/18 08:43:48 - 0:01:42 - New train dictionary of 5529 pairs.
INFO - 04/26/18 08:43:48 - 0:01:42 - Mean cosine (csls_knn_10 method, S2T build, 10000 max size): 0.59617
Traceback (most recent call last):
  File "supervised.py", line 101, in <module>
    logger.info("__log__:%s" % json.dumps(to_log))
  File "/u/xiamengx/anaconda3/envs/nlpcc/lib/python3.5/json/__init__.py", line 230, in dumps
    return _default_encoder.encode(obj)
  File "/u/xiamengx/anaconda3/envs/nlpcc/lib/python3.5/json/encoder.py", line 198, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/u/xiamengx/anaconda3/envs/nlpcc/lib/python3.5/json/encoder.py", line 256, in iterencode
    return _iterencode(o, 0)
  File "/u/xiamengx/anaconda3/envs/nlpcc/lib/python3.5/json/encoder.py", line 179, in default
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: tensor(0.6022, device='cuda:0') is not JSON serializable

Environment

Python 3.5.5 :: Anaconda, Inc
>>> torch.__version__
'0.4.0'
MUSE commit: a620cc8aa394d4eb345ecdeda5067b0b1ef30a6a

Unknow Error

INFO - 05/01/18 22:03:49 - 0:00:07 - Found 3896 pairs of words in the dictionary (1919 unique). 3836 other pairs contained at least one unknown word (1388 in lang1, 3089 in lang2)
INFO - 05/01/18 22:03:49 - 0:00:08 - 1919 source words - nn - Precision at k = 1: 9.692548
INFO - 05/01/18 22:03:49 - 0:00:08 - 1919 source words - nn - Precision at k = 5: 22.094841
INFO - 05/01/18 22:03:50 - 0:00:08 - 1919 source words - nn - Precision at k = 10: 29.233976
INFO - 05/01/18 22:03:50 - 0:00:08 - Found 3896 pairs of words in the dictionary (1919 unique). 3836 other pairs contained at least one unknown word (1388 in lang1, 3089 in lang2)
INFO - 05/01/18 22:03:50 - 0:00:09 - 1919 source words - csls_knn_10 - Precision at k = 1: 9.015112
INFO - 05/01/18 22:03:51 - 0:00:09 - 1919 source words - csls_knn_10 - Precision at k = 5: 19.124544
INFO - 05/01/18 22:03:51 - 0:00:09 - 1919 source words - csls_knn_10 - Precision at k = 10: 24.856696
Traceback (most recent call last):
File "supervised.py", line 98, in
evaluator.all_eval(to_log)
File "/home/jack/software/MUSE/src/evaluation/evaluator.py", line 217, in all_eval
self.dist_mean_cosine(to_log)
File "/home/jack/software/MUSE/src/evaluation/evaluator.py", line 197, in dist_mean_cosine
s2t_candidates = get_candidates(src_emb, tgt_emb, _params)
File "/home/jack/software/MUSE/src/dico_builder.py", line 38, in get_candidates
scores = emb2.mm(emb1[i:min(n_src, i + bs)].transpose(0, 1)).transpose(0, 1)
ValueError: result of slicing is an empty tensor

I have specified --dico_max_rank 4000 --dico_method csls_knn_10, but the message gotten from /home/jack/software/MUSE/src/dico_builder.py", line 38 is that 10000 nn.

CSLS calculation understanding

Hello, in the paper CSLS is calculated by the following equation

since three terms are all pairwise cosine similarity (or mean k nearest neighbor similarity), each tensor should be size as [dictionary_size, 1]. However when I look at the code , I found that cos(Wx,y) size as [128dictionary_size], and average_dist1[i:min(n_src, i + bs)][:, None] + average_dist2[None, :] also sizes as [128dictionary_size]. I wonder what is this 128 means, or if I misunderstand any details? Thank you!

	dirpath = os.path.join(SEMEVAL17_EVAL_PATH, '%s-%s' % (lang1, lang2))
	if not os.path.isdir(dirpath):
	return None

	scores = {}
	separator = "=" * (30 + 1 + 10 + 1 + 13 + 1 + 12)
	pattern = "%30s %10s %13s %12s"
	logger.info(separator)
	logger.info(pattern % ("Dataset", "Found", "Not found", "Rho"))
	logger.info(separator)

	for filename in list(os.listdir(dirpath)):
	if 'SEMEVAL17' not in filename:
	continue
	filepath = os.path.join(dirpath, filename)