facebookresearch / muse Goto Github PK
View Code? Open in Web Editor NEWA library for Multilingual Unsupervised or Supervised word Embeddings
License: Other
A library for Multilingual Unsupervised or Supervised word Embeddings
License: Other
Hi, @glample
Here is the output during the training progress. And I find these files(best_mapping.t7 params.pkl train.log vectors-latin.txt vectors-zh.txt) in the dumped folder. But how could I take use of these word embeddings? Now, I am just taking a word's vector in source word embedding file and find top 10 words most similar to this vector in the target word embedding file. But is this way right? Below outputs show there are two methods, namely nn and csls_knn_10, is it necessary to take care about these two methods? Thank you!
==============================================
OrderedDict([('n_iter', 149)])
nn
path:data/crosslingual/dictionaries/zh-latin.5000-6500.txt
INFO - 01/24/18 09:18:02 - 1:28:43 - Found 8552 pairs of words in the dictionary (8552 unique). 9297 other pairs contained at least one unknown word (4953 in lang1, 6954 in lang2)
INFO - 01/24/18 09:18:02 - 1:28:43 - 8552 source words - nn - Precision at k = 1: 10.289991
INFO - 01/24/18 09:18:02 - 1:28:43 - 8552 source words - nn - Precision at k = 5: 17.516370
INFO - 01/24/18 09:18:02 - 1:28:44 - 8552 source words - nn - Precision at k = 10: 21.246492
csls_knn_10
path:data/crosslingual/dictionaries/zh-latin.5000-6500.txt
INFO - 01/24/18 09:18:02 - 1:28:44 - Found 8552 pairs of words in the dictionary (8552 unique). 9297 other pairs contained at least one unknown word (4953 in lang1, 6954 in lang2)
INFO - 01/24/18 09:18:05 - 1:28:46 - 8552 source words - csls_knn_10 - Precision at k = 1: 9.985968
INFO - 01/24/18 09:18:05 - 1:28:46 - 8552 source words - csls_knn_10 - Precision at k = 5: 16.300281
INFO - 01/24/18 09:18:05 - 1:28:47 - 8552 source words - csls_knn_10 - Precision at k = 10: 20.053789
INFO - 01/24/18 09:18:10 - 1:28:51 - Building the train dictionary ...
INFO - 01/24/18 09:18:10 - 1:28:51 - New train dictionary of 7721 pairs.
INFO - 01/24/18 09:18:10 - 1:28:51 - Mean cosine (nn method, S2T build, 10000 max size): 0.49470
INFO - 01/24/18 09:18:24 - 1:29:05 - Building the train dictionary ...
INFO - 01/24/18 09:18:24 - 1:29:05 - New train dictionary of 6263 pairs.
INFO - 01/24/18 09:18:24 - 1:29:05 - Mean cosine (csls_knn_10 method, S2T build, 10000 max size): 0.50656
INFO - 01/24/18 09:18:24 - 1:29:05 - log:{"n_iter": 149, "precision_at_1-nn": 10.28999064546305, "precision_at_5-nn": 17.516370439663238, "precision_at_10-nn": 21.246492048643592, "precision_at_1-csls_knn_10": 9.985968194574369, "precision_at_5-csls_knn_10": 16.300280636108514, "precision_at_10-csls_knn_10": 20.053788587464922, "mean_cosine-nn-S2T-10000": 0.49470236897468567, "mean_cosine-csls_knn_10-S2T-10000": 0.5065571665763855}
INFO - 01/24/18 09:18:24 - 1:29:05 - End of refinement iteration 149.
there are my setting below, and the rest of parameters are remained as default.
these words vectors and zh-en dictionary were downloaded from official site.
export SRC_EMB=/home/jack/dev1.8t/corpus/zh-en/wiki.zh.vec
export TGT_EMB=/home/jack/dev1.8t/corpus/zh-en/wiki.en.vec
nohup python unsupervised.py --src_lang zh --tgt_lang en --src_emb $SRC_EMB --tgt_emb $TGT_EMB --cuda 1 --export 1 --exp_path ./dumped/unsuperv/zh-mn --emb_dim 300 --refinement true --adversarial true > zh-en-unsuper.log &
but the results are just 0s:
east one unknown word (0 in lang1, 0 in lang2)
INFO - 04/24/18 10:03:34 - 0:08:39 - 1500 source words - nn - Precision at k = 1: 0.000000
INFO - 04/24/18 10:03:34 - 0:08:39 - 1500 source words - nn - Precision at k = 5: 0.000000
INFO - 04/24/18 10:03:34 - 0:08:39 - 1500 source words - nn - Precision at k = 10: 0.000000
INFO - 04/24/18 10:03:34 - 0:08:39 - Found 2483 pairs of words in the dictionary (1500 unique). 0 other pairs contained at l
east one unknown word (0 in lang1, 0 in lang2)
INFO - 04/24/18 10:03:44 - 0:08:49 - 1500 source words - csls_knn_10 - Precision at k = 1: 0.000000
INFO - 04/24/18 10:03:44 - 0:08:49 - 1500 source words - csls_knn_10 - Precision at k = 5: 0.000000
INFO - 04/24/18 10:03:44 - 0:08:49 - 1500 source words - csls_knn_10 - Precision at k = 10: 0.000000
python3 supervised.py --src_lang en --tgt_lang es --src_emb ../wiki.en.vec --tgt_emb ../Spanish_wiki.es.vec --n_iter 5 --dico_train identical_char --cuda False
Failed to load GPU Faiss: No module named 'swigfaiss_gpu'
Faiss falling back to CPU-only.
Impossible to import Faiss-GPU. Switching to FAISS-CPU, this will be slower.
INFO - 01/06/18 12:20:38 - 0:00:00 - ============ Initialized logger ============
INFO - 01/06/18 12:20:38 - 0:00:00 - cuda: False
dico_build: S2T&T2S
dico_max_rank: 10000
dico_max_size: 0
dico_method: csls_knn_10
dico_min_size: 0
dico_threshold: 0
dico_train: identical_char
emb_dim: 300
exp_path: /Documents/MUSE-master/dumped/nrthsd26ay
export: True
max_vocab: 200000
n_iters: 5
normalize_embeddings:
seed: -1
src_emb: ../wiki.en.vec
src_lang: en
tgt_emb: ../Spanish_wiki.es.vec
tgt_lang: es
verbose: 2
INFO - 01/06/18 12:20:38 - 0:00:00 - The experiment will be stored in /Documents/MUSE-master/dumped/nrthsd26ay
INFO - 01/06/18 12:20:48 - 0:00:10 - Loaded 200000 pre-trained word embeddings
INFO - 01/06/18 12:21:02 - 0:00:24 - Loaded 200000 pre-trained word embeddings
INFO - 01/06/18 12:21:04 - 0:00:26 - Found 85912 pairs of identical character strings.
INFO - 01/06/18 12:21:05 - 0:00:26 - Starting refinement iteration 0...
INFO - 01/06/18 12:22:14 - 0:01:36 - ====================================================================
INFO - 01/06/18 12:22:14 - 0:01:36 - Dataset Found Not found Rho
INFO - 01/06/18 12:22:14 - 0:01:36 - ====================================================================
here:
n--> 771
m--> 771
Traceback (most recent call last):
File "supervised.py", line 92, in
evaluator.all_eval(to_log)
File "/Documents/MUSE-master/src/evaluation/evaluator.py", line 188, in all_eval
self.monolingual_wordsim(to_log)
File "/Documents/MUSE-master/src/evaluation/evaluator.py", line 43, in monolingual_wordsim
self.mapping(self.src_emb.weight).data.cpu().numpy()
File "/Documents/MUSE-master/src/evaluation/wordsim.py", line 104, in get_wordsim_scores
coeff, found, not_found = get_spearman_rho(word2id, embeddings, filepath, lower)
File "/Documents/MUSE-master/src/evaluation/wordsim.py", line 83, in get_spearman_rho
return spearmanr(gold, pred).correlation, len(gold), not_found
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/scipy/stats/stats.py", line 3301, in spearmanr
rho, pval = mstats_basic.spearmanr(a, b, axis)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/scipy/stats/mstats_basic.py", line 461, in spearmanr
raise ValueError("The input must have at least 3 entries!")
ValueError: The input must have at least 3 entries!
Hi,
after the alignment I can see that the embeddings of the source have also changed, I mean, both spaces has been "moved", is there any way to anchor the source space and just modify the target?
Hi, I was trying to train an alignment of embeddings from German to Japanese. Since we don't have a translation dictionary from German to Japanese, I was using unsupervised alignment, assuming no German2Japanese dictionary is required. However, during training, "unsupervised.py" calls the functions "evaluator.all_eval" and "self.word_translation" from "evaluator.py" and "word_translation.py" when it tries to do embeddings / discriminator evaluation, and the program stopped because no German2Japanese dictionary is found. I wonder whether in this case a translation dictionary is still required to perform unsupervised alignment, or is there any parameter I need to change to avoid this?
Is it possible to provide some examples on how to use the monolingual aligned vectors (unsupervised) to generate translations? I have got the aligned vectors for Malayalam-Hindi, it would be helpful, if I can get an example how I can proceed further.
After running the supervised alignment script, the embedding vectors are not dumped in the dump directory. Instead i get this error at the end of last refinement iteration :
flags:
n_refinement = 1
export txt
cuda False
langs:
hi-en
INFO - 06/04/18 15:26:33 - 3:15:50 - End of iteration 1.
INFO - 06/04/18 15:26:33 - 3:15:50 - * Reloading the best model from /home/ravi/muse/MUSE/dumped/debug/n31qnn6tl6/best_mapping.pth ...
INFO - 06/04/18 15:26:33 - 3:15:50 - Reloading all embeddings for mapping ...
INFO - 06/04/18 15:26:50 - 3:16:08 - Loaded 158016 pre-trained word embeddings.
INFO - 06/04/18 15:30:55 - 3:20:12 - Loaded 2519370 pre-trained word embeddings.
Traceback (most recent call last):
File "./MUSE/supervised.py", line 109, in <module>
trainer.export()
File "/home/myDir/muse/MUSE/src/trainer.py", line 252, in export
params.tgt_dico, tgt_emb = load_embeddings(params, source=False, full_vocab=True)
File "/home/myDir/muse/MUSE/src/utils.py", line 406, in load_embeddings
return read_txt_embeddings(params, source, full_vocab)
File "/home/myDir/muse/MUSE/src/utils.py", line 310, in read_txt_embeddings
embeddings = torch.from_numpy(embeddings).float()
RuntimeError: $ Torch: not enough memory: you tried to allocate 2GB. Buy new RAM! at /pytorch/aten/src/TH/THGeneral.c:218
can anyone help me out with this?
Thanks in advance.
INFO - 05/01/18 22:03:49 - 0:00:07 - Found 3896 pairs of words in the dictionary (1919 unique). 3836 other pairs contained at least one unknown word (1388 in lang1, 3089 in lang2)
INFO - 05/01/18 22:03:49 - 0:00:08 - 1919 source words - nn - Precision at k = 1: 9.692548
INFO - 05/01/18 22:03:49 - 0:00:08 - 1919 source words - nn - Precision at k = 5: 22.094841
INFO - 05/01/18 22:03:50 - 0:00:08 - 1919 source words - nn - Precision at k = 10: 29.233976
INFO - 05/01/18 22:03:50 - 0:00:08 - Found 3896 pairs of words in the dictionary (1919 unique). 3836 other pairs contained at least one unknown word (1388 in lang1, 3089 in lang2)
INFO - 05/01/18 22:03:50 - 0:00:09 - 1919 source words - csls_knn_10 - Precision at k = 1: 9.015112
INFO - 05/01/18 22:03:51 - 0:00:09 - 1919 source words - csls_knn_10 - Precision at k = 5: 19.124544
INFO - 05/01/18 22:03:51 - 0:00:09 - 1919 source words - csls_knn_10 - Precision at k = 10: 24.856696
Traceback (most recent call last):
File "supervised.py", line 98, in
evaluator.all_eval(to_log)
File "/home/jack/software/MUSE/src/evaluation/evaluator.py", line 217, in all_eval
self.dist_mean_cosine(to_log)
File "/home/jack/software/MUSE/src/evaluation/evaluator.py", line 197, in dist_mean_cosine
s2t_candidates = get_candidates(src_emb, tgt_emb, _params)
File "/home/jack/software/MUSE/src/dico_builder.py", line 38, in get_candidates
scores = emb2.mm(emb1[i:min(n_src, i + bs)].transpose(0, 1)).transpose(0, 1)
ValueError: result of slicing is an empty tensor
I have specified --dico_max_rank 4000 --dico_method csls_knn_10
, but the message gotten from /home/jack/software/MUSE/src/dico_builder.py", line 38
is that 10000 nn
.
I was trying to build cross-lingual word embeddings for Malayalam and Hindi.
Environment : Ubuntu 16, 8CPUs/52GB RAM, Tesla K80, Google Cloud, CUDA 8, Python 3.6, Faiss not installed
This is what I did,
curl -Lo data/wiki.ml.vec https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.ml.vec
curl -Lo data/wiki.hi.vec https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.hi.vec
Then
python unsupervised.py --src_lang ml --tgt_lang hi --src_emb ../data/wiki.ml.vec --tgt_emb ../data/wiki.hi.vec
After running it around 10 mins, I got this error,
INFO - 12/27/17 13:10:56 - 0:10:37 - 988000 - Discriminator loss: 0.4106 - 3290 samples/s INFO - 12/27/17 13:10:58 - 0:10:39 - 992000 - Discriminator loss: 0.4109 - 3339 samples/s INFO - 12/27/17 13:11:00 - 0:10:42 - 996000 - Discriminator loss: 0.4110 - 3344 samples/s Traceback (most recent call last): File "unsupervised.py", line 135, in <module> evaluator.all_eval(to_log) File "/home/jamsheer/jamsheer/fasttext/MUSE/src/evaluation/evaluator.py", line 190, in all_eval self.word_translation(to_log) File "/home/jamsheer/jamsheer/fasttext/MUSE/src/evaluation/evaluator.py", line 94, in word_translation method=method File "/home/jamsheer/jamsheer/fasttext/MUSE/src/evaluation/word_translation.py", line 88, in get_word_translation_accuracy dico = load_dictionary(path, word2id1, word2id2) File "/home/jamsheer/jamsheer/fasttext/MUSE/src/evaluation/word_translation.py", line 48, in load_dictionary assert os.path.isfile(path)AssertionError
INFO - 03/01/18 16:49:10 - 0:00:00 - The experiment will be stored in /usr2/home/nkvyas/Bayesian/MUSE/dumped/6uymzfdoug Traceback (most recent call last): File "unsupervised.py", line 91, in <module> src_emb, tgt_emb, mapping, discriminator = build_model(params, True) File "/usr2/home/nkvyas/Bayesian/MUSE/src/models.py", line 46, in build_model src_dico, _src_emb = load_external_embeddings(params, source=True) File "/usr2/home/nkvyas/Bayesian/MUSE/src/utils.py", line 282, in load_external_embeddings assert len(split) == 2 AssertionError
Hi all,
not sure if this is an error on my end or not, but as I couldn't debug it myself I thought I would post it here. Same happens with both py2 and py3.
Using:
numpy (1.13.3)
torch (0.3.0.post4)
torchvision (0.2.0)
No Faiss installed.
When I run:
$CUDA_VISIBLE_DEVICES=2,3 /home/sevajuri/anaconda3/envs/torch3/bin/python unsupervised.py --src_lang en --tgt_lang fr --src_emb wiki.en.vec --tgt_emb wiki.fr.vec --max_vocab=1000000
I get (after the first epoch):
Traceback (most recent call last):
File "unsupervised.py", line 135, in
evaluator.all_eval(to_log)
File "/home/sevajuri/projects/clef17/MUSE/src/evaluation/evaluator.py", line 190, in all_eval
self.word_translation(to_log)
File "/home/sevajuri/projects/clef17/MUSE/src/evaluation/evaluator.py", line 94, in word_translation
method=method
File "/home/sevajuri/projects/clef17/MUSE/src/evaluation/word_translation.py", line 88, in get_word_translation_accuracy
dico = load_dictionary(path, word2id1, word2id2)
File "/home/sevajuri/projects/clef17/MUSE/src/evaluation/word_translation.py", line 48, in load_dictionary
assert os.path.isfile(path)
AssertionError
Hi, I was wondering how the 5000+ pairs and 1500+ pairs were selected to build the training/testing dictionary? As the full dictionary can contain 100K+ pairs, do we take just the top frequent words? I understand the pre-defined dictionary is only used in the first iteration of supervised training, but how much will the initial selection of translation pairs affect the alignment performance? Another question is that why selecting 5000? Will it help when including more translation pairs in the training dictionary? Thanks in advance!
According to the paper, the only trainable parameter is W, which is used to convert the source embeddings to the target ones. But, why the target embeddings are generated in a separate file, which are apparently different from the initial target embeddings?
I trained fastText embeddings, all basic commands:
./fasttext cbow -input cleaned.et -thread 16 -dim 300 -output mdl.skip.et
./fasttext cbow -input cleaned.en -thread 16 -dim 300 -output mdl.skip.en
Now I wanted to train supervised MUSE, code:
python unsupervised.py --src_lang et --tgt_lang en --src_emb mdl.skip.et.bin --tgt_emb mdl.skip.en.bin --export pth --exp_name eten-skip-300 --emb_dim 300
I get the following stacktrace:
INFO - 04/13/18 09:24:37 - 0:00:00 - The experiment will be stored in /gpfs/hpchome/b02166/thesis/upd_muse/MUSE/dumped/eten-skip-300/c346kmebk8
Traceback (most recent call last):
File "supervised.py", line 69, in
src_emb, tgt_emb, mapping, _ = build_model(params, False)
File "/gpfs/hpchome/b02166/thesis/upd_muse/MUSE/src/models.py", line 46, in build_model
src_dico, _src_emb = load_embeddings(params, source=True)
File "/gpfs/hpchome/b02166/thesis/upd_muse/MUSE/src/utils.py", line 402, in load_embeddings
return load_bin_embeddings(params, source, full_vocab)
File "/gpfs/hpchome/b02166/thesis/upd_muse/MUSE/src/utils.py", line 369, in load_bin_embeddings
assert embeddings.size() == (len(words), params.emb_dim)
AssertionError
So there is assertionerror in utils.py file load_bin_embeddings, in assertion assert embeddings.size() == (len(words), params.emb_dim). What am I doing wrong here?
according to the docs
when set to "identical_char" it will use identical character strings between source and target languages to form a vocabulary.````
I understood that the dictionary was going to be created using the given corpus
Hi, this question is not related to the code implementation itself but more about the detail of the original paper, since I don't know that if there is a better place to ask then I put the question here.
Since from the Orthogonal Procrustes problem, we know that W = SVD(XY-transpose). Why not just compute the SVD of (XY-transpose) instead of introducing the adversarial training approach?
I am getting the below error any idea ?
`Solving environment: failed
PackagesNotFoundError: The following packages are not available from current channels:
faiss-cpu
Current channels:
https://conda.anaconda.org/pytorch/osx-64
https://conda.anaconda.org/pytorch/noarch
https://repo.continuum.io/pkgs/main/osx-64
https://repo.continuum.io/pkgs/main/noarch
https://repo.continuum.io/pkgs/free/osx-64
https://repo.continuum.io/pkgs/free/noarch
https://repo.continuum.io/pkgs/r/osx-64
https://repo.continuum.io/pkgs/r/noarch
https://repo.continuum.io/pkgs/pro/osx-64
https://repo.continuum.io/pkgs/pro/noarch
INFO: deactivate_clangxx_osx-64.sh made the following environmental changes:
-CLANGXX=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-clang++
-CXX=x86_64-apple-darwin13.4.0-clang++
-CXXFLAGS=-march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe -stdlib=libc++ -fvisibility-inlines-hidden -std=c++14 -fmessage-length=0
-DEBUG_CXXFLAGS=-Og -g -Wall -Wextra -fcheck=all -fbacktrace -fimplicit-none
INFO: deactivate_clang_osx-64.sh made the following environmental changes:
-AR=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-ar
-AS=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-as
-CC=x86_64-apple-darwin13.4.0-clang
-CFLAGS=-march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe
-CHECKSYMS=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-checksyms
-CLANG=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-clang
-CODESIGN_ALLOCATE=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-codesign_allocate
-CPPFLAGS=-D_FORTIFY_SOURCE=2 -mmacosx-version-min=10.9
-DEBUG_CFLAGS=-Og -g -Wall -Wextra -fcheck=all -fbacktrace -fimplicit-none
-INDR=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-indr
-INSTALL_NAME_TOOL=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-install_name_tool
-LD=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-ld
-LDFLAGS=-pie -headerpad_max_install_names
-LDFLAGS_CC=-Wl,-pie -Wl,-headerpad_max_install_names
-LIBTOOL=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-libtool
-LIPO=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-lipo
-NM=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-nm
-NMEDIT=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-nmedit
-OTOOL=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-otool
-PAGESTUFF=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-pagestuff
-RANLIB=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-ranlib
-REDO_PREBINDING=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-redo_prebinding
-SEGEDIT=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-segedit
-SEG_ADDR_TABLE=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-seg_addr_table
-SEG_HACK=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-seg_hack
-SIZE=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-size
-STRINGS=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-strings
-STRIP=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-strip
INFO: activate_clang_osx-64.sh made the following environmental changes:
+AR=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-ar
+AS=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-as
+CC=x86_64-apple-darwin13.4.0-clang
+CFLAGS=-march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe
+CHECKSYMS=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-checksyms
+CLANG=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-clang
+CODESIGN_ALLOCATE=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-codesign_allocate
+CPPFLAGS=-D_FORTIFY_SOURCE=2 -mmacosx-version-min=10.9
+DEBUG_CFLAGS=-Og -g -Wall -Wextra -fcheck=all -fbacktrace -fimplicit-none
+INDR=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-indr
+INSTALL_NAME_TOOL=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-install_name_tool
+LD=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-ld
+LDFLAGS=-pie -headerpad_max_install_names
+LDFLAGS_CC=-Wl,-pie -Wl,-headerpad_max_install_names
+LIBTOOL=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-libtool
+LIPO=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-lipo
+NM=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-nm
+NMEDIT=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-nmedit
+OTOOL=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-otool
+PAGESTUFF=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-pagestuff
+RANLIB=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-ranlib
+REDO_PREBINDING=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-redo_prebinding
+SEGEDIT=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-segedit
+SEG_ADDR_TABLE=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-seg_addr_table
+SEG_HACK=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-seg_hack
+SIZE=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-size
+STRINGS=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-strings
+STRIP=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-strip
INFO: activate_clangxx_osx-64.sh made the following environmental changes:
+CLANGXX=/Users//anaconda3/envs/anacondav1/bin/x86_64-apple-darwin13.4.0-clang++
+CXX=x86_64-apple-darwin13.4.0-clang++
+CXXFLAGS=-march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe -stdlib=libc++ -fvisibility-inlines-hidden -std=c++14 -fmessage-length=0
+DEBUG_CXXFLAGS=-Og -g -Wall -Wextra -fcheck=all -fbacktrace -fimplicit-none`
wow, just working with it and you change so much of the readme and actually provide embeddings of 30 languages in one space!
Fantastic, Thank you!
I was about to ask you how it would be possible to have more then 2 languages in 1 space on how accurate that still is.
One problem I encountered is that DIC_EVAL_PATH
in src/evaluation/word_translation.py
is not changeable with an argument, so it would always look in the default
path for it...
Maybe another argument for that path or just one path parameter and it looks for train.txt
and eval.txt
or src_lang-tgt_lang-train.txt
, ...
I am having a hard time figuring out what embeddings to train the model on. Your paper says: "As a result, we only feed the discriminator with the 50,000 most frequent words."
Does that mean that input files data/wiki.en.vec and data/wiki.es.vec each contain 50,000 words only?
If so, how do you then include 200,000 words in the experiments if output files data/wiki.en-es.en.vec and data/wiki.en-es.es.vec only contain 50,000 words each?
python supervised.py --src_lang zh --tgt_lang en --src_emb /Tmp/xiamengx/cc18-st1-data/fasttext-embeddings/wiki.zh.vec --tgt_emb /Tmp/xiamengx/cc18-st1-data/fasttext-embeddings/wiki.en.vec --n_refinement 5
Training finishes successfully
TypeError is thrown
INFO - 04/26/18 08:42:05 - 0:00:00 - ============ Initialized logger ============
INFO - 04/26/18 08:42:05 - 0:00:00 - cuda: True
dico_build: S2T&T2S
dico_eval: default
dico_max_rank: 10000
dico_max_size: 0
dico_method: csls_knn_10
dico_min_size: 0
dico_threshold: 0
dico_train: default
emb_dim: 300
exp_id:
exp_name: debug
exp_path: /u/xiamengx/src/nlpcc/MUSE/dumped/debug/ufwt0sef93
export: txt
max_vocab: 200000
n_refinement: 5
normalize_embeddings:
seed: -1
src_emb: /Tmp/xiamengx/cc18-st1-data/fasttext-embeddings/wiki.zh.vec
src_lang: zh
tgt_emb: /Tmp/xiamengx/cc18-st1-data/fasttext-embeddings/wiki.en.vec
tgt_lang: en
verbose: 2
INFO - 04/26/18 08:42:05 - 0:00:00 - The experiment will be stored in /u/xiamengx/src/nlpcc/MUSE/dumped/debug/ufwt0sef93
INFO - 04/26/18 08:42:24 - 0:00:18 - Loaded 200000 pre-trained word embeddings.
INFO - 04/26/18 08:42:48 - 0:00:43 - Loaded 200000 pre-trained word embeddings.
INFO - 04/26/18 08:42:53 - 0:00:48 - Found 8891 pairs of words in the dictionary (5000 unique). 0 other pairs contained at least one unknown word (0 in lang1, 0 in lang2)
INFO - 04/26/18 08:42:54 - 0:00:48 - Starting iteration 0...
INFO - 04/26/18 08:42:54 - 0:00:48 - ====================================================================
INFO - 04/26/18 08:42:54 - 0:00:48 - Dataset Found Not found Rho
INFO - 04/26/18 08:42:54 - 0:00:48 - ====================================================================
INFO - 04/26/18 08:42:54 - 0:00:48 - EN_WS-353-REL 252 0 0.6820
INFO - 04/26/18 08:42:54 - 0:00:48 - EN_VERB-143 144 0 0.3973
INFO - 04/26/18 08:42:54 - 0:00:49 - EN_MEN-TR-3k 3000 0 0.7637
INFO - 04/26/18 08:42:54 - 0:00:49 - EN_WS-353-SIM 203 0 0.7811
INFO - 04/26/18 08:42:54 - 0:00:49 - EN_SEMEVAL17 379 9 0.7216
INFO - 04/26/18 08:42:54 - 0:00:49 - EN_SIMLEX-999 998 1 0.3823
INFO - 04/26/18 08:42:54 - 0:00:49 - EN_YP-130 130 0 0.5333
INFO - 04/26/18 08:42:54 - 0:00:49 - EN_WS-353-ALL 353 0 0.7388
INFO - 04/26/18 08:42:54 - 0:00:49 - EN_MC-30 30 0 0.8123
INFO - 04/26/18 08:42:54 - 0:00:49 - EN_MTurk-287 286 1 0.6773
INFO - 04/26/18 08:42:54 - 0:00:49 - EN_MTurk-771 771 0 0.6689
INFO - 04/26/18 08:42:54 - 0:00:49 - EN_RG-65 65 0 0.7974
INFO - 04/26/18 08:42:54 - 0:00:49 - EN_RW-STANFORD 1323 711 0.5080
INFO - 04/26/18 08:42:54 - 0:00:49 - ====================================================================
INFO - 04/26/18 08:42:54 - 0:00:49 - Monolingual target word similarity score average: 0.65108
INFO - 04/26/18 08:42:54 - 0:00:49 - Found 2483 pairs of words in the dictionary (1500 unique). 0 other pairs contained at least one unknown word (0 in lang1, 0 in lang2)
INFO - 04/26/18 08:42:55 - 0:00:49 - 2483 source words - nn - Precision at k = 1: 5.598067
INFO - 04/26/18 08:42:55 - 0:00:50 - 2483 source words - nn - Precision at k = 5: 15.304068
INFO - 04/26/18 08:42:55 - 0:00:50 - 2483 source words - nn - Precision at k = 10: 21.546516
INFO - 04/26/18 08:42:55 - 0:00:50 - Found 2483 pairs of words in the dictionary (1500 unique). 0 other pairs contained at least one unknown word (0 in lang1, 0 in lang2)
INFO - 04/26/18 08:43:08 - 0:01:02 - 2483 source words - csls_knn_10 - Precision at k = 1: 15.183246
INFO - 04/26/18 08:43:08 - 0:01:02 - 2483 source words - csls_knn_10 - Precision at k = 5: 34.071687
INFO - 04/26/18 08:43:08 - 0:01:03 - 2483 source words - csls_knn_10 - Precision at k = 10: 42.690294
INFO - 04/26/18 08:43:15 - 0:01:09 - Building the train dictionary ...
INFO - 04/26/18 08:43:15 - 0:01:09 - New train dictionary of 6665 pairs.
INFO - 04/26/18 08:43:15 - 0:01:09 - Mean cosine (nn method, S2T build, 10000 max size): 0.60218
INFO - 04/26/18 08:43:48 - 0:01:42 - Building the train dictionary ...
INFO - 04/26/18 08:43:48 - 0:01:42 - New train dictionary of 5529 pairs.
INFO - 04/26/18 08:43:48 - 0:01:42 - Mean cosine (csls_knn_10 method, S2T build, 10000 max size): 0.59617
Traceback (most recent call last):
File "supervised.py", line 101, in <module>
logger.info("__log__:%s" % json.dumps(to_log))
File "/u/xiamengx/anaconda3/envs/nlpcc/lib/python3.5/json/__init__.py", line 230, in dumps
return _default_encoder.encode(obj)
File "/u/xiamengx/anaconda3/envs/nlpcc/lib/python3.5/json/encoder.py", line 198, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/u/xiamengx/anaconda3/envs/nlpcc/lib/python3.5/json/encoder.py", line 256, in iterencode
return _iterencode(o, 0)
File "/u/xiamengx/anaconda3/envs/nlpcc/lib/python3.5/json/encoder.py", line 179, in default
raise TypeError(repr(o) + " is not JSON serializable")
TypeError: tensor(0.6022, device='cuda:0') is not JSON serializable
Python 3.5.5 :: Anaconda, Inc
>>> torch.__version__
'0.4.0'
MUSE commit: a620cc8aa394d4eb345ecdeda5067b0b1ef30a6a
Faiss can now be installed using:
conda install faiss-cpu -c pytorch
# OR
conda install faiss-gpu -c pytorch
Hi, I am new to deep learning and I got these errors shown on below when I have already installed Faiss,pytorch,cuda and anaconda.
Failed to load GPU Faiss: No module named _swigfaiss_gpu
Faiss falling back to CPU-only.
Impossible to import Faiss library!! Switching to standard nearest neighbors search implementation, this will be significantly slower.
Traceback (most recent call last):
File "unsupervised.py", line 80, in <module>
assert not params.cuda or torch.cuda.is_available()
AssertionError
Hey,
is there any token for OOV words in released models? If I understand correctly, the projection is working with finite set of word embeddings, not with the fastText model itself. So there is no way to use the original fastText way to deal with OOV, right?
Hi,
I tried training MUSE in the unsupervised way with the pretrained fasttext Wikipedia embeddings. On some European language pairs, such as EN-DE or EN-ES, I was able to get reasonable performance using the default parameters.
However, when for EN-ZH or ZH-EN, using the default parameters, the cross-lingual word similarity scores are always 0 (even for top 10).
As a comparison, to rule out problems with the data, I ran the supervised setting for EN-ZH, and it gave non-zero performance (though the number is a few points lower than that in the paper).
Any idea of what I might have done wrong?
Thank you.
While training embedding with unsupervised.py
by the changing
default flag --dis_most_frequent 75000
to --dis_most_frequent 99999
I got the following error,
Traceback (most recent call last):
File "unsupervised.py", line 135, in <module>
evaluator.all_eval(to_log)
File "/data/sls/qcri/asr/sjoty/sbmaruf/MUSE/MUSE/src/evaluation/evaluator.py", line 192, in all_eval
self.dist_mean_cosine(to_log)
File "/data/sls/qcri/asr/sjoty/sbmaruf/MUSE/MUSE/src/evaluation/evaluator.py", line 173, in dist_mean_cosine
t2s_candidates = get_candidates(tgt_emb, src_emb, _params)
File "/data/sls/qcri/asr/sjoty/sbmaruf/MUSE/MUSE/src/dico_builder.py", line 125, in get_candidates
all_scores = all_scores[:params.dico_max_size]
RuntimeError: invalid argument 2: dimension 0 out of range of 0D tensor at /opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/TH/generic/THTensor.c:24
Hello, in the paper CSLS is calculated by the following equation
since three terms are all pairwise cosine similarity (or mean k nearest neighbor similarity), each tensor should be size as [dictionary_size, 1]. However when I look at the code , I found that cos(Wx,y) size as [128dictionary_size], and average_dist1[i:min(n_src, i + bs)][:, None] + average_dist2[None, :]
also sizes as [128dictionary_size]. I wonder what is this 128 means, or if I misunderstand any details? Thank you!
When trying with cuda set to true, i am getting this error on supervised.py#L103
TypeError: tensor(0.5628, device='cuda:0') is not JSON serializable
I tried to run the default data and program instructions for "Word translation without parallel data"
for En<--->Es language pairs
@glample
Why am I getting these errors in the end of the processing running unsupervised.py ?
(Errors are mentioned in the attachments.)
I am aligning english and hindi fasttext monolingual embeddings using the the supervised way on a GPU. Are there are time estimates as to how long it takes? It's been 4 hours, and it is still in the first refinement step.
I ran the following command:
python supervised.py --src_lang en --tgt_lang hi --src_emb wiki.en.vec --tgt_emb wiki.hi.vec --n_iter 5 --dico_train default
Update: it was running for close to 20 hours on a GeForce GTX 1080, constantly hogging 1 CPU core, but no entries were added to the log. I am running it again.
Log:
INFO - 12/27/17 17:57:14 - 0:00:00 - ============ Initialized logger ============
INFO - 12/27/17 17:57:14 - 0:00:00 - cuda: True
dico_build: S2T&T2S
dico_max_rank: 10000
dico_max_size: 0
dico_method: csls_knn_10
dico_min_size: 0
dico_threshold: 0
dico_train: default
emb_dim: 300
exp_path: /MUSE/dumped/hidden
export: True
max_vocab: 200000
n_iters: 5
normalize_embeddings:
seed: -1
src_emb:wiki.en.vec
src_lang: en
tgt_emb: wiki.hi.vec
tgt_lang: hi
verbose: 2
INFO - 12/27/17 17:57:14 - 0:00:00 - The experiment will be stored in hidden/MUSE/dumped/hidden
INFO - 12/27/17 17:57:25 - 0:00:11 - Loaded 200000 pre-trained word embeddings
INFO - 12/27/17 17:57:45 - 0:00:31 - Loaded 158016 pre-trained word embeddings
INFO - 12/27/17 17:57:49 - 0:00:34 - Found 8704 pairs of words in the dictionary (4998 unique). 0 other pairs contained at least one unknown word (0 in lang1, 0 in lang2)
INFO - 12/27/17 17:57:49 - 0:00:34 - Starting refinement iteration 0...
INFO - 12/27/17 17:57:49 - 0:00:35 - ====================================================================
INFO - 12/27/17 17:57:49 - 0:00:35 - Dataset Found Not found Rho
INFO - 12/27/17 17:57:49 - 0:00:35 - ====================================================================
INFO - 12/27/17 17:57:49 - 0:00:35 - EN_MTurk-771 771 0 0.6689
INFO - 12/27/17 17:57:49 - 0:00:35 - EN_MTurk-287 286 1 0.6773
INFO - 12/27/17 17:57:49 - 0:00:35 - EN_SIMLEX-999 998 1 0.3823
INFO - 12/27/17 17:57:49 - 0:00:35 - EN_WS-353-REL 252 0 0.6820
INFO - 12/27/17 17:57:49 - 0:00:35 - EN_RW-STANFORD 1323 711 0.5080
INFO - 12/27/17 17:57:49 - 0:00:35 - EN_MC-30 30 0 0.8123
INFO - 12/27/17 17:57:49 - 0:00:35 - EN_WS-353-ALL 353 0 0.7388
INFO - 12/27/17 17:57:49 - 0:00:35 - EN_VERB-143 144 0 0.3973
INFO - 12/27/17 17:57:49 - 0:00:35 - EN_MEN-TR-3k 3000 0 0.7637
INFO - 12/27/17 17:57:49 - 0:00:35 - EN_YP-130 130 0 0.5333
INFO - 12/27/17 17:57:49 - 0:00:35 - EN_RG-65 65 0 0.7974
INFO - 12/27/17 17:57:49 - 0:00:35 - EN_SEMEVAL17 379 9 0.7216
INFO - 12/27/17 17:57:49 - 0:00:35 - EN_WS-353-SIM 203 0 0.7811
INFO - 12/27/17 17:57:49 - 0:00:35 - ====================================================================
INFO - 12/27/17 17:57:49 - 0:00:35 - Monolingual source word similarity score average: 0.65108
INFO - 12/27/17 17:57:49 - 0:00:35 - Found 2032 pairs of words in the dictionary (1500 unique). 0 other pairs contained at least one unknown word (0 in lang1, 0 in lang2)
INFO - 12/27/17 17:57:50 - 0:00:36 - 1500 source words - nn - Precision at k = 1: 23.800000
INFO - 12/27/17 17:57:51 - 0:00:36 - 1500 source words - nn - Precision at k = 5: 41.133333
INFO - 12/27/17 17:57:51 - 0:00:37 - 1500 source words - nn - Precision at k = 10: 48.133333
INFO - 12/27/17 17:57:51 - 0:00:37 - Found 2032 pairs of words in the dictionary (1500 unique). 0 other pairs contained at least one unknown word (0 in lang1, 0 in lang2)
Hi,
I'm running the unsupervised alignment network on two sets of embeddings, one of which is in an undeciphered language.
I don't really care about the part of the training code that does evaluation against built-in dictionaries, since that isn't really well-defined for my application. Thus, I've tried running unsupervised.py with the default values (es-en) even though my embeddings are in Latin and an unknown language.
This works for the first epoch, but then it gives me the following error message after the epoch finishes and it tries to enter the evaluation code:
Traceback (most recent call last):
File "unsupervised.py", line 137, in
evaluator.all_eval(to_log)
File "/gpfs/loomis/home.grace/fas/frank/wcm24/voynich2vec/MUSE/src/evaluation/evaluator.py", line 190, in all_eval
self.word_translation(to_log)
File "/gpfs/loomis/home.grace/fas/frank/wcm24/voynich2vec/MUSE/src/evaluation/evaluator.py", line 94, in word_translation
method=method
File "/gpfs/loomis/home.grace/fas/frank/wcm24/voynich2vec/MUSE/src/evaluation/word_translation.py", line 92, in get_word_translation_accuracy
assert dico[:, 0].max() < emb1.size(0)
IndexError: trying to index 2 dimensions of a 0 dimensional tensor
Any idea what could be going wrong? Or how I could just disable the part of the evaluation that is causing these issues? I have tried commenting out some lines in the code, but this always leaves me with other errors. I would prefer a cleaner solution.
Thanks!
Thanks for the software. Is there any document on controlling how many CPU threads MUSE spawns? I'm running it with FAISS-CPU.
Except main language pairs like 'en-es', there are missing words in Ground-truth bilingual dictionaries.
For example,
I've tried with simple shell script for counting words.
cat no-en.0-5000.txt | awk -F' ' '{print $1}' | uniq | wc -l
Here is sample of counting results.
/en-ko.0-5000.txt 4870
/en-tr.0-5000.txt 4998
/en-vi.0-5000.txt 4993
/ko-en.0-5000.txt 4685
/ms-en.0-5000.txt 4998
/no-en.0-5000.txt 4999
/tr-en.0-5000.txt 4943
/vi-en.0-5000.txt 4998
/en-ko.5000-6500.txt 1465
/ko-en.5000-6500.txt 1461
/ms-en.5000-6500.txt 1499
/tr-en.5000-6500.txt 1499
Hi,
It seems I could not find the EN-EO dictionary in the list of released dictionaries to reproduce Table 1 results. Are the Esperanto dictionaries released?
Thanks.
INFO - 01/23/18 01:13:58 - 0:06:00 - 4317 source words - csls_knn_10 - Precision at k = 1: 0.486449
INFO - 01/23/18 01:13:58 - 0:06:00 - 4317 source words - csls_knn_10 - Precision at k = 5: 0.625434
INFO - 01/23/18 01:13:58 - 0:06:00 - 4317 source words - csls_knn_10 - Precision at k = 10: 0.741256
INFO - 01/23/18 01:14:04 - 0:06:06 - Building the train dictionary ...
INFO - 01/23/18 01:14:04 - 0:06:06 - New train dictionary of 5223 pairs.
INFO - 01/23/18 01:14:04 - 0:06:06 - Mean cosine (nn method, S2T build, 10000 max size): 0.44828
INFO - 01/23/18 01:14:21 - 0:06:23 - Building the train dictionary ...
INFO - 01/23/18 01:14:21 - 0:06:23 - New train dictionary of 4368 pairs.
INFO - 01/23/18 01:14:21 - 0:06:23 - Mean cosine (csls_knn_10 method, S2T build, 10000 max size): 0.45662
Above is part of the output during the model training progress with "unsupervied.py " or "supervised.py". What does "Precision" mean? Can it be a performance indicator of the trained model?
Thank you very much!
Hi, I was trying your method on the unsupervised setting, with en-fr language pair. I trained my embedding model using fastText on newstest 2014. Dictionary sizes of en and fr are 1962 and 2018.
I downloaded your ground-truth dictionary of en-fr full set, and put it on /MUSE/data/crosslingual/dictionaries/, and renamed it as "en-fr.5000-6500.txt".
I used the following command:
python unsupervised.py --src_lang en --tgt_lang fr --src_emb ../fastText-0.1.0/myemb/en.tok.vec --tgt_emb ../fastText-0.1.0/myemb/fr.tok.vec --dis_most_frequent 1000
And then I got the following logs and error. Could you please help me with that? Thanks!
===========================================================
INFO - 03/22/18 03:15:36 - 0:07:24 - 996000 - Discriminator loss: 1.5441 - 4455 samples/s
INFO - 03/22/18 03:15:38 - 0:07:25 - Found 1335 pairs of words in the dictionary (1014 unique) 111951 other pairs contained at least one unknown word (109965 in lang1, 110720 in lang2)
INFO - 03/22/18 03:15:38 - 0:07:25 - 1014 source words - nn - Precision at k = 1: 0.098619
INFO - 03/22/18 03:15:38 - 0:07:26 - 1014 source words - nn - Precision at k = 5: 0.295858
INFO - 03/22/18 03:15:38 - 0:07:26 - 1014 source words - nn - Precision at k = 10: 0.493097
INFO - 03/22/18 03:15:39 - 0:07:26 - Found 1335 pairs of words in the dictionary (1014 unique) 111951 other pairs contained at least one unknown word (109965 in lang1, 110720 in lang2)
INFO - 03/22/18 03:15:39 - 0:07:26 - 1014 source words - csls_knn_10 - Precision at k = 1: 0.00000
INFO - 03/22/18 03:15:39 - 0:07:26 - 1014 source words - csls_knn_10 - Precision at k = 5: 0.43097
INFO - 03/22/18 03:15:39 - 0:07:26 - 1014 source words - csls_knn_10 - Precision at k = 10: 0.91716
Traceback (most recent call last):
File "unsupervised.py", line 135, in
evaluator.all_eval(to_log)
File "/var/storage/shared/pnrsy/sys/jobs/application_1513627406021_70260/MUSE/src/evaluationevaluator.py", line 192, in all_eval
self.dist_mean_cosine(to_log)
File "/var/storage/shared/pnrsy/sys/jobs/application_1513627406021_70260/MUSE/src/evaluationevaluator.py", line 172, in dist_mean_cosine
s2t_candidates = get_candidates(src_emb, tgt_emb, _params)
File "/var/storage/shared/pnrsy/sys/jobs/application_1513627406021_70260/MUSE/src/dico_buildr.py", line 38, in get_candidates
scores = emb2.mm(emb1[i:min(n_src, i + bs)].transpose(0, 1)).transpose(0, 1)
ValueError: result of slicing is an empty tensor
What was the evaluation set used for table-2, both for results on Wacky (top 7 rows) and Wiki (bottom 2 rows)?
I got this error no matter how the options were small.
Here are the options specified.
The dimensionality of word embddings trained by fastText is 300
GPU has 8G memory
--dis_hid_dim 32
--batch_size 5
--epoch_size 10
--n_epochs 1
Is there any suggestion?Thank you very much!
Supposed that I have already achieved the aligned vector spaces from another approach rather than MUSE, does it make sense to run those already-aligned vector spaces again on the adversarial training proposed in MUSE? Will it make the vector spaces to be more aligned? which can produce better results.
According to the code,
https://github.com/facebookresearch/MUSE/blob/master/src/utils.py#L293
unsupervised.py
only takes 200,000(by default)
/restricted number of vocabulary
words from the embedding text file.
From paper 'Word Translation without parallel data' section 3.2,
The embedding quality of rare words is generally not as good as the one of frequent words and we observed that feeding discriminator with rare words had a small but not negligible negative impact.
What are the parameters you have added to handle this issue?
from the given name it seems param.dis_most_frequent
handles the issue but in that case, why did we restrict the total number of vocabulary by params.max_vocab
.
Thanks for this wonderful project!
I found I can not evaluate on cross-lingual word similarity task (i.e., SEMEVAL17 task).
in get_evaluation.sh
, the eval data are crosslingual/wordsim/$lg_pair-SEMEVAL17.txt
Line 93 in 26e3e40
in src/evaluation/wordsim.py
, the eval file look like $lg_pair/SEMEVAL17.txt
MUSE/src/evaluation/wordsim.py
Lines 204 to 218 in 26e3e40
I am trying to create my own bilingual word embeddings and I am finding it difficult to understand somethings. I will try and describe my problem in detail in hope to find some answers
I have a parallel corpus and I want to create bilingual word embeddings, I preprocessed both of them and created my word vectors with gensim(I wish I have used fasttext, but same again the docs were not clear) Now I have both word embeddings, the question I have is if I dont have same number of words in both the models, does this work for MUSE?
I was wondering whether the English multilingual word embeddings are just a normalized subset (200K) of the fastText English embeddings?
I can see that the vectors are different between the two, would be happy to know if the reason is normalization or something else :)
Thanks!
Hi, I have a quick question about embeddings used during the evaluation.
In README.md, the crosslingual evaluation python evaluate.py --src_lang en --tgt_lang es --src_emb data/wiki.en-es.en.vec --tgt_emb data/wiki.en-es.es.vec --max_vocab 200000
means use the pretrained embeddings or first normalized then mapped to the source target embeddings as processed in trainer.export()
?
It seems that the monolingual evaluation uses pretrained embeddings in data/wiki.en.vec
which is different from data/wiki.en-es.en.vec
. If crosslingual evaluation uses embeddings of exported version, why does the code src_emb = self.mapping(self.src_emb.weight).data.cpu().numpy()
in evaluator.py
do mapping once again?
Am I getting anything wrong?
I was trying to run the unsupervised mapping task and got this error:
INFO - 04/15/18 03:50:23 - 0:00:06 - 9 source words - csls_knn_10 - Precision at k = 10: 0.000000
Traceback (most recent call last):
File "unsupervised.py", line 136, in <module>
evaluator.all_eval(to_log)
File "/home/nghibui/codes/MUSE/src/evaluation/evaluator.py", line 192, in all_eval
self.dist_mean_cosine(to_log)
File "/home/nghibui/codes/MUSE/src/evaluation/evaluator.py", line 172, in dist_mean_cosine
s2t_candidates = get_candidates(src_emb, tgt_emb, _params)
File "/home/nghibui/codes/MUSE/src/dico_builder.py", line 38, in get_candidates
scores = emb2.mm(emb1[i:min(n_src, i + bs)].transpose(0, 1)).transpose(0, 1)
ValueError: result of slicing is an empty tensor
No idea why this is happened, any explain?
I'm doing the mapping task on a pair of languages that does not belong to the available list, because of that, I don't have a dictionary for evaluation. Also, the number of vocabularies for each language is quite small, around 4000 each. I guess if I can get rid of the evaluation tasks in the evaluator.py, the code will work.
I trained the model using unsupervised.py,
English<---->Spanish pair .
But I am unable to open best_mapping.t7 file.
What I know is it's a torch 7 file format, but I am unable to open/view it?
Hi,
I downloaded from the Multilingual word Embeddings, embeddings for english and spanish.
If I understand correctly the embeddings for similar words (like good and bueno) , however, when I calculate the cosine similarity between their embeddings I get a smaller value, around 0.15 and for distance words (like bad and bueno) I get the same value.
Should the aligned embeddings have high cosing similarity? am I missing something?
Thanks for releasing supervised word embeddings for 30 languages, aligned in a single vector space. My question is how would you do alignment of 3 languages? For example, how to align French, German and English in a single vector space? Your examples seem to only show how to align two languages. Thanks in advance!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.