Giter Club home page Giter Club logo

Comments (9)

glample avatar glample commented on June 16, 2024

Can you try to add the option:
--normalize_embeddings center to your command python unsupervised.py
and see if this helps?

from muse.

yudianer avatar yudianer commented on June 16, 2024

Thank you @glample , I am trying.

from muse.

yudianer avatar yudianer commented on June 16, 2024

I have some question about the parameters:
--dis_dropout: Discriminator dropout.
For the parameter above, I just find something about --dis_input_dropout, namely,"As a result, we only feed the discriminator with the 50,000 most frequent words. " as for dis_input_dropout. But I did not find any clue about "--dis_dropout"

--dis_lambda:Discriminator loss feedback coefficient.
--dis_clip_weights: Clip discriminator weights (0 to disable)

I did not get any idea about these two parameters above via the paper.

--dico_max_rank:Maximum dictionary words rank (0 to disable)
--dico_min_size:Minimum generated dictionary size (0 to disable)
--dico_max_size:Maximum generated dictionary size (0 to disable)

In my opinion, as for the three parameters above, the dictionary generated is for validation, so dico_min_size and dico_max_size are used to specify the size of the dictionary. But what is dico_max_rank for?

Thank you @glample

from muse.

yudianer avatar yudianer commented on June 16, 2024

it is not useful to specify --normalize_embeddings center, still got results like:

INFO - 04/25/18 11:06:52 - 0:50:04 - 1500 source words - csls_knn_10 - Precision at k = 1: 0.000000
INFO - 04/25/18 11:06:53 - 0:50:04 - 1500 source words - csls_knn_10 - Precision at k = 5: 0.000000
INFO - 04/25/18 11:06:53 - 0:50:04 - 1500 source words - csls_knn_10 - Precision at k = 10: 0.000000

from muse.

glample avatar glample commented on June 16, 2024

Mmm I think the problem is that the default epoch size we set in the code is too big. I forgot what we used in the paper exactly, but I remember we used a quite small epoch size (and consequently more epochs). This way, the evaluations are more frequent, and you are more likely to get a good model. I just tried the command:

CUDA_VISIBLE_DEVICES=0 python unsupervised.py --src_lang en --tgt_lang zh --src_emb data/wiki.en.vec --tgt_emb data/wiki.zh.vec --n_refinement 10 --n_epochs 10 --epoch_size 250000 --normalize_embeddings center

and after half an hour I get ~33% accuracy P@1. I guess using an even smaller epoch size like 100000 might be even better. I just noticed that there is a small bug when saving the embeddings and using center for normalize_embeddings, I'll fix this tomorrow, sorry about that.

Regarding --dico_max_rank, --dico_min_size and --dico_max_size, they correspond to parameters on the synthetic dictionaries we build during the refinement steps. In particular:

--dico_max_rank 15000  # means we will never consider pairs of words where you have a source word or a target word which is not in the top 15000 most frequent words
--dico_max_size 10000  # means we will never consider more than 10000 pairs in total
--dico_min_size 1000   # means that we will always consider at least 1000 pairs (used in combination with dico_threshold that removes translations for which we do not have a high confidence)

You can check all this in dico_builder.py.

Regarding --dis_dropout , this is just the dropout between the discriminator layers.

from muse.

yudianer avatar yudianer commented on June 16, 2024

Thank you @glample , it becomes normal when training with the latest code(4/23/2018) downloaded.

INFO - 04/25/18 23:42:52 - 3:13:13 - 1500 source words - csls_knn_10 - Precision at k = 1: 33.733333
INFO - 04/25/18 23:42:52 - 3:13:13 - 1500 source words - csls_knn_10 - Precision at k = 5: 53.533333
INFO - 04/25/18 23:42:52 - 3:13:13 - 1500 source words - csls_knn_10 - Precision at k = 10: 59.533333

But there always comes an error when doing mapping with the best model at the last step:

Traceback (most recent call last):
  File "unsupervised.py", line 184, in <module>
    trainer.export()
  File "/home/jack/software/MUSE/src/trainer.py", line 255, in export
    normalize_embeddings(src_emb, params.normalize_embeddings, mean=params.src_mean)
  File "/home/jack/software/MUSE/src/utils.py", line 419, in normalize_embeddings
    emb.sub_(mean)
RuntimeError: inconsistent tensor size, expected r_ [332647 x 300], t [332647 x 300] and src [200000 x 300] to have the same number of elements, but got 99794100, 99794100 and 60000000 elements respectively at /pytorch/torch/lib/TH/generic/THTensorMath.c:1008

this is the bug you mentioned above?
Thank you!

from muse.

glample avatar glample commented on June 16, 2024

Yes, this is the bug I mentioned :) This is fixed: a620cc8

from muse.

glample avatar glample commented on June 16, 2024

Closing for now, feel free to reopen if you still face this issue.

from muse.

tegillis avatar tegillis commented on June 16, 2024

I'm unable to reproduce the results above, even with the changes to epoch size and number of epochs.

I am running: python unsupervised.py --src_lang en --tgt_lang zh --src_emb wiki.en.vec --tgt_emb wiki.zh.vec --n_refinement 10 --n_epochs 10 --epoch_size 250000 --normalize_embeddings center and after 10 epochs (before refinement) only getting the following results:

INFO - 03/25/19 21:23:48 - 0:27:10 - 2230 source words - nn - Precision at k = 1: 1.165919
INFO - 03/25/19 21:23:48 - 0:27:10 - 2230 source words - nn - Precision at k = 5: 2.780269
INFO - 03/25/19 21:23:48 - 0:27:10 - 2230 source words - nn - Precision at k = 10: 3.587444
INFO - 03/25/19 21:23:48 - 0:27:10 - Found 2230 pairs of words in the dictionary (1500 unique). 0 other pairs contained at least one unknown word (0 in lang1, 0 in lang2)
INFO - 03/25/19 21:24:10 - 0:27:31 - 2230 source words - csls_knn_10 - Precision at k = 1: 1.121076
INFO - 03/25/19 21:24:10 - 0:27:31 - 2230 source words - csls_knn_10 - Precision at k = 5: 3.228700
INFO - 03/25/19 21:24:10 - 0:27:31 - 2230 source words - csls_knn_10 - Precision at k = 10: 4.843049

What's also interesting is the mean cosine validation metric seems to decrease as the precision improves. That last epoch had the following value:
Mean cosine (csls_knn_10 method, S2T build, 10000 max size): 0.38720

while the first epoch (with worse precision) had:
Mean cosine (csls_knn_10 method, S2T build, 10000 max size): 0.62325

Any idea what's going on here?

from muse.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.