Hi, I was wondering how the 5000+ pairs and 1500+ pairs were selected to build the tra

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

How to select the 5000/1500 words when building the dictionaries? about muse HOT 6 CLOSED

facebookresearch commented on July 20, 2024

How to select the 5000/1500 words when building the dictionaries?

from muse.

Comments (6)

glample commented on July 20, 2024 2

Hello,

In the supervised approach, we generated translations for all words from the source language to the target language, and vice-versa (a translation being a pair (x, y) associated with the probability for y of being the correct translation of x). Then, we considered all pairs of words (x, y) such that y has a high probability of being a translation of x, but also that x has a high probability of being a translation of y. Then, we sorted all generated translation pairs by frequency of the source word, and took the 5000 first resulting pairs for training, and the 1500 following ones for testing.

The initial selection pair most likely has an impact on the alignment performance, but we did not study this extensively. But we noticed that based on how we were selecting the pairs, the results in the supervised setting were different. In particular, when we were selecting pairs for which there was very little ambiguity / no multiple possible translations, then the translation accuracy was better, but note that the test set was also not the same, and maybe the difference of test pairs alone was enough to explain the differences.

Previous works have shown that using more than 5000 pairs of words does not improve the performance (Artetxe et al., 2017), and can even be detrimental (see Dinu et al., 2015). This is why we decided to consider 5000 pairs only (also because we wanted to be consistent with previous works).

from muse.

fallingstar621 commented on July 20, 2024 1

@glample thank you for providing more insights! Also Congratulations on the acceptance of the paper!

from muse.

fallingstar621 commented on July 20, 2024 1

@glample Thanks for the reply. Again, great insights!

from muse.

glample commented on July 20, 2024

Thank you :)

from muse.

fallingstar621 commented on July 20, 2024

@glample Can I ask another question? Why pre-defined dictionary is only used in the first iteration of supervised training? Can we use the pre-defined dictionary rather than build from the embedding in the following iterations? I tried supervised training for several language pairs. In some cases, I observed that the precision@k metric actually drops over iterations (starting from the second iteration). In particular, the number of translation pairs Does that mean the Procrustes can make the alignment worse? Have you experienced this kind of "convergence" problem in your experiments? Any suggestion on changing the parameters (e.g., number of iterations, dico_threshold, dico_max_rank, etc.)? Thanks in advance!

from muse.

glample commented on July 20, 2024

Can we use the pre-defined dictionary rather than build from the embedding in the following iterations? Do you mean it is possible to use the pre-defined dictionary in addition to the dictionary generated by the alignement, or instead of the generated dictionary? Currently we use the generated dictionary for the next iteration, and totally discard the pre-defined dictionary. But it is true that you could probably use a combination of both and make the supervised + refinement model even stronger.

We sometimes observed that the iterations at step t >= 2 were not as good as the initial one, but this was only for languages where embeddings are difficult to align like en-ru or en-zh. For pairs composed of European languages we did not observed anything like this.

from muse.

How to select the 5000/1500 words when building the dictionaries? about muse HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent