Hi awesome-align team, First, thanks for the great tool. It has real

Thanks a million <a class="user-mention notranslate" data-hovercard-type="user" data-h

Wrong mapping with non-matching sentences about awesome-align HOT 5 OPEN

mzeidhassan commented on August 26, 2024

Wrong mapping with non-matching sentences

from awesome-align.

Comments (5)

zdou0830 commented on August 26, 2024 2

Hi,

Thanks for the interest!

In the demo, we only print the aligned word pairs, thus most words in the second English sentence are now showing up because our model does not find any corresponding target words for them (which is a good thing). "I" appears twice because the model thinks it is aligned to two words in the target sentence, and "today." is mapped to "(there)." because there are "." in both of the words (remember that the inputs should be tokenized).

I am not sure if awesome-align can be used for filtering non-parallel sentences. I guess one thing you can do is score each sentence pair by computing the number of extracted word pairs divided by the sentence length, then filter sentence pairs whose scores are below a pre-defined threshold. Also, awesome-align does have a parallel sentence identification objective and our models can be trained to detect non-parallel sentences. However, I haven't tested it in this scenario and I don't know if these strategies would be better than or is complementary to existing techniques like dual conditional cross-entropy filtering.

I did test a few ways to generate alignment scores, but it is still worth investigating if the scores make sense or are well-calibrated.

The demo provides a (poor) visualization for the mappings and I will try to make it nicer :)

from awesome-align.

juncaofish commented on August 26, 2024 1

I encapsulate alignment calculation into a separate method using simple harmonic mean of aligned tokens rate on both sides. Comments are welcome for the implementation. @mzeidhassan
https://github.com/juncaofish/awesome-align/blob/master/awesome_align/aligner.py#L125

from awesome-align.

jinyiyang-jhu commented on August 26, 2024

I have a similar question: what if a src token is not aligned to any target token (or tgt token not aligned to any src token)? If so, how should we preprocess the gold alignment, and will such token be printed in the hypothesis? How will the AER be calculated?
In the "example/", both the reference and hypothesis are i-j paired so I'm not sure how to process something like -j or i-, the aer.py script won't work with that.

from awesome-align.

zdou0830 commented on August 26, 2024

I have a similar question: what if a src token is not aligned to any target token (or tgt token not aligned to any src token)? If so, how should we preprocess the gold alignment, and will such token be printed in the hypothesis? How will the AER be calculated?
In the "example/", both the reference and hypothesis are i-j paired so I'm not sure how to process something like -j or i-, the aer.py script won't work with that.

Hi @jinyiyang-jhu, the reference/outputs only contain aligned word pairs. If the i-th source word is not aligned to any target words, there would be no i-* in the reference/outputs.

from awesome-align.

mzeidhassan commented on August 26, 2024

Thanks a million @juncaofish ! I will give it a try when I have a chance.

from awesome-align.

Wrong mapping with non-matching sentences about awesome-align HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent