Giter Club home page Giter Club logo

Comments (5)

zdou0830 avatar zdou0830 commented on August 26, 2024 2

Hi,

Thanks for the interest!

In the demo, we only print the aligned word pairs, thus most words in the second English sentence are now showing up because our model does not find any corresponding target words for them (which is a good thing). "I" appears twice because the model thinks it is aligned to two words in the target sentence, and "today." is mapped to "(there)." because there are "." in both of the words (remember that the inputs should be tokenized).

I am not sure if awesome-align can be used for filtering non-parallel sentences. I guess one thing you can do is score each sentence pair by computing the number of extracted word pairs divided by the sentence length, then filter sentence pairs whose scores are below a pre-defined threshold. Also, awesome-align does have a parallel sentence identification objective and our models can be trained to detect non-parallel sentences. However, I haven't tested it in this scenario and I don't know if these strategies would be better than or is complementary to existing techniques like dual conditional cross-entropy filtering.

I did test a few ways to generate alignment scores, but it is still worth investigating if the scores make sense or are well-calibrated.

The demo provides a (poor) visualization for the mappings and I will try to make it nicer :)

from awesome-align.

juncaofish avatar juncaofish commented on August 26, 2024 1

I encapsulate alignment calculation into a separate method using simple harmonic mean of aligned tokens rate on both sides. Comments are welcome for the implementation. @mzeidhassan
https://github.com/juncaofish/awesome-align/blob/master/awesome_align/aligner.py#L125

from awesome-align.

jinyiyang-jhu avatar jinyiyang-jhu commented on August 26, 2024

I have a similar question: what if a src token is not aligned to any target token (or tgt token not aligned to any src token)? If so, how should we preprocess the gold alignment, and will such token be printed in the hypothesis? How will the AER be calculated?
In the "example/", both the reference and hypothesis are i-j paired so I'm not sure how to process something like -j or i-, the aer.py script won't work with that.

from awesome-align.

zdou0830 avatar zdou0830 commented on August 26, 2024

I have a similar question: what if a src token is not aligned to any target token (or tgt token not aligned to any src token)? If so, how should we preprocess the gold alignment, and will such token be printed in the hypothesis? How will the AER be calculated?
In the "example/", both the reference and hypothesis are i-j paired so I'm not sure how to process something like -j or i-, the aer.py script won't work with that.

Hi @jinyiyang-jhu, the reference/outputs only contain aligned word pairs. If the i-th source word is not aligned to any target words, there would be no i-* in the reference/outputs.

from awesome-align.

mzeidhassan avatar mzeidhassan commented on August 26, 2024

Thanks a million @juncaofish ! I will give it a try when I have a chance.

from awesome-align.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.