Giter Club home page Giter Club logo

cdalign's Introduction

Code for AAAI 2021 paper "Lexically Constrained Neural Machine Translation with Explicit Alignment Guidance". [Paper]

Install and Data preprocess

The code is implemented on fairseq v0.9.0, follow the same steps to install and prepare the processed fairseq dataset with script here. You may need to process other datasets similarly. The python package fastbpe is also needed.

git clone https://github.com/ghchen18/cdalign.git
cd cdalign
pip install --editable ./

Step 1: Train vanilla transformer

See scripts/run_train.sh

Step 2: Extract alignment using Att-Input method and process alignment data

See scripts/extract_alignment.sh

Step 3: Train with EAM-Output method

See scripts/run_train.sh

Step 4: Test on lexically constrained NMT task

See scripts/run_test.sh

Citation

@inproceedings{chen2021lexically,
  title={Lexically Constrained Neural Machine Translation with Explicit Alignment Guidance},
  author={Guanhua, Chen and Yun, Chen and Victor O.K., Li},
  booktitle = {Proceedings of AAAI},
  year      = {2021},
  pages  = {12630--12638},
  volume={35},
  number={14},
}

cdalign's People

Contributors

ghchen18 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

cdalign's Issues

The usage of extract_phrase.py

Sorry to bother, I'am stuck in using extract_phrase.py to extract the constraints.

How can I using this scripts to extract the constrains from my own dataset, it seems like there always has some files lack. What other files should I need to provide except the source and target bpe text according to the arguments in extract_phrase.py.

It has stumped me for a long time, very appreciate if you can give me some detail of how to use it.

External fairseq model loading issue

Hi @ghchen18 ,

Thanks for the code. Your work is very interesting.
I am trying to load a translation model trained on fairseq v0.10.0 which, as expected, is giving errors since your paper is trained on fairseq v0.9.
Is there any way to load the v0.10 model (since this version is in existence since Nov 2020) ?

Thanks,

Got Keyerror when training the EAMOUT model and questions about one-to-many decoding

First of all, Your work is very impressive. 😀
I encountered some problems during reproducing and would like to get some help.

① I got Keyerror when training the EAMOUT model on the preprocessed alignment dataset.
I used a NVIDIA-V100 and the environment settings are as follows:
fairseq 0.9.0
torch 1.11.0
Any idea about the error?

Traceback (most recent call last):
  File "/cdaAlign/cdalign-main/train.py", line 337, in <module>
    cli_main()
  File "/cdaAlign/cdalign-main/train.py", line 333, in cli_main
    main(args)
  File "/cdaAlign/cdalign-main/train.py", line 93, in main
    train(args, trainer, task, epoch_itr)
  File "/cdaAlign/cdalign-main/train.py", line 132, in train
    for i, samples in enumerate(progress, start=epoch_itr.iterations_in_epoch):
  File "/cdaAlign/cdalign-main/fairseq/progress_bar.py", line 181, in __iter__
    for i, obj in enumerate(self.iterable, start=self.offset):
  File "/cdaAlign/cdalign-main/fairseq/data/iterators.py", line 314, in __next__
    chunk.append(next(self.itr))
  File "/cdaAlign/cdalign-main/fairseq/data/iterators.py", line 43, in __next__
    return next(self.itr)
  File "/cdaAlign/cdalign-main/fairseq/data/iterators.py", line 36, in __iter__
    for x in self.iterable:
  File "/envs/cda-align/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File "/envs/cda-align/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1224, in _next_data
    return self._process_data(data)
  File "/envs/cda-align/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data
    data.reraise()
  File "/envs/cda-align/lib/python3.8/site-packages/torch/_utils.py", line 457, in reraise
    raise exception
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/envs/cda-align/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/envs/cda-align/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/envs/cda-align/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/cdaAlign/cdalign-main/fairseq/data/language_pair_dataset.py", line 215, in __getitem__
    example['alignment'] = self.align_dataset[index]
  File "/cdaAlign/cdalign-main/fairseq/data/indexed_dataset.py", line 222, in __getitem__
    ptx = self.cache_index[i]
KeyError: 2206735

② In the one-to-many decoding experiment, I tried to make the model to select the constraint candidate as mentioned in your paper, which I quote here.

the model runs another decoder forward pass and selects the constraint with the highest length-averaged log-probability as the target constraint.

Does it means to:

  1. Use the max probability as the probability for target constraint tokens ;
  2. Calculate the score similarly as beamsearch with beam_size 1?

My implementations are simplified as follows. And I got a CSR score Lower than decoding without constraint. I would really appreciate your help with locating my mistake.

# lprobs is the original probability distribution of current step
cur_max_prob =  lprobs[idx, :].max().clone()
# tgt_p_toks is a list of current target constraint candidate tokens
tmp_cons_prob = cur_max_prob * len(tgt_p_toks)
# append a candidate target constraint after the original hypothesis
# run another decoder forward pass after the new hypothesis
# and I got cur_lprobs,  the new probability distribution 
cur_score = (cur_lprobs.max().clone() + tmp_cons_prob + scores.view(bsz*beam_size, -1)[idx, step-1]) / int(step + len(tgt_p_toks) + 1)

IndexError in generate_align.py

This error occured when using bash extract_alignment.sh to extract alignment in task2.
It seems like sample_id is lager than the fix length of align_sents and get the error IndexError: list index out of range.
In line 90-95:
image

How can I solve this problem?

Is the loop over word_num correct in `extract_phrase.py`?

Hi, I am using your repository to train some experiments. I appreciate the documentation.

I had a question about this line in the scripts to extract constraints:

for word_num in range(3):

I wanted to confirm the behavior. So this means it will run with values word_num = {0,1,2}. But does it make sense for max_src_len=0 in the call to phrase_extraction()? That is, there's no way to extract a phrase of length 0. I checked the value of cons_dicts[0] right before writing to file, and found that it is None.

So is this an unintended behavior, and the loop should actually be for word_num in range(1,3+1)? Or am I misunderstanding?

Error while running extract_alignment.sh

I am getting this error while trying to run extract_alignment.sh ..

Traceback (most recent call last): File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/generate_align.py", line 128, in <module> cli_main() File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/generate_align.py", line 125, in cli_main main(args) File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/generate_align.py", line 33, in main task.load_dataset(args.gen_subset) File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/fairseq/tasks/translation.py", line 217, in load_dataset self.datasets[split] = load_langpair_dataset( File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/fairseq/tasks/translation.py", line 54, in load_langpair_dataset src_dataset = data_utils.load_indexed_dataset(prefix + src, src_dict, dataset_impl) File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/fairseq/data/data_utils.py", line 73, in load_indexed_dataset dataset = indexed_dataset.make_dataset( File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/fairseq/data/indexed_dataset.py", line 60, in make_dataset return MMapIndexedDataset(path) File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/fairseq/data/indexed_dataset.py", line 448, in __init__ self._do_init(path) File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/fairseq/data/indexed_dataset.py", line 458, in _do_init self._index = self.Index(index_file_path(self._path)) File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/fairseq/data/indexed_dataset.py", line 408, in __init__ self._dtype = dtypes[dtype_code] KeyError: 9 Exception ignored in: <function MMapIndexedDataset.Index.__del__ at 0x7f0ed7d15b80> Traceback (most recent call last): File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/fairseq/data/indexed_dataset.py", line 423, in __del__ self._bin_buffer_mmap._mmap.close() AttributeError: 'Index' object has no attribute '_bin_buffer_mmap' Exception ignored in: <function MMapIndexedDataset.__del__ at 0x7f0ed7d190d0> Traceback (most recent call last): File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/fairseq/data/indexed_dataset.py", line 465, in __del__ self._bin_buffer_mmap._mmap.close() AttributeError: 'MMapIndexedDataset' object has no attribute '_bin_buffer_mmap'

Looking forward to get response on this.. @ghchen18
Thanks

What's talp format?

I found many "talp" in scripts/extract_alignment.sh. What's it? How can I get this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.