ghchen18 / cdalign Goto Github PK

View Code? Open in Web Editor NEW

23.0 3.0 6.0 469 KB

Code for AAAI 2021 paper "Lexically Constrained Neural Machine Translation with Explicit Alignment Guidance"

License: MIT License

Python 96.10% C++ 0.89% Cuda 1.76% Shell 0.65% Cython 0.60%

nmt constrained-nmt

cdalign's Introduction

Code for AAAI 2021 paper "Lexically Constrained Neural Machine Translation with Explicit Alignment Guidance". [Paper]

Install and Data preprocess

The code is implemented on fairseq v0.9.0, follow the same steps to install and prepare the processed fairseq dataset with script here. You may need to process other datasets similarly. The python package fastbpe is also needed.

git clone https://github.com/ghchen18/cdalign.git
cd cdalign
pip install --editable ./

Step 1: Train vanilla transformer

See scripts/run_train.sh

Step 2: Extract alignment using Att-Input method and process alignment data

See scripts/extract_alignment.sh

Step 3: Train with EAM-Output method

See scripts/run_train.sh

Step 4: Test on lexically constrained NMT task

See scripts/run_test.sh

Citation

@inproceedings{chen2021lexically,
  title={Lexically Constrained Neural Machine Translation with Explicit Alignment Guidance},
  author={Guanhua, Chen and Yun, Chen and Victor O.K., Li},
  booktitle = {Proceedings of AAAI},
  year      = {2021},
  pages  = {12630--12638},
  volume={35},
  number={14},
}

cdalign's People

Contributors

Stargazers

Watchers

Forkers

lizezhonglaile koanho jiezhanggt dengningyuan skt-t1-thecai kyoto7250

cdalign's Issues

The usage of extract_phrase.py

Sorry to bother, I'am stuck in using extract_phrase.py to extract the constraints.

How can I using this scripts to extract the constrains from my own dataset, it seems like there always has some files lack. What other files should I need to provide except the source and target bpe text according to the arguments in extract_phrase.py.

It has stumped me for a long time, very appreciate if you can give me some detail of how to use it.

关于 cdalign/scripts/extract_phrase.py 的bug

请问在cdalign/scripts/extract_phrase.py 中的greedy文件是什么，在运行中，报错缺乏greedy文件，但该参数为false参数～

About scripts/extract_phrase.py?

What is the function of this script, and is there any specific usage example?

External fairseq model loading issue

Hi @ghchen18 ,

Thanks for the code. Your work is very interesting.
I am trying to load a translation model trained on fairseq v0.10.0 which, as expected, is giving errors since your paper is trained on fairseq v0.9.
Is there any way to load the v0.10 model (since this version is in existence since Nov 2020) ?

Thanks,

Got Keyerror when training the EAMOUT model and questions about one-to-many decoding

First of all, Your work is very impressive. 😀
I encountered some problems during reproducing and would like to get some help.

① I got Keyerror when training the EAMOUT model on the preprocessed alignment dataset.
I used a NVIDIA-V100 and the environment settings are as follows:
fairseq 0.9.0
torch 1.11.0
Any idea about the error?

Traceback (most recent call last):
  File "/cdaAlign/cdalign-main/train.py", line 337, in <module>
    cli_main()
  File "/cdaAlign/cdalign-main/train.py", line 333, in cli_main
    main(args)
  File "/cdaAlign/cdalign-main/train.py", line 93, in main
    train(args, trainer, task, epoch_itr)
  File "/cdaAlign/cdalign-main/train.py", line 132, in train
    for i, samples in enumerate(progress, start=epoch_itr.iterations_in_epoch):
  File "/cdaAlign/cdalign-main/fairseq/progress_bar.py", line 181, in __iter__
    for i, obj in enumerate(self.iterable, start=self.offset):
  File "/cdaAlign/cdalign-main/fairseq/data/iterators.py", line 314, in __next__
    chunk.append(next(self.itr))
  File "/cdaAlign/cdalign-main/fairseq/data/iterators.py", line 43, in __next__
    return next(self.itr)
  File "/cdaAlign/cdalign-main/fairseq/data/iterators.py", line 36, in __iter__
    for x in self.iterable:
  File "/envs/cda-align/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File "/envs/cda-align/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1224, in _next_data
    return self._process_data(data)
  File "/envs/cda-align/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data
    data.reraise()
  File "/envs/cda-align/lib/python3.8/site-packages/torch/_utils.py", line 457, in reraise
    raise exception
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/envs/cda-align/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/envs/cda-align/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/envs/cda-align/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/cdaAlign/cdalign-main/fairseq/data/language_pair_dataset.py", line 215, in __getitem__
    example['alignment'] = self.align_dataset[index]
  File "/cdaAlign/cdalign-main/fairseq/data/indexed_dataset.py", line 222, in __getitem__
    ptx = self.cache_index[i]
KeyError: 2206735

② In the one-to-many decoding experiment, I tried to make the model to select the constraint candidate as mentioned in your paper, which I quote here.

the model runs another decoder forward pass and selects the constraint with the highest length-averaged log-probability as the target constraint.

Does it means to:

Use the max probability as the probability for target constraint tokens ;
Calculate the score similarly as beamsearch with beam_size 1?

My implementations are simplified as follows. And I got a CSR score Lower than decoding without constraint. I would really appreciate your help with locating my mistake.

# lprobs is the original probability distribution of current step
cur_max_prob =  lprobs[idx, :].max().clone()
# tgt_p_toks is a list of current target constraint candidate tokens
tmp_cons_prob = cur_max_prob * len(tgt_p_toks)
# append a candidate target constraint after the original hypothesis
# run another decoder forward pass after the new hypothesis
# and I got cur_lprobs,  the new probability distribution 
cur_score = (cur_lprobs.max().clone() + tmp_cons_prob + scores.view(bsz*beam_size, -1)[idx, step-1]) / int(step + len(tgt_p_toks) + 1)

IndexError in generate_align.py

This error occured when using bash extract_alignment.sh to extract alignment in task2.
It seems like sample_id is lager than the fix length of align_sents and get the error IndexError: list index out of range.
In line 90-95:

How can I solve this problem?

Is the loop over word_num correct in `extract_phrase.py`?

Hi, I am using your repository to train some experiments. I appreciate the documentation.

I had a question about this line in the scripts to extract constraints:

cdalign/scripts/extract_phrase.py

Line 184 in 74389f7

for word_num in range(3):

I wanted to confirm the behavior. So this means it will run with values word_num = {0,1,2}. But does it make sense for max_src_len=0 in the call to phrase_extraction()? That is, there's no way to extract a phrase of length 0. I checked the value of cons_dicts[0] right before writing to file, and found that it is None.

So is this an unintended behavior, and the loop should actually be for word_num in range(1,3+1)? Or am I misunderstanding?

'int' object has no attribute 'backward'

When training EAM-Output model with the params of vanilla Transformer frozen, I got "'int' object has no attribute 'backward'".

Error while running extract_alignment.sh

I am getting this error while trying to run extract_alignment.sh ..

Traceback (most recent call last): File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/generate_align.py", line 128, in <module> cli_main() File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/generate_align.py", line 125, in cli_main main(args) File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/generate_align.py", line 33, in main task.load_dataset(args.gen_subset) File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/fairseq/tasks/translation.py", line 217, in load_dataset self.datasets[split] = load_langpair_dataset( File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/fairseq/tasks/translation.py", line 54, in load_langpair_dataset src_dataset = data_utils.load_indexed_dataset(prefix + src, src_dict, dataset_impl) File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/fairseq/data/data_utils.py", line 73, in load_indexed_dataset dataset = indexed_dataset.make_dataset( File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/fairseq/data/indexed_dataset.py", line 60, in make_dataset return MMapIndexedDataset(path) File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/fairseq/data/indexed_dataset.py", line 448, in __init__ self._do_init(path) File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/fairseq/data/indexed_dataset.py", line 458, in _do_init self._index = self.Index(index_file_path(self._path)) File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/fairseq/data/indexed_dataset.py", line 408, in __init__ self._dtype = dtypes[dtype_code] KeyError: 9 Exception ignored in: <function MMapIndexedDataset.Index.__del__ at 0x7f0ed7d15b80> Traceback (most recent call last): File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/fairseq/data/indexed_dataset.py", line 423, in __del__ self._bin_buffer_mmap._mmap.close() AttributeError: 'Index' object has no attribute '_bin_buffer_mmap' Exception ignored in: <function MMapIndexedDataset.__del__ at 0x7f0ed7d190d0> Traceback (most recent call last): File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/fairseq/data/indexed_dataset.py", line 465, in __del__ self._bin_buffer_mmap._mmap.close() AttributeError: 'MMapIndexedDataset' object has no attribute '_bin_buffer_mmap'

Looking forward to get response on this.. @ghchen18
Thanks

What's talp format?

I found many "talp" in scripts/extract_alignment.sh. What's it? How can I get this?

Where does ‘bak’ come from in ”--srcdict $fseq/bak/dict.${src}.txt“ in extract_alignment.sh?

The scripts are sa follows:
echo "Start processing alignment data into fairseq data format"
python preprocess.py -s $src -t $tgt --dataset-impl lazy
--workers 8 --destdir $fseq --align-suffix align --joined-dictionary
--trainpref $fseq/bpe/train --validpref $fseq/bpe/valid
--srcdict $fseq/bak/dict.${src}.txt --process-only-alignment

Decoding with One-tomany Constraints

Did the codes implement "Decoding with One-tomany Constraints"?