Giter Club home page Giter Club logo

awesome-align's Introduction

AWESOME: Aligning Word Embedding Spaces of Multilingual Encoders

awesome-align is a tool that can extract word alignments from multilingual BERT (mBERT) [Demo] and allows you to fine-tune mBERT on parallel corpora for better alignment quality (see our paper for more details).

Dependencies

First, you need to install the dependencies:

pip install -r requirements.txt

Input format

Inputs should be tokenized and each line is a source language sentence and its target language translation, separated by (|||). You can see some examples in the examples folder.

Extracting alignments

Here is an example of extracting word alignments from multilingual BERT:

DATA_FILE=/path/to/data/file
MODEL_NAME_OR_PATH=bert-base-multilingual-cased
OUTPUT_FILE=/path/to/output/file

CUDA_VISIBLE_DEVICES=0 python run_align.py \
    --output_file=$OUTPUT_FILE \
    --model_name_or_path=$MODEL_NAME_OR_PATH \
    --data_file=$DATA_FILE \
    --extraction 'softmax' \
    --batch_size 32 \

This produces outputs in the i-j Pharaoh format. A pair i-j indicates that the ith word (zero-indexed) of the source sentence is aligned to the jth word of the target sentence.

You can also set MODEL_NAME_OR_PATH to the path of your fine-tuned model as shown below.

Fine-tuning on parallel data

If there is parallel data available, you can fine-tune embedding models on that data.

Here is an example of fine-tuning mBERT that balances well between efficiency and effectiveness:

TRAIN_FILE=/path/to/train/file
EVAL_FILE=/path/to/eval/file
OUTPUT_DIR=/path/to/output/directory

CUDA_VISIBLE_DEVICES=0 python run_train.py \
    --output_dir=$OUTPUT_DIR \
    --model_name_or_path=bert-base-multilingual-cased \
    --extraction 'softmax' \
    --do_train \
    --train_tlm \
    --train_so \
    --train_data_file=$TRAIN_FILE \
    --per_gpu_train_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --num_train_epochs 1 \
    --learning_rate 2e-5 \
    --save_steps 4000 \
    --max_steps 20000 \
    --do_eval \
    --eval_data_file=$EVAL_FILE \

You can also fine-tune the model a bit longer with more training objectives for better quality:

TRAIN_FILE=/path/to/train/file
EVAL_FILE=/path/to/eval/file
OUTPUT_DIR=/path/to/output/directory

CUDA_VISIBLE_DEVICES=0 python run_train.py \
    --output_dir=$OUTPUT_DIR \
    --model_name_or_path=bert-base-multilingual-cased \
    --extraction 'softmax' \
    --do_train \
    --train_mlm \
    --train_tlm \
    --train_tlm_full \
    --train_so \
    --train_psi \
    --train_data_file=$TRAIN_FILE \
    --per_gpu_train_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --num_train_epochs 1 \
    --learning_rate 2e-5 \
    --save_steps 10000 \
    --max_steps 40000 \
    --do_eval \
    --eval_data_file=$EVAL_FILE \

If you want high alignment recalls, you can turn on the --train_co option, but note that the alignment precisions may drop.

Model performance

The following table shows the alignment error rates (AERs) of our models and popular statistical word aligners on five language pairs. The De-En, Fr-En, Ro-En datasets can be obtained following this repo, the Ja-En data is from this link and the Zh-En data is available at this link. The best scores are in bold.

De-En Fr-En Ro-En Ja-En Zh-En
fast_align 27.0 10.5 32.1 51.1 38.1
eflomal 22.6 8.2 25.1 47.5 28.7
Mgiza 20.6 5.9 26.4 48.0 35.1
Ours (w/o fine-tuning, softmax) 17.4 5.6 27.9 45.6 18.1
Ours (multilingually fine-tuned
w/o --train_co, softmax) [Download]
15.2 4.1 22.6 37.4 13.4
Ours (multilingually fine-tuned
w/ --train_co, softmax) [Download]
15.1 4.5 20.7 38.4 14.5

Citation

If you use our tool, we'd appreciate if you cite the following paper:

@inproceedings{dou2021word,
  title={Word Alignment by Fine-tuning Embeddings on Parallel Corpora},
  author={Dou, Zi-Yi and Neubig, Graham},
  booktitle={Conference of the European Chapter of the Association for Computational Linguistics (EACL)},
  year={2021}
}

Acknowledgements

Some of the code is borrowed from HuggingFace Transformers licensed under Apache 2.0 and the entmax implementation is from this repo.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.