Giter Club home page Giter Club logo

csr4mbart's Introduction

CSR4mBART

Code of our paper Bridging Cross-Lingual Gaps During Leveraging the Multilingual Sequence-to-Sequence Pretraining for Text Generation and Understanding

For multilingual sequence-to-sequence pretrained language models (multilingual Seq2Seq PLMs), e.g. mBART, the self-supervised pretraining task is trained on a wide range of monolingual languages, e.g. 25 languages from CommonCrawl, while the downstream cross-lingual tasks generally progress on a bilingual language subset, e.g. English-German, making there exists the data discrepancy, namely domain discrepancy, and cross-lingual learning objective discrepancy, namely task discrepancy, between the pretraining and finetuning stages. To bridge the above cross-lingual domain and task gaps, we extend the vanilla pretrain-finetune pipeline with an extra code-switching restore task. Specifically, the first stage employs the self-supervised code-switching restore task as a pretext task, allowing the multilingual Seq2Seq PLMs to acquire some in-domain alignment information. And for the second stage, we fine-tune the model on downstream data normally. Experiments on both NLG evaluation (12 bilingual translation tasks, 30 zero-shot translation tasks, and 2 cross-lingual summarization tasks) and NLU evaluation (7 cross-lingual natural language inference tasks) show our model outperforms the strong baseline mBART with standard finetuning strategy, consistently. Analyses indicate our approach could narrow the Euclidean distance of cross-lingual sentence representations, and improve the model generalization with trivial computational cost.

Alt text

Installation

  • bash setup.sh

Usage

Example commands

Extracting the unsupervised translation vocabulary from the downstream task data:

  • bash scripts/extract_vocab.sh $SRC_LANG $TGT_LANG

Training translation model with our two-stage fine-tuning method

  • bash scripts/train.sh

Data

  • Bilingual Translation: The translation datasets are based on ML50 and iwslt14.

  • Cross-Lingual Summarization: The cross-lingual summarization datasets are based on NCLS

  • XNLI: The XNLI dataset is based on XGLUE

  • Preprocess: We follow mBART to preprocess the data, including tokenization and binarization.

Evaluate

  • bash scripts/evaluate.sh $checkpoint ${SRC} ${TGT}

Reference

@article{zan2022CSR4mBART,
  title={Bridging cross-lingual gaps during leveraging the multilingual sequence-to-sequence pretraining for text generation},
  author={Changtong Zan, Liang Ding, Li Shen, Yu Cao,   Weifeng Liu and Dacheng Tao},
  journal={arXiv preprint},
  year={2022}
}

csr4mbart's People

Contributors

zanchangtong avatar

Stargazers

Amir Hossein Mohammadi امیرحسین محمدی  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.