Giter Club home page Giter Club logo

const's Introduction

ConST: Cross-modal Contrastive Learning for Speech Translation

This is an implementation of NAACL 2022 paper "Cross-modal Contrastive Learning for Speech Translation" (read paper here). The implementation based on fairseq codebase.

CONTRIBUTION: You are also more than welcomed to test our code on your machines, and report feedbacks on results, bugs and performance!

👀 Overview

The motivation of our ConST model is to learn similar representations for semantically similar speech and text.

ConST (1) inherits the advantages of multi-task learning (as shown in our previous paper XSTNet (with code)), (2) while employing a contrastive learning approach to bridge the gap between low-level speech representation and text embedding.

Result on MuST-C En-X dataset

We report case-sensitive detokenized BLEU via sacrebleu toolkit.

Model En-De En-Es En-Fr En-It En-Nl En-Pt En-Ro En-Ru Avg.
ConST-base 25.7 30.4 36.8 26.3 30.6 32.0 24.8 17.3 28.0
ConST-expand 28.3 32.0 38.3 27.2 31.7 33.1 25.6 18.9 29.4

🤗 Huggingface Space Demo available now!

Experience our end-to-end voice translation system on Huggingface Space now! Record a sentence in English and translate it into other languages! You are a TRANSLATOR!

HERE IS THE WEBSITE:

https://huggingface.co/spaces/ReneeYe/ConST-speech2text-translator

P.S. Since huggingface space only provides CPU, it will take 12-20 seconds to inference and generate the translation result.

⬇️ Download Trained Models

The models are trained based on pytorch. You may download all the models at 🤗huggingface model.

Datasets Model SPM & Vocab
En-De Download SPM model; Vocab
En-Es Download SPM model; Vocab
En-Fr Download SPM model; Vocab
En-It Download SPM model; Vocab
En-Nl Download SPM model; Vocab
En-Pt Download SPM model; Vocab
En-Ro Download SPM model; Vocab
En-Ru Download SPM model; Vocab

Training & Generation Instruction

⚙️ Requirements and Installation

  • PyTorch version >= 1.5.0
  • Python version >= 3.6
  • For training new models, you'll also need an NVIDIA GPU and NCCL
git clone [email protected]:ReneeYe/ConST.git
cd ConST
pip3 install -r requirements.txt
pip3 install --editable ./

📉 Pre-processing and Training

The instructions of data pre-processing are here. To train the model, take En-De as an example, you may run:

bash ConST/scripts/train_en2x.sh de checkpoint/model_saved.

🤖️ Inference, Generation and Evaluation

We strongly recommend that you average the checkpoints after you get the best checkpoint with highest BLEU on dev set.

python3 ConST/scripts/average_checkpoints.py --inputs checkpoint/model_saved \
--num-update-checkpoints 10 --checkpoint-upper-bound ${step-to-get-the-best-dev} \
--output ${path-to-averaged-ckpt}

Then generate and evaluate your model.

fairseq-generate data/ --gen-subset tst-COMMON_st --task speech_to_text --prefix-size 1 \
--max-tokens 4000000 --max-source-positions 4000000 --beam 10 \
--config-yaml config_st.yaml  --path ${path-to-averaged-ckpt} \
--scoring sacrebleu

✏️ Citation

@InProceedings{ye2022cross,
  author    = {Rong Ye and Mingxuan Wang and Lei Li},
  booktitle = {Proc. of NAACL},
  title     = {Cross-modal Contrastive Learning for Speech Translation },
  year      = {2022}
}

const's People

Contributors

reneeye avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

const's Issues

Reproduce experiments with external MT data

Hello,

Thank you very much for sharing your work!
I have followed your script to run the experiments on the En->De ST. I got a BLEU score of 25.31 without using external MT data. With external MT data used, I got 25.82, which is lower than 28.3 in your paper.
I'm wondering if a pre-trained MT model was perhaps utilized to achieve the 28.3 score? Or is there anything else that I might have missed in the process? Any suggestions or guidance you can offer would be greatly appreciated.

Thank you in advanced!

Extra MT Data

Hi, you have restricted in the script download_wmt.sh that only en-ro parallel data comes from wmt16 version, however you wrote in you paper that en-de and en-ru parallel data also come from wmt16. Is there something wrong in your paper or your script?

# download_wmt.sh
if [[ $version == "wmt16" && $target != "ro" ]] || [[ $version != "wmt16" && $target == "ro" ]]; then
    echo "--wmt16 if and only if target is ro"
    exit
fi

Fail to run the experiment

Hi, when I was using bash ConST/scripts/train_en2x.sh de checkpoint/model_saved. to train the model, I encountered some problem. It would be great help to me if you can help taking a look at my bug.
In the original ConST/scripts/train_en2x.sh file, language prefix token looks like this:
image
which brings me the following error:
image
After deleting the '<' and '>' outside lang:${TGT_LANG}, it can start training on train_st but when it began to validate on dev_st, another assertion error occured like this:
image
which comes from fairseq/tasks/speech_to_text_triplet_with_extra_mt.py, line 377.
I tried adding '<' and '>' in speech_to_text_triplet_with_extra_mt.py before assertion, but it didn't work. It would be of great help to me if you know some solutions. Thank you!

freeze wav2vec

您好!在train_en2x.sh中没有看到--freeze-w2v的参数设置,请问wav2vec2在ST训练中需要freeze吗?

Following your “Training & Generation Instruction”, BLEU is 25.68.

Hello, I can't reproduce your results. The possible reasons are

  1. I used 1 rtx-3090(update-freq=2) instead of 8 v100.
  2. The script you provided does not seem to be pre-trained specifically on external translation data.

Can you give me some ideas about how to reach your performance? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.