helsinki-nlp / opus-mt-train Goto Github PK

Training open neural machine translation models

License: MIT License

Makefile 94.12% Perl 2.73% Shell 1.46% PHP 0.35% Python 1.24% sed 0.10%

machine-translation natural-language-processing machine-learning language-technology

opus-mt-train's Introduction

Train Opus-MT models

This package includes scripts for training NMT models using MarianNMT and OPUS data for OPUS-MT. More details are given in the Makefile but documentation needs to be improved. Also, the targets require a specific environment and right now only work well on the CSC HPC cluster in Finland.

Pre-trained models

The subdirectory models contains information about pre-trained models that can be downloaded from this project. They are distribted with a CC-BY 4.0 license license. More pre-trained models trained with the OPUS-MT training pipeline are available from the Tatoeba translation challenge also under a CC-BY 4.0 license license.

Quickstart

Setting up:

git clone https://github.com/Helsinki-NLP/OPUS-MT-train.git
git submodule update --init --recursive --remote
make install

Look into lib/env.mk and adust any settings that you need in your environment. For CSC-users: adjust lib/env/puhti.mk and lib/env/mahti.mk to match yoursetup (especially the locations where Marian-NMT and other tools are installed and the CSC project that you are using).

Training a multilingual NMT model (Finnish and Estonian to Danish, Swedish and English):

make SRCLANGS="fi et" TRGLANGS="da sv en" train
make SRCLANGS="fi et" TRGLANGS="da sv en" eval
make SRCLANGS="fi et" TRGLANGS="da sv en" release

More information is available in the documentation linked below.

Documentation

Tutorials

References

Please, cite the following paper if you use OPUS-MT software and models:

@InProceedings{TiedemannThottingal:EAMT2020,
  author = {J{\"o}rg Tiedemann and Santhosh Thottingal},
  title = {{OPUS-MT} — {B}uilding open translation services for the {W}orld},
  booktitle = {Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT)},
  year = {2020},
  address = {Lisbon, Portugal}
 }

Acknowledgements

None of this would be possible without all the great open source software including

GNU/Linux tools
Marian-NMT
eflomal

... and many other tools like terashuf, pigz, jq, Moses SMT, fast_align, sacrebleu ...

We would also like to acknowledge the support by the University of Helsinki, the IT Center of Science CSC, the funding through projects in the EU Horizon 2020 framework (FoTran, MeMAD, ELG) and the contributors to the open collection of parallel corpora OPUS.

opus-mt-train's People

Contributors

Stargazers

Watchers

opus-mt-train's Issues

Handle sublanguages with OPUS-CAT MT Engine memoQ Integration

Hello,
I have downloaded the en-it model, but I noticed that OPUS-CAT MT Engine doesn't work with sublanguages via the memoQ plugin. For example, if I translate EN>IT-ITA I'd like the MT to return results as EN>IT. Same with EN-US>IT, EN-UK>IT-ITA, etc. In other words, I want any sublanguage to use the main language. How can I achieve this?

integrate OPUS-filter

integrate data filtering using OPUS-filter

parallel corpus filtering
monolingual corpus filtering (good for back-translation)
language (pair) specific configuration files

How to determine the correspondence between language abbreviations and full language names?

For mt , Is there a language abbreviation mapping table？

How to perform fine-tuning?

Hey, currently the fine-tuning commands ('tune' etc) are not recognized when running from the main directory ('OPUS-MT-train/'), and do not work correctly when running from the finetune directory ('OPUS-MT-train/finetune'), as the makefile is looking for model-related files and tools which are under the main directory under the directories 'work' and 'tools' and not under the finetune directory, and so it fails.

Is there an easier way to run fine-tuning from the main directory (similar to running train -> eval -> dist)? For example, to train a model on a few corpora and then fine-tune it on the desired corpus (which is most relevant to the final translation task) before packaging it?

Are the models released under MIT License?

Hi,

Thank you for releasing this. Very very interesting work. I plan to test this in the coming months.

Can you confirm that the pretrained models you distribute on this repo and through huggingface are under MIT License?

Thank you,

Alex

Bad translation using marian-decoder

Hi, I've loaded the models from the following directory: https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models/ru-en
When I tried some of them I often get translation like: "▁Y O O O O O O O O O O O O O O O O O O O O" or "I 'm b@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@"
Then I tried to load the model from the Hugging Face site: but get pretty similar outputs while using Hugging Face framework gives good translations. Probably something wrong with config.
I launch it using the Marian library. For example:

 echo "привет" | ./marian-decoder -c /path/to/opus_models/opus-2019-12-05-ru-en/decoder.yml

So what can be wrong?

en->bcl quality issue

Using latest model for en-bcl at opusmt.wmflabs.org, with sample translation:

Beautiful blue skies and golden sunshine

results to,

Moarvatezeobetuntreugennaddasevelurreizhkornabalamourd'urreizhkornuheloc'ha@-@enepurreizhkornuheloc'h.

Is there any issue with model or escape character or encoding issue? I can test commandline if needed further debugging.

knowledge distillation

Add knowledge distillation and teacher-student models

smaller student models + quantization
3 layers encoder + 1-2 layers decoder?

quantized NMT models

add model quantization from marian-nmt

How does marian use vocab.yaml?

Is it like this? Having trouble understanding the C++ code.

import sentencepiece
import yaml
text = 'What is for dinner ?'
vocab_file = 'en-de/opus.spm32k-spm32k.vocab.yml'

vocab = yaml.load(open(vocab_file), Loader=yaml.BaseLoader)

spm_source = sentencepiece.SentencePieceProcessor()
spm_source.Load('en-de/source.spm')
pieces= spm_source.encode_as_pieces(text) 
ids = [vocab[p] for p in pieces]

Incremental ENSK training failed using the provided assets

Problem: incremental training using Marian 1.9 did not succeed using additional bilingual data and the ENSK model and SPM model downloaded from this repo.

Exception stack:

[2020-05-13 12:56:00] Error: Requested shape shape=1x32000 size=32000 for existing parameter 'decoder_ff_logit_out_b' does not match original shape shape=1x60024 size=60024
[2020-05-13 12:56:00] Error: Aborted from marian::Expr marian::ExpressionGraph::param(const string&, const marian::Shape&, marian::Ptr<marian::inits::NodeInitializer>&, marian::Type, bool, bool) in /marian/src/graph/expression_graph.h:317

[CALL STACK]
[0x56519d56f34f]    marian::ExpressionGraph::  param  (std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&,  marian::Shape const&,  std::shared_ptr<marian::inits::NodeInitializer> const&,  marian::Type,  bool,  bool) + 0xf3f
[0x56519d7f414e]    marian::mlp::Output::  lazyConstruct  (int)        + 0x24e
[0x56519d7fe7ac]    marian::mlp::Output::  applyAsLogits  (IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>) + 0x6c
[0x56519d8d9b15]    marian::DecoderTransformer::  step  (std::shared_ptr<marian::DecoderState>) + 0x1b15
[0x56519d8dcdde]    marian::DecoderTransformer::  step  (std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::DecoderState>) + 0x3ee
[0x56519d8f59a5]    marian::EncoderDecoder::  stepAll  (std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::data::CorpusBatch>,  bool) + 0x225
[0x56519d8e6603]    marian::models::EncoderDecoderCECost::  apply  (std::shared_ptr<marian::models::IModel>,  std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::data::Batch>,  bool) + 0xf3
[0x56519d4d6742]    marian::models::Trainer::  build  (std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::data::Batch>,  bool) + 0xa2
[0x56519d94623d]                                                       + 0x70f23d
[0x56519d9d8383]    marian::NCCLCommunicator::  foreach  (std::function<void (unsigned long,unsigned long,unsigned long)> const&,  bool) const + 0x763
[0x56519d942b61]    marian::SyncGraphGroup::  initialize  (std::shared_ptr<marian::data::Batch> const&) + 0x61
[0x56519d94a72e]    marian::SyncGraphGroup::  update  (std::vector<std::shared_ptr<marian::data::Batch>,std::allocator<std::shared_ptr<marian::data::Batch>>>,  unsigned long) + 0x15e
[0x56519d94cd73]    marian::SyncGraphGroup::  update  (std::shared_ptr<marian::data::Batch>) + 0x283
[0x56519d596dcf]    marian::Train<marian::SyncGraphGroup>::  run  ()   + 0x6ff
[0x56519d4b41b1]    mainTrainer  (int,  char**)                        + 0x221
[0x56519d475e35]    main                                               + 0x35
[0x7fa3a5b9eb97]    __libc_start_main                                  + 0xe7
[0x56519d4b256a]    _start                                             + 0x2a

Steps:

Obtained the zip file for the EN-SK engine and extracted it in the working directory.
Attempted to continue training using the EN-SK corpus, using the following command "${marian_path}/marian" -c "${cfg_file}" --model "${fldr}/opus.spm32k-spm32k.transformer-align.model1.npz.best-perplexity.npz" --no-restore-corpus, using source.spm and target.spm as the vocabulary files.
This produced the following error: Error: Requested shape shape=32000x512 size=16384000 for existing parameter 'Wemb' does not match original shape shape=60024x512 size=30732288
Upon examining the *.spm files, it turned out that wc -l source.spm returned 69880, and wc -l target.spm returned 69359. There's also a file opus.spm32k-spm32k.vocab.yml, wc -l opus.spm32k-spm32k.vocab.yml returns 60023.
Retried the training using the same command, but with the opus.spm32k-spm32k.vocab.yml as the source and target vocabularies, but this produced the following error: Error: Detokenizing BLEU validator expects the target vocabulary to be SentencePieceVocab or FactoredVocab. Current vocabulary type is DefaultVocab

I suspect the SPM shipped with the ENSK model was trained on a different corpus.

Thank you for any hints how to solve the problem.

Language code table and Inconsistency about the benchmark test set & support languages

Hi!

It will be nice if we could have a language code explanation table as some of them are missing. By language code, I am referring to the output of

>>> from transformers import MarianMTModel, MarianTokenizer
>>>tokenizer = MarianTokenizer.from_pretrained('pretrained_models/opus-mt-en-mul')  # en-mul as an example here
>>>print(tokenizer.supported_language_codes)
['>>ewe<<', '>>sna<<', '>>lin<<', '>>toi_Latn<<', '>>ceb<<', '>>oss<<', '>>run<<', '>>mfe<<', '>>ilo<<', '>>zlm_Latn<<', '>>pes<<', '>>smo<<', '>>hil<<', '>>niu<<', '>>sag<<', '>>fij<<', '>>cmn_Hans<<', '>>nya<<', '>>tso<<', '>>war<<', '>>gil<<', '>>hau_Latn<<', '>>umb<<', '>>glv<<', '>>tvl<<', '>>ton<<', '>>zul<<', '>>kal<<', '>>pag<<', '>>cmn_Hant<<', '>>pus<<', '>>abk<<', '>>pap<<', '>>hat<<', '>>mkd<<', '>>tuk_Latn<<', '>>yor<<', '>>tuk<<', '>>sqi<<', '>>tir<<', '>>mlg<<', '>>tur<<', '>>ido_Latn<<', '>>mai<<', '>>ibo<<', '>>srp_Cyrl<<', '>>srp_Latn<<', '>>kir_Cyrl<<', '>>heb<<', '>>bos_Latn<<', '>>bak<<', '>>ast<<', '>>som<<', '>>tah<<', '>>chv<<', '>>kek_Latn<<', '>>lug<<', '>>vie<<', '>>wln<<', '>>isl<<', '>>hye<<', '>>mah<<', '>>yue_Hant<<', '>>crh_Latn<<', '>>amh<<', '>>nds<<', '>>pan_Guru<<', '>>xho<<', '>>ukr<<', '>>cat<<', '>>afr<<', '>>tat<<', '>>guj<<', '>>jpn<<', '>>mon<<', '>>eus<<', '>>nob<<', '>>glg<<', '>>ind<<', '>>sin<<', '>>cym<<', '>>zho_Hant<<', '>>zho_Hans<<', '>>tgk_Cyrl<<', '>>aze_Latn<<', '>>ltz<<', '>>bod<<', '>>asm<<', '>>tel<<', '>>urd<<', '>>kaz_Cyrl<<', '>>lat_Latn<<', '>>gla<<', '>>kan<<', '>>bul<<', '>>kin<<', '>>ina_Latn<<', '>>ron<<', '>>spa<<', '>>csb_Latn<<', '>>iba<<', '>>tha<<', '>>nno<<', '>>hrv<<', '>>fry<<', '>>bre<<', '>>mar<<', '>>sme<<', '>>swe<<', '>>deu<<', '>>jav<<', '>>snd_Arab<<', '>>ben<<', '>>cmn<<', '>>ces<<', '>>ita<<', '>>fin<<', '>>por<<', '>>hin<<', '>>hun<<', '>>mal<<', '>>pol<<', '>>fra<<', '>>nld<<', '>>epo<<', '>>slv<<', '>>hsb<<', '>>kur_Latn<<', '>>ori<<', '>>tam<<', '>>bel<<', '>>dan<<', '>>ara<<', '>>mya<<', '>>rus<<', '>>mri<<', '>>est<<', '>>uzb_Latn<<', '>>lao<<', '>>yid<<', '>>uzb_Cyrl<<', '>>uig_Arab<<', '>>lit<<', '>>zho<<', '>>lav<<', '>>ell<<', '>>kat<<', '>>gle<<', '>>mlt<<', '>>khm<<', '>>oci<<', '>>kur_Arab<<', '>>ang_Latn<<', '>>kaz_Latn<<', '>>wol<<', '>>sun<<', '>>chr<<', '>>tat_Latn<<', '>>mhr<<', '>>tyv<<', '>>rom<<', '>>cha<<', '>>kab<<', '>>nav<<', '>>arg<<', '>>khm_Latn<<', '>>bul_Latn<<', '>>udm<<', '>>quc<<', '>>cor<<', '>>san_Deva<<', '>>fao<<', '>>bel_Latn<<', '>>jbo_Latn<<', '>>yue<<', '>>grn<<', '>>sco<<', '>>arq<<', '>>ltg<<', '>>yue_Hans<<', '>>min<<', '>>nan<<', '>>bam_Latn<<', '>>ido<<', '>>ile_Latn<<', '>>wuu<<', '>>crh<<', '>>tlh_Latn<<', '>>lzh<<', '>>jbo<<', '>>lzh_Hans<<', '>>vol_Latn<<', '>>lfn_Latn<<', '>>arz<<']

Moreover, I've noticed that in the opus-mt-en-mul model card, Malay (ms/msa) is not listed as a support language and I couldn't find it in the output above, however, in the Benchmark section, there is a test set listed as Tatoeba-test.eng-msa.eng.msa and its corresponding scores are listed. So I am a little bit confused... Is Malay supported in this model? If yes, how am I suppose to test the model to translate English to Malay, more specifically, what kind of prefix should I append before each sentence to translate sentences from English to Malay?

A similar situation can be found in opus-mt-en-gem where Norwegian (no/nor) is not listed as a supported language, however, Tatoeba-test.eng-nor.eng.nor is listed as a test set. The output is listed below:

>>> tokenizer = MarianTokenizer.from_pretrained('pretrained_models/opus-mt-en-gem')
>>> print(tokenizer.supported_language_codes)
['>>isl<<', '>>nob<<', '>>nds<<', '>>afr<<', '>>deu<<', '>>swe<<', '>>nno<<', '>>fry<<', '>>nld<<', '>>ltz<<', '>>dan<<', '>>yid<<', '>>ang_Latn<<', '>>fao<<', '>>sco<<']

Thank you so much!

SentencePiece models training

Hi!

Can you be so kind as to answer a question about SentencePiece (spm) models training? You train separated models for source and target languages (on the source and target sentences respectively) but you have one vocabulary file. I don't understand this moment. Here I see a recommendation to train one spm model and one vocabulary file: https://github.com/google/sentencepiece#vocabulary-restriction.
Anyway, how do you create one vocabulary file for two spm models?

models/ru-fr/README.md

Two models added in separate commit with same filenames.

readme
Are these russian characters? (from https://object.pouta.csc.fi/OPUS-MT-models/ru-fr/opus-2020-01-16.test.txt).

src: Ð¯ Ð½Ðµ Ð¿Ð¾Ð½Ð¸Ð¼Ð°ÑŽ, Ñ‡Ñ‚Ð¾ Ñ�Ñ‚Ð¾ Ð·Ð½Ð°Ñ‡Ð¸Ñ‚.
system: Je ne comprend pas ce que cela signifie.
gold: Je ne comprends pas ce que Ã§a veut dire.

Syntax for targeting language variants like fr_BE or fr_CA

The Romance languages model seems to have a variety of variants like Belgian French, Canadian French, etc. I was wondering, is there a correct syntax to translate into these languages?

For example, for just French, I can prepend >>fr<<. But >>fr_BE<<, >>frbe<<, >>fr_be<< etc. don't seem to work (I get Italian instead).

[Language Codes] How are models named?

For example,
In,
cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh-de
Is there a table or some other source for what zh_HK, zh_yue, yue, etc. represent?
Is zh_yue is different than yue?
is zh_cn different than cn somehow?

Thanks in advance!

New models

Thanks for all of these models! Sometimes it works comparable with Google Translate!

I noticed that you improve a model for French and several other languages. Do you have plans to do the same for es-en, pt-en, da-en, it-en pairs?

And what was the trick that improved results?

How to download training data?

It seems like make data is looking for /projappl/nlpl/data/OPUS/*/latest/xml/en-ro.xml.gz
I can fix the path, but I think I will still need to download en-ro.xml.gz.
Could you provide instructions for how to do that?

I found the opustools command opus_express -s en -t ro, is that the data the models were trained on?

Prop:Help Create own model from scratch or fine tuning pre trained model?

My scope create Finnish language chat based on seq2seq model.

Can you give me some hits to start up.
Create own model from scratch or fine tuning pre trained model?
Particularly interesting
https://huggingface.co/Helsinki-NLP/opus-mt-fi-fi model
Maybe it possible to make fine tuning using Finnish language chat pairs

Kuka sei? /t Mina Alex
................................

Any hints appreciated.
Maybe exist ready made project?

In fact I use for now:
https://medium.com/axel-springer-tech/headliner-easy-training-and-deployment-of-seq2seq-models-2a26508b4dae
https://github.com/as-ideas/headliner

Thanks.

Should have a clean or clean-data Makefile target

It can happen that data directories get into a broken state which breaks building the data target. It would be useful to have a "clean" target to clean data, models, or both.

eg. if files in /scratch/clarin/hardwick/OPUS-MT-train/work/data/simple/ end up empty, it happens that:

for d in /scratch/clarin/hardwick/OPUS-MT-train/work/data/simple/GNOME.br-en.clean.br.gz /scratch/clarin/hardwick/OPUS-MT-train/work/data/simple/KDE4.br-en.clean.br.gz /scratch/clarin/hardwick/OPUS-MT-train/work/data/simple/OpenSubtitles.br-en.clean.br.gz /scratch/clarin/hardwick/OPUS-MT-train/work/data/simple/QED.br-en.clean.br.gz /scratch/clarin/hardwick/OPUS-MT-train/work/data/simple/Ubuntu.br-en.clean.br.gz /scratch/clarin/hardwick/OPUS-MT-train/work/data/simple/wikimedia.br-en.clean.br.gz; do \
  l=`/projappl/project_2001194/bin/pigz -cd < $d  | wc -l`; \
  if [ $l -gt 0 ]; then \
    echo "$d" | xargs basename | \
    sed -e 's#.br.gz$##' \
	-e 's#.clean$##'\
	-e 's#.br-en$##' | tr "\n" ' '         >> /local_scratch/hardwick/br-en/train/README.md; \
    echo -n "($l) "                                  >> /local_scratch/hardwick/br-en/train/README.md; \
  fi \
done
pigz: skipping: <stdin> empty
pigz: skipping: <stdin> empty
pigz: skipping: <stdin> empty
pigz: skipping: <stdin> empty
pigz: skipping: <stdin> empty
pigz: skipping: <stdin> empty
echo ""                                               >> /local_scratch/hardwick/br-en/train/README.md
echo "only one target language"
only one target language
zcat /scratch/clarin/hardwick/OPUS-MT-train/work/data/simple/GNOME.br-en.clean.br.gz /scratch/clarin/hardwick/OPUS-MT-train/work/data/simple/KDE4.br-en.clean.br.gz /scratch/clarin/hardwick/OPUS-MT-train/work/data/simple/OpenSubtitles.br-en.clean.br.gz /scratch/clarin/hardwick/OPUS-MT-train/work/data/simple/QED.br-en.clean.br.gz /scratch/clarin/hardwick/OPUS-MT-train/work/data/simple/Ubuntu.br-en.clean.br.gz /scratch/clarin/hardwick/OPUS-MT-train/work/data/simple/wikimedia.br-en.clean.br.gz  > /local_scratch/hardwick/br-en/train/opus.src.br-en.src

gzip: /scratch/clarin/hardwick/OPUS-MT-train/work/data/simple/GNOME.br-en.clean.br.gz: unexpected end of file
make[1]: *** [add-to-local-train-data] Error 1
make[1]: Leaving directory `/scratch/clarin/hardwick/OPUS-MT-train'

I was able to fix this by

rm -rf work/data/simple

Corpus clean up and normalization

(This is a question, please redirect me if this is not the right place to ask)

I observed the test data set and train dataset can be greatly improved if we do an automatic cleanup and normalization(language specific). For example, consider this MT output for en-ml "എന് റെ വീട് ഇന്ത്യയിലാണ്." Here, the space in that bold content is unwanted. This is a known issue in most of the Malayalam content found in web. I found these kind of issues in training and testing data.

If I want to fix this, where exactly I need to add a cleanup code?

Request for EN-PL model

It would be great to see EN-PL model!

Broken links on the huggingface portal

Hi everyone,

I just post a message to warn you that some models link are broken on the portal.

Here is a list of some of them :

Edit : added more broken links

Request to update en-br model

We are planning to deploy en-br model at opusmt.wmflabs.org. Current model is from 2019. Is it possible to update model from latest dataset? @jorgtied

Using pretrained models for translations

Hello, I have a doubt regarding the use of the released pretrained models.

I have a marian server running the opus en-fr model (BPE).
I'd like to test the model by translating some sentences of my choice.

According to model documentation, I have to send to the server preprocessed input.
The file preprocess.sh usage is:

USAGE preprocess.sh langid bpecodes < input > output

While langid, input, and output are clear to me, I don't understand what should I pass as bpecodes.
Can you please point me towards the right direction?

Thanks in advance

Missing bi-directional models for some language pairs

Hi,
If I am not wrong, I could not find bi-directional models for some language pairs.
Can you please advise me on how to find the bi-directional models for the following language pair:

Punjabi - English
Hindi - English
Telugu - English

Thank you.

Chinese-English model?

Thank you for this great resource, it's a really impressive collection of models.

I've noticed there are models for zh->fi, zh->sv and zh->de but no model for zh->en (or en->zh for that matter). Since these are quite prominent language pairs, I'm wondering if there are plans to add these in the future? Or am I just looking wrong?

[ga-en] Broken Link to test.txt

This link, from this README, seems broken.
Do you still have the file to repost?
Thanks!

Include some binary dependencies in lib/env.mk

Currently on puhti, the user needs to have /projappl/project_2001194/bin under $PATH, and possibly other subdirectories visible. This could be done automatically from lib/env.mk when we detect puhti.

en-bcl model download link broken

en-bcl download link for latest language model opus+bt-2020-05-23.zip is broken.

Please fix it.

How to get vocab.yml file when doing train->eval->dist

Hey,

First of all I wanted to thank you for this amazing project.

I followed the instructions in the repo and set up the environment correctly and I can run train and eval as instructed without any problems. Release does not work for me, so I tried dist and that did work and packaged the model into a zip. However, it seems like the Huggingface script for converting Marian models to Pytorch models requires a vocab.yml file which is also present in all the pretrained Opus-MT models but is not present in my zip file - I only have src.vocab and trg.vocab files.

Could you please explain to me how to get the vocab.yml file, and whether it is done using any make commands or manually?

Thanks,
Best,
Oren

Continue training/fine tuning OPUS models leads to embedding size mismatch

Bug description

I want to fine-tune Marian models trained on Opus. Particularly, I'm working currently with https://huggingface.co/Helsinki-NLP/opus-mt-en-cs
I have two text files with sources and targets and I want to fine tune the model on this data.
When I run the training script below, I get following error:

[2021-01-28 17:03:34] Initialize model weights with the pre-trained model /home/alan/data/46898-custom-mt-benefit-estimate-train-custom/opus-models/encs-train//opus.spm32k-spm32k.transformer-align.model1.npz.best-perplexity.npz
[2021-01-28 17:03:34] Loading model from /home/alan/data/46898-custom-mt-benefit-estimate-train-custom/opus-models/encs-train//opus.spm32k-spm32k.transformer-align.model1.npz.best-perplexity.npz
[2021-01-28 17:03:35] Training started
[2021-01-28 17:03:35] [data] Shuffling data
[2021-01-28 17:03:35] [data] Done reading 5000 sentences
[2021-01-28 17:03:35] [data] Done shuffling 5000 sentences to temp files
[2021-01-28 17:03:35] Error: Requested shape shape=1x32000 size=32000 for existing parameter 'decoder_ff_logit_out_b' does not match original shape shape=1x58100 size=58100
[2021-01-28 17:03:35] Error: Aborted from marian::Expr marian::ExpressionGraph::param(const string&, const marian::Shape&, marian::Ptr<marian::inits::NodeInitializer>&, marian::Type, bool, bool) in /root/marian/src/graph/expression_graph.h:317

[CALL STACK]
[0x55e8ccbe50ef]    marian::ExpressionGraph::  param  (std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&,  marian::Shape const&,  std::shared_ptr<marian::inits::NodeInitializer> const&,  marian::Type,  bool,  bool) + 0xf2f
[0x55e8cce5054d]    marian::mlp::Output::  lazyConstruct  (int)        + 0x24d
[0x55e8cce5a6cc]    marian::mlp::Output::  applyAsLogits  (IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>) + 0x6c
[0x55e8ccf33dc7]    marian::DecoderTransformer::  step  (std::shared_ptr<marian::DecoderState>) + 0x1987
[0x55e8ccf36fad]    marian::DecoderTransformer::  step  (std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::DecoderState>) + 0x3fd
[0x55e8ccf4f5ed]    marian::EncoderDecoder::  stepAll  (std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::data::CorpusBatch>,  bool) + 0x21d
[0x55e8ccf402f4]    marian::models::EncoderDecoderCECost::  apply  (std::shared_ptr<marian::models::IModel>,  std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::data::Batch>,  bool) + 0xd4
[0x55e8ccb54890]    marian::models::Trainer::  build  (std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::data::Batch>,  bool) + 0xa0
[0x55e8ccfad805]    marian::SingletonGraph::  execute  (std::shared_ptr<marian::data::Batch>) + 0x95
[0x55e8ccc04bbc]    marian::Train<marian::SingletonGraph>::  run  ()   + 0x8ac
[0x55e8ccb323a2]    mainTrainer  (int,  char**)                        + 0x8a2
[0x55e8ccb10535]    main                                               + 0x35
[0x7f5fd86fe0b3]    __libc_start_main                                  + 0xf3
[0x55e8ccb3006a]    _start                                             + 0x2a

There should be a 32k vocabulary, but it seems that the model has actually 58100. I tried also en-de model and it has embedding size 65000. I tried using --dim-vocabs 58100 58100 but it didn't help, as the accompanied .spm models have really 32k vocab size.

How to reproduce

Here is a minimal example. There has to be text files sources.txt and targets.txt.

MODEL_DIR="my_model_dir"
marian \
  --model ${MODEL_DIR}/model.npz --type transformer \
  --pretrained-model ${MODEL_DIR}/opus.spm32k-spm32k.transformer-align.model1.npz.best-perplexity.npz \
  --train-sets sources.txt targets.txt \
  --vocabs ${MODEL_DIR}/source.spm ${MODEL_DIR}/target.spm \
  --mini-batch 2  --maxi-batch 10

adding links to the source datasets in benchmarks

Currently it's hard to tell which datasets were used for the benchmark results posted here:
https://huggingface.co/Helsinki-NLP/opus-mt-ru-en
(and the other models from your user).

After quite some digging I derived these:

all but last entry: http://opus.nlpl.eu/WMT-News.php and maybe the original http://www.statmt.org/wmt19/
last entry: http://opus.nlpl.eu/Tatoeba.php and maybe the original https://tatoeba.org/eng/
I hope this is correct.

There is also an ambiguity about the "year" used in the dataset names in the benchmark.

[...]
|newstest2015-enru.ru.en |30.4 |0.568|
|newstest2016-enru.ru.en |30.1 |0.565|
[...]
newstest2019-ruen.ru.en 	|31.4 	|0.576
Tatoeba.ru.en 	| 61.1 	|0.736

is newstest2016-enru.ru.en referring to wmt16 or crawl news corpus that includes data from 2016 (i.e. wmt17)?

Thank you.

p.s. I originally posted about it here and was recommended to file an issue here instead.

Issues with Caucasian Languages

The model for Caucasian Languages are having very less accuracy. The languages of Abkhaz, Adyghe and Chechen have quite a lot of data in the internet and also bilingual corpuses. Chechen has a huge amount of Wikipedia data and Abkhaz, Kabardian, Lezgi, Adyghe, Ingush, Lak, Avar has Wikipedia data but the models using these languages have very less accuracy.

Are posted test sets preprocessed?

In the posted test sets, like https://object.pouta.csc.fi/OPUS-MT-models/jap-en/opus-2020-01-09.test.txt,

has the source been run through preprocess.sh?
have the system translations and gold been run through postprocess.sh (assuming yes, given lack of _)?

Can't download from "object.pouta.csc.fi" - 503 Service Unavailable

The server hosting all trained models seems to be down.

For example, the download link for "de-en" throws a "503 Service Unavailable":
https://object.pouta.csc.fi/OPUS-MT-models/de-en/opus-2019-12-04.zip

Your help is much appreciated!
@jorgtied

Google Colab

How to train a model in Colab.

Multilingual preprocessing

For this multilingual model (en-ROMANCE), test set
I am having trouble running inputs with language codes through marian.
Specifically, if I add the codes after running spm_encode, things are fine.
If I try to use the preprocessing logic in preprocess.sh, things break.

test set: >>pt<< Don't spend so much time watching TV.

Using preprocess.sh, I get either

▁> > pt < < ▁Don ' t ▁spend ▁so ▁much ▁time ▁watching ▁TV .

>>pt<< ▁> > pt < < ▁Don ' t ▁spend ▁so ▁much ▁time ▁watching ▁TV .

but both of those cause marian_decoder to error.

The correct input seems to be

>>pt<< ▁Don ' t ▁spend ▁so ▁much ▁time ▁watching ▁TV .

With the code added after spm_encode is run. Is that correct?

integration of domain-specific fine-tuning

fine-tune for a specific domain

sample from selected sub-corpus

Very high BLEU score for ta-en

Hello! I noticed that the BLEU score for ta-en is 89.1, which seems a little too high. Could this be a bug? Also, were the BLEU scores calculated on the de-tokenised outputs or the BPE-ed ones in opus-2019-12-05.test.txt?

Thank you in advance!

improved models for Sámi languages

improve models for translating from and to Sámi languages

multilingual models and transfer learning
integration of monolingual data / backtranslation
pivoting
data augmentation using rule-based MT (Apertium)

Hugging Face OPUS EN-PT Model

Curious if there are plans to upload the EN-PT Portuguese model to HF library of pre-trained models?
It seems that Helsinki-NLP/opus-mt-en-pt does not exist.

Thanks!

Conversion of models based on BPE tokenizers to pytorch

Hello,

Trying to convert the portuguese-english model to pytorch I noticed that this is not possible since the tokenizer is a BPE one. Is there a way of converting it? Or do you plan to release the spm version of such model at some point?

Thank you

improved models for Celtic languages

multilingual models
backtranslation
pivoting
data augmentation and new data sources?

Issue with it-en model

In some rare situations, specific sentences translated from Italian to the English language with "(Translated with Google Translate)" at the end of the output sentence. For example, the following Italian sentences will have it in English:
Gli oggetti ordinati sono arrivati in tempi piu'che rapidi e tutto anche piu'bello dal vivo....perfetto!!
Ho fatto alcuni acquistati da Mano Mano mi sono arrivati in tempi brevi e senza alcun problema,auguri grazie di ❤️!!!!
grazie. fino ad adesso buoni prodotti... speriamo anche il prossimo!

I guess in training data there is this kind of rubbish in translation pairs in some datasets. I suggest removing "(Translated with Google Translate)" from English sentences from training data in preprocessing pipeline.

I'm talking about this model:
models/it-en/

Improve platform independence by allowing missing languages in lib/data.mk

In lib/data.mk, there is the possibility of fetching data with opus_read, but this is inside a if clause like this:

	elif [ -e ${OPUSHOME}/$$c/latest/xml/${LANGPAIR}.xml.gz ]; then \
          echo "extract $$c (${LANGPAIR}) from OPUS"; \
          opus_read ${OPUSREAD_ARGS} -rd ${OPUSHOME} -d $$c -s ${SRC} -t ${TRG} -wm moses -p raw > [email protected]; \

So this could have a special case for missing data.

Word Alignment Files

Hi,
I'm retraining existing id-en model with my own training data. To train the model, in the makefile --guided-alignment parameter is passed along with path to word alignment file, but that file is not present in the pre-trained models.
Can you share that file?

Thanks.

en-es / es-en : spm instead of bpe?

Hi,

Do you have spm versions of the tokenization for es-en / en-es models since source and target spm are required to convert to models into pytorch?

Thank you.

How long to build a pair of languages? e.g: en-ko

I'm building a pair of languages (en-ko) and I have been building it for 30 days.
Now, it's starting epoch 19
I don't know when it's finished.
Can anyone give me information about how long it takes you to build a pair of languages with your computer's configuration?

Here my computer's configuration I use to build:
CPU i5 7200U
2 cores/ 2.5 GHz
and I use the configuration default of the OPUS-MT-train repository

helsinki-nlp / opus-mt-train Goto Github PK

opus-mt-train's Introduction

Train Opus-MT models

Pre-trained models

Quickstart

Documentation

Tutorials

References

Acknowledgements

opus-mt-train's People

Contributors

Stargazers

Watchers

Forkers

opus-mt-train's Issues

Bug description

How to reproduce

Recommend Projects

Recommend Topics

Recommend Org