Giter Club home page Giter Club logo

joeynmt's Introduction

ย  Joey-NMT Joey NMT

build License arXiv

Goal and Purpose

๐Ÿจ Joey NMT framework is developed for educational purposes. It aims to be a clean and minimalistic code base to help novices find fast answers to the following questions.

  • โ” How to implement classic NMT architectures (RNN and Transformer) in PyTorch?
  • โ” What are the building blocks of these architectures and how do they interact?
  • โ” How to modify these blocks (e.g. deeper, wider, ...)?
  • โ” How to modify the training procedure (e.g. add a regularizer)?

In contrast to other NMT frameworks, we will not aim for the most recent features or speed through engineering or training tricks since this often goes in hand with an increase in code complexity and a decrease in readability. ๐Ÿ‘€

However, Joey NMT re-implements baselines from major publications.

Check out the detailed documentation ๐Ÿ“š and our paper. ๐Ÿ“ฐ

Contributors

Joey NMT was initially developed and is maintained by Jasmijn Bastings (University of Amsterdam) and Julia Kreutzer (Heidelberg University), now both at Google Research. Mayumi Ohta at Fraunhofer Institute is continuing the legacy.

Welcome to our new contributors โ™ฅ๏ธ, please don't hesitate to open a PR or an issue if there's something that needs improvement!

Features

Joey NMT implements the following features (aka the minimalist toolkit of NMT ๐Ÿ”ง):

  • Recurrent Encoder-Decoder with GRUs or LSTMs
  • Transformer Encoder-Decoder
  • Attention Types: MLP, Dot, Multi-Head, Bilinear
  • Word-, BPE- and character-based tokenization
  • BLEU, ChrF evaluation
  • Beam search with length penalty and greedy decoding
  • Customizable initialization
  • Attention visualization
  • Learning curve plotting
  • Scoring hypotheses and references
  • Multilingual translation with language tags

Installation

Joey NMT is built on PyTorch. Please make sure you have a compatible environment. We tested Joey NMT v2.3 with

  • python 3.11
  • torch 2.1.2
  • cuda 12.1

โš ๏ธ Warning When running on GPU you need to manually install the suitable PyTorch version for your CUDA version. For example, you can install PyTorch 2.1.2 with CUDA v12.1 as follows:

python -m pip install --upgrade torch==2.1.2 --index-url https://download.pytorch.org/whl/cu121

See PyTorch installation instructions.

You can install Joey NMT either A. via pip or B. from source.

A. Via pip (the latest stable version)

python -m pip install joeynmt

B. From source (for local development)

git clone https://github.com/joeynmt/joeynmt.git  # Clone this repository
cd joeynmt
python -m pip install -e .  # Install Joey NMT and it's requirements
python -m unittest  # Run the unit tests

๐Ÿ“ Info For Windows users, we recommend to check whether txt files (i.e. test/data/toy/*) have utf-8 encoding.

Changelog

v2.3

previous releases

v2.2.1

  • compatibility with torch 2.0 tested
  • configurable activation function #211
  • bug fix #207

v2.2

  • compatibility with torch 1.13 tested
  • torchhub introduced
  • bugfixes, minor refactoring

v2.1

  • upgrade to python 3.10, torch 1.12
  • replace Automated Mixed Precision from NVIDA's amp to Pytorch's amp package
  • replace discord.py with pycord in the Discord Bot demo
  • data iterator refactoring
  • add wmt14 ende / deen benchmark trained on v2 from scratch
  • add tokenizer tutorial
  • minor bugfixes

v2.0 Breaking change!

  • upgrade to python 3.9, torch 1.11
  • torchtext.legacy dependencies are completely replaced by torch.utils.data
  • joeynmt/tokenizers.py: handles tokenization internally (also supports bpe-dropout!)
  • joeynmt/datasets.py: loads data from plaintext, tsv, and huggingface's datasets
  • scripts/build_vocab.py: trains subwords, creates joint vocab
  • enhancement in decoding
    • scoring with hypotheses or references
    • repetition penalty, ngram blocker
    • attention plots for transformers
  • yapf, isort, flake8 introduced
  • bugfixes, minor refactoring

โš ๏ธ Warning The models trained with Joey NMT v1.x can be decoded with Joey NMT v2.0. But there is no guarantee that you can reproduce the same score as before.

v1.4

  • upgrade to sacrebleu 2.0, python 3.7, torch 1.8
  • bugfixes

v1.3

  • upgrade to torchtext 0.9 (torchtext -> torchtext.legacy)
  • n-best decoding
  • demo colab notebook

v1.0

  • Multi-GPU support
  • fp16 (half precision) support

Documentation & Tutorials

We also updated the documentation thoroughly for Joey NMT 2.0!

For details, follow the tutorials in notebooks dir.

v2.x

v1.x

Usage

โš ๏ธ Warning For Joey NMT v1.x, please refer the archive here.

Joey NMT has 3 modes: train, test, and translate, and all of them takes a YAML-style config file as argument. You can find examples in the configs directory. transformer_small.yaml contains a detailed explanation of configuration options.

Most importantly, the configuration contains the description of the model architecture (e.g. number of hidden units in the encoder RNN), paths to the training, development and test data, and the training hyperparameters (learning rate, validation frequency etc.).

๐Ÿ“ Info Note that subword model training and joint vocabulary creation is not included in the 3 modes above, has to be done separately. We provide a script that takes care of it: scritps/build_vocab.py.

python scripts/build_vocab.py configs/transformer_small.yaml --joint

train mode

For training, run

python -m joeynmt train configs/transformer_small.yaml

This will train a model on the training data, validate on validation data, and store model parameters, vocabularies, validation outputs. All needed information should be specified in the data, training and model sections of the config file (here configs/transformer_small.yaml).

model_dir/
โ”œโ”€โ”€ *.ckpt          # checkpoints
โ”œโ”€โ”€ *.hyps          # translated texts at validation
โ”œโ”€โ”€ config.yaml     # config file
โ”œโ”€โ”€ spm.model       # sentencepiece model / subword-nmt codes file
โ”œโ”€โ”€ src_vocab.txt   # src vocab
โ”œโ”€โ”€ trg_vocab.txt   # trg vocab
โ”œโ”€โ”€ train.log       # train log
โ””โ”€โ”€ validation.txt  # validation scores

๐Ÿ’ก Tip Be careful not to overwrite model_dir, set overwrite: False in the config file.

test mode

This mode will generate translations for validation and test set (as specified in the configuration) in model_dir/out.[dev|test].

python -m joeynmt test configs/transformer_small.yaml

You can specify the ckpt path explicitly in the config file. If load_model is not given in the config, the best model in model_dir will be used to generate translations.

You can specify i.e. sacrebleu options in the test section of the config file.

๐Ÿ’ก Tip scripts/average_checkpoints.py will generate averaged checkpoints for you.

python scripts/average_checkpoints.py --inputs model_dir/*00.ckpt --output model_dir/avg.ckpt

If you want to output the log-probabilities of the hypotheses or references, you can specify return_score: 'hyp' or return_score: 'ref' in the testing section of the config. And run test with --output_path and --save_scores options.

python -m joeynmt test configs/transformer_small.yaml --output-path model_dir/pred --save-scores

This will generate model_dir/pred.{dev|test}.{scores|tokens} which contains scores and corresponding tokens.

๐Ÿ“ Info

  • If you set return_score: 'hyp' with greedy decoding, then token-wise scores will be returned. The beam search will return sequence-level scores, because the scores are summed up per sequence during beam exploration.
  • If you set return_score: 'ref', the model looks up the probabilities of the given ground truth tokens, and both decoding and evaluation will be skipped.
  • If you specify n_best >1 in config, the first translation in the nbest list will be used in the evaluation.

translate mode

This mode accepts inputs from stdin and generate translations.

  • File translation

    python -m joeynmt translate configs/transformer_small.yaml < my_input.txt > output.txt
  • Interactive translation

    python -m joeynmt translate configs/transformer_small.yaml

    You'll be prompted to type an input sentence. Joey NMT will then translate with the model specified in the config file.

    ๐Ÿ’ก Tip Interactive translate mode doesn't work with Multi-GPU. Please run it on single GPU or CPU.

Benchmarks & pretrained models

iwslt14 de/en/fr multilingual

We trained this multilingual model with JoeyNMT v2.3.0 using DDP.

Direction Architecture tok dev test #params download
en->de Transformer sentencepiece - 28.88 200M iwslt14_prompt
de->en - 35.28
en->fr - 38.86
fr->en - 40.35

sacrebleu signature: nrefs:1|case:lc|eff:no|tok:13a|smooth:exp|version:2.4.0

wmt14 ende / deen

We trained the models with JoeyNMT v2.1.0 from scratch.
cf) wmt14 deen leaderboard in paperswithcode

Direction Architecture tok dev test #params download
en->de Transformer sentencepiece 24.36 24.38 60.5M wmt14_ende.tar.gz (766M)
de->en Transformer sentencepiece 30.60 30.51 60.5M wmt14_deen.tar.gz (766M)

sacrebleu signature: nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.2.0


โš ๏ธ Warning The following models are trained with JoeynNMT v1.x, and decoded with Joey NMT v2.0. See config_v1.yaml and config_v2.yaml in the linked zip, respectively. Joey NMT v1.x benchmarks are archived here.

iwslt14 deen

Pre-processing with Moses decoder tools as in this script.

Direction Architecture tok dev test #params download
de->en RNN subword-nmt 31.77 30.74 61M rnn_iwslt14_deen_bpe.tar.gz (672MB)
de->en Transformer subword-nmt 34.53 33.73 19M transformer_iwslt14_deen_bpe.tar.gz (221MB)

sacrebleu signature: nrefs:1|case:lc|eff:no|tok:13a|smooth:exp|version:2.0.0

๐Ÿ“ Info For interactive translate mode, you should specify pretokenizer: "moses" in the both src's and trg's tokenizer_cfg, so that you can input raw sentence. Then MosesTokenizer and MosesDetokenizer will be applied internally. For test mode, we used the preprocessed texts as input and set pretokenizer: "none" in the config.

Masakhane JW300 afen / enaf

We picked the pretrained models and configs (bpe codes file etc.) from masakhane.io.

Direction Architecture tok dev test #params download
af->en Transformer subword-nmt - 57.70 46M transformer_jw300_afen.tar.gz (525MB)
en->af Transformer subword-nmt 47.24 47.31 24M transformer_jw300_enaf.tar.gz (285MB)

sacrebleu signature: nrefs:1|case:mixed|eff:no|tok:intl|smooth:exp|version:2.0.0

JParaCrawl enja / jaen

For training, we split JparaCrawl v2 into train and dev set and trained a model on them. Please check the preprocessing script here. We tested then on kftt test set and wmt20 test set, respectively.

Direction Architecture tok wmt20 kftt #params download
en->ja Transformer sentencepiece 17.66 14.31 225M jparacrawl_enja.tar.gz (2.3GB)
ja->en Transformer sentencepiece 14.97 11.49 221M jparacrawl_jaen.tar.gz (2.2GB)

sacrebleu signature:

  • en->ja nrefs:1|case:mixed|eff:no|tok:ja-mecab-0.996-IPA|smooth:exp|version:2.0.0
  • ja->en nrefs:1|case:mixed|eff:no|tok:intl|smooth:exp|version:2.0.0

Note: In wmt20 test set, newstest2020-enja has 1000 examples, newstest2020-jaen has 993 examples.

Coding

In order to keep the code clean and readable, we make use of:

  • Style checks:
    • pylint with (mostly) PEP8 conventions, see .pylintrc.
    • yapf, isort, and flake8; see .style.yapf, setup.cfg and Makefile.
  • Typing: Every function has documented input types.
  • Docstrings: Every function, class and module has docstrings describing their purpose and usage.
  • Unittests: Every module has unit tests, defined in test/unit/.
  • Documentation: Update documentation in docs/source/ accordingly.

To ensure the repository stays clean, unittests and linters are triggered by github's workflow on every push or pull request to main branch. Before you create a pull request, you can check the validity of your modifications with the following commands:

make test
make check
make -C docs clean html

Contributing

Since this codebase is supposed to stay clean and minimalistic, contributions addressing the following are welcome:

  • code correctness
  • code cleanliness
  • documentation quality
  • speed or memory improvements
  • resolving issues
  • providing pre-trained models

Code extending the functionalities beyond the basics will most likely not end up in the main branch, but we're curious to learn what you used Joey NMT for.

Projects and Extensions

Here we'll collect projects and repositories that are based on Joey NMT, so you can find inspiration and examples on how to modify and extend the code.

Joey NMT v2.x

  • ๐Ÿ‘‚ JoeyS2T. Joey NMT is extended for Speech-to-Text tasks! Checkout the code and the EMNLP 2022 Paper.
  • ๐Ÿ—ฏ๏ธ Discord Joey. This script demonstrates how to deploy Joey NMT models as a Chatbot on Discord. Code

Joey NMT v1.x

  • ๐Ÿ•ธ๏ธ Masakhane Web. @CateGitau, @Kabongosalomon, @vukosim and team built a whole web translation platform for the African NMT models that Masakhane built with Joey NMT. The best is: it's completely open-source, so anyone can contribute new models or features. Try it out here, and check out the code.
  • โš™๏ธ MutNMT. @sjarmero created a web application to train NMT: it lets the user train, inspect, evaluate and translate with Joey NMT --- perfect for NMT newbies! Code here. The tool was developed by Prompsit in the framework of the European project MultiTraiNMT.
  • ๐ŸŒŸ Cantonese-Mandarin Translator. @evelynkyl trained different NMT models for translating between the low-resourced Cantonese and Mandarin, with the help of some cool parallel sentence mining tricks! Check out her work here.
  • ๐Ÿ“– Russian-Belarusian Translator. @tsimafeip built a translator from Russian to Belarusian and adapted it to legal and medical domains. The code can be found here.
  • ๐Ÿ’ช Reinforcement Learning. @samuki implemented various policy gradient variants in Joey NMT: here's the code, could the logo be any more perfect? ๐Ÿ’ช ๐Ÿจ
  • โœ‹ Sign Language Translation. @neccam built a sign language translator that continuosly recognizes sign language and translates it. Check out the code and the CVPR 2020 paper!
  • ๐Ÿ”ค @bpopeters built Possum-NMT for multilingual grapheme-to-phoneme transduction and morphologic inflection. Read their paper for SIGMORPHON 2020!
  • ๐Ÿ“ท Image Captioning. @pperle and @stdhd built an image captioning tool on top of Joey NMT, check out the code and the demo!
  • ๐Ÿ’ก Joey Toy Models. @bricksdont built a collection of scripts showing how to install Joey NMT, preprocess data, train and evaluate models. This is a great starting point for anyone who wants to run systematic experiments, tends to forget python calls, or doesn't like to run notebook cells!
  • ๐ŸŒ African NMT. @jaderabbit started an initiative at the Indaba Deep Learning School 2019 to "put African NMT on the map". The goal is to build and collect NMT models for low-resource African languages. The Masakhane repository contains and explains all the code you need to train Joey NMT and points to data sources. It also contains benchmark models and configurations that members of Masakhane have built for various African languages. Furthermore, you might be interested in joining the Masakhane community if you're generally interested in low-resource NLP/NMT. Also see the EMNLP Findings paper.
  • ๐Ÿ’ฌ Slack Joey. Code to locally deploy a Joey NMT model as chat bot in a Slack workspace. It's a convenient way to probe your model without having to implement an API. And bad translations for chat messages can be very entertaining, too ;)
  • ๐ŸŒ Flask Joey. @kevindegila built a flask interface to Joey, so you can deploy your trained model in a web app and query it in the browser.
  • ๐Ÿ‘ฅ User Study. We evaluated the code quality of this repository by testing the understanding of novices through quiz questions. Find the details in Section 3 of the Joey NMT paper.
  • ๐Ÿ“ Self-Regulated Interactive Seq2Seq Learning. Julia Kreutzer and Stefan Riezler. Published at ACL 2019. Paper and Code. This project augments the standard fully-supervised learning regime by weak and self-supervision for a better trade-off of quality and supervision costs in interactive NMT.
  • ๐Ÿซ Hieroglyph Translation. Joey NMT was used to translate hieroglyphs in this IWSLT 2019 paper by Philipp Wiesenbach and Stefan Riezler. They gave Joey NMT multi-tasking abilities.

If you used Joey NMT for a project, publication or built some code on top of it, let us know and we'll link it here.

Contact

Please leave an issue if you have questions or issues with the code.

For general questions, email us at joeynmt <at> gmail.com. ๐Ÿ’Œ

Reference

If you use Joey NMT in a publication or thesis, please cite the following paper:

@inproceedings{kreutzer-etal-2019-joey,
    title = "Joey {NMT}: A Minimalist {NMT} Toolkit for Novices",
    author = "Kreutzer, Julia  and
      Bastings, Jasmijn  and
      Riezler, Stefan",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations",
    month = nov,
    year = "2019",
    address = "Hong Kong, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D19-3019",
    doi = "10.18653/v1/D19-3019",
    pages = "109--114",
}

Naming

Joeys are infant marsupials. ๐Ÿจ

joeynmt's People

Contributors

alexrudnick avatar amitmy avatar amro-kamal avatar antoniogois avatar b-czarnetzki avatar bastings avatar bricksdont avatar categitau avatar constantinelignos avatar edward-upton avatar freshia avatar heidelkin avatar israaar avatar joeynmt avatar juliakreutzer avatar may- avatar mayohta avatar mcognetta avatar mrshu avatar nbrgr avatar neccam avatar powerverwirrt avatar pperle avatar samuki avatar sariyusha avatar seifferth avatar simon-will avatar sjarmero avatar timifasubaa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

joeynmt's Issues

Nice to have's

Not necessarily needed, but nice to have. Might or might not implement.

  • hybrid model: recurrent encoder and self-attentional decoder and vice versa
  • checkpoint averaging
  • transformer: output layer / embedding parameter sharing
  • multi-GPU support
  • log probs of hypotheses as additional output

Missing documentation

  • Explaining all hyperparameters
  • Create a good default ini with working example
  • Explaining the structure of a config file
  • Explaining the architecture of the code base
    -> "Annotated Joeynmt"?

Training with Transformer with BPE error

I am training isiXhosa > English NMT for a small (XhosaNavy Corpus)<45k.
Tested JoeyNMT with Luong and Badhanau with no errors, but ...

The produced:

  1. Got errors when training the transformer_BPE approach for isiXhosa > English
  2. configuration file:
    name: "xhen_transformer"
    data:
    src: "xh"
    trg: "en"
    train: "test/data/navy_xhen/train-bpe"
    dev: "test/data/navy_xhen/dev-bpe"
    test: "test/data/navy_xhen/test-bpe"
    level: "bpe"
    lowercase: True
    max_sent_length: 50
    src_vocab: "test/data/navy_xhen/vocab-bpe.xh"
    trg_vocab: "test/data/navy_xhen/vocab-bpe.en"
    ....
  3. code call [!python -m joeynmt train configs/xhen_transformer_bpe.yaml]

Logged output
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/content/joeynmt/joeynmt/main.py", line 41, in
main()
File "/content/joeynmt/joeynmt/main.py", line 29, in main
train(cfg_file=args.config_path)
File "/content/joeynmt/joeynmt/training.py", line 535, in train
model = build_model(cfg["model"], src_vocab=src_vocab, trg_vocab=trg_vocab)
File "/content/joeynmt/joeynmt/model.py", line 226, in build_model
"Embedding cannot be tied since vocabularies differ.")
joeynmt.helpers.ConfigurationError: Embedding cannot be tied since vocabularies differ.

Expected behavior
output : Hello! This is Joey-NMT.
params: ...
config

System (please complete the following information):

  • OS: Google Colab VM runtime
  • GPU
  • Python >=3.0

Additional context
I am a novice and just stated with NMT projects and JoeyNMT toolkit is a good start

<pad> exists in prediction file

Describe the bug
After training, when I used the model to generate the hypothesis on dev and test set, there are at the end of some generated sentences.

Logged output
One example of the output is:
but/and horses then- redup- tell s.o. - <pad> <pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>

System (please complete the following information):

  • MAC OS
  • GPU
  • Python Version 3.7.4

possibly a bug in PullRequest #139 (N-best candidates)?

Hi @sjarmero,

After your pull request #139 has been merged, I've got an error at

assert len(sort_reverse_index) == len(data)

It should look like this, maybe?

assert len(sort_reverse_index) == batch.nseqs

and a small problem here, too.

# sort outputs back to original order
for reverse_index in sort_reverse_index:
all_outputs.append(output[reverse_index])
valid_attention_scores.append(
attention_scores[reverse_index]
if attention_scores is not None else [])

because now you use append, not extend as before, valid_attention_scores wil become non-empty even if attention_scores is None, and it unexpectedly triggers this if-block:

if attention_scores:
attention_name = "{}.{}.att".format(data_set_name, step)
attention_path = os.path.join(model_dir, attention_name)
logger.info("Saving attention plots. This might take a while..")
store_attention_plots(attentions=attention_scores,
targets=hypotheses_raw,
sources=data_set.src,
indices=range(len(hypotheses)),
output_prefix=attention_path)
logger.info("Attention plots saved to: %s", attention_path)


thank you for your effort to include the nbest list!

@juliakreutzer, FYI

stop after </s> for greedy decoding

Check whether all elements in batch are finished is already implemented in beam search, but it's useful for greedy decoding during validation as well, since it will save some computation.

RNN Multi GPU error

Describe the bug
When trying to use multiple GPUs to train RNN models, I get the following error:

AssertionError: Caught AssertionError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/home/students/kiegeland/anaconda3/envs/nmt/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/home/students/kiegeland/anaconda3/envs/nmt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/students/kiegeland/test/joeynmt/joeynmt/model.py", line 87, in forward
trg_mask=kwargs["trg_mask"])
File "/home/students/kiegeland/test/joeynmt/joeynmt/model.py", line 147, in _encode_decode
trg_mask=trg_mask)
File "/home/students/kiegeland/test/joeynmt/joeynmt/model.py", line 186, in _decode
trg_mask=trg_mask)
File "/home/students/kiegeland/anaconda3/envs/nmt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/students/kiegeland/test/joeynmt/joeynmt/decoders.py", line 340, in forward
prev_att_vector=prev_att_vector)
File "/home/students/kiegeland/test/joeynmt/joeynmt/decoders.py", line 192, in _check_shapes_input_forward
assert src_mask.shape[2] == encoder_output.shape[1]
AssertionError

It seems like the src_length is different for each device which results in different shapes for the encoder output.
Is there anything else I need to configure?

To Reproduce
Training the reverse experiment using 2 gpus on version=1.0
node_configuration.txt

Logged output
error_log.txt

Example scripts repo

Hi Julia and Joost!

Currently testing joeynmt for our MT class, verdict: awesome! Planning to use it for several exercises.

For the class I've put together example scripts in a small repo here:

https://github.com/bricksdont/joeynmt-toy-models

Just a collection of scripts that show how to make a virtualenv, install packages, download data, preprocess, train and evaluate. Your documentation mentions all of this of course, but the info is more distributed.

If this is useful in any way feel free to link to this somewhere. Otherwise, feel free to close :)

BPE removed during translation

Describe the bug

I believe that the current inference code internally removes BPE tokens, which in my opinion is unexpected. This happens here:

valid_hypotheses = [bpe_postprocess(v) for

since this function for validation is re-used for translation.

To Reproduce
Follow all the steps here :)

https://github.com/bricksdont/joeynmt-toy-models

especially:

https://github.com/bricksdont/joeynmt-toy-models/blob/master/scripts/evaluate.sh

where removing BPE from translations does not have any effect.

Expected behavior

Translation should not do any postprocessing in my opinion, since JoeyNMT also does not do any preprocessing during training.

Doing BPE postprocessing to show examples from the validation set during training is indeed helpful I think.

joeyNMT 1.1/Google Colab - uses device:cpu instead of GPU

Problem
When training using Masakhane custom data notebook, and pip installing joeyNMT,
!cd joeynmt; python3 -m joeynmt train configs/transformer_$src$tgt.yaml uses CPU instead of GPU.

This occurs if I use version 1.1, but NOT if I pip install 1.0 specifically.

  • If I use joeynmt==1.0 when pip installing, I get device:cuda
  • If I use joeynmt when pip installing, it installs 1.1, and I get device:cpu

System (please complete the following information):

Encountering error mid-training

Describe the bug
Model training is encountering an error after a few steps. The training log is shown below:

Logged Output
2020-06-23 03:52:05,935 Epoch 1 Step: 3900 Batch Loss: 4.291916 Tokens per Sec: 2948, Lr: 0.000300
โ€จ2020-06-23 03:52:16,157 Epoch 1 Step: 4000 Batch Loss: 5.365633 Tokens per Sec: 2889, Lr: 0.000300โ€จ
Traceback (most recent call last):โ€จFile "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_mainโ€จ"main", mod_spec)
โ€จFile "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_codeโ€จexec(code, run_globals)
โ€จFile "/home/renz/joeynmt/joeynmt/main.py", line 41, inโ€จmain()โ€จFile "/home/renz/joeynmt/joeynmt/main.py", line 29, in mainโ€จtrain(cfg_file=args.config_path)โ€จ
File "/home/renz/joeynmt/joeynmt/training.py", line 653, in trainโ€จtrainer.train_and_validate(train_data=train_data, valid_data=dev_data)โ€จ
File "/home/renz/joeynmt/joeynmt/training.py", line 378, in train_and_validateโ€จbatch_type=self.eval_batch_type
โ€จFile "/home/renz/joeynmt/joeynmt/prediction.py", line 88, in validate_on_dataโ€จfor valid_batch in iter(valid_iter):โ€จFile "/opt/conda/lib/python3.7/site-packages/torchtext/data/iterator.py", line 156, inย iterโ€จyield Batch(minibatch, self.dataset, self.device)
โ€จFile "/opt/conda/lib/python3.7/site-packages/torchtext/data/batch.py", line 34, inย initโ€จsetattr(self, name, field.process(batch, device=device))
โ€จFile "/opt/conda/lib/python3.7/site-packages/torchtext/data/field.py", line 236, in processโ€จpadded = self.pad(batch)โ€จ
File "/opt/conda/lib/python3.7/site-packages/torchtext/data/field.py", line 254, in padโ€จmax_len = max(len(x) for x in minibatch)โ€จValueError: max() arg is an empty sequence

Config file used
data:โ€จ
level: bpeโ€จ
max_sent_length: 80โ€จ
...

training:โ€จ
random_seed: 42
โ€จoptimizer: "adam"
โ€จnormalization: "tokens"โ€จ
adam_betas: [0.9, 0.999]โ€จ
scheduling: "plateau"
โ€จpatience: 5
โ€จdecrease_factor: 0.7โ€จ
loss: "crossentropy"โ€จ
learning_rate: 0.0003
โ€จlearning_rate_min: 0.00000001โ€จ
weight_decay: 0.0โ€จ
label_smoothing: 0.1โ€จ
batch_size: 512
โ€จbatch_type: "token"โ€จ
eval_batch_size: 256โ€จ
eval_batch_type: "token"โ€จ
batch_multiplier: 1
โ€จearly_stopping_metric: "ppl
"โ€จepochs: 100โ€จ
validation_freq: 4000
โ€จlogging_freq: 100โ€จ
eval_metric: "bleu"โ€จ
model_dir: "models/one2many"
โ€จoverwrite: Falseโ€จshuffle: True
โ€จuse_cuda: True
โ€จmax_output_length: 100
โ€จprint_valid_sents: [0, 1, 2, 3]โ€จ
keep_last_ckpts: 3
โ€จmodel:โ€จinitializer: "xavier"
โ€จbias_initializer: "zeros"โ€จ
init_gain: 1.0โ€จ
embed_initializer: "xavier"โ€จ
embed_init_gain: 1.0โ€จ
tied_embeddings: Trueโ€จ
tied_softmax: Trueโ€จ
encoder:โ€จtype: "transformer"โ€จ
num_layers: 6
โ€จnum_heads: 8
โ€จembeddings:โ€จ
embedding_dim: 512โ€จ
scale: Trueโ€จ
dropout: 0.โ€จ
hidden_size: 512โ€จ
ff_size: 2048โ€จ
dropout: 0.1โ€จ
decoder:
โ€จtype: "transformer"
โ€จnum_layers: 6
โ€จnum_heads: 8โ€จ
embeddings:โ€จ
embedding_dim: 512โ€จ
scale: Trueโ€จ
dropout: 0.
โ€จhidden_size:
512โ€จ
ff_size: 2048
โ€จdropout: 0.1

I tried to use different batch sizes for the tokens e.g. 4096, 2048, 1028, etc. but i keep on encountering the same error. I checked the dataset I used and it has been properly preprocessed according to the Sockeye paper so I am not sure where's the error is coming from.

Iwslt-envi-luong config fails to learn when using GRU instead of LSTM

Amazing tool! Really usefull for fast testing / prototyping. I use it to test different RNN cell's. Thanx!

I played with both the iwslt-envi-luong and the iwslt-envi-xnmt predefined configs and get the results as reported. But, when i change the rnn_type in both encoder/decoder in both configs from LSTM to GRU, a strange things happen. The iwslt-envi-xnmt still learns well, no problem.. But the iwslt-envi-luong fails to learn at all.

I've included the validations.txt files for both LSTM and GRU

What am i missing? Is there any explanation for this behaviour? Is it due to the luong attention mechanism/internals/implementation only working with LSTM?

validations-gru.txt
validations-lstm.txt

expected behavior for `keep_last_ckpts = -1`

Hi @juliakreutzer,

I wanted to save ckpts at every validation step regardless of early-stopping-metric score, so I set keep_last_ckpts = -1, according to the description here:

keep_last_ckpts: 3 # keep this many of the latest checkpoints, if -1: all of them, default: 5

But joeynmt didn't save ckpts at all in that case. Actually, the TrainManager doesn't call _save_checkpoint() func if keep_last_ckpts is less than or equal to zero (queue with infinite length: https://docs.python.org/3/library/queue.html).

self.ckpt_queue = queue.Queue(
maxsize=train_config.get("keep_last_ckpts", 5))

joeynmt/joeynmt/training.py

Lines 544 to 547 in 46b2fe3

if self.ckpt_queue.maxsize > 0:
logger.info("Saving new checkpoint.")
new_best = True
self._save_checkpoint()

What is the expected behavior? You indeed intended no save action if keep_last_ckpts = -1, that is, the description in config was wrong or can we change the code so that ckpts will be saved every time if keep_last_ckpts = -1?

File translation results different from interactive outputs

Describe the bug
File translation results are different from interactive translation outputs.

To Reproduce
Steps to reproduce the behavior:

I used the following command for generating file translations:

python3 -m joeynmt translate configs/small.yaml < my_input.txt > out.

To generate translations the interactive way:

python3 -m joeynmt translate configs/small.yaml

The interactive translation outputs are better than the file translation outputs.

Missing type prediction during testing

Describe the bug
A clear and concise description of what the bug is.
While running the testing, it throws an error like below saying that it is missing a required argument 'batch_class'

! cd joeynmt; python3 -m joeynmt test "$gdrive_path/models/${src}${tgt}_transformer/config.yaml"

2021-01-16 15:01:14,414 - INFO - root - Hello! This is Joey-NMT (version 1.0).
2021-01-16 15:01:14,418 - INFO - joeynmt.data - building vocabulary...
2021-01-16 15:01:26,060 - INFO - joeynmt.data - loading dev data...
2021-01-16 15:01:26,089 - INFO - joeynmt.data - loading test data...
2021-01-16 15:01:26,102 - INFO - joeynmt.data - data loaded.
2021-01-16 15:01:26,121 - INFO - joeynmt.prediction - Process device: cuda, n_gpu: 1, batch_size per device: 3600
2021-01-16 15:01:32,448 - INFO - joeynmt.prediction - Decoding on dev set (data/uzen/dev.bpe.en)...
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/content/joeynmt/joeynmt/main.py", line 41, in
main()
File "/content/joeynmt/joeynmt/main.py", line 32, in main
output_path=args.output_path, save_attention=args.save_attention)
File "/content/joeynmt/joeynmt/prediction.py", line 325, in test
bpe_type=bpe_type, sacrebleu=sacrebleu, n_gpu=n_gpu)
TypeError: validate_on_data() missing 1 required positional argument: 'batch_class'

To Reproduce
Steps to reproduce the behavior:

  1. Translation task with over 500k examples
  2. config file looks like this:

Create the config

config = """
name: "{name}_transformer"

data:
src: "{source_language}"
trg: "{target_language}"
train: "data/{name}/train.bpe"
dev: "data/{name}/dev.bpe"
test: "data/{name}/test.bpe" # change this to data/{name}/test2.bpe so that you can test it on Ted Talks
level: "bpe"
lowercase: False
max_sent_length: 128
src_vocab: "data/{name}/vocab.txt"
trg_vocab: "data/{name}/vocab.txt"

testing:
beam_size: 5
alpha: 1.0
sacrebleu: # sacrebleu options
remove_whitespace: True # remove_whitespace option in sacrebleu.corpus_chrf() function (defalut: True)
tokenize: "none" # tokenize option in sacrebleu.corpus_bleu() function (options include: "none" (use for already tokenized test data), "13a" (default minimal tokenizer), "intl" which mostly does punctuation and unicode, etc)

training:
load_model: "{gdrive_path}/models/{name}_transformer/best.ckpt" # if uncommented, load a pre-trained model from this checkpoint
random_seed: 42
optimizer: "adam"
normalization: "tokens"
adam_betas: [0.9, 0.998]
scheduling: "plateau"
patience: 5
learning_rate_factor: 0.5
learning_rate_warmup: 4000
decrease_factor: 0.7
loss: "crossentropy"
learning_rate: 0.0003
learning_rate_min: 0.00000001
weight_decay: 0.0
label_smoothing: 0.1
batch_size: 4096
batch_type: "token"
eval_batch_size: 3600
eval_batch_type: "token"
batch_multiplier: 8
early_stopping_metric: "ppl"
epochs: 100
validation_freq: 1000
logging_freq: 100
eval_metric: "bleu"
model_dir: "{gdrive_path}/models/{name}_transformer2"
overwrite: True # TODO: Set to True if you want to overwrite possibly existing models.
shuffle: True
use_cuda: True
fp16: True
max_output_length: 128
print_valid_sents: [0, 1, 2, 3]
keep_last_ckpts: 3

model:
initializer: "xavier"
bias_initializer: "zeros"
init_gain: 1.0
embed_initializer: "xavier"
embed_init_gain: 1.0
tied_embeddings: True
tied_softmax: True
encoder:
type: "transformer"
num_layers: 6
num_heads: 4
embeddings:
embedding_dim: 512
scale: True
dropout: 0.2
# typically ff_size = 4 x hidden_size
hidden_size: 512
ff_size: 2048
dropout: 0.3
decoder:
type: "transformer"
num_layers: 6
num_heads: 4
embeddings:
embedding_dim: 512
scale: True
dropout: 0.2
# typically ff_size = 4 x hidden_size
hidden_size: 512
ff_size: 2048
dropout: 0.3
"""

System (please complete the following information):

  • Google Colab
  • Python 3.6

Additional context
Installing from joeynmt from pip

Extend Tensorboard logging for more better insights.

Joey NMT framework is developed for educational purposes.

It would be great to extend joeynmt with more Tensorboard metrics and diagrams to gain better insights on the internals of the encoder/decoders.

  • histograms of weights/bias per layer per timestamp
  • histograms of gradients or gradient_norm per step
    Etc.

Visualisation helps a lot in understanding what's going on and how the networks learns or fails to learn in a particular setup.

embedding dropout

Make clear in documentation and small.yaml that encoder (and decoder?) embedding section takes a separate dropout argument and defaults to encoder dropout argument if missing.

Will send a pull request myself eventually and just document all issues I find along the way of setting joey up for me.

UnicodeEncodeError on Windows

Python 3.7 ; Windows ;

I get an error when trying to train a network on deen_bpe downloaded dataset task on a Windows machine.

Error
UnicodeEncodeError: 'charmap' codec can't encode character '\u201f' in position 0: character maps to

Suggested solution
In vocabulary.py the with open(file, "r") and with open(file,"w") lacks the encoding="utf-8" argument.

Test cases

We need to include tests (urgently)! Most important:

  • unit tests for all modules
    • encoder
    • embeddings
    • decoder
    • attention
    • data creation
    • batch creation
    • data iterators
    • beam search
      • alpha values
    • greedy decoding
  • tiny benchmarks for quality checks
  • style checks, e.g. with pylint & pep8
  • shape checks for
    • encoder input
    • decoder input
    • attention

We can take inspiration from the tests in Sockeye, OpenNMT, or Neuralmonkey or any other NMT toolkit.

init_hidden

Hi,
I have a question about the init_hidden in decoder.
I can understand "only feed the final state of the top-most layer to the decoder".
But, the hidden state and cell state of LSTM have different meaning ?
So is it appropriate to initialize the decoder with " (h, h) if isinstance(self.rnn, nn.LSTM) " ?

Continuing training from checkpoint is different than uninterupted training

Currently the checkpoints have no way to continue training from the same point in the dataset that the checkpoint left off at. Instead, the dataset is reset to how it was at the beginning of training using the same state given by the random seed.

This causes the following training progressions when training for two epochs and interrupting training after the first epoch:

Uninterrupted training Interrupted training Continued from checkpoint
Epoch 1
1 1
2 2
3 3
4 4
Epoch 2
5 1
6 2
7 3
8 4

So the interrupted training plus the continued training causes the model to see the same batches in the same order twice. If the training continued from the checkpoint is allowed to continue for a third epoch, it will then see what the uninterrupted training saw during the second epoch. In order to make sure uninterrupted training is comparable to continuing from checkpoint, it would be good if the checkpoint started and saw batches [5, 6, 7, 8].

Additionally, if the config file specifies two epochs, uninterrupted training will see exactly two epochs, but if you see one epoch before interrupting and continue from that checkpoint, the training will forget that it already saw one epoch and attempt to train on two more epochs instead of one more.

The training iterator also has a state_dict that can be saved and restored, which we should add to the checkpoints.

Training in debug mode

Describe the bug
I want to train in debug mode in PyCharm (or VS code) to primarily inspect the structure of training data. How do I do this?

To Reproduce

I added ../configs/small.yaml as argument for file training.py before running training.py in debug mode. My data is stored in the required directory, and the data is in pickle format, so I need to load it first. I get an error saying the file can't be found, which makes sense because the training.py can't see ./data/training_file.pkl.
This is line 625 in training.py.

 # load the data
 train_data, dev_data, test_data, src_vocab, trg_vocab = load_data(
               data_cfg=cfg["data"])

Joey NMT for paper Correct Me If You Can

I am trying to replicate the results presented in the paper "Correct Me If You Can". I am using Google Colab. I have already cloned the Joey NMT repo as well as Downloaded the model into the same location. I notice that for the toy models a few shell scripts are present to evaluate the toy models. However, I am unclear how I can evaluate the WMT17 en-de model. Can someone please help?

batch_multiplier NotImplementedError (Transformer)

Hello, I'm testing JoeyNMT to see if it works for my translation studies master's thesis (right now looking at different toolkits with which to traini several patent translation models in EN-JP)

Describe the bug

When trying to run training on with batch_multiplier set to 4 instead of 1 training fails before starting.

Using the same settings, but with "batch_multiplier 1" starts training.

To Reproduce
Steps to reproduce the behavior:

  1. Task description: Training Patent Transformer NMT on trainingcorpus with approx 680.000 lines in both src and tgt.

  2. configuration file [ntc7otrans.yaml]

name: "transformer"

data:
src: "jp.tk"
trg: "en.tk"
train: "/home/chris/NMT/data/ntc7o/traino"
dev: "/home/chris/NMT/data/ntc7o/val"
test: "/home/chris/NMT/data/ntc7o/test"
level: "word"
lowercase: False
max_sent_length: 100
src_vocab: "/home/chris/NMT/data/ntc7o/joeyNMT/vocabo.jp.tk"
trg_vocab: "/home/chris/NMT/data/ntc7o/joeyNMT/vocabo.en.tk"

testing:
beam_size: 5
alpha: 1.0

training:
random_seed: 3435
optimizer: "adam"
normalization: "tokens"
adam_betas: [0.9, 0.999]
scheduling: "plateau"
patience: 8
decrease_factor: 0.7
loss: "crossentropy"
learning_rate: 0.0002
learning_rate_min: 0.00000001
weight_decay: 0.0
label_smoothing: 0.1
batch_size: 512
batch_type: "token"
eval_batch_size: 32
eval_batch_type: "token"
batch_multiplier: 4
early_stopping_metric: "ppl"
epochs: 100
validation_freq: 5000
logging_freq: 100
eval_metric: "bleu"
model_dir: "models/ntc7otrans"
overwrite: True
shuffle: True
use_cuda: True
max_output_length: 100
print_valid_sents: [0, 1, 2, 3]
keep_last_ckpts: 5

model:
initializer: "xavier"
bias_initializer: "zeros"
init_gain: 1.0
embed_initializer: "xavier"
embed_init_gain: 1.0
tied_embeddings: False
tied_softmax: True
encoder:
type: "transformer"
num_layers: 6
num_heads: 8
embeddings:
embedding_dim: 512
scale: True
dropout: 0.
# typically ff_size = 4 x hidden_size
hidden_size: 512
ff_size: 2048
dropout: 0.1
decoder:
type: "transformer"
num_layers: 6
num_heads: 8
embeddings:
embedding_dim: 512
scale: True
dropout: 0.
# typically ff_size = 4 x hidden_size
hidden_size: 512
ff_size: 2048
dropout: 0.1

  1. code call [python3 -m joeynmt train configs/ntc7otrans.yaml]

Logged output

2020-03-27 13:49:27,311 Data set sizes:
train 682849,
valid 914,
test 899
2020-03-27 13:49:27,311 First training example:
[SRC] ใใ—ใฆ ใ€ ไธŠ่จ˜ ้–ขไฟ‚ ใ‚’ ๅฐ‘ใชใใจใ‚‚ 10 ไธ‡ ๆžš ้€š ็ด™ ใ— ใฆ ใ‚‚ ็ถญๆŒ ใ— ใชใ‘ใ‚Œ ใฐ ใชใ‚‰ ใชใ„ ใ€‚
[TRG] This relation must be maintained even after passing at least 100,000 sheets .
2020-03-27 13:49:27,312 First 10 words (src): (0) (1) (2) (3) (4) ใ€ (5) ใฎ (6) ใซ (7) ๏ผ‘ (8) ใ‚’ (9) ใ€‚
2020-03-27 13:49:27,312 First 10 words (trg): (0) (1) (2) (3) (4) the (5) , (6) . (7) of (8) a (9) is
2020-03-27 13:49:27,312 Number of Src words (types): 67918
2020-03-27 13:49:27,312 Number of Trg words (types): 89223
2020-03-27 13:49:27,313 Model(
encoder=TransformerEncoder(num_layers=6, num_heads=8),
decoder=TransformerDecoder(num_layers=6, num_heads=8),
src_embed=Embeddings(embedding_dim=512, vocab_size=67918),
trg_embed=Embeddings(embedding_dim=512, vocab_size=89223))
2020-03-27 13:49:27,378 EPOCH 1
Traceback (most recent call last):
File "/usr/local/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/local/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/chris/NMT/joeyNMT/joeynmt/main.py", line 41, in
main()
File "/home/chris/NMT/joeyNMT/joeynmt/main.py", line 29, in main
train(cfg_file=args.config_path)
File "/home/chris/NMT/joeyNMT/joeynmt/training.py", line 650, in train
trainer.train_and_validate(train_data=train_data, valid_data=dev_data)
File "/home/chris/NMT/joeyNMT/joeynmt/training.py", line 317, in train_and_validate
if self.batch_multiplier > 1 and i == len(train_iter) -
File "/home/chris/.virtualenvs/joeyNMT/lib/python3.6/site-packages/torchtext/data/iterator.py", line 136, in len
raise NotImplementedError
NotImplementedError

Expected behavior
Training should start and wait for 4 batches to complete before update.

System (please complete the following information):

  • OS: Manjaro Linux
    DISTRIB_ID=ManjaroLinux
    DISTRIB_RELEASE=19.0.2
    DISTRIB_CODENAME=Kyria
    DISTRIB_DESCRIPTION="Manjaro Linux"
  • CPU / GPU
    CPU: Core i7 2600k
    GPU: RTX 2060 6GB vram
  • Python Version
    3.6.6

Additional context
Add any other context about the problem here.

Add tensorboard logging

  • Add tensorboardX to requirements, and use it to log quantities (e.g. loss, validation BLEU) during training.
  • Check if tensorboard shows all logged values correctly.
  • Also use it to export all logged values at the end of training to a json file.

No default ckpt

If there is no validation during training, the program chooses the checkpoint 0.ckpt for testing. But it does not exist.

RuntimeError: index_select(): Expected dtype int64 for index

Describe the bug
Runtime error while training using bandit-joeynmt fork, acl19 branch.

To Reproduce
Steps to reproduce the behavior:

  1. Run the train script with data from bandit-joeynmt/test/data/reverse
  2. configuration file: reverse.yaml
  3. code call: !python3 training.py /content/drive/My\ Drive/IIITH/regulator/bandit-joeynmt/configs/reverse.yaml

Logged output
Traceback (most recent call last):
File "training.py", line 1256, in
train(cfg_file=args.config)
File "training.py", line 1228, in train
beam_size=beam_size, beam_alpha=beam_alpha)
File "/usr/local/lib/python3.6/dist-packages/joeynmt/prediction.py", line 66, in validate_on_data
max_output_length=max_output_length)
File "/usr/local/lib/python3.6/dist-packages/joeynmt/model.py", line 1379, in run_batch
decoder=self.decoder)
File "/usr/local/lib/python3.6/dist-packages/joeynmt/search.py", line 208, in beam_search
[alive_seq.index_select(0, select_indices),
RuntimeError: index_select(): Expected dtype int64 for index

Additional Information
select_indices parameter in the call to index_select() is a tensor of float values, but a tensor of integers is expected.

System - Using Google Colab:

  • GPU
  • Python Version: 3.6.9

Slow validation during training

I've been using the transformer_small.yaml configuration to train a model.
During training the validate_on_data() method takes 5 times longer than in 'test' mode. I did adapt the test mode code a bit to load all the lines from a file and batch them equally as in training mode.
I can't find a good explanation for it since i'm using the same config file and same validation data.

Data set sizes:
train 90727,
valid 926,
test 926

Expected behavior
I would suspect it to take more or less the same time since the metric calculation is only done after that method.

System:

  • Ubuntu 18
  • CPU
  • python 3.7.4

As my knowledge about transformers is rather limited i was hoping someone had some insight into this.
Thank you for this really nice code base!

question: some positive result after python -m joeynmt train configs/small.yaml

I am wondering, must be some positive result after use python3 -m joeynmt train configs/small.yaml

In my case, I have only something like:
2019-07-23 10:00:47,923 Example #0 2019-07-23 10:00:47,923 Source: ich freue mich , dass ich da bin . 2019-07-23 10:00:47,923 Reference: iโ€™m happy to be here . 2019-07-23 10:00:47,924 Hypothesis: <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> 2019-07-23 10:00:47,924 Example #1 2019-07-23 10:00:47,924 Source: ja , guten tag . 2019-07-23 10:00:47,924 Reference: yes , hello . 2019-07-23 10:00:47,924 Hypothesis: <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> 2019-07-23 10:00:47,924 Example #2 2019-07-23 10:00:47,924 Source: ja , also , was soll biohacking sein ? 2019-07-23 10:00:47,925 Reference: yes , so , what is biohacking ? 2019-07-23 10:00:47,925 Hypothesis: <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk>

I commented

src_voc_limit: trg_voc_limit:

as well changed

epochs: 1000.

But it don't help.
As well files my_model/valid out.dev out.test empty?

Occur RuntimeError when I execute translate

Describe the bug
When I execute "python3 -m joeynmt translate hoge", a following error message is displayed.

Error message:
RuntimeError: Integer division of tensors using div or / is no longer supported, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead.

To Reproduce
Steps to reproduce the behavior:

  1. task description
    I want to train a formality style transfer system from informal to formal on GYAFC Dataset (Entertainment_Music).
  2. configuration file
    ezyzip.zip
  3. code call
    python3 -m joeynmt translate gyafc/Entertainment_Music/result/config.yaml

Logged output
train.log
validations.txt

Expected behavior
I expect this error message is not displayed and I can execute the code call.

System (please complete the following information):

  • OS: mac OS Catalina
  • CPU
  • Python Version 3.7.6

Remove Torchtext dependency

For more transparent and flexible data handling, batching and vocabulary building.

The goal would be to be as efficient as torchtext, but have more flexibility for the following use cases:

  • loading data without the target side (e.g. for inference)
  • an interactive mode where the data cannot be loaded from a file
  • shared vocabularies, e.g. for source and target side, for multiple encoders or decoders
  • for multi-tasking where more than one data set is used

AttributeError: module 'tensorflow' has no attribute 'io'

Describe the bug
Environment setup breaks when trying to run training code

To Reproduce

conda create --name joey python=3.8
conda activate joey
pip install git+https://github.com/joeynmt/joeynmt.git

The trainer uses tf.io, so let's try to call it:
image

The error:

AttributeError: module 'tensorflow' has no attribute 'io'

Expected behavior
No error

System (please complete the following information):

  • OS: CentOS 7
  • CPU / GPU: CPU
  • Python Version: 3.8.6 (also tried 3.7)

Additional context
I'm not using Joey's CLI, but programmatically with python

Translation on CPU: possible to set number of threads?

On a multicore machine, translation can potentially use multithreading for a speedup.

Does Joey respect OMP_NUM_THREADS?

I am testing on a n1-standard-4 GCP instance and running a command such as

CUDA_VISIBLE_DEVICES="" OMP_NUM_THREADS=4 python -m joeynmt translate \
   $configs/transformer_wmt17_ende.yaml < $data/test.bpe.$src > \
   $translations/test.bpe.$model_name.$trg

CPU utilization is constantly at 200% - do you know why that is and if it could be higher?

Thanks so much for your help
Mathias

โ€œEarly stoppingโ€ with accuracy metric does not work

When configuring a model with eval_metric: 'sequence_accuracy' and early_stopping_metric: 'eval_metric' in the training section, Joey thinks a small sequence accuracy is desirable and will (usually) only save a checkpoint only at the first validation time since the accuracy is still small at this time and will grow with training the model further.

This is due to the following code in training.py:

        # if we schedule after BLEU/chrf, we want to maximize it, else minimize
        # early_stopping_metric decides on how to find the early stopping point:
        # ckpts are written when there's a new high/low score for this metric
        if self.early_stopping_metric in ["ppl", "loss"]:
            self.minimize_metric = True
        elif self.early_stopping_metric == "eval_metric":
            if self.eval_metric in ["bleu", "chrf"]:
                self.minimize_metric = False
            # eval metric that has to get minimized (not yet implemented)
            else:
                self.minimize_metric = True
        else:
            raise ConfigurationError(
                "Invalid setting for 'early_stopping_metric', "
                "valid options: 'loss', 'ppl', 'eval_metric'.")

By the way, actual early stopping (i.e. training begin stopped if the patience is exhausted) is not implemented yet, right? Is this an open TODO?

Multi-GPU training

Does joeynmt support multi-gpu training? After reading the source-code, I've found none code distributing graphs in different devices. I might be wrong. Thanks for the work!

Size mismatch during validation

Describe the bug
The model ran perfectly with some of my dataset. However, for one of the datasets, the following bug appears. The log is as below.

Logged output
2020-05-24 18:50:40,236 EPOCH 1 2020-05-24 18:50:40,706 Epoch 1 Step: 10 Batch Loss: 129.308456 Tokens per Sec: 6443, Lr: 0.000014 2020-05-24 18:50:41,086 Epoch 1 Step: 20 Batch Loss: 88.039078 Tokens per Sec: 8751, Lr: 0.000028 2020-05-24 18:50:41,476 Epoch 1 Step: 30 Batch Loss: 94.100029 Tokens per Sec: 7615, Lr: 0.000042 2020-05-24 18:50:41,863 Epoch 1 Step: 40 Batch Loss: 128.856537 Tokens per Sec: 7575, Lr: 0.000056 2020-05-24 18:50:42,255 Epoch 1 Step: 50 Batch Loss: 114.877518 Tokens per Sec: 8100, Lr: 0.000070 2020-05-24 18:50:42,737 Epoch 1 Step: 60 Batch Loss: 85.376068 Tokens per Sec: 8259, Lr: 0.000084 2020-05-24 18:50:43,267 Epoch 1 Step: 70 Batch Loss: 298.312286 Tokens per Sec: 5568, Lr: 0.000098 2020-05-24 18:50:43,812 Epoch 1 Step: 80 Batch Loss: 101.327324 Tokens per Sec: 6123, Lr: 0.000112 2020-05-24 18:50:44,323 Epoch 1 Step: 90 Batch Loss: 172.604233 Tokens per Sec: 6887, Lr: 0.000126 2020-05-24 18:50:44,822 Epoch 1 Step: 100 Batch Loss: 93.117615 Tokens per Sec: 5750, Lr: 0.000140 Traceback (most recent call last): File "/home/xingyuaz/anaconda3/envs/my_fisrt_env/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/xingyuaz/anaconda3/envs/my_fisrt_env/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/xingyuaz/test_joeynmt/joeynmt/joeynmt/__main__.py", line 41, in <module> main() File "/home/xingyuaz/test_joeynmt/joeynmt/joeynmt/__main__.py", line 29, in main train(cfg_file=args.config_path) File "/home/xingyuaz/test_joeynmt/joeynmt/joeynmt/training.py", line 650, in train trainer.train_and_validate(train_data=train_data, valid_data=dev_data) File "/home/xingyuaz/test_joeynmt/joeynmt/joeynmt/training.py", line 375, in train_and_validate batch_type=self.eval_batch_type File "/home/xingyuaz/test_joeynmt/joeynmt/joeynmt/prediction.py", line 98, in validate_on_data batch, loss_function=loss_function) File "/home/xingyuaz/test_joeynmt/joeynmt/joeynmt/model.py", line 133, in get_loss_for_batch trg_mask=batch.trg_mask) File "/home/xingyuaz/test_joeynmt/joeynmt/joeynmt/model.py", line 74, in forward src_mask=src_mask) File "/home/xingyuaz/test_joeynmt/joeynmt/joeynmt/model.py", line 92, in encode return self.encoder(self.src_embed(src), src_length, src_mask) File "/home/xingyuaz/anaconda3/envs/my_fisrt_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/xingyuaz/test_joeynmt/joeynmt/joeynmt/encoders.py", line 216, in forward x = self.pe(embed_src) # add position encoding to word embeddings File "/home/xingyuaz/anaconda3/envs/my_fisrt_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/xingyuaz/test_joeynmt/joeynmt/joeynmt/transformer_layers.py", line 159, in forward return emb + self.pe[:, :emb.size(1)] RuntimeError: The size of tensor a (9098) must match the size of tensor b (5000) at non-singleton dimension 1

Query regarding how to distribute raw data

Hi,

I am trying to run joeynmt model by this config file (transformer_iwslt14_deen_bpe.yaml). I ran sucessfull but the blue score is very low like 23.18. I supposed that might be my dataset is small thats why I am getting low blue score. I have few questions regarding this. I am very new user and dont have much experience.

  1. what should be the size of train, valid and test file?.
  2. I already have much rawdata for train file but from where i can download data for valid.
  3. should I use same parameter that are give by joenmt in a config file or can i change parameter like batch_size, logging_freq, validation_freq, num_layers and num_heads.
  4. is it above parameters are according to the size of data.

Also, currently the size of data that I am using is following after tokenization and bpe fomatting:
train: 769107 KB
valid: 120,705 kB
test: 441 KB

Thanks

Joeynmt sees all available GPU devices but attempts to use memory of only 1 GPU. It complains CUDA out of memory

Describe the bug
Available GPUS: 4 Tesla V100-SXM2 with 15.78GB RAM each. JoeyNMT sees it:
2020-08-25 21:36:56.896921: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0, 1, 2, 3

During self.forward CUDA out of memory pops up:
RuntimeError: CUDA out of memory. Tried to allocate 21.71 GiB (GPU 0; 15.78 GiB total capacity; 4.20 GiB already allocated; 10.50 GiB free; 4.21 GiB reserved in total by PyTorch)

How does JoeyNMT utilizes availability of multiple GPUs?

System (please complete the following information):

  • OS: Ubuntu 18.04
  • CPU / GPU
  • Python Version: 3.6.9
  • CUDA 10.1
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64       Driver Version: 440.64       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:04:00.0 Off |                    0 |
| N/A   33C    P0    39W / 300W |      0MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:05:00.0 Off |                    0 |
| N/A   30C    P0    38W / 300W |      0MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:09:00.0 Off |                    0 |
| N/A   29C    P0    39W / 300W |      0MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:89:00.0 Off |                    0 |
| N/A   28C    P0    38W / 300W |      0MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

get_iwslt14_bpe.sh broken after link changed

The link that the script get_iwslt14_bpe.sh uses to download the data is broken and now leads to a 404 page. They moved the hosting of the data to a google drive, which I don't think you can access with curl. Additionally, the tgz file on google drive is all languages so some of the logic will need to be re-written. Since I need to use this data, I'll try to write a new script for the iwslt14 data.

Saving checkpoints in continued training

Hi there,

I have tried loading a previous checkpoint and continuing my training but when I do this the training goes on without saving the checkpoints. It does the validation at every specified step but never saves the model in the directory. Is there a specific command in the config I should specify about continued training?
Thanks so much!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.