Giter Club home page Giter Club logo

dl4mt-c2c's Introduction

Fully Character-Level Neural Machine Translation

Theano implementation of the models described in the paper Fully Character-Level Neural Machine Translation without Explicit Segmentation.

We present code for training and decoding four different models:

  1. bilingual bpe2char (from Chung et al., 2016).
  2. bilingual char2char
  3. multilingual bpe2char
  4. multilingual char2char

Dependencies

Python

  • Theano
  • Numpy
  • NLTK

GPU

  • CUDA (we recommend using the latest version. The version 8.0 was used in all our experiments.)

Related code

Downloading Datasets & Pre-trained Models

The original WMT'15 corpora can be downloaded from here. For the preprocessed corpora used in our experiments, see below.

To obtain the pre-trained top-performing models, see below.

  • Pre-trained models (6.0GB): Tarball updated on Nov 21st 2016. The CS-EN bi-char2char model in the previous tarball was not the best-performing model.

Training Details

Using GPUs

Do the following before executing train*.py.

$ export THEANO_FLAGS=device=gpu,floatX=float32

With space permitting on your GPU, it may speed up training to use cnmem:

$ export THEANO_FLAGS=device=gpu,floatX=float32,lib.cnmem=0.95,allow_gc=False

On a pre-2016 Titan X GPU with 12GB RAM, our bpe2char models were trained with cnmem. Our char2char models (both bilingual and multilingual) were trained without cnmem (due to lack of RAM).

Training models

Before executing the following, modify train*.py such that the correct directory containing WMT15 corpora is referenced.

Bilingual bpe2char

$ python bpe2char/train_bi_bpe2char.py -translate <LANGUAGE_PAIR>

Bilingual char2char

$ python char2char/train_bi_char2char.py -translate <LANGUAGE_PAIR>

Multilingual bpe2char

$ python bpe2char/train_multi_bpe2char.py 

Multilingual char2char

$ python char2char/train_multi_char2char.py 

Checkpoint

To resume training a model from a checkpoint, simply append -re_load and -re_load_old_setting above. Make sure the checkpoint resides in the correct directory (.../dl4mt-c2c/models).

Using Custom Datasets

To train your models using your own dataset (and not the WMT'15 corpus), you first need to learn your vocabulary using build_dictionary_char.py or build_dictionary_word.py for char2char or bpe2char model, respectively. For the bpe2char model, you additionally need to learn your BPE segmentation rules on the source corpus using the Subword-NMT repository (see below).

Decoding

Decoding WMT'15 validation / test files

Before executing the following, modify translate*.py such that the correct directory containing WMT15 corpora is referenced.

$ export THEANO_FLAGS=device=gpu,floatX=float32,lib.cnmem=0.95,allow_gc=False
$ python translate/translate_bpe2char.py -model <PATH_TO_MODEL.npz> -translate <LANGUAGE_PAIR> -saveto <DESTINATION> -which <VALID/TEST_SET> # for bpe2char models
$ python translate/translate_char2char.py -model <PATH_TO_MODEL.npz> -translate <LANGUAGE_PAIR> -saveto <DESTINATION> -which <VALID/TEST_SET> # for char2char models

When choosing which pre-trained model to give to -model, make sure to choose e.g. .grads.123000.npz. The models with .grads in their names are the optimal models and you should be decoding from those.

Decoding an arbitrary file

Remove -which <VALID/TEST_SET> and append -source <PATH_TO_SOURCE>.

If you choose to decode your own source file, make sure it is:

  1. properly tokenized (using preprocess/preprocess.sh).
  2. bpe-tokenized for bpe2char models.
  3. Cyrillic characters should be converted to Latin for multilingual models.

Decoding multilingual models

Append -many (of course, provide a path to a multilingual model for -model).

Evaluation

We use the script from MOSES to compute the bleu score. The reference translations can be found in .../wmt15.

perl preprocess/multi-bleu.perl reference.txt < model_output.txt

Extra

Extracting & applying BPE rules

Clone the Subword-NMT repository.

git clone https://github.com/rsennrich/subword-nmt

Use following commands (find more information in Subword-NMT)

./learn_bpe.py -s {num_operations} < {train_file} > {codes_file}
./apply_bpe.py -c {codes_file} < {test_file}

Converting Cyrillic to Latin

$ python preprocess/iso.py russian_source.txt

will produce an output at russian_source.txt.iso9.

Citation

@article{Lee:16,
  author    = {Jason Lee and Kyunghyun Cho and Thomas Hofmann},
  title     = {Fully Character-Level Neural Machine Translation without Explicit Segmentation},
  year      = {2016},
  journal   = {arXiv preprint arXiv:1610.03017},
}

dl4mt-c2c's People

Contributors

kyunghyuncho avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.