tensorflow / nmt Goto Github PK

TensorFlow Neural Machine Translation Tutorial

License: Apache License 2.0

Python 97.30% Shell 2.70%

nmt's Introduction

Neural Machine Translation (seq2seq) Tutorial

Authors: Thang Luong, Eugene Brevdo, Rui Zhao (Google Research Blogpost, Github)

This version of the tutorial requires TensorFlow Nightly. For using the stable TensorFlow versions, please consider other branches such as tf-1.4.

If make use of this codebase for your research, please cite this.

Introduction
Basic
Intermediate
Tips & Tricks
Benchmarks
Other resources
Acknowledgment
References
BibTex

Introduction

Sequence-to-sequence (seq2seq) models (Sutskever et al., 2014, Cho et al., 2014) have enjoyed great success in a variety of tasks such as machine translation, speech recognition, and text summarization. This tutorial gives readers a full understanding of seq2seq models and shows how to build a competitive seq2seq model from scratch. We focus on the task of Neural Machine Translation (NMT) which was the very first testbed for seq2seq models with wild success. The included code is lightweight, high-quality, production-ready, and incorporated with the latest research ideas. We achieve this goal by:

Using the recent decoder / attention wrapper API, TensorFlow 1.2 data iterator
Incorporating our strong expertise in building recurrent and seq2seq models
Providing tips and tricks for building the very best NMT models and replicating Google’s NMT (GNMT) system.

We believe that it is important to provide benchmarks that people can easily replicate. As a result, we have provided full experimental results and pretrained our models on the following publicly available datasets:

Small-scale: English-Vietnamese parallel corpus of TED talks (133K sentence pairs) provided by the IWSLT Evaluation Campaign.
Large-scale: German-English parallel corpus (4.5M sentence pairs) provided by the WMT Evaluation Campaign.

We first build up some basic knowledge about seq2seq models for NMT, explaining how to build and train a vanilla NMT model. The second part will go into details of building a competitive NMT model with attention mechanism. We then discuss tips and tricks to build the best possible NMT models (both in speed and translation quality) such as TensorFlow best practices (batching, bucketing), bidirectional RNNs, beam search, as well as scaling up to multiple GPUs using GNMT attention.

Basic

Background on Neural Machine Translation

Back in the old days, traditional phrase-based translation systems performed their task by breaking up source sentences into multiple chunks and then translated them phrase-by-phrase. This led to disfluency in the translation outputs and was not quite like how we, humans, translate. We read the entire source sentence, understand its meaning, and then produce a translation. Neural Machine Translation (NMT) mimics that!

Figure 1. Encoder-decoder architecture – example of a general approach for NMT. An encoder converts a source sentence into a "meaning" vector which is passed through a decoder to produce a translation.

Specifically, an NMT system first reads the source sentence using an encoder to build a "thought" vector, a sequence of numbers that represents the sentence meaning; a decoder, then, processes the sentence vector to emit a translation, as illustrated in Figure 1. This is often referred to as the encoder-decoder architecture. In this manner, NMT addresses the local translation problem in the traditional phrase-based approach: it can capture long-range dependencies in languages, e.g., gender agreements; syntax structures; etc., and produce much more fluent translations as demonstrated by Google Neural Machine Translation systems.

NMT models vary in terms of their exact architectures. A natural choice for sequential data is the recurrent neural network (RNN), used by most NMT models. Usually an RNN is used for both the encoder and decoder. The RNN models, however, differ in terms of: (a) directionality – unidirectional or bidirectional; (b) depth – single- or multi-layer; and (c) type – often either a vanilla RNN, a Long Short-term Memory (LSTM), or a gated recurrent unit (GRU). Interested readers can find more information about RNNs and LSTM on this blog post.

In this tutorial, we consider as examples a deep multi-layer RNN which is unidirectional and uses LSTM as a recurrent unit. We show an example of such a model in Figure 2. In this example, we build a model to translate a source sentence "I am a student" into a target sentence "Je suis étudiant". At a high level, the NMT model consists of two recurrent neural networks: the encoder RNN simply consumes the input source words without making any prediction; the decoder, on the other hand, processes the target sentence while predicting the next words.

For more information, we refer readers to Luong (2016) which this tutorial is based on.

Figure 2. Neural machine translation – example of a deep recurrent architecture proposed by for translating a source sentence "I am a student" into a target sentence "Je suis étudiant". Here, "<s>" marks the start of the decoding process while "</s>" tells the decoder to stop.

Installing the Tutorial

To install this tutorial, you need to have TensorFlow installed on your system. This tutorial requires TensorFlow Nightly. To install TensorFlow, follow the installation instructions here.

Once TensorFlow is installed, you can download the source code of this tutorial by running:

git clone https://github.com/tensorflow/nmt/

Training – How to build our first NMT system

Let's first dive into the heart of building an NMT model with concrete code snippets through which we will explain Figure 2 in more detail. We defer data preparation and the full code to later. This part refers to file model.py.

At the bottom layer, the encoder and decoder RNNs receive as input the following: first, the source sentence, then a boundary marker "<s>" which indicates the transition from the encoding to the decoding mode, and the target sentence. For training, we will feed the system with the following tensors, which are in time-major format and contain word indices:

encoder_inputs [max_encoder_time, batch_size]: source input words.
decoder_inputs [max_decoder_time, batch_size]: target input words.
decoder_outputs [max_decoder_time, batch_size]: target output words, these are decoder_inputs shifted to the left by one time step with an end-of-sentence tag appended on the right.

Here for efficiency, we train with multiple sentences (batch_size) at once. Testing is slightly different, so we will discuss it later.

Embedding

Given the categorical nature of words, the model must first look up the source and target embeddings to retrieve the corresponding word representations. For this embedding layer to work, a vocabulary is first chosen for each language. Usually, a vocabulary size V is selected, and only the most frequent V words are treated as unique. All other words are converted to an "unknown" token and all get the same embedding. The embedding weights, one set per language, are usually learned during training.

# Embedding
embedding_encoder = variable_scope.get_variable(
    "embedding_encoder", [src_vocab_size, embedding_size], ...)
# Look up embedding:
#   encoder_inputs: [max_time, batch_size]
#   encoder_emb_inp: [max_time, batch_size, embedding_size]
encoder_emb_inp = embedding_ops.embedding_lookup(
    embedding_encoder, encoder_inputs)

Similarly, we can build embedding_decoder and decoder_emb_inp. Note that one can choose to initialize embedding weights with pretrained word representations such as word2vec or Glove vectors. In general, given a large amount of training data we can learn these embeddings from scratch.

Encoder

Once retrieved, the word embeddings are then fed as input into the main network, which consists of two multi-layer RNNs – an encoder for the source language and a decoder for the target language. These two RNNs, in principle, can share the same weights; however, in practice, we often use two different RNN parameters (such models do a better job when fitting large training datasets). The encoder RNN uses zero vectors as its starting states and is built as follows:

# Build RNN cell
encoder_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units)

# Run Dynamic RNN
#   encoder_outputs: [max_time, batch_size, num_units]
#   encoder_state: [batch_size, num_units]
encoder_outputs, encoder_state = tf.nn.dynamic_rnn(
    encoder_cell, encoder_emb_inp,
    sequence_length=source_sequence_length, time_major=True)

Note that sentences have different lengths to avoid wasting computation, we tell dynamic_rnn the exact source sentence lengths through source_sequence_length. Since our input is time major, we set time_major=True. Here, we build only a single layer LSTM, encoder_cell. We will describe how to build multi-layer LSTMs, add dropout, and use attention in a later section.

Decoder

The decoder also needs to have access to the source information, and one simple way to achieve that is to initialize it with the last hidden state of the encoder, encoder_state. In Figure 2, we pass the hidden state at the source word "student" to the decoder side.

# Build RNN cell
decoder_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units)

# Helper
helper = tf.contrib.seq2seq.TrainingHelper(
    decoder_emb_inp, decoder_lengths, time_major=True)
# Decoder
decoder = tf.contrib.seq2seq.BasicDecoder(
    decoder_cell, helper, encoder_state,
    output_layer=projection_layer)
# Dynamic decoding
outputs, _ = tf.contrib.seq2seq.dynamic_decode(decoder, ...)
logits = outputs.rnn_output

Here, the core part of this code is the BasicDecoder object, decoder, which receives decoder_cell (similar to encoder_cell), a helper, and the previous encoder_state as inputs. By separating out decoders and helpers, we can reuse different codebases, e.g., TrainingHelper can be substituted with GreedyEmbeddingHelper to do greedy decoding. See more in helper.py.

Lastly, we haven't mentioned projection_layer which is a dense matrix to turn the top hidden states to logit vectors of dimension V. We illustrate this process at the top of Figure 2.

projection_layer = layers_core.Dense(
    tgt_vocab_size, use_bias=False)

Loss

Given the logits above, we are now ready to compute our training loss:

crossent = tf.nn.sparse_softmax_cross_entropy_with_logits(
    labels=decoder_outputs, logits=logits)
train_loss = (tf.reduce_sum(crossent * target_weights) /
    batch_size)

Here, target_weights is a zero-one matrix of the same size as decoder_outputs. It masks padding positions outside of the target sequence lengths with values 0.

Important note: It's worth pointing out that we divide the loss by batch_size, so our hyperparameters are "invariant" to batch_size. Some people divide the loss by (batch_size * num_time_steps), which plays down the errors made on short sentences. More subtly, our hyperparameters (applied to the former way) can't be used for the latter way. For example, if both approaches use SGD with a learning of 1.0, the latter approach effectively uses a much smaller learning rate of 1 / num_time_steps.

Gradient computation & optimization

We have now defined the forward pass of our NMT model. Computing the backpropagation pass is just a matter of a few lines of code:

# Calculate and clip gradients
params = tf.trainable_variables()
gradients = tf.gradients(train_loss, params)
clipped_gradients, _ = tf.clip_by_global_norm(
    gradients, max_gradient_norm)

One of the important steps in training RNNs is gradient clipping. Here, we clip by the global norm. The max value, max_gradient_norm, is often set to a value like 5 or 1. The last step is selecting the optimizer. The Adam optimizer is a common choice. We also select a learning rate. The value of learning_rate can is usually in the range 0.0001 to 0.001; and can be set to decrease as training progresses.

# Optimization
optimizer = tf.train.AdamOptimizer(learning_rate)
update_step = optimizer.apply_gradients(
    zip(clipped_gradients, params))

In our own experiments, we use standard SGD (tf.train.GradientDescentOptimizer) with a decreasing learning rate schedule, which yields better performance. See the benchmarks.

Hands-on – Let's train an NMT model

Let's train our very first NMT model, translating from Vietnamese to English! The entry point of our code is nmt.py.

We will use a small-scale parallel corpus of TED talks (133K training examples) for this exercise. All data we used here can be found at: https://nlp.stanford.edu/projects/nmt/. We will use tst2012 as our dev dataset, and tst2013 as our test dataset.

Run the following command to download the data for training NMT model:
nmt/scripts/download_iwslt15.sh /tmp/nmt_data

Run the following command to start the training:

mkdir /tmp/nmt_model
python -m nmt.nmt \
    --src=vi --tgt=en \
    --vocab_prefix=/tmp/nmt_data/vocab  \
    --train_prefix=/tmp/nmt_data/train \
    --dev_prefix=/tmp/nmt_data/tst2012  \
    --test_prefix=/tmp/nmt_data/tst2013 \
    --out_dir=/tmp/nmt_model \
    --num_train_steps=12000 \
    --steps_per_stats=100 \
    --num_layers=2 \
    --num_units=128 \
    --dropout=0.2 \
    --metrics=bleu

The above command trains a 2-layer LSTM seq2seq model with 128-dim hidden units and embeddings for 12 epochs. We use a dropout value of 0.2 (keep probability 0.8). If no error, we should see logs similar to the below with decreasing perplexity values as we train.

# First evaluation, global step 0
  eval dev: perplexity 17193.66
  eval test: perplexity 17193.27
# Start epoch 0, step 0, lr 1, Tue Apr 25 23:17:41 2017
  sample train data:
    src_reverse: </s> </s> Điều đó , dĩ nhiên , là câu chuyện trích ra từ học thuyết của Karl Marx .
    ref: That , of course , was the <unk> distilled from the theories of Karl Marx . </s> </s> </s>
  epoch 0 step 100 lr 1 step-time 0.89s wps 5.78K ppl 1568.62 bleu 0.00
  epoch 0 step 200 lr 1 step-time 0.94s wps 5.91K ppl 524.11 bleu 0.00
  epoch 0 step 300 lr 1 step-time 0.96s wps 5.80K ppl 340.05 bleu 0.00
  epoch 0 step 400 lr 1 step-time 1.02s wps 6.06K ppl 277.61 bleu 0.00
  epoch 0 step 500 lr 1 step-time 0.95s wps 5.89K ppl 205.85 bleu 0.00

See train.py for more details.

We can start Tensorboard to view the summary of the model during training:

tensorboard --port 22222 --logdir /tmp/nmt_model/

Training the reverse direction from English and Vietnamese can be done simply by changing:
--src=en --tgt=vi

Inference – How to generate translations

While you're training your NMT models (and once you have trained models), you can obtain translations given previously unseen source sentences. This process is called inference. There is a clear distinction between training and inference (testing): at inference time, we only have access to the source sentence, i.e., encoder_inputs. There are many ways to perform decoding. Decoding methods include greedy, sampling, and beam-search decoding. Here, we will discuss the greedy decoding strategy.

The idea is simple and we illustrate it in Figure 3:

We still encode the source sentence in the same way as during training to obtain an encoder_state, and this encoder_state is used to initialize the decoder.
The decoding (translation) process is started as soon as the decoder receives a starting symbol "<s>" (refer as tgt_sos_id in our code);
For each timestep on the decoder side, we treat the RNN's output as a set of logits. We choose the most likely word, the id associated with the maximum logit value, as the emitted word (this is the "greedy" behavior). For example in Figure 3, the word "moi" has the highest translation probability in the first decoding step. We then feed this word as input to the next timestep.
The process continues until the end-of-sentence marker "</s>" is produced as an output symbol (refer as tgt_eos_id in our code).

Figure 3. Greedy decoding – example of how a trained NMT model produces a translation for a source sentence "Je suis étudiant" using greedy search.

Step 3 is what makes inference different from training. Instead of always feeding the correct target words as an input, inference uses words predicted by the model. Here's the code to achieve greedy decoding. It is very similar to the training decoder.

# Helper
helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(
    embedding_decoder,
    tf.fill([batch_size], tgt_sos_id), tgt_eos_id)

# Decoder
decoder = tf.contrib.seq2seq.BasicDecoder(
    decoder_cell, helper, encoder_state,
    output_layer=projection_layer)
# Dynamic decoding
outputs, _ = tf.contrib.seq2seq.dynamic_decode(
    decoder, maximum_iterations=maximum_iterations)
translations = outputs.sample_id

Here, we use GreedyEmbeddingHelper instead of TrainingHelper. Since we do not know the target sequence lengths in advance, we use maximum_iterations to limit the translation lengths. One heuristic is to decode up to two times the source sentence lengths.

maximum_iterations = tf.round(tf.reduce_max(source_sequence_length) * 2)

Having trained a model, we can now create an inference file and translate some sentences:

cat > /tmp/my_infer_file.vi
# (copy and paste some sentences from /tmp/nmt_data/tst2013.vi)

python -m nmt.nmt \
    --out_dir=/tmp/nmt_model \
    --inference_input_file=/tmp/my_infer_file.vi \
    --inference_output_file=/tmp/nmt_model/output_infer

cat /tmp/nmt_model/output_infer # To view the inference as output

Note the above commands can also be run while the model is still being trained as long as there exists a training checkpoint. See inference.py for more details.

Intermediate

Having gone through the most basic seq2seq model, let's get more advanced! To build state-of-the-art neural machine translation systems, we will need more "secret sauce": the attention mechanism, which was first introduced by Bahdanau et al., 2015, then later refined by Luong et al., 2015 and others. The key idea of the attention mechanism is to establish direct short-cut connections between the target and the source by paying "attention" to relevant source content as we translate. A nice byproduct of the attention mechanism is an easy-to-visualize alignment matrix between the source and target sentences (as shown in Figure 4).

Figure 4. Attention visualization – example of the alignments between source and target sentences. Image is taken from (Bahdanau et al., 2015).

Remember that in the vanilla seq2seq model, we pass the last source state from the encoder to the decoder when starting the decoding process. This works well for short and medium-length sentences; however, for long sentences, the single fixed-size hidden state becomes an information bottleneck. Instead of discarding all of the hidden states computed in the source RNN, the attention mechanism provides an approach that allows the decoder to peek at them (treating them as a dynamic memory of the source information). By doing so, the attention mechanism improves the translation of longer sentences. Nowadays, attention mechanisms are the defacto standard and have been successfully applied to many other tasks (including image caption generation, speech recognition, and text summarization).

Background on the Attention Mechanism

We now describe an instance of the attention mechanism proposed in (Luong et al., 2015), which has been used in several state-of-the-art systems including open-source toolkits such as OpenNMT and in the TF seq2seq API in this tutorial. We will also provide connections to other variants of the attention mechanism.

Figure 5. Attention mechanism – example of an attention-based NMT system as described in (Luong et al., 2015) . We highlight in detail the first step of the attention computation. For clarity, we don't show the embedding and projection layers in Figure (2).

As illustrated in Figure 5, the attention computation happens at every decoder time step. It consists of the following stages:

The current target hidden state is compared with all source states to derive attention weights (can be visualized as in Figure 4).
Based on the attention weights we compute a context vector as the weighted average of the source states.
Combine the context vector with the current target hidden state to yield the final attention vector
The attention vector is fed as an input to the next time step (input feeding). The first three steps can be summarized by the equations below:

Here, the function score is used to compared the target hidden state $$h_t$$ with each of the source hidden states $$\overline{h}_s$$, and the result is normalized to produced attention weights (a distribution over source positions). There are various choices of the scoring function; popular scoring functions include the multiplicative and additive forms given in Eq. (4). Once computed, the attention vector $$a_t$$ is used to derive the softmax logit and loss. This is similar to the target hidden state at the top layer of a vanilla seq2seq model. The function f can also take other forms.

Various implementations of attention mechanisms can be found in attention_wrapper.py.

What matters in the attention mechanism?

As hinted in the above equations, there are many different attention variants. These variants depend on the form of the scoring function and the attention function, and on whether the previous state $$h_{t-1}$$ is used instead of $$h_t$$ in the scoring function as originally suggested in (Bahdanau et al., 2015). Empirically, we found that only certain choices matter. First, the basic form of attention, i.e., direct connections between target and source, needs to be present. Second, it's important to feed the attention vector to the next timestep to inform the network about past attention decisions as demonstrated in (Luong et al., 2015). Lastly, choices of the scoring function can often result in different performance. See more in the benchmark results section.

Attention Wrapper API

In our implementation of the AttentionWrapper, we borrow some terminology from (Weston et al., 2015) in their work on memory networks. Instead of having readable & writable memory, the attention mechanism presented in this tutorial is a read-only memory. Specifically, the set of source hidden states (or their transformed versions, e.g., $$W\overline{h}_s$$ in Luong's scoring style or $$W_2\overline{h}_s$$ in Bahdanau's scoring style) is referred to as the "memory". At each time step, we use the current target hidden state as a "query" to decide on which parts of the memory to read. Usually, the query needs to be compared with keys corresponding to individual memory slots. In the above presentation of the attention mechanism, we happen to use the set of source hidden states (or their transformed versions, e.g., $$W_1h_t$$ in Bahdanau's scoring style) as "keys". One can be inspired by this memory-network terminology to derive other forms of attention!

Thanks to the attention wrapper, extending our vanilla seq2seq code with attention is trivial. This part refers to file attention_model.py

First, we need to define an attention mechanism, e.g., from (Luong et al., 2015):

# attention_states: [batch_size, max_time, num_units]
attention_states = tf.transpose(encoder_outputs, [1, 0, 2])

# Create an attention mechanism
attention_mechanism = tf.contrib.seq2seq.LuongAttention(
    num_units, attention_states,
    memory_sequence_length=source_sequence_length)

In the previous Encoder section, encoder_outputs is the set of all source hidden states at the top layer and has the shape of [max_time, batch_size, num_units] (since we use dynamic_rnn with time_major set to True for efficiency). For the attention mechanism, we need to make sure the "memory" passed in is batch major, so we need to transpose attention_states. We pass source_sequence_length to the attention mechanism to ensure that the attention weights are properly normalized (over non-padding positions only).

Having defined an attention mechanism, we use AttentionWrapper to wrap the decoding cell:

decoder_cell = tf.contrib.seq2seq.AttentionWrapper(
    decoder_cell, attention_mechanism,
    attention_layer_size=num_units)

The rest of the code is almost the same as in the Section Decoder!

Hands-on – building an attention-based NMT model

To enable attention, we need to use one of luong, scaled_luong, bahdanau or normed_bahdanau as the value of the attention flag during training. The flag specifies which attention mechanism we are going to use. In addition, we need to create a new directory for the attention model, so we don't reuse the previously trained basic NMT model.

Run the following command to start the training:

mkdir /tmp/nmt_attention_model

python -m nmt.nmt \
    --attention=scaled_luong \
    --src=vi --tgt=en \
    --vocab_prefix=/tmp/nmt_data/vocab  \
    --train_prefix=/tmp/nmt_data/train \
    --dev_prefix=/tmp/nmt_data/tst2012  \
    --test_prefix=/tmp/nmt_data/tst2013 \
    --out_dir=/tmp/nmt_attention_model \
    --num_train_steps=12000 \
    --steps_per_stats=100 \
    --num_layers=2 \
    --num_units=128 \
    --dropout=0.2 \
    --metrics=bleu

After training, we can use the same inference command with the new out_dir for inference:

python -m nmt.nmt \
    --out_dir=/tmp/nmt_attention_model \
    --inference_input_file=/tmp/my_infer_file.vi \
    --inference_output_file=/tmp/nmt_attention_model/output_infer

Tips & Tricks

Building Training, Eval, and Inference Graphs

When building a machine learning model in TensorFlow, it's often best to build three separate graphs:

The Training graph, which:
- Batches, buckets, and possibly subsamples input data from a set of files/external inputs.
- Includes the forward and backprop ops.
- Constructs the optimizer, and adds the training op.
The Eval graph, which:
- Batches and buckets input data from a set of files/external inputs.
- Includes the training forward ops, and additional evaluation ops that aren't used for training.
The Inference graph, which:
- May not batch input data.
- Does not subsample or bucket input data.
- Reads input data from placeholders (data can be fed directly to the graph via feed_dict or from a C++ TensorFlow serving binary).
- Includes a subset of the model forward ops, and possibly additional special inputs/outputs for storing state between session.run calls.

Building separate graphs has several benefits:

The inference graph is usually very different from the other two, so it makes sense to build it separately.
The eval graph becomes simpler since it no longer has all the additional backprop ops.
Data feeding can be implemented separately for each graph.
Variable reuse is much simpler. For example, in the eval graph there's no need to reopen variable scopes with reuse=True just because the Training model created these variables already. So the same code can be reused without sprinkling reuse= arguments everywhere.
In distributed training, it is commonplace to have separate workers perform training, eval, and inference. These need to build their own graphs anyway. So building the system this way prepares you for distributed training.

The primary source of complexity becomes how to share Variables across the three graphs in a single machine setting. This is solved by using a separate session for each graph. The training session periodically saves checkpoints, and the eval session and the infer session restore parameters from checkpoints. The example below shows the main differences between the two approaches.

Before: Three models in a single graph and sharing a single Session

with tf.variable_scope('root'):
  train_inputs = tf.placeholder()
  train_op, loss = BuildTrainModel(train_inputs)
  initializer = tf.global_variables_initializer()

with tf.variable_scope('root', reuse=True):
  eval_inputs = tf.placeholder()
  eval_loss = BuildEvalModel(eval_inputs)

with tf.variable_scope('root', reuse=True):
  infer_inputs = tf.placeholder()
  inference_output = BuildInferenceModel(infer_inputs)

sess = tf.Session()

sess.run(initializer)

for i in itertools.count():
  train_input_data = ...
  sess.run([loss, train_op], feed_dict={train_inputs: train_input_data})

  if i % EVAL_STEPS == 0:
    while data_to_eval:
      eval_input_data = ...
      sess.run([eval_loss], feed_dict={eval_inputs: eval_input_data})

  if i % INFER_STEPS == 0:
    sess.run(inference_output, feed_dict={infer_inputs: infer_input_data})

After: Three models in three graphs, with three Sessions sharing the same Variables

train_graph = tf.Graph()
eval_graph = tf.Graph()
infer_graph = tf.Graph()

with train_graph.as_default():
  train_iterator = ...
  train_model = BuildTrainModel(train_iterator)
  initializer = tf.global_variables_initializer()

with eval_graph.as_default():
  eval_iterator = ...
  eval_model = BuildEvalModel(eval_iterator)

with infer_graph.as_default():
  infer_iterator, infer_inputs = ...
  infer_model = BuildInferenceModel(infer_iterator)

checkpoints_path = "/tmp/model/checkpoints"

train_sess = tf.Session(graph=train_graph)
eval_sess = tf.Session(graph=eval_graph)
infer_sess = tf.Session(graph=infer_graph)

train_sess.run(initializer)
train_sess.run(train_iterator.initializer)

for i in itertools.count():

  train_model.train(train_sess)

  if i % EVAL_STEPS == 0:
    checkpoint_path = train_model.saver.save(train_sess, checkpoints_path, global_step=i)
    eval_model.saver.restore(eval_sess, checkpoint_path)
    eval_sess.run(eval_iterator.initializer)
    while data_to_eval:
      eval_model.eval(eval_sess)

  if i % INFER_STEPS == 0:
    checkpoint_path = train_model.saver.save(train_sess, checkpoints_path, global_step=i)
    infer_model.saver.restore(infer_sess, checkpoint_path)
    infer_sess.run(infer_iterator.initializer, feed_dict={infer_inputs: infer_input_data})
    while data_to_infer:
      infer_model.infer(infer_sess)

Notice how the latter approach is "ready" to be converted to a distributed version.

One other difference in the new approach is that instead of using feed_dicts to feed data at each session.run call (and thereby performing our own batching, bucketing, and manipulating of data), we use stateful iterator objects. These iterators make the input pipeline much easier in both the single-machine and distributed setting. We will cover the new input data pipeline (as introduced in TensorFlow 1.2) in the next section.

Data Input Pipeline

Prior to TensorFlow 1.2, users had two options for feeding data to the TensorFlow training and eval pipelines:

Feed data directly via feed_dict at each training session.run call.
Use the queueing mechanisms in tf.train (e.g. tf.train.batch) and tf.contrib.train.
Use helpers from a higher level framework like tf.contrib.learn or tf.contrib.slim (which effectively use #2).

The first approach is easier for users who aren't familiar with TensorFlow or need to do exotic input modification (i.e., their own minibatch queueing) that can only be done in Python. The second and third approaches are more standard but a little less flexible; they also require starting multiple python threads (queue runners). Furthermore, if used incorrectly queues can lead to deadlocks or opaque error messages. Nevertheless, queues are significantly more efficient than using feed_dict and are the standard for both single-machine and distributed training.

Starting in TensorFlow 1.2, there is a new system available for reading data into TensorFlow models: dataset iterators, as found in the tf.data module. Data iterators are flexible, easy to reason about and to manipulate, and provide efficiency and multithreading by leveraging the TensorFlow C++ runtime.

A dataset can be created from a batch data Tensor, a filename, or a Tensor containing multiple filenames. Some examples:

# Training dataset consists of multiple files.
train_dataset = tf.data.TextLineDataset(train_files)

# Evaluation dataset uses a single file, but we may
# point to a different file for each evaluation round.
eval_file = tf.placeholder(tf.string, shape=())
eval_dataset = tf.data.TextLineDataset(eval_file)

# For inference, feed input data to the dataset directly via feed_dict.
infer_batch = tf.placeholder(tf.string, shape=(num_infer_examples,))
infer_dataset = tf.data.Dataset.from_tensor_slices(infer_batch)

All datasets can be treated similarly via input processing. This includes reading and cleaning the data, bucketing (in the case of training and eval), filtering, and batching.

To convert each sentence into vectors of word strings, for example, we use the dataset map transformation:

dataset = dataset.map(lambda string: tf.string_split([string]).values)

We can then switch each sentence vector into a tuple containing both the vector and its dynamic length:

dataset = dataset.map(lambda words: (words, tf.size(words))

Finally, we can perform a vocabulary lookup on each sentence. Given a lookup table object table, this map converts the first tuple elements from a vector of strings to a vector of integers.

dataset = dataset.map(lambda words, size: (table.lookup(words), size))

Joining two datasets is also easy. If two files contain line-by-line translations of each other and each one is read into its own dataset, then a new dataset containing the tuples of the zipped lines can be created via:

source_target_dataset = tf.data.Dataset.zip((source_dataset, target_dataset))

Batching of variable-length sentences is straightforward. The following transformation batches batch_size elements from source_target_dataset, and respectively pads the source and target vectors to the length of the longest source and target vector in each batch.

batched_dataset = source_target_dataset.padded_batch(
        batch_size,
        padded_shapes=((tf.TensorShape([None]),  # source vectors of unknown size
                        tf.TensorShape([])),     # size(source)
                       (tf.TensorShape([None]),  # target vectors of unknown size
                        tf.TensorShape([]))),    # size(target)
        padding_values=((src_eos_id,  # source vectors padded on the right with src_eos_id
                         0),          # size(source) -- unused
                        (tgt_eos_id,  # target vectors padded on the right with tgt_eos_id
                         0)))         # size(target) -- unused

Values emitted from this dataset will be nested tuples whose tensors have a leftmost dimension of size batch_size. The structure will be:

iterator[0][0] has the batched and padded source sentence matrices.
iterator[0][1] has the batched source size vectors.
iterator[1][0] has the batched and padded target sentence matrices.
iterator[1][1] has the batched target size vectors.

Finally, bucketing that batches similarly-sized source sentences together is also possible. Please see the file utils/iterator_utils.py for more details and the full implementation.

Reading data from a Dataset requires three lines of code: create the iterator, get its values, and initialize it.

batched_iterator = batched_dataset.make_initializable_iterator()

((source, source_lengths), (target, target_lengths)) = batched_iterator.get_next()

# At initialization time.
session.run(batched_iterator.initializer, feed_dict={...})

Once the iterator is initialized, every session.run call that accesses source or target tensors will request the next minibatch from the underlying dataset.

Other details for better NMT models

Bidirectional RNNs

Bidirectionality on the encoder side generally gives better performance (with some degradation in speed as more layers are used). Here, we give a simplified example of how to build an encoder with a single bidirectional layer:

# Construct forward and backward cells
forward_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units)
backward_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units)

bi_outputs, encoder_state = tf.nn.bidirectional_dynamic_rnn(
    forward_cell, backward_cell, encoder_emb_inp,
    sequence_length=source_sequence_length, time_major=True)
encoder_outputs = tf.concat(bi_outputs, -1)

The variables encoder_outputs and encoder_state can be used in the same way as in Section Encoder. Note that, for multiple bidirectional layers, we need to manipulate the encoder_state a bit, see model.py, method _build_bidirectional_rnn() for more details.

Beam search

While greedy decoding can give us quite reasonable translation quality, a beam search decoder can further boost performance. The idea of beam search is to better explore the search space of all possible translations by keeping around a small set of top candidates as we translate. The size of the beam is called beam width; a minimal beam width of, say size 10, is generally sufficient. For more information, we refer readers to Section 7.2.3 of Neubig, (2017). Here's an example of how beam search can be done:

# Replicate encoder infos beam_width times
decoder_initial_state = tf.contrib.seq2seq.tile_batch(
    encoder_state, multiplier=hparams.beam_width)

# Define a beam-search decoder
decoder = tf.contrib.seq2seq.BeamSearchDecoder(
        cell=decoder_cell,
        embedding=embedding_decoder,
        start_tokens=start_tokens,
        end_token=end_token,
        initial_state=decoder_initial_state,
        beam_width=beam_width,
        output_layer=projection_layer,
        length_penalty_weight=0.0,
        coverage_penalty_weight=0.0)

# Dynamic decoding
outputs, _ = tf.contrib.seq2seq.dynamic_decode(decoder, ...)

Note that the same dynamic_decode() API call is used, similar to the Section Decoder. Once decoded, we can access the translations as follows:

translations = outputs.predicted_ids
# Make sure translations shape is [batch_size, beam_width, time]
if self.time_major:
   translations = tf.transpose(translations, perm=[1, 2, 0])

See model.py, method _build_decoder() for more details.

Hyperparameters

There are several hyperparameters that can lead to additional performances. Here, we list some based on our own experience [ Disclaimers: others might not agree on things we wrote! ].

Optimizer: while Adam can lead to reasonable results for "unfamiliar" architectures, SGD with scheduling will generally lead to better performance if you can train with SGD.

Attention: Bahdanau-style attention often requires bidirectionality on the encoder side to work well; whereas Luong-style attention tends to work well for different settings. For this tutorial code, we recommend using the two improved variants of Luong & Bahdanau-style attentions: scaled_luong & normed bahdanau.

Multi-GPU training

Training a NMT model may take several days. Placing different RNN layers on different GPUs can improve the training speed. Here’s an example to create RNN layers on multiple GPUs.

cells = []
for i in range(num_layers):
  cells.append(tf.contrib.rnn.DeviceWrapper(
      tf.contrib.rnn.LSTMCell(num_units),
      "/gpu:%d" % (num_layers % num_gpus)))
cell = tf.contrib.rnn.MultiRNNCell(cells)

In addition, we need to enable the colocate_gradients_with_ops option in tf.gradients to parallelize the gradients computation.

You may notice the speed improvement of the attention based NMT model is very small as the number of GPUs increases. One major drawback of the standard attention architecture is using the top (final) layer’s output to query attention at each time step. That means each decoding step must wait its previous step completely finished; hence, we can’t parallelize the decoding process by simply placing RNN layers on multiple GPUs.

The GNMT attention architecture parallelizes the decoder's computation by using the bottom (first) layer’s output to query attention. Therefore, each decoding step can start as soon as its previous step's first layer and attention computation finished. We implemented the architecture in GNMTAttentionMultiCell, a subclass of tf.contrib.rnn.MultiRNNCell. Here’s an example of how to create a decoder cell with the GNMTAttentionMultiCell.

cells = []
for i in range(num_layers):
  cells.append(tf.contrib.rnn.DeviceWrapper(
      tf.contrib.rnn.LSTMCell(num_units),
      "/gpu:%d" % (num_layers % num_gpus)))
attention_cell = cells.pop(0)
attention_cell = tf.contrib.seq2seq.AttentionWrapper(
    attention_cell,
    attention_mechanism,
    attention_layer_size=None,  # don't add an additional dense layer.
    output_attention=False,)
cell = GNMTAttentionMultiCell(attention_cell, cells)

Benchmarks

IWSLT English-Vietnamese

Train: 133K examples, vocab=vocab.(vi|en), train=train.(vi|en) dev=tst2012.(vi|en), test=tst2013.(vi|en), download script.

Training details. We train 2-layer LSTMs of 512 units with bidirectional encoder (i.e., 1 bidirectional layers for the encoder), embedding dim is 512. LuongAttention (scale=True) is used together with dropout keep_prob of 0.8. All parameters are uniformly. We use SGD with learning rate 1.0 as follows: train for 12K steps (~ 12 epochs); after 8K steps, we start halving learning rate every 1K step.

Results.

Below are the averaged results of 2 models (model 1, model 2).
We measure the translation quality in terms of BLEU scores (Papineni et al., 2002).

Systems	tst2012 (dev)	test2013 (test)
NMT (greedy)	23.2	25.5
NMT (beam=10)	23.8	26.1
(Luong & Manning, 2015)	-	23.3

Training Speed: (0.37s step-time, 15.3K wps) on K40m & (0.17s step-time, 32.2K wps) on TitanX.
Here, step-time means the time taken to run one mini-batch (of size 128). For wps, we count words on both the source and target.

WMT German-English

Train: 4.5M examples, vocab=vocab.bpe.32000.(de|en), train=train.tok.clean.bpe.32000.(de|en), dev=newstest2013.tok.bpe.32000.(de|en), test=newstest2015.tok.bpe.32000.(de|en), download script

Training details. Our training hyperparameters are similar to the English-Vietnamese experiments except for the following details. The data is split into subword units using BPE (32K operations). We train 4-layer LSTMs of 1024 units with bidirectional encoder (i.e., 2 bidirectional layers for the encoder), embedding dim is 1024. We train for 350K steps (~ 10 epochs); after 170K steps, we start halving learning rate every 17K step.

Results.

The first 2 rows are the averaged results of 2 models (model 1, model 2). Results in the third row is with GNMT attention (model) ; trained with 4 GPUs.

Systems	newstest2013 (dev)	newstest2015
NMT (greedy)	27.1	27.6
NMT (beam=10)	28.0	28.9
NMT + GNMT attention (beam=10)	29.0	29.9
WMT SOTA	-	29.3

These results show that our code builds strong baseline systems for NMT.
(Note that WMT systems generally utilize a huge amount monolingual data which we currently do not.)

Training Speed: (2.1s step-time, 3.4K wps) on Nvidia K40m & (0.7s step-time, 8.7K wps) on Nvidia TitanX for standard models.
To see the speed-ups with GNMT attention, we benchmark on K40m only:

Systems	1 gpu	4 gpus	8 gpus
NMT (4 layers)	2.2s, 3.4K	1.9s, 3.9K	-
NMT (8 layers)	3.5s, 2.0K	-	2.9s, 2.4K
NMT + GNMT attention (4 layers)	2.6s, 2.8K	1.7s, 4.3K	-
NMT + GNMT attention (8 layers)	4.2s, 1.7K	-	1.9s, 3.8K

These results show that without GNMT attention, the gains from using multiple gpus are minimal.
With GNMT attention, we obtain from 50%-100% speed-ups with multiple gpus.

WMT English-German — Full Comparison

The first 2 rows are our models with GNMT attention: model 1 (4 layers), model 2 (8 layers).

Systems	newstest2014	newstest2015
Ours — NMT + GNMT attention (4 layers)	23.7	26.5
Ours — NMT + GNMT attention (8 layers)	24.4	27.6
WMT SOTA	20.6	24.9
OpenNMT (Klein et al., 2017)	19.3	-
tf-seq2seq (Britz et al., 2017)	22.2	25.2
GNMT (Wu et al., 2016)	24.6	-

The above results show our models are very competitive among models of similar architectures.
[Note that OpenNMT uses smaller models and the current best result (as of this writing) is 28.4 obtained by the Transformer network (Vaswani et al., 2017) which has a significantly different architecture.]

Standard HParams

We have provided a set of standard hparams for using pre-trained checkpoint for inference or training NMT architectures used in the Benchmark.

We will use the WMT16 German-English data, you can download the data by the following command.

nmt/scripts/wmt16_en_de.sh /tmp/wmt16

Here is an example command for loading the pre-trained GNMT WMT German-English checkpoint for inference.

python -m nmt.nmt \
    --src=de --tgt=en \
    --ckpt=/path/to/checkpoint/translate.ckpt \
    --hparams_path=nmt/standard_hparams/wmt16_gnmt_4_layer.json \
    --out_dir=/tmp/deen_gnmt \
    --vocab_prefix=/tmp/wmt16/vocab.bpe.32000 \
    --inference_input_file=/tmp/wmt16/newstest2014.tok.bpe.32000.de \
    --inference_output_file=/tmp/deen_gnmt/output_infer \
    --inference_ref_file=/tmp/wmt16/newstest2014.tok.bpe.32000.en

Here is an example command for training the GNMT WMT German-English model.

python -m nmt.nmt \
    --src=de --tgt=en \
    --hparams_path=nmt/standard_hparams/wmt16_gnmt_4_layer.json \
    --out_dir=/tmp/deen_gnmt \
    --vocab_prefix=/tmp/wmt16/vocab.bpe.32000 \
    --train_prefix=/tmp/wmt16/train.tok.clean.bpe.32000 \
    --dev_prefix=/tmp/wmt16/newstest2013.tok.bpe.32000 \
    --test_prefix=/tmp/wmt16/newstest2015.tok.bpe.32000

Other resources

For deeper reading on Neural Machine Translation and sequence-to-sequence models, we highly recommend the following materials by Luong, Cho, Manning, (2016); Luong, (2016); and Neubig, (2017).

There's a wide variety of tools for building seq2seq models, so we pick one per language:
Stanford NMT https://nlp.stanford.edu/projects/nmt/ [Matlab]
tf-seq2seq https://github.com/google/seq2seq [TensorFlow]
Nemantus https://github.com/rsennrich/nematus [Theano]
OpenNMT http://opennmt.net/ [Torch]
OpenNMT-py https://github.com/OpenNMT/OpenNMT-py [PyTorch]

Acknowledgment

We would like to thank Denny Britz, Anna Goldie, Derek Murray, and Cinjon Resnick for their work bringing new features to TensorFlow and the seq2seq library. Additional thanks go to Lukasz Kaiser for the initial help on the seq2seq codebase; Quoc Le for the suggestion to replicate GNMT; Yonghui Wu and Zhifeng Chen for details on the GNMT systems; as well as the Google Brain team for their support and feedback!

References

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. ICLR.
Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. EMNLP.
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. NIPS.

BibTex

@article{luong17,
  author  = {Minh{-}Thang Luong and Eugene Brevdo and Rui Zhao},
  title   = {Neural Machine Translation (seq2seq) Tutorial},
  journal = {https://github.com/tensorflow/nmt},
  year    = {2017},
}

nmt's People

Contributors

Stargazers

Watchers

Forkers

peratham mamonraab redeipirati florian42 kmario23 brcordeiro jazracherif valerasarapas tmbtw jiayingjie92 poivrenoir jdc08161063 liqunchen0606 mydp2017 kitisak yushangbin jianweicui hhy5277 nevinhappy zxsted frankatmech ruanchong qoboty shuidongliu dylan-fan ssghost tonydeep syx528911137 cjzhaosimons kormilitzin zhang-jian caoqian1995 ipanivko natureofnature giancds mindis dansyu missfall altaken sagarora77 yanghaha11514 panfowa lliger9 aliaspeng wensiding jeffkrop collawolley courage622 anusornc abunaser71 xaveng vishwgupta 19ai yongfu-li boookmarks 554288653 proffl028 srush xueyouluo suqi pramoth rohitrawat jmvictor5656 zhfzhmsra winnerineast aprilcj001 xingbaji zabin10 zzzhou juzenn zj2089 krbuti dyhpoon sungjinlees linlut megazone87 tpnguyen wb14123 alokranjan1234 hitxueliang amallet datagold2017 cogmeta mtdersvan tranhd praveenkumartelugu jackiechan007 yiqiang-zhao zhweizhang willsonny vishalkakkar shikharateverest agistrueai dryuna outcastofmusic maverick7 wangcho2k shamanez m0ckupc0de cclauss

nmt's Issues

Question: Data parallel vs Split model training

The currently implemented setup allows to split the model across multiple GPUs by distributing the layers.
Is this approach prefered over having the same model on multiple GPUs and doing data parallel training by increasing the batch size?

What are pros and cons of the two approaches?

new seq2seq libraries does not provide the result legacy seq2seq provided.

I am using custom database.
First Step:
I was using the old seq2seq tutorial in tensorflow-gpu 1.0.0v.
The perplexity used to converge to a minimum of 2.1 along with evaluation perplexity of each buckets in an around that value.

Second Step:
When I upgraded to the new version tensorflow-gpu1.2.0v and used this tutorial to train, the perplexity shows to be 400 and never goes down. The reason of trying the new api was to later do the schedule sampling. But even without that and using the tutorial as it is the results are very bad.

Reason (Maybe):
Is it because the perplexity before printing out is multiplied by the batch size and that increases the value? If so I read the comment on the loss function in the tutorial but did not understand it completely. It would be really nice that I can get a hypothetical answer to this question on what would be happening.

Great tutorial!
Thank you in advance, appreciate your help and guidance.

Filed to run GNMT

It complains a key error
"KeyError: num_residual_layers"

Here is my script

python -m nmt.nmt
--src=en --tgt=de
--vocab_prefix=${DATA_DIR}/vocab
--train_prefix=${DATA_DIR}/train
--dev_prefix=${DATA_DIR}/newstest2014
--test_prefix=${DATA_DIR}/newstest2015
--out_dir=$(OUT_DIR}/test
--hparams_path nmt/standard_hparams/wmt16_en_de_gnmt.json

AttributeError: 'dict' object has no attribute 'iterkeys'

When I train the model in the tutorials, I met such an error.

I have updated my tensorflow to version 1.2.1, and use script to download the data .
Version of my python is 3.6.

python -m nmt.nmt --src=vi --tgt=en --vocab_prefix=/tmp/nmt_data/vocab --train_prefix=/tmp/nmt_data/train --dev_prefix=/tmp/nmt_data/tst2012 --test_prefix=/tmp/nmt_data/tst2013 --out_dir=/tmp/nmt_model --num_train_steps=12000 --steps_per_stats=100 --num_layers=2 --num_units=128 --dropout=0.2 --metrics=bleu
b'# Job id 0'
b'# hparams:'
b' src=vi'
b' tgt=en'
b' train_prefix=/tmp/nmt_data/train'
b' dev_prefix=/tmp/nmt_data/tst2012'
b' test_prefix=/tmp/nmt_data/tst2013'
b' out_dir=/tmp/nmt_model'
b'# Vocab file /tmp/nmt_data/vocab.vi exists'
b'# Vocab file /tmp/nmt_data/vocab.en exists'
b' saving hparams to /tmp/nmt_model/hparams'
b' saving hparams to /tmp/nmt_model/best_bleu/hparams'
Traceback (most recent call last):
File "/Applications/bin/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/Applications/bin/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Users/jiahaohong/Downloads/nmt-master/nmt/nmt.py", line 479, in
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/Applications/bin/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/Users/jiahaohong/Downloads/nmt-master/nmt/nmt.py", line 289, in main
hparams = load_train_hparams(out_dir)
File "/Users/jiahaohong/Downloads/nmt-master/nmt/nmt.py", line 237, in load_train_hparams
utils.print_hparams(hparams)
File "/Users/jiahaohong/Downloads/nmt-master/nmt/utils/misc_utils.py", line 72, in print_hparams
for key in sorted(values.iterkeys()):
AttributeError: 'dict' object has no attribute 'iterkeys'

Why was bpe applied to both tokenized and cleaned texts

In wmt'16 preprocessing script, why is applied to both tokenized and cleaned texts? Is bpe supposed to only be used on cleaned (pruned) texts? Thanks!

Might consider add Pytorch version to Other resources

Pytorch version is also available at here.

TypeError: Expected binary or unicode string, got None

My python version is 3.6.1, and tf version is 1.2.1.
I've gotten an error when I ran inference:

cat > /tmp/my_infer_file.vi
# (copy and paste some sentences from /tmp/nmt_data/tst2013.vi)

python -m nmt.nmt \
    --model_dir=/tmp/nmt_model \
    --inference_input_file=/tmp/my_infer_file.vi \
    --inference_output_file=/tmp/nmt_model/output_infer

cat /tmp/nmt_model/output_infer # To view the inference as output

The error message is here:

Traceback (most recent call last):
  File "/home_path/.pyenv/versions/3.6.1/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home_path/.pyenv/versions/3.6.1/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home_path/Documents/char-rnn/nmt-master/nmt/nmt.py", line 478, in <module>
    tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
  File "/home/openmind/.pyenv/versions/3.6.1/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/home_path/Documents/char-rnn/nmt-master/nmt/nmt.py", line 438, in main
    if not tf.gfile.Exists(out_dir): tf.gfile.MakeDirs(out_dir)
  File "/home_path/.pyenv/versions/3.6.1/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 252, in file_exists
    pywrap_tensorflow.FileExists(compat.as_bytes(filename), status)
  File "/home_path/.pyenv/versions/3.6.1/lib/python3.6/site-packages/tensorflow/python/util/compat.py", line 65, in as_bytes
    (bytes_or_text,))
TypeError: Expected binary or unicode string, got None

Is there anyone who have faced the same issue?

How to resume training (finetuning) on the checkpoint(saved) model?

For example, I have trained a model for 300,000 rounds, and save the model successfully. What if I want to go on training base on the saved model, saying, I want to train 300,000 more rounds. I can not find some documentation. Does anybody know the command details?

How to use a specific GPU to train model?

I have multiple GPUs in my machine, and I want to use a specific one to trian my nmt model. What should I do?

Is there any preprocessing script for en-de translation like those for de-en and vi-en?

I'm wondering if there is nay preprocessing script for en-de translation so that I can try NMT myself and see if the result matches the numbers published in this tutorial? Thanks.

How to run the tests ?

I tried 2 ways:

python model_test.py
python nmt/model_test.py

with both python27 and python36

In both cases, I get the error

Traceback (most recent call last):
  File "model_test.py", line 26, in <module>
    from . import attention_model
ImportError: cannot import name 'attention_model'

I also tried to add the nmt directory to PYTHONPATH, without success.

It interests me very much that you've written tests, but I find no way to run them. Any help would be appreciated :)

BPE support seems missing

I'm trying to run wmt16_en_de_gnmt.json.
It first comes back with an error of missing vocabulary file. Looking into the code, it doesn't look for the vocab files with "bpe.32000" which are created by the wmt16_en_de.sh. If I force it to look at the right vocab file, then the model starts to run and graph build seems successful. However, it stops with an error "HashTable has different value for same key. Key <s> has 1 and trying to add value 4"

Would i decode when when export to file?

In nmt/nmt/utils/nmt_utils.py line 64
I think translation should be decoded to 'utf-8' when writing to output file.

DO anybody who met the same issue?

a small issue about the print function in train.py

The link is here
utils.print_out(
"# Start step %d, lr %g, %s" %
(global_step, train_model.learning_rate.eval(session=train_sess),
time.ctime()),
log_f)
what's the meaning of train_model.learning_rate.eval()
the train_model.learning_rate is just a tensorflow constant, and we can call the eval function?
Thanks for anyone's response.

What was the dev set used to achieve the numbers in the en-de task in the tutorial?

I'm trying to reproduce the numbers published in the tutorial for en-de translation using the wmt'16 data. I'm wondering what dev set was used to achieve the numbers presented in the tutorial?

ImportError: cannot import name inference (not duplicate)

question

Hey, great work here.

Can you please confirm whether your BLEU score you report for EN<>DE are based on subwords or reformed words ?

Thanks.

ImportError: cannot import name lookup_ops

  File "nmt/inference.py", line 25, in <module>
    from tensorflow.python.ops import lookup_ops

my tensorflow is v1.1

I can't use beamsearch

I got a error when I set beam_width=10 or load hparams from the standard_hparams

InvalidArgumentError (see above for traceback): Multiple OpKernel registrations match NodeDef 'dynamic_seq2seq/decoder/decoder/GatherTree = GatherTree[T=DT_INT32](dynamic_seq2seq/decoder/decoder/TensorArrayStack_1/TensorArrayGatherV3, dynamic_seq2seq/decoder/decoder/TensorArrayStack_2/TensorArrayGatherV3, dynamic_seq2seq/decoder/decoder/while/Exit_16)': 'op: "GatherTree" device_type: "CPU" constraint { name: "T" allowed_values { list { type: DT_INT32 } } }' and 'op: "GatherTree" device_type: "CPU" constraint { name: "T" allowed_values { list { type: DT_INT32 } } }'
	 [[Node: dynamic_seq2seq/decoder/decoder/GatherTree = GatherTree[T=DT_INT32](dynamic_seq2seq/decoder/decoder/TensorArrayStack_1/TensorArrayGatherV3, dynamic_seq2seq/decoder/decoder/TensorArrayStack_2/TensorArrayGatherV3, dynamic_seq2seq/decoder/decoder/while/Exit_16)]]

ValueError when creating inference file

When trying to create an inference file, I get the following ValueError: hparams.vocab_prefix must be provided.

~/nmt $ python -m nmt.nmt \
>     --model_dir=/tmp/nmt_model \
>     --inference_input_file=/tmp/my_infer_file.vi \
>     --inference_output_file=/tmp/nmt_model/output_infer
# Job id 0
# Loading hparams from /tmp/nmt_model/hparams
# hparams:
  src=None
  tgt=None
  train_prefix=None
  dev_prefix=None
  test_prefix=None
  out_dir=None
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "~/nmt/nmt/nmt.py", line 479, in <module>
    tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
  File "~/.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "~/nmt/nmt/nmt.py", line 267, in main
    hparams = ensure_compatible_hparams(hparams)
  File "~/nmt/nmt/nmt.py", line 195, in ensure_compatible_hparams
    new_hparams = extend_hparams(new_hparams)
  File "~/nmt/nmt/nmt.py", line 148, in extend_hparams
    raise ValueError("hparams.vocab_prefix must be provided.")
ValueError: hparams.vocab_prefix must be provided.

If I comment hparams = ensure_compatible_hparams(hparams) out, it runs without any trouble.

OOM due to large target vocab.

I am facing a OOM error due to the target vocab size being very large. Any idea how to tackle this ?
Maybe somehow using sampled_softmax_loss after decoding ?

Possible missing tanh() in calculating attention

Hi, I was looking at the tutorial on attention section, and it says attention vector = tanh( Wc [c; h] ). The attention wrappers in the seq2seq library use the attention_layer to perform an affine transformation on [c; h] via layers_core.Dense without activation function, but I have trouble locating where the tanh() function is.

Question: Cross-Entropy and variable sized logits from dynamic_decode

I have a very similar setup also using the new v1.2 APIs.
When I run the encoder+decoer with features of size e.g. [2, 23, 100] ([batch, sentence_length, source_vocabulary_size]) it will sometimes return a differntly sized Tensor e.g. [2, 16, 100] from dynamic_decode. I know this is intended, as the target sentence can have a different length from the input sentence.

What I do not understand is the following:
Let's assume I have a label tensor of shape [2, 30] and dynamic_decode outputs [2, 16, 100] how do I compute the crossentropy like it is done here:

target_output = self.iterator.target_output
if self.time_major:
  target_output = tf.transpose(target_output)
max_time = self.get_max_time(target_output)
crossent = tf.nn.sparse_softmax_cross_entropy_with_logits(
    labels=target_output, logits=logits)
target_weights = tf.sequence_mask(
    self.iterator.target_sequence_length, max_time, dtype=logits.dtype)
if self.time_major:
  target_weights = tf.transpose(target_weights)

loss = tf.reduce_sum(
    crossent * target_weights) / tf.to_float(self.batch_size)
return loss

https://github.com/tensorflow/nmt/blob/master/nmt/model.py#L414

When the shapes of logits and labels do not match. I currently pad/trim my logits in that case but I don't see that happening in the code here.
Could that point to a bug in my implementation or do I misunderstand something?
I know this might be more suitable for stackoverflow but I feel like I miss an important part of the model that is presented here.
I hope someone can clarify that for me.

shouldn't loss be calculated between decoder_outputs and projected layer of logits?

In L328 of model.py , I see that BasicDecoder is not provided with an argument output_layer which implies that we are getting the raw rnn outputs while doing dynamic_decode later. As a result, it seems that we are calculating the loss between logits which are the actually the rnn_outputs and decoder labels. Is this the intended behavior or am I missing something?

Dev dataset cannot be very large?

If I use a large dev dataset, then the code will crash. And it will be cost much time during the following command:
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator:
After l long time ,it will crash, the error log is as follows:
Resource exhausted: OOM when allocating tensor with shape[23296,47291]
But if I change the dev dataset to a small one ,then it will work successfully.
I'm not sure if I have a problem with my operation or the code cannot use a large dev dataset.
Does anybody know how to solve this problem? Thank you very much!

Can't inference big file

When I use configuration --out_dir=./nmt_attention_model --inference_input_file=./nmt_data/train.vi --inference_output_file=./nmt_model/output_infer --inference_ref_file=./nmt_data/train.en in Pycharm, I got following error:

Traceback (most recent call last):
  File "/home/liyanyang/projects/baseline/nmt.py", line 481, in <module>
    tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
  File "/home/liyanyang/tensorflow/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/home/liyanyang/projects/baseline/nmt.py", line 460, in main
    trans_file, hparams, num_workers, jobid)
  File "/home/liyanyang/projects/baseline/inference.py", line 164, in inference
    scope=scope)
  File "/home/liyanyang/projects/baseline/inference.py", line 222, in _single_worker_inference
    tgt_eos=hparams.eos)
  File "/home/liyanyang/projects/baseline/utils/nmt_utils.py", line 51, in decode_and_evaluate
    nmt_outputs, _ = model.decode(sess)
  File "/home/liyanyang/projects/baseline/model.py", line 448, in decode
    _, infer_summary, _, sample_words = self.infer(sess)
  File "/home/liyanyang/projects/baseline/model.py", line 435, in infer
    self.infer_logits, self.infer_summary, self.sample_id, self.sample_words
  File "/home/liyanyang/tensorflow/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 789, in run
    run_metadata_ptr)
  File "/home/liyanyang/tensorflow/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 997, in _run
    feed_dict_string, options, run_metadata)
  File "/home/liyanyang/tensorflow/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1132, in _do_run
    target_list, options, run_metadata)
  File "/home/liyanyang/tensorflow/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1152, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [All values in memory_sequence_length must greater than zero.] [Condition x > 0 did not hold element-wise:] [x (IteratorGetNext:1) = ] [11 39 41...]
	 [[Node: dynamic_seq2seq/decoder/decoder/while/BasicDecoderStep/decoder/attention/assert_positive/assert_less/Assert/Assert = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT32], summarize=3, _device="/job:localhost/replica:0/task:0/cpu:0"](dynamic_seq2seq/decoder/decoder/while/BasicDecoderStep/decoder/attention/assert_positive/assert_less/All/_191, dynamic_seq2seq/decoder/decoder/while/BasicDecoderStep/decoder/attention/assert_positive/assert_less/Assert/Assert/data_0, dynamic_seq2seq/decoder/decoder/while/BasicDecoderStep/decoder/attention/assert_positive/assert_less/Assert/Assert/data_1, dynamic_seq2seq/decoder/decoder/while/BasicDecoderStep/decoder/attention/assert_positive/assert_less/Assert/Assert/data_2, dynamic_seq2seq/decoder/decoder/while/BasicDecoderStep/decoder/attention/assert_positive/assert_less/Less/Enter/_193)]]

Caused by op u'dynamic_seq2seq/decoder/decoder/while/BasicDecoderStep/decoder/attention/assert_positive/assert_less/Assert/Assert', defined at:
  File "/home/liyanyang/projects/baseline/nmt.py", line 481, in <module>
    tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
  File "/home/liyanyang/tensorflow/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/home/liyanyang/projects/baseline/nmt.py", line 460, in main
    trans_file, hparams, num_workers, jobid)
  File "/home/liyanyang/projects/baseline/inference.py", line 164, in inference
    scope=scope)
  File "/home/liyanyang/projects/baseline/inference.py", line 189, in _single_worker_inference
    infer_model = create_infer_model(model_creator, hparams, scope)
  File "/home/liyanyang/projects/baseline/inference.py", line 80, in create_infer_model
    scope=scope)
  File "/home/liyanyang/projects/baseline/attention_model.py", line 53, in __init__
    scope=scope)
  File "/home/liyanyang/projects/baseline/model.py", line 88, in __init__
    res = self.build_graph(hparams, scope=scope)
  File "/home/liyanyang/projects/baseline/model.py", line 226, in build_graph
    encoder_outputs, encoder_state, hparams)
  File "/home/liyanyang/projects/baseline/model.py", line 380, in _build_decoder
    scope=decoder_scope)
  File "/home/liyanyang/tensorflow/lib/python2.7/site-packages/tensorflow/contrib/seq2seq/python/ops/decoder.py", line 286, in dynamic_decode
    swap_memory=swap_memory)
  File "/home/liyanyang/tensorflow/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2770, in while_loop
    result = context.BuildLoop(cond, body, loop_vars, shape_invariants)
  File "/home/liyanyang/tensorflow/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2599, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/home/liyanyang/tensorflow/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2549, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "/home/liyanyang/tensorflow/lib/python2.7/site-packages/tensorflow/contrib/seq2seq/python/ops/decoder.py", line 234, in body
    decoder_finished) = decoder.step(time, inputs, state)
  File "/home/liyanyang/tensorflow/lib/python2.7/site-packages/tensorflow/contrib/seq2seq/python/ops/basic_decoder.py", line 139, in step
    cell_outputs, cell_state = self._cell(inputs, state)
  File "/home/liyanyang/tensorflow/lib/python2.7/site-packages/tensorflow/python/ops/rnn_cell_impl.py", line 844, in __call__
    return self._cell(inputs, state, scope=scope)
  File "/home/liyanyang/tensorflow/lib/python2.7/site-packages/tensorflow/python/ops/rnn_cell_impl.py", line 180, in __call__
    return super(RNNCell, self).__call__(inputs, state)
  File "/home/liyanyang/tensorflow/lib/python2.7/site-packages/tensorflow/python/layers/base.py", line 441, in __call__
    outputs = self.call(inputs, *args, **kwargs)
  File "/home/liyanyang/tensorflow/lib/python2.7/site-packages/tensorflow/contrib/seq2seq/python/ops/attention_wrapper.py", line 727, in call
    cell_output, previous_alignments=state.alignments)
  File "/home/liyanyang/tensorflow/lib/python2.7/site-packages/tensorflow/contrib/seq2seq/python/ops/attention_wrapper.py", line 357, in __call__
    alignments = self._probability_fn(score, previous_alignments)
  File "/home/liyanyang/tensorflow/lib/python2.7/site-packages/tensorflow/contrib/seq2seq/python/ops/attention_wrapper.py", line 185, in <lambda>
    _maybe_mask_score(score, memory_sequence_length, score_mask_value),
  File "/home/liyanyang/tensorflow/lib/python2.7/site-packages/tensorflow/contrib/seq2seq/python/ops/attention_wrapper.py", line 120, in _maybe_mask_score
    [check_ops.assert_positive(memory_sequence_length, message=message)]):
  File "/home/liyanyang/tensorflow/lib/python2.7/site-packages/tensorflow/python/ops/check_ops.py", line 198, in assert_positive
    return assert_less(zero, x, data=data, summarize=summarize)
  File "/home/liyanyang/tensorflow/lib/python2.7/site-packages/tensorflow/python/ops/check_ops.py", line 401, in assert_less
    return control_flow_ops.Assert(condition, data, summarize=summarize)
  File "/home/liyanyang/tensorflow/lib/python2.7/site-packages/tensorflow/python/util/tf_should_use.py", line 170, in wrapped
    return _add_should_use_warning(fn(*args, **kwargs))
  File "/home/liyanyang/tensorflow/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 124, in Assert
    condition, data, summarize, name="Assert")
  File "/home/liyanyang/tensorflow/lib/python2.7/site-packages/tensorflow/python/ops/gen_logging_ops.py", line 37, in _assert
    summarize=summarize, name=name)
  File "/home/liyanyang/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/home/liyanyang/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/home/liyanyang/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

But it works fine with --out_dir=./nmt_attention_model --inference_input_file=./nmt_data/tst2013.vi --inference_output_file=./nmt_attention_model/output_infer --inference_ref_file=./nmt_data/tst2013.en.

Is the decoder output in its BPE-ed form?

Does one need to combine word pieces in the decoder output together on their own, in order to get the final translation hypothesis?

hello,can you tell me what wrong when read data

###it said vocab.query empty line ,I dont know what is means,look forword you reply!
File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/taoyanqi_i/new_project/query_nmt/nmt/nmt.py", line 492, in
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/taoyanqi_i/new_project/query_nmt/nmt/nmt.py", line 485, in main
run_main(FLAGS, default_hparams, train_fn, inference_fn)
File "/home/taoyanqi_i/new_project/query_nmt/nmt/nmt.py", line 478, in run_main
train_fn(hparams)
File "nmt/train.py", line 277, in train
single_cell_fn)
File "nmt/train.py", line 61, in create_train_model
src_vocab_file, tgt_vocab_file, hparams.share_vocab)
File "nmt/utils/vocab_utils.py", line 75, in create_vocab_tables
src_vocab_file, default_value=UNK_ID)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/lookup_ops.py", line 947, in index_table_from_file
init, default_value, shared_name=shared_name, name=hash_table_scope)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/lookup_ops.py", line 275, in init
super(HashTable, self).init(table_ref, default_value, initializer)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/lookup_ops.py", line 161, in init
self._init = initializer.initialize(self)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/lookup_ops.py", line 519, in initialize
name=scope)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_lookup_ops.py", line 187, in _initialize_table_from_text_file_v2
name=name)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1269, in init
self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Invalid content in ../query_nmt_attention_model/vocab.query: empty line found at position 35.
[[Node: string_to_index/hash_table/table_init = InitializeTableFromTextFileV2[delimiter="\t", key_index=-2, value_index=-1, vocab_size=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](string_to_index/hash_table, string_to_index/hash_table/table_init/asset_filepath)]]
[[Node: string_to_index_1/hash_table/table_init/_14 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_8_string_to_index_1/hash_table/table_init", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]]

When decoding, is the latest model selected or the one with the highest bleu on dev?

If it is not the one with the highest bleu on dev that's selected, how do I enforce the selection of the model with the highest bleu on dev? Thanks.

Clarifying How to Correctly Access Alignment History

I have a problem with the Seq2Seq library, and I'm trying to use this tutorial to find out where my bug is. My problem is that my model's alignment values are initially uniformly distributed across encoder outputs; however, the alignment values remain the same even after training, despite the fact that my model's accuracy climbs from chance (25%) to 100%. My problem is structured in a way that the only way to do well at it is for the decoder to learn to pay attention.

Copying and pasting from the earlier Github issue:

As background, both my inputs and labeled outputs at each time step are vectors of shape (4, ). I run my encoder for 500 steps i.e. inputs have shape (minibatch size, 500, 4), and my decoder runs for approximately 40-41 steps i.e. final output has shape (minibatch size, 41, 4). Each output label depends roughly on 12 sequential inputs, so for example, the first output depends on inputs 1-12, the second output depends on inputs 13-24, etc.

I don't use embeddings since doing so isn't applicable for my problem.

I reduced my model to a single layer encoder, single layer decoder to eliminate any mistake I might be making with multi-layered architectures. The encoder is a bidirectional RNN.

At the start of training, my alignment_history has roughly random uniform weights. Its shape is (41, minibatch size, 500) (although I could transpose it from time-major to batch-major). alignment_history will have values between 0.001739 and 0.002241, which makes sense - randomly initialized attention should be around 1/500 = 0.002. Additionally, my model performs at chance (25% classification accuracy).

During training, my model converges to 100% classification accuracy on both training and validation data, as shown below.

The model never sees the same training data twice, so I'm 99% confident that the model isn't memorizing the training data. However, after training, the values of alignment_history effectively haven't changed; the values now look randomly chosen from between 0.00185 and 0.00219.

My code is relatively straightforward. I have a class encapsulating my model. One method instantiates a RNN cell:

@staticmethod
def _create_lstm_cell(cell_size):
    """
    Creates a RNN cell. If lstm_or_gru is True (default), create a Layer
    Normalized LSTM cell (if layer_norm is True (default); otherwise,
    create a vanilla LSTM cell. If lstm_or_gru is False, create a Gated
    Recurrent Unit cell.
    """

    if tf.flags.FLAGS.lstm_or_gru:
        if tf.flags.FLAGS.layer_norm:
            return LayerNormBasicLSTMCell(cell_size)
        else:
            return BasicLSTMCell(cell_size)
    else:
        return GRUCell(cell_size)

I have one method for building the encoder:

def _define_encoder(self):
    """
    Construct an encoder RNN using a bidirectional layer.
    """

    with tf.variable_scope('define_encoder'):

        encoder_outputs, encoder_final_states = bidirectional_dynamic_rnn(
            cell_fw=self._create_lstm_cell(ENCODER_SINGLE_DIRECTION_SIZE),
            cell_bw=self._create_lstm_cell(ENCODER_SINGLE_DIRECTION_SIZE),
            inputs=self.x,
            dtype=tf.float32,
            sequence_length=self.x_lengths,
            time_major=False  # default
        )

        # concatenate forward and backwards encoder outputs
        encoder_outputs = tf.concat(encoder_outputs, axis=-1)

        # concatenate forward and backwards cell states
        new_c = tf.concat([encoder_final_states[0].c, encoder_final_states[1].c], axis=1)
        new_h = tf.concat([encoder_final_states[0].h, encoder_final_states[1].h], axis=1)
        encoder_final_states = (LSTMStateTuple(c=new_c, h=new_h),)

    return encoder_outputs, encoder_final_states

I similarly have another method for building the decoder:

def _define_decoder(self, encoder_outputs, encoder_final_states):
    """
    Construct a decoder complete with an attention mechanism. The encoder's
    final states will be used as the decoder's initial states.
    """



    with tf.variable_scope('define_decoder'):
        # instantiate attention mechanism
        attention_mechanism = BahdanauAttention(num_units=DECODER_SIZE,
                                                memory=encoder_outputs,
                                                normalize=True)

        # wrap LSTM cell with attention mechanism
        attention_cell = AttentionWrapper(cell=self._create_lstm_cell(cell_size=DECODER_SIZE),
                                          attention_mechanism=attention_mechanism,
                                          # output_attention=False,  # doesn't seem to affect alignments
                                          alignment_history=True,
                                          attention_layer_size=DECODER_SIZE)  # arbitrarily chosen

        # create initial attention state of zeros everywhere
        decoder_initial_state = attention_cell.zero_state(batch_size=tf.flags.FLAGS.batch_size, dtype=tf.float32).clone(cell_state=encoder_final_states[0])


        # TODO: switch this out at inference time
        training_helper = TrainingHelper(inputs=self.y,  # feed in ground truth
                                         sequence_length=self.y_lengths)  # feed in sequence lengths

        decoder = BasicDecoder(cell=attention_cell,
                               helper=training_helper,
                               initial_state=decoder_initial_state
                               )

        # run decoder over input sequence
        decoder_outputs, decoder_final_states, decoder_final_sequence_lengths = dynamic_decode(
            decoder=decoder,
            maximum_iterations=41,
            impute_finished=True)

        decoder_outputs = decoder_outputs[0]
        decoder_final_states = (decoder_final_states,)

    return decoder_outputs, decoder_final_states

I use both of these methods, and then project the output of the decoder to the same dimensionality as my labels.

def _add_inference(self):
    """
    Create a Sequence-to-Sequence model using a bidirectional encoder and an
    attention mechanism-wrapped decoder.
    
    The outputs of the decoder need to be projected to a lower dimensional
    space i.e. from DECODER_SIZE to 4.
    """

    with tf.variable_scope('add_inference'):
        encoder_outputs, encoder_final_states = self._define_encoder()
        decoder_outputs, decoder_final_states = self._define_decoder(encoder_outputs, encoder_final_states)

        weights = tf.Variable(tf.truncated_normal(shape=[DECODER_SIZE, 4]))
        bias = tf.Variable(tf.truncated_normal(shape=[4]))
        logits = tf.tensordot(decoder_outputs, weights, axes=[[2], [0]]) + bias  # 2nd dimension of decoder outputs, 0th dimension of weights

    return encoder_final_states, decoder_final_states, logits

Most of my code was written before the NMT tutorial was released, so I read the code and then stepped through it, but I can't find any glaring differences. I do have a couple of additional questions.

I have two hypotheses. One is that I'm incorrectly accessing my model's alignments, and the other is that I'm screwing something up in a much more significant way. Just to eliminate the first as a possibility, the correct way to access the decoder's alignments is through setting alignment_history=True in AttentionWrapper and then examining the values in decoder_final_states[0].alignment_history.stack(). Is this correct?
How is the attention mechanism's num_units chosen? Is the attention mechanism's number of units required to match the number of units in the RNN cell as well as the number of units in the AttentionWrapper, or is that not necessary?
I'm confused by the terminology used regarding memory, queries and keys. Memory and keys are both defined in English as "the set of source hidden states", but mathematically they're defined differently i.e. memory is W_2\overline{h}_s for Bahdanau Attention, but the keys are W_1h_t for Bahdanau Attention. My guess is that the tutorial means to say that the query h_t is converted into a key using W_1, and that key is then compared against keys generated from the encoder's hidden states i.e. W\overline{h}_s. Is this correct, or am I misunderstanding something?

Is there any way to get the meaning vector?

Hi,

I'm trying to use this model in research. I want to get the content/meaning vector produced by the encoder which represents the sentence meaning during inference. However, I found it is hard to get the meaning vector.
The only way I can find is to using “tf.Print” to print data when evaluating “encoder_state” which is a local variable in the “build_graph” method of a model. I failed to use “tf.py_func” to attach extra process to “encoder_state” as following, which cause error : ttributeError: 'tuple' object has no attribute 'encode'.

def printFunc(x):
    utils.print_out(encoder_state)
    return x

wrapped_encode_state_0_c = tf.py_func(printFunc, [encoder_state[0].c], tf.float32)
wrapped_encode_state_0_c.set_shape(encoder_state[0].c.get_shape())

Can you give some suggestions for extracting the meaning vector? Thanks a lot.

Run wmt16_en_de.sh got "ImportError: No module named site"

Hello! When I run the wmt16_en_de.sh to download and preprocess the data, I got the following error:

Learning BPE with merge_ops=32000. This may take a while...
Could not find platform independent libraries <prefix>
Could not find platform dependent libraries <exec_prefix>
Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>]
ImportError: No module named site

I installed the tensorflow by anaconda2, and my python version is 2.7.
My .bash_profile is as follows:

PATH=$PATH:$HOME/bin
PATH=/disk1/wangfang/anaconda2/bin:$PATH
export LD_LIBRARY_PATH=/disk1/wangfang/anaconda2/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export PYTHONPATH=/disk1/wangfang/anaconda2/lib/python2.7:$PYTHONPATH
export PYTHONHOME=/disk1/wangfang/anaconda2/lib/python2.7:$PYTHONHOME
export PATH

I'm not sure if my environment variable is wrong. Who can tell me how to fix this problem? Thank you very much!!!

"copy" functionality for unknown words

Would that be easy to add the copy unknown words ib the decoding stage ?
most systems do that when they report BLEU.
doesn't hurt when using BPE or Wordpiece but for other systems based on words, it's nice to have.

about debug

There is no problem running the nmt example in shell windows with 'python -m nmt.nmt ......' comands.
but how can i debug the code in anaconda environment? I get some relative path import problem when
I try to debug the nmt.py file.
Run the nmt.py directly will cause an empty package, so it's impossible to execute import statements
like 'from . import inference' and so on.
So what is the reason that the example use relative import way? it is hard for users to debug

Error when I set 'share_vocab' to true

Hi,

I was trying to train the model with my own data on a response generation task. I set the argument 'share_vocab' to true and got an 'AlreadyExistsError'.

Here's what it looks like

Caused by op u'ParallelMapDataset_1', defined at:
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/Users/claud/ucsb/nmt/nmt/nmt.py", line 478, in
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/Library/Python/2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/Users/claud/ucsb/nmt/nmt/nmt.py", line 471, in main
train.train(hparams)
File "nmt/train.py", line 234, in train
model_creator, hparams, scope)
File "nmt/train.py", line 121, in create_eval_model
tgt_max_len=hparams.tgt_max_len_infer)
File "nmt/utils/iterator_utils.py", line 199, in get_iterator
batched_iter = batched_dataset.make_initializable_iterator()
File "/Library/Python/2.7/site-packages/tensorflow/contrib/data/python/ops/dataset_ops.py", line 396, in make_initializable_iterator
return Iterator.from_dataset(self, shared_name)
File "/Library/Python/2.7/site-packages/tensorflow/contrib/data/python/ops/dataset_ops.py", line 98, in from_dataset
initializer = gen_dataset_ops.make_iterator(dataset.make_dataset_resource(),
File "/Library/Python/2.7/site-packages/tensorflow/contrib/data/python/ops/dataset_ops.py", line 1363, in make_dataset_resource
self._input_dataset.make_dataset_resource(),
File "/Library/Python/2.7/site-packages/tensorflow/contrib/data/python/ops/dataset_ops.py", line 1450, in make_dataset_resource
input_resource = self._input_dataset.make_dataset_resource()
File "/Library/Python/2.7/site-packages/tensorflow/contrib/data/python/ops/dataset_ops.py", line 1450, in make_dataset_resource
input_resource = self._input_dataset.make_dataset_resource()
File "/Library/Python/2.7/site-packages/tensorflow/contrib/data/python/ops/dataset_ops.py", line 1466, in make_dataset_resource
output_shapes=nest.flatten(self.output_shapes))
File "/Library/Python/2.7/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 414, in parallel_map_dataset
output_shapes=output_shapes, name=name)
File "/Library/Python/2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/Library/Python/2.7/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/Library/Python/2.7/site-packages/tensorflow/python/framework/ops.py", line 1269, in init
self._traceback = _extract_stack()

AlreadyExistsError (see above for traceback): Resource localhost/hash_table_/tmp/nmt_data/vocab.vi_-2_-1/N10tensorflow6lookup15LookupInterfaceE
[[Node: ParallelMapDataset_1 = ParallelMapDataset[Targuments=[DT_RESOURCE, DT_INT64, DT_RESOURCE, DT_INT64], f=tf_map_func_4aca3265[], output_shapes=[[-1], [-1]], output_types=[DT_INT32, DT_INT32], _device="/job:localhost/replica:0/task:0/cpu:0"](FilterDataset, string_to_index/hash_table, string_to_index/hash_table/Const, string_to_index_1/hash_table, string_to_index_1/hash_table/Const, num_threads_1, output_buffer_size_1)]]

Anyone has some clues?
Thank you!

Question: Using TrainingHelper vs GreedyEmbeddingHelper during training.

We are able to train simple seq2seq (our own code but using TF v1.2 contrib.seq2seq APIs) to perform simple tasks like string reversals (BLEU 100%).

However on De-En data our inference seems to stuck emitting one or two words. During training the model samples (using TrainingHelper) almost perfect translations. BUT inference keeps emitting same token or two per sentence. (We tried BasicDecoder with GreedyEmbeddingHelper and BeamSearchDecoder for inference).
Looking at evaluation loss on De-En (computed using BasicDecoder + GreedyEmbeddingHelper), it looks as if model really quickly overfits (eval loss shoots up and train loss goes down quickly).
When I try training using BasicDecoder + GreedyEmbeddingHelper, both training loss and eval loss are trending down (need few more hours to train though).
Also, on another (bigger and with longer sequences) toy task, test results after training using BasicDecoder + GreedyEmbeddingHelper are much better than after training using BasicDecoder + TrainingHelper. In all our experiments we use Bahdanau attention.

Hence my questions are:

Why not use GreedyEmbeddingHelper during training? This would make decoder auto-regressive during training which is more similar to what happens during inference.
Any ideas why TrainingHelper fails to work on bigger tasks? (looks like overfit and model only learned target language LM, but encoder and attention weights/gradients aren't zeros)
@ebrevdo could you please comment on this?

if i don't want to use dev set and test set

hello, everyone;

i'm training a model and i use shell script which indicates my dev set file and test set file like this:

--dev_prefix=./nmt_dev_test_data/dev
--test_prefix=./nmt_dev_test_data/test

but i found it was too time-consuming when "decoding to output ./nmt_model/output_dev"。in my machine with 12 cpu core and 4 gpu，it takes 12 hours ! my dev set file has 100000 lines.

now i have two questions.

if i don't use dev set and test set when training, this will effects my trained model?
if don't effect my trained model , how can i modify my shell script or source code to achieve this goal?

thans. everybody

Error with import of "inference" in nmt.py

I can't get past the import of "inference.py". Here is my error message:

Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 163, in _run_module_as_main
mod_name, _Error)
File "/usr/lib/python2.7/runpy.py", line 102, in _get_module_details
loader = get_loader(mod_name)
File "/usr/lib/python2.7/pkgutil.py", line 464, in get_loader
return find_loader(fullname)
File "/usr/lib/python2.7/pkgutil.py", line 474, in find_loader
for importer in iter_importers(fullname):
File "/usr/lib/python2.7/pkgutil.py", line 430, in iter_importers
import(pkg)
File "nmt.py", line 28, in
from . import inference
ValueError: Attempted relative import in non-package

Cannot replicate the results mentioned in the repo (English-Vietnamese) .

I cannot replicate the result that mentioned in the repo. Here are my settings:
Python 2.7
Tensorflow 1.2.1
Using a docker based on nvidia/cuda:8.0-cudnn5-devel-ubuntu14.04

The command I ran was:
python2 -m nmt.nmt
--src=vi --tgt=en
--vocab_prefix=/data/nmt/iwslt15/vocab
--train_prefix=/data/nmt/iwslt15/train
--dev_prefix=/data/nmt/iwslt15/tst2012
--test_prefix=/data/nmt/iwslt15/tst2013
--out_dir=/data/nmt/models/nmt_attention_model
--hparams_path=nmt/standard_hparams/iwslt15.json
--num_gpus=2

I got the blue score of 24.83, however, on the website, 26.1 has been reported.

How to fix this error?

I used attention here.

def decoding_layer(dec_input,encoder_outputs, encoder_state,source_sequence_length,
                   target_sequence_length, max_target_sequence_length,
                   rnn_size,
                   num_layers, target_vocab_to_int, target_vocab_size,
                   batch_size, keep_prob, decoding_embedding_size):
    
    decoder_embeddings = tf.Variable(tf.random_uniform([target_vocab_size, decoding_embedding_size]))
    decoder_embed_input = tf.nn.embedding_lookup(decoder_embeddings, dec_input)
    
    def get_lstm(rnn_size):
        lstm = tf.contrib.rnn.LSTMCell(rnn_size, initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
        cell = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
        return cell
    
    cell = tf.contrib.rnn.MultiRNNCell([get_lstm(rnn_size) for _ in range(num_layers)])
    attention_states = encoder_outputs
    attention_mechanism = tf.contrib.seq2seq.LuongAttention(100, attention_states,memory_sequence_length=source_sequence_length)
    cell = tf.contrib.seq2seq.AttentionWrapper(cell, attention_mechanism,attention_layer_size=100)
    
    # output layer
    output_layer = layers_core.Dense(target_vocab_size,
                         kernel_initializer = tf.truncated_normal_initializer(mean=0.0, stddev=0.1))
    
    # training logits
    with tf.variable_scope("decode"):
        training_logits = decoding_layer_train(encoder_state, 
                                           cell, 
                                           decoder_embed_input, 
                                           target_sequence_length, 
                                           max_target_sequence_length, 
                                           output_layer, 
                                           keep_prob)
    # inference logits
    with tf.variable_scope("decode", reuse=True):
        inference_logits = decoding_layer_infer(encoder_state, 
                                                cell, 
                                                decoder_embeddings, 
                                                target_vocab_to_int['<GO>'], 
                                                target_vocab_to_int['<EOS>'], 
                                                max_target_sequence_length, 
                                                    target_vocab_size, 
                                                    output_layer, 
                                                    batch_size, 
                                                    keep_prob)
    return training_logits, inference_logits

I got an error as follows:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-43-0fc3bbadbc36> in <module>()
     21                                                    rnn_size,
     22                                                    num_layers,
---> 23                                                    target_word_to_int)
     24 
     25 

<ipython-input-41-b5a1043ac14d> in seq2seq_model(input_data, target_data, keep_prob, batch_size, source_sequence_length, target_sequence_length, max_target_sentence_length, source_vocab_size, target_vocab_size, enc_embedding_size, dec_embedding_size, rnn_size, num_layers, target_vocab_to_int)
     45                                                                        batch_size,
     46                                                                        keep_prob,
---> 47                                                                        dec_embedding_size) 
     48 
     49     return enc_output, training_decoder_output, inference_decoder_output

<ipython-input-40-eff236afd715> in decoding_layer(dec_input, encoder_outputs, encoder_state, source_sequence_length, target_sequence_length, max_target_sequence_length, rnn_size, num_layers, target_vocab_to_int, target_vocab_size, batch_size, keep_prob, decoding_embedding_size)
     45                                            max_target_sequence_length,
     46                                            output_layer,
---> 47                                            keep_prob)
     48     # inference logits
     49     with tf.variable_scope("decode", reuse=True):

<ipython-input-18-44bb37848c0e> in decoding_layer_train(encoder_state, dec_cell, dec_embed_input, target_sequence_length, max_summary_length, output_layer, keep_prob)
     24     training_decoder_output, _, _ = tf.contrib.seq2seq.dynamic_decode(training_decoder, 
     25                                                                    impute_finished=True,
---> 26                                                                    maximum_iterations=max_summary_length)
     27 
     28     return training_decoder_output

/usr/local/lib/python3.5/site-packages/tensorflow/contrib/seq2seq/python/ops/decoder.py in dynamic_decode(decoder, output_time_major, impute_finished, maximum_iterations, parallel_iterations, swap_memory, scope)
    284         ],
    285         parallel_iterations=parallel_iterations,
--> 286         swap_memory=swap_memory)
    287 
    288     final_outputs_ta = res[1]

/usr/local/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py in while_loop(cond, body, loop_vars, shape_invariants, parallel_iterations, back_prop, swap_memory, name)
   2768     context = WhileContext(parallel_iterations, back_prop, swap_memory, name)
   2769     ops.add_to_collection(ops.GraphKeys.WHILE_CONTEXT, context)
-> 2770     result = context.BuildLoop(cond, body, loop_vars, shape_invariants)
   2771     return result
   2772 

/usr/local/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py in BuildLoop(self, pred, body, loop_vars, shape_invariants)
   2597       self.Enter()
   2598       original_body_result, exit_vars = self._BuildLoop(
-> 2599           pred, body, original_loop_vars, loop_vars, shape_invariants)
   2600     finally:
   2601       self.Exit()

/usr/local/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py in _BuildLoop(self, pred, body, original_loop_vars, loop_vars, shape_invariants)
   2547         structure=original_loop_vars,
   2548         flat_sequence=vars_for_body_with_tensor_arrays)
-> 2549     body_result = body(*packed_vars_for_body)
   2550     if not nest.is_sequence(body_result):
   2551       body_result = [body_result]

/usr/local/lib/python3.5/site-packages/tensorflow/contrib/seq2seq/python/ops/decoder.py in body(time, outputs_ta, state, inputs, finished, sequence_lengths)
    232       """
    233       (next_outputs, decoder_state, next_inputs,
--> 234        decoder_finished) = decoder.step(time, inputs, state)
    235       next_finished = math_ops.logical_or(decoder_finished, finished)
    236       if maximum_iterations is not None:

/usr/local/lib/python3.5/site-packages/tensorflow/contrib/seq2seq/python/ops/basic_decoder.py in step(self, time, inputs, state, name)
    137     """
    138     with ops.name_scope(name, "BasicDecoderStep", (time, inputs, state)):
--> 139       cell_outputs, cell_state = self._cell(inputs, state)
    140       if self._output_layer is not None:
    141         cell_outputs = self._output_layer(cell_outputs)

/usr/local/lib/python3.5/site-packages/tensorflow/python/ops/rnn_cell_impl.py in __call__(self, inputs, state, scope)
    178       with vs.variable_scope(vs.get_variable_scope(),
    179                              custom_getter=self._rnn_get_variable):
--> 180         return super(RNNCell, self).__call__(inputs, state)
    181 
    182   def _rnn_get_variable(self, getter, *args, **kwargs):

/usr/local/lib/python3.5/site-packages/tensorflow/python/layers/base.py in __call__(self, inputs, *args, **kwargs)
    439         # Check input assumptions set after layer building, e.g. input shape.
    440         self._assert_input_compatibility(inputs)
--> 441         outputs = self.call(inputs, *args, **kwargs)
    442 
    443         # Apply activity regularization.

/usr/local/lib/python3.5/site-packages/tensorflow/contrib/seq2seq/python/ops/attention_wrapper.py in call(self, inputs, state)
    704     # Step 1: Calculate the true inputs to the cell based on the
    705     # previous attention value.
--> 706     cell_inputs = self._cell_input_fn(inputs, state.attention)
    707     cell_state = state.cell_state
    708     cell_output, next_cell_state = self._cell(cell_inputs, cell_state)

AttributeError: 'tuple' object has no attribute 'attention'

Question: How to set the encoding to prevent unexpected conversion?

While following the tutorial, i met two issues as below. Would you please share me your comments and suggestions? thanks in advance!

Encoding?
The target strings were converted into unicode such as "B\xc2\xa0..." I notice that in the tutorial, text should be displayed as they are (for example, Vietnamese text). I use ubuntu x64 + python 3.5. The default encoding of system is UTF-8. Any idea for what I might do wrong here?
Errors as below.
There's no error during the training, however we cannot use the trained module and get errors as below. I guess it might relate to the first issue but i cannot figure out where exactly goes wrong. Any suggestion would be welcome

getting <unk> for all words

Hi,

I have succesfully trained the Vietnamese-English system as given in the tutorial.

I am training on English-Hindi pair of around 1.5 million parallel sent. The training was done without giving any error.

But while translating test sent, I am getting all <unk> <unk> .

I have taken a very small vocabulary file for Hindi . i.e. size of 8k words.

Is this the reason?

Thanks,
Sriram

en-zh or zh-en translation

Is there any tutorial for en-zh or zh-en translation? How to get train and eval dataset for this translation?

dynamic_decode raise Segmentation default & inference.load_data filter some lines

Hello,

I use en-zh data like in the tmp.zip
put these files in /tmp/nmt_data

python -m nmt.nmt --src=zh --tgt=en --vocab_prefix=/tmp/nmt_data/vocab --train_prefix=/tmp/nmt_data/dev2 --dev_prefix=/tmp/nmt_data/dev2 --test_prefix=/tmp/nmt_data/dev2 --out_dir=/tmp/nmt_model_zh2en --num_train_steps=12000 --steps_per_stats=100 --num_layers=2 --num_units=128 --dropout=0.2 --metrics=ble

then i get a Segmentation fault result.it should be reproduced.

After some bug fix, i location the problem lieing in dynamic_decode , line 326 of model.py. But i can't go on to solve it.
Can you give some suggestions? Really Thanks.

Another small problem is inference.load_data filter some lines, which make loaded data of zh and en has different lens.

notes: tmp.zip is placed in https://github.com/hxsnow10/nmt_problem

TypeError: Expected binary or unicode string, got None

I have successfully finished the training task with the following parameters:
drwael@drwael-VirtualBox:~/nmt$ python -m nmt.nmt \

--src=vi --tgt=en \
--vocab_prefix=/home/drwael/mynmt/tmp/nmt_data/vocab  \
--train_prefix=/home/drwael/mynmt/tmp/nmt_data/train \
--dev_prefix=/home/drwael/mynmt/tmp/nmt_data/tst2012  \
--test_prefix=/home/drwael/mynmt/tmp/nmt_data/tst2013 \
--out_dir=/home/drwael/mynmt/tmp/nmt_model \
--num_train_steps=12000 \
--steps_per_stats=100 \
--num_layers=2 \
--num_units=128 \
--dropout=0.2 \
--metrics=bleu

But I recevied the following error when trying to test the model after coping some sentences to my_infer_file.vi:
drwael@drwael-VirtualBox:~/nmt$ python -m nmt.nmt --model_dir=/home/drwael/mynmt/tmp/nmt_model --inference_input_file=/home/drwael/mynmt/tmp/my_infer_file.vi --inference_output_file=/home/drwael/mynmt/tmp/nmt_model/output_infer

Job id 0

Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/drwael/nmt/nmt/nmt.py", line 478, in
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/drwael/nmt/nmt/nmt.py", line 438, in main
if not tf.gfile.Exists(out_dir): tf.gfile.MakeDirs(out_dir)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 252, in file_exists
pywrap_tensorflow.FileExists(compat.as_bytes(filename), status)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/compat.py", line 65, in as_bytes
(bytes_or_text,))
TypeError: Expected binary or unicode string, got None

Failure during inference

Hi,

I followed the installation instructions (installed TF 1.2.1, cloned the repo, 'ran nmt/scripts/download_iwslt15.sh /tmp/nmt_data' and downloaded the pre trained weight files).

I ran the following command which is specified in the envi_model_1 README:

python -m nmt.nmt --src=en --tgt=vi
--ckpt=models/envi_model_1/translate.ckpt
--hparams_path=nmt/standard_hparams/iwslt15.json
--out_dir=/tmp/envi
--vocab_prefix=/tmp/nmt_data/vocab
--inference_input_file=/tmp/nmt_data/tst2013.en
--inference_output_file=/tmp/envi/output_infer --inference_ref_file=/tmp/nmt_data/tst2013.vi

The result was:

Job id 0

Loading hparams from /tmp/envi/hparams

Loading standard hparams from nmt/standard_hparams/iwslt15.json

saving hparams to /tmp/envi/hparams
saving hparams to /tmp/envi/best_bleu/hparams
attention=scaled_luong
attention_architecture=standard
batch_size=128
beam_width=10
best_bleu=0
best_bleu_dir=/tmp/envi/best_bleu
bpe_delimiter=None
colocate_gradients_with_ops=True
decay_factor=0.5
decay_steps=1000
dev_prefix=None
dropout=0.2
encoder_type=bi
eos=
epoch_step=0
forget_bias=1.0
infer_batch_size=32
init_weight=0.1
learning_rate=1.0
length_penalty_weight=0.0
log_device_placement=False
max_gradient_norm=5.0
max_train=0
metrics=[u'bleu']
num_buckets=5
num_gpus=1
num_layers=2
num_residual_layers=0
num_train_steps=12000
num_units=512
optimizer=sgd
out_dir=/tmp/envi
pass_hidden_state=True
random_seed=None
residual=False
share_vocab=False
sos=
source_reverse=False
src=vi
src_max_len=50
src_max_len_infer=None
src_vocab_file=nmt_data/vocab.vi
src_vocab_size=7709
start_decay_step=8000
steps_per_external_eval=None
steps_per_stats=100
test_prefix=None
tgt=en
tgt_max_len=50
tgt_max_len_infer=None
tgt_vocab_file=nmt_data/vocab.en
tgt_vocab_size=17191
time_major=True
train_prefix=None
unit_type=lstm
vocab_prefix=nmt_data/vocab

creating infer graph ...

num_bi_layers = 1, num_bi_residual_layers=0
cell 0 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0
cell 0 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0
cell 0 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0
cell 1 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0
start_decay_step=8000, learning_rate=1, decay_steps 1000,decay_factor 0.5

Trainable variables

embeddings/encoder/embedding_encoder:0, (7709, 512),
embeddings/decoder/embedding_decoder:0, (17191, 512),
dynamic_seq2seq/encoder/bidirectional_rnn/fw/basic_lstm_cell/kernel:0, (1024, 2048), /device:GPU:0
dynamic_seq2seq/encoder/bidirectional_rnn/fw/basic_lstm_cell/bias:0, (2048,), /device:GPU:0
dynamic_seq2seq/encoder/bidirectional_rnn/bw/basic_lstm_cell/kernel:0, (1024, 2048), /device:GPU:0
dynamic_seq2seq/encoder/bidirectional_rnn/bw/basic_lstm_cell/bias:0, (2048,), /device:GPU:0
dynamic_seq2seq/decoder/memory_layer/kernel:0, (1024, 512),
dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_0/basic_lstm_cell/kernel:0, (1536, 2048), /device:GPU:0
dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_0/basic_lstm_cell/bias:0, (2048,), /device:GPU:0
dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_1/basic_lstm_cell/kernel:0, (1024, 2048), /device:GPU:0
dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_1/basic_lstm_cell/bias:0, (2048,), /device:GPU:0
dynamic_seq2seq/decoder/attention/luong_attention/attention_g:0, (), /device:GPU:0
dynamic_seq2seq/decoder/attention/attention_layer/kernel:0, (1536, 512), /device:GPU:0
dynamic_seq2seq/decoder/output_projection/kernel:0, (512, 17191),
2017-08-09 16:35:38.750263: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-09 16:35:38.750287: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-09 16:35:38.750295: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-08-09 16:35:38.750303: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-09 16:35:38.750309: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
models/envi_model_1/translate.ckpt
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/emeiri/work/nmt/nmt/nmt.py", line 483, in
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/emeiri/work/nmt/nmt/nmt.py", line 476, in main
run_main(FLAGS, default_hparams, train_fn, inference_fn)
File "/home/emeiri/work/nmt/nmt/nmt.py", line 455, in run_main
trans_file, hparams, num_workers, jobid)
File "nmt/inference.py", line 165, in inference
hparams)
File "nmt/inference.py", line 191, in single_worker_inference
infer_model.model, ckpt, sess, "infer")
File "nmt/model_helper.py", line 207, in load_model
model.saver.restore(session, ckpt)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1548, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 789, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 997, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1132, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [512,17191] rhs shape= [512,7709]
[[Node: save/Assign_8 = Assign[T=DT_FLOAT, _class=["loc:@dynamic_seq2seq/decoder/output_projection/kernel"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/cpu:0"](dynamic_seq2seq/decoder/output_projection/kernel, save/RestoreV2_8)]]

Caused by op u'save/Assign_8', defined at:
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/emeiri/work/nmt/nmt/nmt.py", line 483, in
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/emeiri/work/nmt/nmt/nmt.py", line 476, in main
run_main(FLAGS, default_hparams, train_fn, inference_fn)
File "/home/emeiri/work/nmt/nmt/nmt.py", line 455, in run_main
trans_file, hparams, num_workers, jobid)
File "nmt/inference.py", line 157, in inference
single_cell_fn)
File "nmt/inference.py", line 79, in create_infer_model
single_cell_fn=single_cell_fn)
File "nmt/attention_model.py", line 55, in init
single_cell_fn=single_cell_fn)
File "nmt/model.py", line 167, in init
self.saver = tf.train.Saver(tf.global_variables())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1139, in init
self.build()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1170, in build
restore_sequentially=self._restore_sequentially)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 691, in build
restore_sequentially, reshape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 419, in _AddRestoreOps
assign_ops.append(saveable.restore(tensors, shapes))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 155, in restore
self.op.get_shape().is_fully_defined())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/state_ops.py", line 271, in assign
validate_shape=validate_shape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_state_ops.py", line 45, in assign
use_locking=use_locking, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1269, in init
self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [512,17191] rhs shape= [512,7709]
[[Node: save/Assign_8 = Assign[T=DT_FLOAT, _class=["loc:@dynamic_seq2seq/decoder/output_projection/kernel"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/cpu:0"](dynamic_seq2seq/decoder/output_projection/kernel, save/RestoreV2_8)]]

Any idea what the problem is?

Thanks a lot,

Etay

Error when using multiple metrics

There seems to be a bug when using multiple metrics.
It exits with an error after outputting "# Best %s" % metrics...

Here's the error trackback:

...
bleu dev: 0.0
rouge dev: 0.0
bleu test: 0.0
rouge test: 0.0
# Best rouge, step 0 step-time 0.38 wps 9.17K, dev ppl 17190.87, dev bleu 0.0, dev rouge 0.0, test ppl 17190.47, test bleu 0.0, test rouge 0.0, Sun Jul 23 21:36:22 2017
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/xianda/nmt/nmt/nmt.py", line 478, in
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/home/xianda/nmt/env/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/xianda/nmt/nmt/nmt.py", line 471, in main
train.train(hparams)
File "nmt/train.py", line 464, in train
return (dev_scores, test_scores, dev_ppl, test_ppl, global_step)
UnboundLocalError: local variable 'dev_scores' referenced before assignment

ImportError: No module named _collections

Under the "gnmt-mater" directory, I executed the following command to train model:
python -m nmt.nmt --hparams_path=./nmt/standard_hparams/iwslt15.json --src=vi --tgt=en --vocab_prefix=./nmt/nmt_data/iwslt15/vocab --train_prefix=./nmt/nmt_data/iwslt15/train --dev_prefix=./nmt/nmt_data/iwslt15/tst2012 --test_prefix=./nmt/nmt_data/iwslt15/tst2013 --out_dir=./nmt/nmt_attention_model/iwslt152
But it reported an error:

Traceback (most recent call last):
  File "/disk1/wangfang/anaconda2/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/disk1/wangfang/anaconda2/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/disk1/wangfang/gnmt-master/nmt/nmt.py", line 19, in <module>
    import argparse
  File "/disk1/wangfang/anaconda2/lib/python2.7/argparse.py", line 85, in <module>
    import collections as _collections
  File "/disk1/wangfang/anaconda2/lib/python2.7/collections.py", line 20, in <module>
    from _collections import deque, defaultdict
ImportError: No module named _collections

I executed training command successfully two days ago!!! I don't know what happened....Does anybody know how to solve this problem?

Translated strings still contain BPE delimiter and EOS token.

I am using python 3.6.1 with tensorflow version of 1.2.1.
It seems like the function get_translation didn't work well.
The problem may caused by the reverse_target_vocab_table which convert target sample id into target words.
The result of look up operation on reverse_target_vocab_table is not pure str type. Instead it is something like byte string, which is

[b'<s>', b'</s>']

And this causes mismatch between '</s>' and b'</s>'.

AttributeError: 'Tensor' object has no attribute 'attention'

Hi, when i ran tf.contrib.seq2seq.dynamic_decode(decoder) it returns AttributeError: 'Tensor' object has no attribute 'attention'. i'm using tensorflow 1.2.1

Thank you for you help

tensorflow / nmt Goto Github PK

nmt's Introduction

Neural Machine Translation (seq2seq) Tutorial

Introduction

Basic

Background on Neural Machine Translation

Installing the Tutorial

Training – How to build our first NMT system

Embedding

Encoder

Decoder

Loss

Gradient computation & optimization

Hands-on – Let's train an NMT model

Inference – How to generate translations

Intermediate

Background on the Attention Mechanism

Attention Wrapper API

Hands-on – building an attention-based NMT model

Tips & Tricks

Building Training, Eval, and Inference Graphs

Data Input Pipeline

Other details for better NMT models

Bidirectional RNNs

Beam search

Hyperparameters

Multi-GPU training

Benchmarks

IWSLT English-Vietnamese

WMT German-English

WMT English-German — Full Comparison

Standard HParams

Other resources

Acknowledgment

References

BibTex

nmt's People

Contributors

Stargazers

Watchers

Forkers

nmt's Issues

Job id 0

Job id 0

Loading hparams from /tmp/envi/hparams

Loading standard hparams from nmt/standard_hparams/iwslt15.json

creating infer graph ...

Trainable variables

Recommend Projects

Recommend Topics

Recommend Org