Giter Club home page Giter Club logo

tensor2tensor's Introduction

Tensor2Tensor

PyPI version GitHub Issues Contributions welcome Gitter License Travis Run on FH

Tensor2Tensor, or T2T for short, is a library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.

T2T was developed by researchers and engineers in the Google Brain team and a community of users. It is now deprecated — we keep it running and welcome bug-fixes, but encourage users to use the successor library Trax.

Quick Start

This iPython notebook explains T2T and runs in your browser using a free VM from Google, no installation needed. Alternatively, here is a one-command version that installs T2T, downloads MNIST, trains a model and evaluates it:

pip install tensor2tensor && t2t-trainer \
  --generate_data \
  --data_dir=~/t2t_data \
  --output_dir=~/t2t_train/mnist \
  --problem=image_mnist \
  --model=shake_shake \
  --hparams_set=shake_shake_quick \
  --train_steps=1000 \
  --eval_steps=100

Contents

Suggested Datasets and Models

Below we list a number of tasks that can be solved with T2T when you train the appropriate model on the appropriate problem. We give the problem and model below and we suggest a setting of hyperparameters that we know works well in our setup. We usually run either on Cloud TPUs or on 8-GPU machines; you might need to modify the hyperparameters if you run on a different setup.

Mathematical Language Understanding

For evaluating mathematical expressions at the character level involving addition, subtraction and multiplication of both positive and negative decimal numbers with variable digits assigned to symbolic variables, use

  • the MLU data-set: --problem=algorithmic_math_two_variables

You can try solving the problem with different transformer models and hyperparameters as described in the paper:

  • Standard transformer: --model=transformer --hparams_set=transformer_tiny
  • Universal transformer: --model=universal_transformer --hparams_set=universal_transformer_tiny
  • Adaptive universal transformer: --model=universal_transformer --hparams_set=adaptive_universal_transformer_tiny

Story, Question and Answer

For answering questions based on a story, use

  • the bAbi data-set: --problem=babi_qa_concat_task1_1k

You can choose the bAbi task from the range [1,20] and the subset from 1k or 10k. To combine test data from all tasks into a single test set, use --problem=babi_qa_concat_all_tasks_10k

Image Classification

For image classification, we have a number of standard data-sets:

  • ImageNet (a large data-set): --problem=image_imagenet, or one of the re-scaled versions (image_imagenet224, image_imagenet64, image_imagenet32)
  • CIFAR-10: --problem=image_cifar10 (or --problem=image_cifar10_plain to turn off data augmentation)
  • CIFAR-100: --problem=image_cifar100
  • MNIST: --problem=image_mnist

For ImageNet, we suggest to use the ResNet or Xception, i.e., use --model=resnet --hparams_set=resnet_50 or --model=xception --hparams_set=xception_base. Resnet should get to above 76% top-1 accuracy on ImageNet.

For CIFAR and MNIST, we suggest to try the shake-shake model: --model=shake_shake --hparams_set=shakeshake_big. This setting trained for --train_steps=700000 should yield close to 97% accuracy on CIFAR-10.

Image Generation

For (un)conditional image generation, we have a number of standard data-sets:

  • CelebA: --problem=img2img_celeba for image-to-image translation, namely, superresolution from 8x8 to 32x32.
  • CelebA-HQ: --problem=image_celeba256_rev for a downsampled 256x256.
  • CIFAR-10: --problem=image_cifar10_plain_gen_rev for class-conditional 32x32 generation.
  • LSUN Bedrooms: --problem=image_lsun_bedrooms_rev
  • MS-COCO: --problem=image_text_ms_coco_rev for text-to-image generation.
  • Small ImageNet (a large data-set): --problem=image_imagenet32_gen_rev for 32x32 or --problem=image_imagenet64_gen_rev for 64x64.

We suggest to use the Image Transformer, i.e., --model=imagetransformer, or the Image Transformer Plus, i.e., --model=imagetransformerpp that uses discretized mixture of logistics, or variational auto-encoder, i.e., --model=transformer_ae. For CIFAR-10, using --hparams_set=imagetransformer_cifar10_base or --hparams_set=imagetransformer_cifar10_base_dmol yields 2.90 bits per dimension. For Imagenet-32, using --hparams_set=imagetransformer_imagenet32_base yields 3.77 bits per dimension.

Language Modeling

For language modeling, we have these data-sets in T2T:

  • PTB (a small data-set): --problem=languagemodel_ptb10k for word-level modeling and --problem=languagemodel_ptb_characters for character-level modeling.
  • LM1B (a billion-word corpus): --problem=languagemodel_lm1b32k for subword-level modeling and --problem=languagemodel_lm1b_characters for character-level modeling.

We suggest to start with --model=transformer on this task and use --hparams_set=transformer_small for PTB and --hparams_set=transformer_base for LM1B.

Sentiment Analysis

For the task of recognizing the sentiment of a sentence, use

  • the IMDB data-set: --problem=sentiment_imdb

We suggest to use --model=transformer_encoder here and since it is a small data-set, try --hparams_set=transformer_tiny and train for few steps (e.g., --train_steps=2000).

Speech Recognition

For speech-to-text, we have these data-sets in T2T:

  • Librispeech (US English): --problem=librispeech for the whole set and --problem=librispeech_clean for a smaller but nicely filtered part.

  • Mozilla Common Voice (US English): --problem=common_voice for the whole set --problem=common_voice_clean for a quality-checked subset.

Summarization

For summarizing longer text into shorter one we have these data-sets:

  • CNN/DailyMail articles summarized into a few sentences: --problem=summarize_cnn_dailymail32k

We suggest to use --model=transformer and --hparams_set=transformer_prepend for this task. This yields good ROUGE scores.

Translation

There are a number of translation data-sets in T2T:

  • English-German: --problem=translate_ende_wmt32k
  • English-French: --problem=translate_enfr_wmt32k
  • English-Czech: --problem=translate_encs_wmt32k
  • English-Chinese: --problem=translate_enzh_wmt32k
  • English-Vietnamese: --problem=translate_envi_iwslt32k
  • English-Spanish: --problem=translate_enes_wmt32k

You can get translations in the other direction by appending _rev to the problem name, e.g., for German-English use --problem=translate_ende_wmt32k_rev (note that you still need to download the original data with t2t-datagen --problem=translate_ende_wmt32k).

For all translation problems, we suggest to try the Transformer model: --model=transformer. At first it is best to try the base setting, --hparams_set=transformer_base. When trained on 8 GPUs for 300K steps this should reach a BLEU score of about 28 on the English-German data-set, which is close to state-of-the art. If training on a single GPU, try the --hparams_set=transformer_base_single_gpu setting. For very good results or larger data-sets (e.g., for English-French), try the big model with --hparams_set=transformer_big.

See this example to know how the translation works.

Basics

Walkthrough

Here's a walkthrough training a good English-to-German translation model using the Transformer model from Attention Is All You Need on WMT data.

pip install tensor2tensor

# See what problems, models, and hyperparameter sets are available.
# You can easily swap between them (and add new ones).
t2t-trainer --registry_help

PROBLEM=translate_ende_wmt32k
MODEL=transformer
HPARAMS=transformer_base_single_gpu

DATA_DIR=$HOME/t2t_data
TMP_DIR=/tmp/t2t_datagen
TRAIN_DIR=$HOME/t2t_train/$PROBLEM/$MODEL-$HPARAMS

mkdir -p $DATA_DIR $TMP_DIR $TRAIN_DIR

# Generate data
t2t-datagen \
  --data_dir=$DATA_DIR \
  --tmp_dir=$TMP_DIR \
  --problem=$PROBLEM

# Train
# *  If you run out of memory, add --hparams='batch_size=1024'.
t2t-trainer \
  --data_dir=$DATA_DIR \
  --problem=$PROBLEM \
  --model=$MODEL \
  --hparams_set=$HPARAMS \
  --output_dir=$TRAIN_DIR

# Decode

DECODE_FILE=$DATA_DIR/decode_this.txt
echo "Hello world" >> $DECODE_FILE
echo "Goodbye world" >> $DECODE_FILE
echo -e 'Hallo Welt\nAuf Wiedersehen Welt' > ref-translation.de

BEAM_SIZE=4
ALPHA=0.6

t2t-decoder \
  --data_dir=$DATA_DIR \
  --problem=$PROBLEM \
  --model=$MODEL \
  --hparams_set=$HPARAMS \
  --output_dir=$TRAIN_DIR \
  --decode_hparams="beam_size=$BEAM_SIZE,alpha=$ALPHA" \
  --decode_from_file=$DECODE_FILE \
  --decode_to_file=translation.en

# See the translations
cat translation.en

# Evaluate the BLEU score
# Note: Report this BLEU score in papers, not the internal approx_bleu metric.
t2t-bleu --translation=translation.en --reference=ref-translation.de

Installation

# Assumes tensorflow or tensorflow-gpu installed
pip install tensor2tensor

# Installs with tensorflow-gpu requirement
pip install tensor2tensor[tensorflow_gpu]

# Installs with tensorflow (cpu) requirement
pip install tensor2tensor[tensorflow]

Binaries:

# Data generator
t2t-datagen

# Trainer
t2t-trainer --registry_help

Library usage:

python -c "from tensor2tensor.models.transformer import Transformer"

Features

  • Many state of the art and baseline models are built-in and new models can be added easily (open an issue or pull request!).
  • Many datasets across modalities - text, audio, image - available for generation and use, and new ones can be added easily (open an issue or pull request for public datasets!).
  • Models can be used with any dataset and input mode (or even multiple); all modality-specific processing (e.g. embedding lookups for text tokens) is done with bottom and top transformations, which are specified per-feature in the model.
  • Support for multi-GPU machines and synchronous (1 master, many workers) and asynchronous (independent workers synchronizing through a parameter server) distributed training.
  • Easily swap amongst datasets and models by command-line flag with the data generation script t2t-datagen and the training script t2t-trainer.
  • Train on Google Cloud ML and Cloud TPUs.

T2T overview

Problems

Problems consist of features such as inputs and targets, and metadata such as each feature's modality (e.g. symbol, image, audio) and vocabularies. Problem features are given by a dataset, which is stored as a TFRecord file with tensorflow.Example protocol buffers. All problems are imported in all_problems.py or are registered with @registry.register_problem. Run t2t-datagen to see the list of available problems and download them.

Models

T2TModels define the core tensor-to-tensor computation. They apply a default transformation to each input and output so that models may deal with modality-independent tensors (e.g. embeddings at the input; and a linear transform at the output to produce logits for a softmax over classes). All models are imported in the models subpackage, inherit from T2TModel, and are registered with @registry.register_model.

Hyperparameter Sets

Hyperparameter sets are encoded in HParams objects, and are registered with @registry.register_hparams. Every model and problem has a HParams. A basic set of hyperparameters are defined in common_hparams.py and hyperparameter set functions can compose other hyperparameter set functions.

Trainer

The trainer binary is the entrypoint for training, evaluation, and inference. Users can easily switch between problems, models, and hyperparameter sets by using the --model, --problem, and --hparams_set flags. Specific hyperparameters can be overridden with the --hparams flag. --schedule and related flags control local and distributed training/evaluation (distributed training documentation).

Adding your own components

T2T's components are registered using a central registration mechanism that enables easily adding new ones and easily swapping amongst them by command-line flag. You can add your own components without editing the T2T codebase by specifying the --t2t_usr_dir flag in t2t-trainer.

You can do so for models, hyperparameter sets, modalities, and problems. Please do submit a pull request if your component might be useful to others.

See the example_usr_dir for an example user directory.

Adding a dataset

To add a new dataset, subclass Problem and register it with @registry.register_problem. See TranslateEndeWmt8k for an example. Also see the data generators README.

Run on FloydHub

Run on FloydHub

Click this button to open a Workspace on FloydHub. You can use the workspace to develop and test your code on a fully configured cloud GPU machine.

Tensor2Tensor comes preinstalled in the environment, you can simply open a Terminal and run your code.

# Test the quick-start on a Workspace's Terminal with this command
t2t-trainer \
  --generate_data \
  --data_dir=./t2t_data \
  --output_dir=./t2t_train/mnist \
  --problem=image_mnist \
  --model=shake_shake \
  --hparams_set=shake_shake_quick \
  --train_steps=1000 \
  --eval_steps=100

Note: Ensure compliance with the FloydHub Terms of Service.

Papers

When referencing Tensor2Tensor, please cite this paper.

@article{tensor2tensor,
  author    = {Ashish Vaswani and Samy Bengio and Eugene Brevdo and
    Francois Chollet and Aidan N. Gomez and Stephan Gouws and Llion Jones and
    \L{}ukasz Kaiser and Nal Kalchbrenner and Niki Parmar and Ryan Sepassi and
    Noam Shazeer and Jakob Uszkoreit},
  title     = {Tensor2Tensor for Neural Machine Translation},
  journal   = {CoRR},
  volume    = {abs/1803.07416},
  year      = {2018},
  url       = {http://arxiv.org/abs/1803.07416},
}

Tensor2Tensor was used to develop a number of state-of-the-art models and deep learning methods. Here we list some papers that were based on T2T from the start and benefited from its features and architecture in ways described in the Google Research Blog post introducing T2T.

NOTE: This is not an official Google product.

tensor2tensor's People

Contributors

afrozenator avatar aidangomez avatar artitw avatar blazejosinski avatar cbockman avatar conchylicultor avatar dusenberrymw avatar dustinvtran avatar endingcredits avatar katelee168 avatar keyonvafa avatar kolloldas avatar konradczechowski avatar koz4k avatar lgeiger avatar lmthang avatar lukaszkaiser avatar martinpopel avatar mbz avatar mechcoder avatar nshazeer avatar redeipirati avatar royaurko avatar stefan-falk avatar stefan-it avatar toponado-zz avatar urvashik avatar vthorsteinsson avatar wangpengmit avatar ziy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tensor2tensor's Issues

Decoding problem for char-based translation

Hi,

I modified the wmt_ende_characters to translate Macedonian to English (bleu-score after training was 0.526888).

The input sentence is:

Kosovskiot proces na privatizaciјa se ispituva

Then the t2t_trainer command shows some weird output:

INFO:tensorflow:Restoring parameters from t2t_train/model.ckpt-250000
INFO:tensorflow:Inference results INPUT: Mquqxumkqv"rtqegu"pc"rtkxcvk|cekӚc"ug"kurkvwxc
INFO:tensorflow:Inference results OUTPUT: Mukwak.cwave.gurk.fe.ce.sce.gurkwe.ce.ce
INFO:tensorflow:Writing decodes into test.txt.transformer.transformer_base.beam4.alpha0.6.decodes

Tested with version 1.0.5 and 1.0.7. Is this a bug?

algorithmic_reverse_decimal40 with baseline_lstm_seq2seq model produce Error NaN

Steps to reproduce

Before training on a new generator(nlplike) i have tried to train on the baseline_lstm_seq2seq model to see how it works on the algorithmic_reverse_decimal40, this is the result(i am inside a Docker container):

root@df1a91a7be96:/t2t# PROBLEM=algorithmic_reverse_decimal40
root@df1a91a7be96:/t2t# MODEL=baseline_lstm_seq2seq
root@df1a91a7be96:/t2t# HPARAMS=basic_1
root@df1a91a7be96:/t2t# DATA_DIR=/tmp/t2t_data
root@df1a91a7be96:/t2t# TMP_DIR=/tmp/t2t_datagen
root@df1a91a7be96:/t2t# TRAIN_DIR=/tmp/t2t_train/$PROBLEM/$MODEL-$HPARAMS
root@df1a91a7be96:/t2t# mkdir -p $DATA_DIR $TMP_DIR $TRAIN_DIR
root@df1a91a7be96:/t2t# # Generate data
root@df1a91a7be96:/t2t# t2t-datagen \
>   --data_dir=$DATA_DIR \
>   --tmp_dir=$TMP_DIR \
>   --problem=$PROBLEM

INFO:tensorflow:Generating training data for algorithmic_reverse_decimal40.
INFO:tensorflow:Generating case 0 for algorithmic_reverse_decimal40-unshuffled-train.
INFO:tensorflow:Generating development data for algorithmic_reverse_decimal40.
INFO:tensorflow:Generating case 0 for algorithmic_reverse_decimal40-unshuffled-dev.
INFO:tensorflow:Shuffling data...
INFO:tensorflow:read: 10000
INFO:tensorflow:write: 0
INFO:tensorflow:read: 10000
INFO:tensorflow:write: 0
INFO:tensorflow:read: 10000
INFO:tensorflow:write: 0
INFO:tensorflow:read: 10000
INFO:tensorflow:write: 0
INFO:tensorflow:read: 10000
INFO:tensorflow:write: 0
INFO:tensorflow:read: 10000
INFO:tensorflow:write: 0
INFO:tensorflow:read: 10000
INFO:tensorflow:write: 0
INFO:tensorflow:read: 10000
INFO:tensorflow:write: 0
INFO:tensorflow:read: 10000
INFO:tensorflow:write: 0
INFO:tensorflow:read: 10000
INFO:tensorflow:write: 0
INFO:tensorflow:read: 10000
INFO:tensorflow:write: 0
root@df1a91a7be96:/t2t# 
root@df1a91a7be96:/t2t# t2t-trainer \
>   --data_dir=$DATA_DIR \
>   --problems=$PROBLEM \
>   --model=$MODEL \
>   --hparams_set=$HPARAMS \
>   --output_dir=$TRAIN_DIR
INFO:tensorflow:Creating experiment, storing model files in /tmp/t2t_train/algorithmic_reverse_decimal40/baseline_lstm_seq2seq-basic_1
INFO:tensorflow:datashard_devices: ['gpu:0']
INFO:tensorflow:caching_devices: None
INFO:tensorflow:Using config: {'_model_dir': '/tmp/t2t_train/algorithmic_reverse_decimal40/baseline_lstm_seq2seq-basic_1', '_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 20, '_tf_random_seed': None, '_task_type': None, '_environment': 'local', '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fdc4e2b5390>, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_num_worker_replicas': 0, '_task_id': 0, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_evaluation_master': '', '_keep_checkpoint_every_n_hours': 10000, '_master': '', '_session_config': allow_soft_placement: true
graph_options {
  optimizer_options {
  }
}
}
INFO:tensorflow:Performing local training.
INFO:tensorflow:datashard_devices: ['gpu:0']
INFO:tensorflow:caching_devices: None
INFO:tensorflow:Doing model_fn_body took 0.668 sec.
INFO:tensorflow:This model_fn took 0.983 sec.
INFO:tensorflow:Weight    body/lstm_seq2seq/decoder/rnn/multi_rnn_cell/cell_0/basic_lstm_cell/bias        	shape    (256,)              	size    256
INFO:tensorflow:Weight    body/lstm_seq2seq/decoder/rnn/multi_rnn_cell/cell_0/basic_lstm_cell/kernel      	shape    (128, 256)          	size    32768
INFO:tensorflow:Weight    body/lstm_seq2seq/decoder/rnn/multi_rnn_cell/cell_1/basic_lstm_cell/bias        	shape    (256,)              	size    256
INFO:tensorflow:Weight    body/lstm_seq2seq/decoder/rnn/multi_rnn_cell/cell_1/basic_lstm_cell/kernel      	shape    (128, 256)          	size    32768
INFO:tensorflow:Weight    body/lstm_seq2seq/decoder/rnn/multi_rnn_cell/cell_2/basic_lstm_cell/bias        	shape    (256,)              	size    256
INFO:tensorflow:Weight    body/lstm_seq2seq/decoder/rnn/multi_rnn_cell/cell_2/basic_lstm_cell/kernel      	shape    (128, 256)          	size    32768
INFO:tensorflow:Weight    body/lstm_seq2seq/decoder/rnn/multi_rnn_cell/cell_3/basic_lstm_cell/bias        	shape    (256,)              	size    256
INFO:tensorflow:Weight    body/lstm_seq2seq/decoder/rnn/multi_rnn_cell/cell_3/basic_lstm_cell/kernel      	shape    (128, 256)          	size    32768
INFO:tensorflow:Weight    body/lstm_seq2seq/encoder/rnn/multi_rnn_cell/cell_0/basic_lstm_cell/bias        	shape    (256,)              	size    256
INFO:tensorflow:Weight    body/lstm_seq2seq/encoder/rnn/multi_rnn_cell/cell_0/basic_lstm_cell/kernel      	shape    (128, 256)          	size    32768
INFO:tensorflow:Weight    body/lstm_seq2seq/encoder/rnn/multi_rnn_cell/cell_1/basic_lstm_cell/bias        	shape    (256,)              	size    256
INFO:tensorflow:Weight    body/lstm_seq2seq/encoder/rnn/multi_rnn_cell/cell_1/basic_lstm_cell/kernel      	shape    (128, 256)          	size    32768
INFO:tensorflow:Weight    body/lstm_seq2seq/encoder/rnn/multi_rnn_cell/cell_2/basic_lstm_cell/bias        	shape    (256,)              	size    256
INFO:tensorflow:Weight    body/lstm_seq2seq/encoder/rnn/multi_rnn_cell/cell_2/basic_lstm_cell/kernel      	shape    (128, 256)          	size    32768
INFO:tensorflow:Weight    body/lstm_seq2seq/encoder/rnn/multi_rnn_cell/cell_3/basic_lstm_cell/bias        	shape    (256,)              	size    256
INFO:tensorflow:Weight    body/lstm_seq2seq/encoder/rnn/multi_rnn_cell/cell_3/basic_lstm_cell/kernel      	shape    (128, 256)          	size    32768
INFO:tensorflow:Weight    symbol_modality_12_64/input_emb/weights_0                                       	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/input_emb/weights_10                                      	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/input_emb/weights_11                                      	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/input_emb/weights_12                                      	shape    (0, 64)             	size    0
INFO:tensorflow:Weight    symbol_modality_12_64/input_emb/weights_13                                      	shape    (0, 64)             	size    0
INFO:tensorflow:Weight    symbol_modality_12_64/input_emb/weights_14                                      	shape    (0, 64)             	size    0
INFO:tensorflow:Weight    symbol_modality_12_64/input_emb/weights_15                                      	shape    (0, 64)             	size    0
INFO:tensorflow:Weight    symbol_modality_12_64/input_emb/weights_1                                       	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/input_emb/weights_2                                       	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/input_emb/weights_3                                       	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/input_emb/weights_4                                       	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/input_emb/weights_5                                       	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/input_emb/weights_6                                       	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/input_emb/weights_7                                       	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/input_emb/weights_8                                       	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/input_emb/weights_9                                       	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/softmax/weights_0                                         	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/softmax/weights_10                                        	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/softmax/weights_11                                        	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/softmax/weights_12                                        	shape    (0, 64)             	size    0
INFO:tensorflow:Weight    symbol_modality_12_64/softmax/weights_13                                        	shape    (0, 64)             	size    0
INFO:tensorflow:Weight    symbol_modality_12_64/softmax/weights_14                                        	shape    (0, 64)             	size    0
INFO:tensorflow:Weight    symbol_modality_12_64/softmax/weights_15                                        	shape    (0, 64)             	size    0
INFO:tensorflow:Weight    symbol_modality_12_64/softmax/weights_1                                         	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/softmax/weights_2                                         	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/softmax/weights_3                                         	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/softmax/weights_4                                         	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/softmax/weights_5                                         	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/softmax/weights_6                                         	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/softmax/weights_7                                         	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/softmax/weights_8                                         	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/softmax/weights_9                                         	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/target_emb/weights_0                                      	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/target_emb/weights_10                                     	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/target_emb/weights_11                                     	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/target_emb/weights_12                                     	shape    (0, 64)             	size    0
INFO:tensorflow:Weight    symbol_modality_12_64/target_emb/weights_13                                     	shape    (0, 64)             	size    0
INFO:tensorflow:Weight    symbol_modality_12_64/target_emb/weights_14                                     	shape    (0, 64)             	size    0
INFO:tensorflow:Weight    symbol_modality_12_64/target_emb/weights_15                                     	shape    (0, 64)             	size    0
INFO:tensorflow:Weight    symbol_modality_12_64/target_emb/weights_1                                      	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/target_emb/weights_2                                      	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/target_emb/weights_3                                      	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/target_emb/weights_4                                      	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/target_emb/weights_5                                      	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/target_emb/weights_6                                      	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/target_emb/weights_7                                      	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/target_emb/weights_8                                      	shape    (1, 64)             	size    64
INFO:tensorflow:Weight    symbol_modality_12_64/target_emb/weights_9                                      	shape    (1, 64)             	size    64
INFO:tensorflow:Total trainable variables size: 266496
INFO:tensorflow:Total embedding variables size: 0
INFO:tensorflow:Total non-embedding variables size: 266496
INFO:tensorflow:Computing gradients for global model_fn.
INFO:tensorflow:Global model_fn finished.
INFO:tensorflow:Create CheckpointSaverHook.
2017-06-23 11:14:29.014420: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-23 11:14:29.014488: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-23 11:14:29.014519: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-06-23 11:14:29.076780: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-06-23 11:14:29.077245: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties: 
name: GeForce GTX 670MX
major: 3 minor: 0 memoryClockRate (GHz) 0.601
pciBusID 0000:01:00.0
Total memory: 2.94GiB
Free memory: 2.62GiB
2017-06-23 11:14:29.077338: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 
2017-06-23 11:14:29.077374: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   Y 
2017-06-23 11:14:29.084741: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 670MX, pci bus id: 0000:01:00.0)
2017-06-23 11:14:38.368302: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 5370 get requests, put_count=3301 evicted_count=1000 eviction_rate=0.302939 and unsatisfied allocation rate=0.59013
2017-06-23 11:14:38.368375: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110
INFO:tensorflow:Saving checkpoints for 1 into /tmp/t2t_train/algorithmic_reverse_decimal40/baseline_lstm_seq2seq-basic_1/model.ckpt.
ERROR:tensorflow:Model diverged with loss = NaN.
Traceback (most recent call last):
  File "/usr/local/bin/t2t-trainer", line 83, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/usr/local/bin/t2t-trainer", line 79, in main
    schedule=FLAGS.schedule)
  File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_utils.py", line 240, in run
    run_locally(exp_fn(output_dir))
  File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_utils.py", line 531, in run_locally
    exp.train()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train
    hooks=self._train_monitors + extra_hooks)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 665, in _call_train
    monitors=hooks)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit
    loss = self._train_model(input_fn=input_fn, hooks=hooks)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1007, in _train_model
    _, loss = mon_sess.run([model_fn_ops.train_op, model_fn_ops.loss])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 505, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 842, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 798, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 960, in run
    run_metadata=run_metadata))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/basic_session_run_hooks.py", line 477, in after_run
    raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.

I have missed something during configuration? I have also tried the same problem on the transformer model and the training seems fine, but during inference it doesn't reproduce reverse input!!!(Later i'll post the output of this last command and my configuration).

Edit: With transformer everything is ok.

Data download corrupted when running demo

When running the demo (also in README: English-to-German translation model using the Transformer model from Attention Is All You Need on WMT data.), downloading the data, gives a corrupted version.
Eventually this causes the tokenizer to run into errors.

`PROBLEM=wmt_ende_tokens_32k
MODEL=transformer
HPARAMS=transformer_base

DATA_DIR=$HOME/t2t_data
TMP_DIR=/tmp/t2t_datagen
TRAIN_DIR=$HOME/t2t_train/$PROBLEM/$MODEL-$HPARAMS

mkdir -p $DATA_DIR $TMP_DIR $TRAIN_DIR

Generate data

t2t-datagen
--data_dir=$DATA_DIR
--tmp_dir=$TMP_DIR
--num_shards=100 `

The output of the previous data generation commands:

INFO:tensorflow:Generating training data for wmt_ende_tokens_32k.
INFO:tensorflow:Downloading http://data.statmt.org/wmt16/translation-task/training-parallel-nc-v11.tgz to /tmp/t2t_datagen/training-parallel-nc-v11.tgz
INFO:tensorflow:Succesfully downloaded training-parallel-nc-v11.tgz, 75178032 bytes.
INFO:tensorflow:Reading file: training-parallel-nc-v11/news-commentary-v11.de-en.en
INFO:tensorflow:Reading file: training-parallel-nc-v11/news-commentary-v11.de-en.de
INFO:tensorflow:Downloading http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz to /tmp/t2t_datagen/training-parallel-commoncrawl.tgz

At this point, the download just keeps hanging eventhough the data has been downloaded succesfully (checked in /tmp/t2t_datagen) and I abort with CTRL-C. When trying again it gives the following error:

Traceback (most recent call last):
File "/usr/local/bin/t2t-datagen", line 361, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/usr/local/bin/t2t-datagen", line 344, in main
training_gen(), FLAGS.problem + UNSHUFFLED_SUFFIX + "-train",
File "/usr/local/bin/t2t-datagen", line 140, in
lambda: wmt.ende_wordpiece_token_generator(FLAGS.tmp_dir, True, 2**15),
File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/data_generators/wmt.py", line 224, in ende_wordpiece_token_generator
tmp_dir, "tokens.vocab.%d" % vocab_size, vocab_size)
File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/data_generators/generator_utils.py", line 220, in get_or_generate_vocab
corpus_tar.extractall(tmp_dir)
File "/usr/lib/python2.7/tarfile.py", line 2079, in extractall
self.extract(tarinfo, path)
File "/usr/lib/python2.7/tarfile.py", line 2116, in extract
self._extract_member(tarinfo, os.path.join(path, tarinfo.name))
File "/usr/lib/python2.7/tarfile.py", line 2192, in _extract_member
self.makefile(tarinfo, targetpath)
File "/usr/lib/python2.7/tarfile.py", line 2233, in makefile
copyfileobj(source, target)
File "/usr/lib/python2.7/tarfile.py", line 266, in copyfileobj
shutil.copyfileobj(src, dst)
File "/usr/lib/python2.7/shutil.py", line 49, in copyfileobj
buf = fsrc.read(length)
File "/usr/lib/python2.7/tarfile.py", line 831, in read
buf += self.fileobj.read(size - len(buf))
File "/usr/lib/python2.7/tarfile.py", line 743, in read
return self.readnormal(size)
File "/usr/lib/python2.7/tarfile.py", line 758, in readnormal
return self.__read(size)
File "/usr/lib/python2.7/tarfile.py", line 748, in __read
buf = self.fileobj.read(size)
File "/usr/lib/python2.7/gzip.py", line 268, in read
self._read(readsize)
File "/usr/lib/python2.7/gzip.py", line 315, in _read
self._read_eof()
File "/usr/lib/python2.7/gzip.py", line 354, in _read_eof
hex(self.crc)))
IOError: CRC check failed 0x75d9e49c != 0xd122220fL

One strategy might be to manually download the final tar.gz from http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz and unpack it in /tmp/t2t_data. When trying, download is extremely slow, approx. 2 hours for 876MB...

Results of manual download:
Working:

INFO:tensorflow:Not downloading, file already found: /tmp/t2t_datagen/training-parallel-commoncrawl.tgz
INFO:tensorflow:Reading file: commoncrawl.de-en.en
INFO:tensorflow:Reading file: commoncrawl.de-en.de
INFO:tensorflow:Reading file: commoncrawl.fr-en.en
INFO:tensorflow:Reading file: commoncrawl.fr-en.fr

Next in line for (hopefully not too slow) download:

INFO:tensorflow:Downloading
http://www.statmt.org/wmt13/training-parallel-europarl-v7.tgz to /tmp/t2t_datagen/training-parallel-europarl-v7.tgz

The above as an FYI or possible issue to be resolved.

Adding a new dataset + problem

I've tried to add a new text dataset for a basic text classification task: the data generator works flawlessly. I then tried to add my task in problem_hparams.py, but it crashes with:

Variable symbol_modality_13_512/shared/weights_0 does not exist

Why ? I've got the feeling that I added all required hyperparameters. Should I add something else elsewhere ?

Thanks !

Here's my problem hparms:

def txtclassif_tokens(model_hparams):
    p = default_problem_hparams()

    class DetTextEncoder():
        def __init__(self, vocfile):
            import codecs
            # just stupid code that builds a vocabulary dict
            self.v = {}
            with codecs.open(vocfile,"r","utf-8") as f:
                for l in f:
                    s=l.strip()
                    if len(s)>0:
                        i=s.rfind(' ')
                        self.v[s[0:i]]=int(s[i+1:])

        def encode(self, sentence):
            """Converts a space-separated string of tokens to a list of ids."""
            ret = []
            for tok in sentence.strip().split():
                if tok in self.v: ret.append(self.v[tok])
                else: ret.append(self.v['UNK'])
            if self._reverse: ret = ret[::-1]
            return ret

        def decode(self, ids):
            if self._reverse: ids = ids[::-1]
            toks=[]
            for i in ids:
                for w in self.v.keys():
                    if self.v[w]==i:
                        toks.append(w)
                        break
            return ' '.join(toks)
        @property
        def vocab_size(self):
            return len(self.v)

    wvoc = DetTextEncoder(model_hparams.data_dir+"/voc.txt")
    lvoc = DetTextEncoder(model_hparams.data_dir+"/voclab.txt")
    p.input_modality = {
      "inputs": (registry.Modalities.SYMBOL, wvoc.vocab_size)
    }
    p.target_modality = (registry.Modalities.SYMBOL, lvoc.vocab_size)

    p.vocabulary = {
      "inputs": wvoc,
      "targets": lvoc,
    }
    p.input_space_id = 3
    p.target_space_id = 3
    return p

SYMBOL modality vocab size

I'm trying to train a bytes-to-subwords model:

def problem(model_hparams):
    # This vocab file must be present within the data directory.
    vocab_filename = os.path.join(model_hparams.data_dir, 'vocab')

    source_encoder = text_encoder.ByteTextEncoder()
    target_encoder = text_encoder.SubwordTextEncoder(vocab_filename)

    p = problem_hparams.default_problem_hparams()
    p.input_modality = {"inputs": (registry.Modalities.SYMBOL, source_encoder.vocab_size)}
    p.target_modality = (registry.Modalities.SYMBOL, target_encoder.vocab_size)
    p.vocabulary = {
        "inputs": source_encoder,
        "targets": target_encoder,
    }

    return p

This fails catastrophically during model construction. It appears to work if the input & target modalities have the same vocab size (eg switching both to share the same SubwordTextEncoder) but fails if they differ in size. This appears to not be the case for other modalities (eg changing both the above to CLASS_LABEL appears to work).

Lack of LSTM(RNN)-CNN classification problem

For current problems, we have NMT problems and Image Classification problems.
But it's lack of the LSTM-CNN's classification problems.
I've implemented a toy data set which will generate random sequences to represent people names.
Some are boys' names while others are girls.
If all names in the sequence are boys' names, it's classified as 1.
If all names in the sequence are girls' names, it's classified as 2.
Otherwise, it's classified as 3.
Now I'm training my toy dataset using the default Transformer model.
Do you guys think it's good to pull this problem to the master?

WSJ parsing can parse one child only

It seems that words_and_tags_from_wsj_tree can parse wsj trees (lists) like
(A (B (C c)))
but not trees like
(A (B (C c d) e))

This is because it either assumes opening or closing parenthesis for each token.

Evaluate metrics in WMT task

Hi,
I read the paper Attention is all you need. The results of wmt tasks are really exciting.

But I found that there's no detailed explanation about what exact metrics was used in wmt translation task in the paper.

What I really mean by detailed explanation:

  1. What evaluation script was used? For example, mteval-v11b.pl, or multi-bleu.perl
  2. Is the evaluation case sensitive or insensitive?
  3. Do we need to de-tokenize the output before evaluating?

update

a tiny mis-spelling here
deocding -> decoding

Thank you so much

Vocabulary size in WMT translation task

Hi all,
I'm a little confused with the vocab_size defined in problem_hparams.py.

In wmt_ende_bpe32k, the vocab_size param is set to 40960, while in wmt_enfr_tokens, there's a wrong_vocab_size param, which is set to 2**13 if in wmt_enfr_tokens_8k. I'm guessing that this might not be the actual number of the vocabulary.

My question is:
When setting input_modality, how to set the vocab_size?

More specifically, If I have a separated source vocab and target vocab, with size n_src, n_tgt, respectively, should I set the vocab_size to n_src + n_tgt or something else?

Thank you

proper size of wmt_ende_tokens_32k-{dev, train}* file?

I got quite low performance compared to the paper.

So, i did some research, and I found the sizes of wmt_ende_tokens_32k-{dev, train}* are too small as follows.
444K wmt_ende_tokens_32k-dev-00000-of-00001
730M wmt_ende_tokens_32k-train-00000-of-00001

I ran t2t_datagen again, then i got the following sizes. (with 100 split option)
820K wmt_ende_tokens_32k-dev-00000-of-00001
14M wmt_ende_tokens_32k-train-00000-of-00100
....
(total 1400M)

what is the proper size of wmt_ende_tokens_32k-* file?

tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[27,4,0] = -1 is not in [0, 31488)

Hi all,

When I run the shell of walkthrough training a good English-to-German translation model using the Transformer model, but I encountered the problem.

My problem is:
INFO:tensorflow:Total trainable variables size: 60276736
INFO:tensorflow:Total embedding variables size: 16384
INFO:tensorflow:Total non-embedding variables size: 60260352
INFO:tensorflow:Computing gradients for global model_fn.
INFO:tensorflow:Global model_fn finished.
INFO:tensorflow:Create CheckpointSaverHook.
2017-06-30 15:05:58.562782: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-30 15:05:58.562814: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-30 15:05:58.562820: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-06-30 15:05:58.562824: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-30 15:05:58.562829: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
Traceback (most recent call last):
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1139, in _do_call
return fn(*args)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1121, in _run_fn
status, run_metadata)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/contextlib.py", line 66, in exit
next(self.gen)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[27,4,0] = -1 is not in [0, 31488)
[[Node: symbol_modality_31488_512/parallel_0/symbol_modality_31488_512/shared/Gather = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](symbol_modality_31488_512/parallel_0/symbol_modality_31488_512/shared/ConvertGradientToTensor_cc661786, symbol_modality_31488_512/parallel_0/symbol_modality_31488_512/shared/Squeeze)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/sycmss/tc/anaconda3/envs/tensorflow/bin/t2t-trainer", line 83, in
tf.app.run()
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/sycmss/tc/anaconda3/envs/tensorflow/bin/t2t-trainer", line 79, in main
schedule=FLAGS.schedule)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensor2tensor/utils/trainer_utils.py", line 240, in run
run_locally(exp_fn(output_dir))
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensor2tensor/utils/trainer_utils.py", line 532, in run_locally
exp.train()
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train
hooks=self._train_monitors + extra_hooks)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 665, in _call_train
monitors=hooks)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
return func(*args, **kwargs)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit
loss = self._train_model(input_fn=input_fn, hooks=hooks)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1007, in _train_model
_, loss = mon_sess.run([model_fn_ops.train_op, model_fn_ops.loss])
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/python/training/monitored_session.py", line 505, in run
run_metadata=run_metadata)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/python/training/monitored_session.py", line 842, in run
run_metadata=run_metadata)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/python/training/monitored_session.py", line 798, in run
return self._sess.run(*args, **kwargs)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/python/training/monitored_session.py", line 952, in run
run_metadata=run_metadata)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/python/training/monitored_session.py", line 798, in run
return self._sess.run(*args, **kwargs)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 789, in run
run_metadata_ptr)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 997, in _run
feed_dict_string, options, run_metadata)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1132, in _do_run
target_list, options, run_metadata)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1152, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[27,4,0] = -1 is not in [0, 31488)
[[Node: symbol_modality_31488_512/parallel_0/symbol_modality_31488_512/shared/Gather = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](symbol_modality_31488_512/parallel_0/symbol_modality_31488_512/shared/ConvertGradientToTensor_cc661786, symbol_modality_31488_512/parallel_0/symbol_modality_31488_512/shared/Squeeze)]]

Caused by op 'symbol_modality_31488_512/parallel_0/symbol_modality_31488_512/shared/Gather', defined at:
File "/home/sycmss/tc/anaconda3/envs/tensorflow/bin/t2t-trainer", line 83, in
tf.app.run()
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/sycmss/tc/anaconda3/envs/tensorflow/bin/t2t-trainer", line 79, in main
schedule=FLAGS.schedule)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensor2tensor/utils/trainer_utils.py", line 240, in run
run_locally(exp_fn(output_dir))
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensor2tensor/utils/trainer_utils.py", line 532, in run_locally
exp.train()
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train
hooks=self._train_monitors + extra_hooks)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 665, in _call_train
monitors=hooks)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
return func(*args, **kwargs)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit
loss = self._train_model(input_fn=input_fn, hooks=hooks)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 955, in _train_model
model_fn_ops = self._get_train_ops(features, labels)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1162, in _get_train_ops
return self._call_model_fn(features, labels, model_fn_lib.ModeKeys.TRAIN)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1133, in _call_model_fn
model_fn_results = self._model_fn(features, labels, **kwargs)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensor2tensor/utils/trainer_utils.py", line 424, in model_fn
len(hparams.problems) - 1)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensor2tensor/utils/trainer_utils.py", line 751, in _cond_on_index
return fn(cur_idx)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensor2tensor/utils/trainer_utils.py", line 406, in nth_model
features, skip=(skipping_is_on and skip_this_one))
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensor2tensor/utils/t2t_model.py", line 377, in model_fn
sharded_features[key], dp)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensor2tensor/utils/modality.py", line 91, in bottom_sharded
return data_parallelism(self.bottom, xs)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensor2tensor/utils/expert_utils.py", line 294, in call
outputs.append(fns[i](*my_args[i], **my_kwargs[i]))
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensor2tensor/models/modalities.py", line 88, in bottom
return self.bottom_simple(x, "shared", reuse=None)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensor2tensor/models/modalities.py", line 80, in bottom_simple
ret = tf.gather(var, x)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/python/ops/gen_array_ops.py", line 1179, in gather
validate_indices=validate_indices, name=name)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 1269, in init
self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): indices[27,4,0] = -1 is not in [0, 31488)
[[Node: symbol_modality_31488_512/parallel_0/symbol_modality_31488_512/shared/Gather = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](symbol_modality_31488_512/parallel_0/symbol_modality_31488_512/shared/ConvertGradientToTensor_cc661786, symbol_modality_31488_512/parallel_0/symbol_modality_31488_512/shared/Squeeze)]]

INFO:tensorflow:Creating experiment, storing model files in /root/t2t_train/wmt_ende_tokens_32k/transformer-transformer_base
INFO:tensorflow:datashard_devices: ['gpu:0']
INFO:tensorflow:caching_devices: None
INFO:tensorflow:Using config: {'_task_type': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f884bf21630>, '_model_dir': '/root/t2t_train/wmt_ende_tokens_32k/transformer-transformer_base', '_save_checkpoints_secs': 600, '_save_summary_steps': 100, '_session_config': allow_soft_placement: true
graph_options {
optimizer_options {
}
}
, '_tf_config': gpu_options {
per_process_gpu_memory_fraction: 1.0
}
, '_task_id': 0, '_tf_random_seed': None, '_num_ps_replicas': 0, '_evaluation_master': '', '_keep_checkpoint_max': 20, '_keep_checkpoint_every_n_hours': 10000, '_master': '', '_is_chief': True, '_num_worker_replicas': 0, '_save_checkpoints_steps': None, '_environment': 'local'}
INFO:tensorflow:Performing Decoding from a file.
INFO:tensorflow:Getting sorted inputs
Traceback (most recent call last):
File "/home/sycmss/tc/anaconda3/envs/tensorflow/bin/t2t-trainer", line 83, in
tf.app.run()
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/sycmss/tc/anaconda3/envs/tensorflow/bin/t2t-trainer", line 79, in main
schedule=FLAGS.schedule)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensor2tensor/utils/trainer_utils.py", line 240, in run
run_locally(exp_fn(output_dir))
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensor2tensor/utils/trainer_utils.py", line 544, in run_locally
decode_from_file(estimator, FLAGS.decode_from_file)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensor2tensor/utils/trainer_utils.py", line 648, in decode_from_file
as_iterable=True)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
return func(*args, **kwargs)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 590, in predict
as_iterable=as_iterable)
File "/home/sycmss/tc/anaconda3/envs/tensorflow/lib/python3.4/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 878, in _infer_model
% self._model_dir)
tensorflow.contrib.learn.python.learn.estimators._sklearn.NotFittedError: Couldn't find trained model at /root/t2t_train/wmt_ende_tokens_32k/transformer-transformer_base.
cat: /root/t2t_data/decode_this.txt.transformer.transformer_base.beam4.alpha0.6.decodes: No such file or directory

How should I solve this problem?

Thank you

Is GPU requirement a requirement?

Hi,

I'd very much like to try this, but I don't have an nvidia gpu... Is the dependency on tensorflow-gpu a hard requirement?

Thanks a lot
Sigrid

ImportError: No module named 'cPickle', however, cPickle is obsolete。。。

$t2t-datagen --data_dir=$DATA_DIR --tmp_dir=$TMP_DIR --num_shards=100 --problem=$PROBLEM
Traceback (most recent call last):
File "/tf1.2py3_venv/venv/bin/t2t-datagen", line 39, in
from tensor2tensor.data_generators import image
File "/tf1.2py3_venv/venv/lib/python3.6/site-packages/tensor2tensor/data_generators/image.py", line 21, in
import cPickle
ImportError: No module named 'cPickle'

Multi-GPU decoding support

Hi all,

I'm wondering whether tensor2tensor support multi-GPU decoding for now? (wmt translation task)

I'm saying this because when I tried to use multiple GPU cards to decode a data (translation task), the following exception raised, while no exception in a single GPU decoding scenario.

I'm putting the decoding script and full exception trace here. Thank you.

decoding script

t2t-trainer   --data_dir=/tensor2tensor/t2t_data   --problems=wmt_ende_tokens_32k \
    --model=transformer   --hparams_set=transformer_base --worker_gpu=3 \
    --output_dir=/tensor2tensor/exp/8cards/wmt_ende_tokens_32k/transformer-transformer_base \  
    --train_steps=0   --eval_steps=0   --decode_beam_size=4   --decode_alpha=0.6 \
    --decode_use_last_position_only  --decode_batch_size=128 \ 
    --decode_from_file=/tensor2tensor/t2t_data/validate.en

exception info

INFO:tensorflow:Restoring parameters from /search/odin/public/experiments/tensor2tensor/exp/8cards/wmt_ende_tokens_32k/transformer-transformer_base/model.ckpt-56426
2017-06-23 11:34:20.978020: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Number of ways to split should evenly divide the split dimension, but got split_dim 0 (size = 128) and num_split 3
	 [[Node: while/split = Split[T=DT_INT32, num_split=3, _device="/job:localhost/replica:0/task:0/cpu:0"](while/split/split_dim, while/split/Enter)]]
2017-06-23 11:34:20.978210: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Number of ways to split should evenly divide the split dimension, but got split_dim 0 (size = 128) and num_split 3
	 [[Node: while/split = Split[T=DT_INT32, num_split=3, _device="/job:localhost/replica:0/task:0/cpu:0"](while/split/split_dim, while/split/Enter)]]
Traceback (most recent call last):
  File "/search/odin/public/anaconda2/bin/t2t-trainer", line 4, in <module>
    __import__('pkg_resources').run_script('tensor2tensor==1.0.4', 't2t-trainer')
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/pkg_resources/__init__.py", line 739, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1507, in run_script
    exec(script_code, namespace, namespace)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensor2tensor-1.0.4-py2.7.egg/EGG-INFO/scripts/t2t-trainer", line 55, in <module>
    
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensor2tensor-1.0.4-py2.7.egg/EGG-INFO/scripts/t2t-trainer", line 51, in main
    
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/trainer_utils.py", line 240, in run
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/trainer_utils.py", line 543, in run_locally
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/trainer_utils.py", line 646, in decode_from_file
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 902, in _predict_generator
    preds = mon_sess.run(predictions, feed_fn() if feed_fn else None)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 505, in run
    run_metadata=run_metadata)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 842, in run
    run_metadata=run_metadata)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 798, in run
    return self._sess.run(*args, **kwargs)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 952, in run
    run_metadata=run_metadata)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 798, in run
    return self._sess.run(*args, **kwargs)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 789, in run
    run_metadata_ptr)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 997, in _run
    feed_dict_string, options, run_metadata)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1132, in _do_run
    target_list, options, run_metadata)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1152, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Number of ways to split should evenly divide the split dimension, but got split_dim 0 (size = 128) and num_split 3
	 [[Node: while/split = Split[T=DT_INT32, num_split=3, _device="/job:localhost/replica:0/task:0/cpu:0"](while/split/split_dim, while/split/Enter)]]
	 [[Node: while/GatherNd/_1405 = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_4080_while/GatherNd", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](^_cloopwhile/parallel_0/Identity/_1292)]]

Caused by op u'while/split', defined at:
  File "/search/odin/public/anaconda2/bin/t2t-trainer", line 4, in <module>
    __import__('pkg_resources').run_script('tensor2tensor==1.0.4', 't2t-trainer')
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/pkg_resources/__init__.py", line 739, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1507, in run_script
    exec(script_code, namespace, namespace)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensor2tensor-1.0.4-py2.7.egg/EGG-INFO/scripts/t2t-trainer", line 55, in <module>
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensor2tensor-1.0.4-py2.7.egg/EGG-INFO/scripts/t2t-trainer", line 51, in main
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/trainer_utils.py", line 240, in run
    run_locally(exp_fn(output_dir))
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/trainer_utils.py", line 543, in run_locally
    decode_from_file(estimator, FLAGS.decode_from_file)
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/trainer_utils.py", line 645, in decode_from_file
    result_iter = estimator.predict(input_fn=input_fn.next, as_iterable=True)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
    return func(*args, **kwargs)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 590, in predict
    as_iterable=as_iterable)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 884, in _infer_model
    infer_ops = self._get_predict_ops(features)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1218, in _get_predict_ops
    return self._call_model_fn(features, labels, model_fn_lib.ModeKeys.INFER)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1133, in _call_model_fn
    model_fn_results = self._model_fn(features, labels, **kwargs)
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/trainer_utils.py", line 423, in model_fn
    len(hparams.problems) - 1)
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/trainer_utils.py", line 748, in _cond_on_index
    return fn(cur_idx)
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/trainer_utils.py", line 396, in nth_model
    decode_length=FLAGS.decode_extra_length)
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/t2t_model.py", line 154, in infer
    last_position_only, alpha)
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/t2t_model.py", line 211, in _beam_decode
    alpha)
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/beam_search.py", line 405, in beam_search
    back_prop=False)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2766, in while_loop
    result = context.BuildLoop(cond, body, loop_vars, shape_invariants)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2595, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2545, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/beam_search.py", line 336, in inner_loop
    i, alive_seq, alive_log_probs)
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/beam_search.py", line 240, in grow_topk
    flat_logits = symbols_to_logits_fn(flat_ids)
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/t2t_model.py", line 181, in symbols_to_logits_fn
    features, False, last_position_only=last_position_only)
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/t2t_model.py", line 352, in model_fn
    sharded_features = self._shard_features(features)
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/t2t_model.py", line 332, in _shard_features
    0))
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 1214, in split
    split_dim=axis, num_split=num_or_size_splits, value=value, name=name)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3261, in _split
    num_split=num_split, name=name)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/search/odin/public/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Number of ways to split should evenly divide the split dimension, but got split_dim 0 (size = 128) and num_split 3
	 [[Node: while/split = Split[T=DT_INT32, num_split=3, _device="/job:localhost/replica:0/task:0/cpu:0"](while/split/split_dim, while/split/Enter)]]
	 [[Node: while/GatherNd/_1405 = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_4080_while/GatherNd", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](^_cloopwhile/parallel_0/Identity/_1292)]]

Resume training

Is it possible to resume training for a certain amount of additional steps?

Walkthrough training error

when run Walkthrough
t2t-trainer
--data_dir=$DATA_DIR
--problems=$PROBLEM
--model=$MODEL
--hparams_set=$HPARAMS
--output_dir=$TRAIN_DIR
--train_steps=0
--eval_steps=0
--decode_beam_size=$BEAM_SIZE
--decode_alpha=$ALPHA
--decode_from_file=$DECODE_FILE

the error infomation is as follows:

INFO:tensorflow:Creating experiment, storing model files in /root/t2t_train/wmt_ende_tokens_32k/transformer-transformer_base
INFO:tensorflow:datashard_devices: ['gpu:0']
INFO:tensorflow:caching_devices: None
Traceback (most recent call last):
File "/usr/local/bin/t2t-trainer", line 83, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/usr/local/bin/t2t-trainer", line 79, in main
schedule=FLAGS.schedule)
File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_utils.py", line 240, in run
run_locally(exp_fn(output_dir))
File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_utils.py", line 126, in experiment_fn
eval_steps=eval_steps)
File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_utils.py", line 138, in create_experiment
model_name=model_name)
File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_utils.py", line 174, in create_experiment_components
keep_checkpoint_max=FLAGS.keep_checkpoint_max))
TypeError: init() got an unexpected keyword argument 'session_config'
INFO:tensorflow:Creating experiment, storing model files in /root/t2t_train/wmt_ende_tokens_32k/transformer-transformer_base
INFO:tensorflow:datashard_devices: ['gpu:0']
INFO:tensorflow:caching_devices: None
Traceback (most recent call last):
File "/usr/local/bin/t2t-trainer", line 83, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/usr/local/bin/t2t-trainer", line 79, in main
schedule=FLAGS.schedule)
File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_utils.py", line 240, in run
run_locally(exp_fn(output_dir))
File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_utils.py", line 126, in experiment_fn
eval_steps=eval_steps)
File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_utils.py", line 138, in create_experiment
model_name=model_name)
File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_utils.py", line 174, in create_experiment_components
keep_checkpoint_max=FLAGS.keep_checkpoint_max))
TypeError: init() got an unexpected keyword argument 'session_config'

Python3 compatibility

There are some holes in the Python 3 compatibility of the Tensor2tensor code. For instance:

In data_generators/generator_utils.py, import urllib needs to be:

import sys
if sys.version_info[0] >= 3:
  import urllib.request as urllib
else:
  import urllib

In data_generators/image.py, import cPickle needs to be:

try:
  import cPickle
except ImportError:
  import pickle as cPickle

Finally, data_generators/tokenizer.py needs to be revised as it assumes that a char ordinal is always in the range (0, 256), which is not a safe assumption in Python 3. A better solution uses a set instead of array subscripts based on char ordinals. Would you like me to submit a revised version in a pull request?

Training step in Walkthrough fails

I downloaded and ran the tensorflow docker, then started following the walkthough by installing tensor2tensor with pip install, setting the environment variables, and running t2t-datagen.

Next. I ran the t2t-trainer:
t2t-trainer --data_dir=$DATA_DIR --problems=$PROBLEM --model=$MODEL --hparams_set=$HPARAMS --output_dir=$TRAIN_DIR

It looked like it was training for a minute, until it failed with:
t2t-trainer --data_dir=$DATA_DIR --problems=$PROBLEM --model=$MODEL --hparams_set=$HPARAMS --output_dir=$TRAIN_DIR
INFO:tensorflow:Creating experiment, storing model files in /root/t2t_train/wmt_ende_tokens_32k/transformer-transformer_base
INFO:tensorflow:datashard_devices: ['gpu:0']
INFO:tensorflow:caching_devices: None
INFO:tensorflow:Using config: {'_model_dir': '/root/t2t_train/wmt_ende_tokens_32k/transformer-transformer_base', '_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 20, '_tf_random_seed': None, '_task_type': None, '_environment': 'local', '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f054c81bb50>, '_tf_config': gpu_options {
per_process_gpu_memory_fraction: 1.0
}
, '_num_worker_replicas': 0, '_task_id': 0, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_evaluation_master': '', '_keep_checkpoint_every_n_hours': 10000, '_master': '', '_session_config': allow_soft_placement: true
graph_options {
optimizer_options {
}
}
}
INFO:tensorflow:Performing local training.
INFO:tensorflow:datashard_devices: ['gpu:0']
INFO:tensorflow:caching_devices: None
INFO:tensorflow:Doing model_fn_body took 2.320 sec.
INFO:tensorflow:This model_fn took 2.521 sec.
INFO:tensorflow:Weight body/decoder/layer_0/conv_hidden_relu/conv1_single/bias shape (2048,) size 2048
INFO:tensorflow:Weight body/decoder/layer_0/conv_hidden_relu/conv1_single/kernel shape (1, 1, 512, 2048) size 1048576
INFO:tensorflow:Weight body/decoder/layer_0/conv_hidden_relu/conv2_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_0/conv_hidden_relu/conv2_single/kernel shape (1, 1, 2048, 512) size 1048576
INFO:tensorflow:Weight body/decoder/layer_0/decoder_self_attention/output_transform_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_0/decoder_self_attention/output_transform_single/kernel shape (1, 1, 512, 512) size 262144
INFO:tensorflow:Weight body/decoder/layer_0/decoder_self_attention/qkv_transform_single/bias shape (1536,) size 1536
INFO:tensorflow:Weight body/decoder/layer_0/decoder_self_attention/qkv_transform_single/kernel shape (1, 1, 512, 1536) size 786432
INFO:tensorflow:Weight body/decoder/layer_0/encdec_attention/kv_transform_single/bias shape (1024,) size 1024
INFO:tensorflow:Weight body/decoder/layer_0/encdec_attention/kv_transform_single/kernel shape (1, 1, 512, 1024) size 524288
INFO:tensorflow:Weight body/decoder/layer_0/encdec_attention/output_transform_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_0/encdec_attention/output_transform_single/kernel shape (1, 1, 512, 512) size 262144
INFO:tensorflow:Weight body/decoder/layer_0/encdec_attention/q_transform_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_0/encdec_attention/q_transform_single/kernel shape (1, 1, 512, 512) size 262144
INFO:tensorflow:Weight body/decoder/layer_0/layer_norm/layer_norm_bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_0/layer_norm/layer_norm_scale shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_0/layer_norm_1/layer_norm_bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_0/layer_norm_1/layer_norm_scale shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_0/layer_norm_2/layer_norm_bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_0/layer_norm_2/layer_norm_scale shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_1/conv_hidden_relu/conv1_single/bias shape (2048,) size 2048
INFO:tensorflow:Weight body/decoder/layer_1/conv_hidden_relu/conv1_single/kernel shape (1, 1, 512, 2048) size 1048576
INFO:tensorflow:Weight body/decoder/layer_1/conv_hidden_relu/conv2_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_1/conv_hidden_relu/conv2_single/kernel shape (1, 1, 2048, 512) size 1048576
INFO:tensorflow:Weight body/decoder/layer_1/decoder_self_attention/output_transform_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_1/decoder_self_attention/output_transform_single/kernel shape (1, 1, 512, 512) size 262144
INFO:tensorflow:Weight body/decoder/layer_1/decoder_self_attention/qkv_transform_single/bias shape (1536,) size 1536
INFO:tensorflow:Weight body/decoder/layer_1/decoder_self_attention/qkv_transform_single/kernel shape (1, 1, 512, 1536) size 786432
INFO:tensorflow:Weight body/decoder/layer_1/encdec_attention/kv_transform_single/bias shape (1024,) size 1024
INFO:tensorflow:Weight body/decoder/layer_1/encdec_attention/kv_transform_single/kernel shape (1, 1, 512, 1024) size 524288
INFO:tensorflow:Weight body/decoder/layer_1/encdec_attention/output_transform_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_1/encdec_attention/output_transform_single/kernel shape (1, 1, 512, 512) size 262144
INFO:tensorflow:Weight body/decoder/layer_1/encdec_attention/q_transform_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_1/encdec_attention/q_transform_single/kernel shape (1, 1, 512, 512) size 262144
INFO:tensorflow:Weight body/decoder/layer_1/layer_norm/layer_norm_bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_1/layer_norm/layer_norm_scale shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_1/layer_norm_1/layer_norm_bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_1/layer_norm_1/layer_norm_scale shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_1/layer_norm_2/layer_norm_bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_1/layer_norm_2/layer_norm_scale shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_2/conv_hidden_relu/conv1_single/bias shape (2048,) size 2048
INFO:tensorflow:Weight body/decoder/layer_2/conv_hidden_relu/conv1_single/kernel shape (1, 1, 512, 2048) size 1048576
INFO:tensorflow:Weight body/decoder/layer_2/conv_hidden_relu/conv2_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_2/conv_hidden_relu/conv2_single/kernel shape (1, 1, 2048, 512) size 1048576
INFO:tensorflow:Weight body/decoder/layer_2/decoder_self_attention/output_transform_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_2/decoder_self_attention/output_transform_single/kernel shape (1, 1, 512, 512) size 262144
INFO:tensorflow:Weight body/decoder/layer_2/decoder_self_attention/qkv_transform_single/bias shape (1536,) size 1536
INFO:tensorflow:Weight body/decoder/layer_2/decoder_self_attention/qkv_transform_single/kernel shape (1, 1, 512, 1536) size 786432
INFO:tensorflow:Weight body/decoder/layer_2/encdec_attention/kv_transform_single/bias shape (1024,) size 1024
INFO:tensorflow:Weight body/decoder/layer_2/encdec_attention/kv_transform_single/kernel shape (1, 1, 512, 1024) size 524288
INFO:tensorflow:Weight body/decoder/layer_2/encdec_attention/output_transform_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_2/encdec_attention/output_transform_single/kernel shape (1, 1, 512, 512) size 262144
INFO:tensorflow:Weight body/decoder/layer_2/encdec_attention/q_transform_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_2/encdec_attention/q_transform_single/kernel shape (1, 1, 512, 512) size 262144
INFO:tensorflow:Weight body/decoder/layer_2/layer_norm/layer_norm_bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_2/layer_norm/layer_norm_scale shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_2/layer_norm_1/layer_norm_bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_2/layer_norm_1/layer_norm_scale shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_2/layer_norm_2/layer_norm_bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_2/layer_norm_2/layer_norm_scale shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_3/conv_hidden_relu/conv1_single/bias shape (2048,) size 2048
INFO:tensorflow:Weight body/decoder/layer_3/conv_hidden_relu/conv1_single/kernel shape (1, 1, 512, 2048) size 1048576
INFO:tensorflow:Weight body/decoder/layer_3/conv_hidden_relu/conv2_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_3/conv_hidden_relu/conv2_single/kernel shape (1, 1, 2048, 512) size 1048576
INFO:tensorflow:Weight body/decoder/layer_3/decoder_self_attention/output_transform_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_3/decoder_self_attention/output_transform_single/kernel shape (1, 1, 512, 512) size 262144
INFO:tensorflow:Weight body/decoder/layer_3/decoder_self_attention/qkv_transform_single/bias shape (1536,) size 1536
INFO:tensorflow:Weight body/decoder/layer_3/decoder_self_attention/qkv_transform_single/kernel shape (1, 1, 512, 1536) size 786432
INFO:tensorflow:Weight body/decoder/layer_3/encdec_attention/kv_transform_single/bias shape (1024,) size 1024
INFO:tensorflow:Weight body/decoder/layer_3/encdec_attention/kv_transform_single/kernel shape (1, 1, 512, 1024) size 524288
INFO:tensorflow:Weight body/decoder/layer_3/encdec_attention/output_transform_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_3/encdec_attention/output_transform_single/kernel shape (1, 1, 512, 512) size 262144
INFO:tensorflow:Weight body/decoder/layer_3/encdec_attention/q_transform_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_3/encdec_attention/q_transform_single/kernel shape (1, 1, 512, 512) size 262144
INFO:tensorflow:Weight body/decoder/layer_3/layer_norm/layer_norm_bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_3/layer_norm/layer_norm_scale shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_3/layer_norm_1/layer_norm_bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_3/layer_norm_1/layer_norm_scale shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_3/layer_norm_2/layer_norm_bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_3/layer_norm_2/layer_norm_scale shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_4/conv_hidden_relu/conv1_single/bias shape (2048,) size 2048
INFO:tensorflow:Weight body/decoder/layer_4/conv_hidden_relu/conv1_single/kernel shape (1, 1, 512, 2048) size 1048576
INFO:tensorflow:Weight body/decoder/layer_4/conv_hidden_relu/conv2_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_4/conv_hidden_relu/conv2_single/kernel shape (1, 1, 2048, 512) size 1048576
INFO:tensorflow:Weight body/decoder/layer_4/decoder_self_attention/output_transform_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_4/decoder_self_attention/output_transform_single/kernel shape (1, 1, 512, 512) size 262144
INFO:tensorflow:Weight body/decoder/layer_4/decoder_self_attention/qkv_transform_single/bias shape (1536,) size 1536
INFO:tensorflow:Weight body/decoder/layer_4/decoder_self_attention/qkv_transform_single/kernel shape (1, 1, 512, 1536) size 786432
INFO:tensorflow:Weight body/decoder/layer_4/encdec_attention/kv_transform_single/bias shape (1024,) size 1024
INFO:tensorflow:Weight body/decoder/layer_4/encdec_attention/kv_transform_single/kernel shape (1, 1, 512, 1024) size 524288
INFO:tensorflow:Weight body/decoder/layer_4/encdec_attention/output_transform_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_4/encdec_attention/output_transform_single/kernel shape (1, 1, 512, 512) size 262144
INFO:tensorflow:Weight body/decoder/layer_4/encdec_attention/q_transform_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_4/encdec_attention/q_transform_single/kernel shape (1, 1, 512, 512) size 262144
INFO:tensorflow:Weight body/decoder/layer_4/layer_norm/layer_norm_bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_4/layer_norm/layer_norm_scale shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_4/layer_norm_1/layer_norm_bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_4/layer_norm_1/layer_norm_scale shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_4/layer_norm_2/layer_norm_bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_4/layer_norm_2/layer_norm_scale shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_5/conv_hidden_relu/conv1_single/bias shape (2048,) size 2048
INFO:tensorflow:Weight body/decoder/layer_5/conv_hidden_relu/conv1_single/kernel shape (1, 1, 512, 2048) size 1048576
INFO:tensorflow:Weight body/decoder/layer_5/conv_hidden_relu/conv2_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_5/conv_hidden_relu/conv2_single/kernel shape (1, 1, 2048, 512) size 1048576
INFO:tensorflow:Weight body/decoder/layer_5/decoder_self_attention/output_transform_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_5/decoder_self_attention/output_transform_single/kernel shape (1, 1, 512, 512) size 262144
INFO:tensorflow:Weight body/decoder/layer_5/decoder_self_attention/qkv_transform_single/bias shape (1536,) size 1536
INFO:tensorflow:Weight body/decoder/layer_5/decoder_self_attention/qkv_transform_single/kernel shape (1, 1, 512, 1536) size 786432
INFO:tensorflow:Weight body/decoder/layer_5/encdec_attention/kv_transform_single/bias shape (1024,) size 1024
INFO:tensorflow:Weight body/decoder/layer_5/encdec_attention/kv_transform_single/kernel shape (1, 1, 512, 1024) size 524288
INFO:tensorflow:Weight body/decoder/layer_5/encdec_attention/output_transform_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_5/encdec_attention/output_transform_single/kernel shape (1, 1, 512, 512) size 262144
INFO:tensorflow:Weight body/decoder/layer_5/encdec_attention/q_transform_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_5/encdec_attention/q_transform_single/kernel shape (1, 1, 512, 512) size 262144
INFO:tensorflow:Weight body/decoder/layer_5/layer_norm/layer_norm_bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_5/layer_norm/layer_norm_scale shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_5/layer_norm_1/layer_norm_bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_5/layer_norm_1/layer_norm_scale shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_5/layer_norm_2/layer_norm_bias shape (512,) size 512
INFO:tensorflow:Weight body/decoder/layer_5/layer_norm_2/layer_norm_scale shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_0/conv_hidden_relu/conv1_single/bias shape (2048,) size 2048
INFO:tensorflow:Weight body/encoder/layer_0/conv_hidden_relu/conv1_single/kernel shape (1, 1, 512, 2048) size 1048576
INFO:tensorflow:Weight body/encoder/layer_0/conv_hidden_relu/conv2_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_0/conv_hidden_relu/conv2_single/kernel shape (1, 1, 2048, 512) size 1048576
INFO:tensorflow:Weight body/encoder/layer_0/encoder_self_attention/output_transform_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_0/encoder_self_attention/output_transform_single/kernel shape (1, 1, 512, 512) size 262144
INFO:tensorflow:Weight body/encoder/layer_0/encoder_self_attention/qkv_transform_single/bias shape (1536,) size 1536
INFO:tensorflow:Weight body/encoder/layer_0/encoder_self_attention/qkv_transform_single/kernel shape (1, 1, 512, 1536) size 786432
INFO:tensorflow:Weight body/encoder/layer_0/layer_norm/layer_norm_bias shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_0/layer_norm/layer_norm_scale shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_0/layer_norm_1/layer_norm_bias shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_0/layer_norm_1/layer_norm_scale shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_1/conv_hidden_relu/conv1_single/bias shape (2048,) size 2048
INFO:tensorflow:Weight body/encoder/layer_1/conv_hidden_relu/conv1_single/kernel shape (1, 1, 512, 2048) size 1048576
INFO:tensorflow:Weight body/encoder/layer_1/conv_hidden_relu/conv2_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_1/conv_hidden_relu/conv2_single/kernel shape (1, 1, 2048, 512) size 1048576
INFO:tensorflow:Weight body/encoder/layer_1/encoder_self_attention/output_transform_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_1/encoder_self_attention/output_transform_single/kernel shape (1, 1, 512, 512) size 262144
INFO:tensorflow:Weight body/encoder/layer_1/encoder_self_attention/qkv_transform_single/bias shape (1536,) size 1536
INFO:tensorflow:Weight body/encoder/layer_1/encoder_self_attention/qkv_transform_single/kernel shape (1, 1, 512, 1536) size 786432
INFO:tensorflow:Weight body/encoder/layer_1/layer_norm/layer_norm_bias shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_1/layer_norm/layer_norm_scale shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_1/layer_norm_1/layer_norm_bias shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_1/layer_norm_1/layer_norm_scale shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_2/conv_hidden_relu/conv1_single/bias shape (2048,) size 2048
INFO:tensorflow:Weight body/encoder/layer_2/conv_hidden_relu/conv1_single/kernel shape (1, 1, 512, 2048) size 1048576
INFO:tensorflow:Weight body/encoder/layer_2/conv_hidden_relu/conv2_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_2/conv_hidden_relu/conv2_single/kernel shape (1, 1, 2048, 512) size 1048576
INFO:tensorflow:Weight body/encoder/layer_2/encoder_self_attention/output_transform_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_2/encoder_self_attention/output_transform_single/kernel shape (1, 1, 512, 512) size 262144
INFO:tensorflow:Weight body/encoder/layer_2/encoder_self_attention/qkv_transform_single/bias shape (1536,) size 1536
INFO:tensorflow:Weight body/encoder/layer_2/encoder_self_attention/qkv_transform_single/kernel shape (1, 1, 512, 1536) size 786432
INFO:tensorflow:Weight body/encoder/layer_2/layer_norm/layer_norm_bias shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_2/layer_norm/layer_norm_scale shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_2/layer_norm_1/layer_norm_bias shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_2/layer_norm_1/layer_norm_scale shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_3/conv_hidden_relu/conv1_single/bias shape (2048,) size 2048
INFO:tensorflow:Weight body/encoder/layer_3/conv_hidden_relu/conv1_single/kernel shape (1, 1, 512, 2048) size 1048576
INFO:tensorflow:Weight body/encoder/layer_3/conv_hidden_relu/conv2_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_3/conv_hidden_relu/conv2_single/kernel shape (1, 1, 2048, 512) size 1048576
INFO:tensorflow:Weight body/encoder/layer_3/encoder_self_attention/output_transform_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_3/encoder_self_attention/output_transform_single/kernel shape (1, 1, 512, 512) size 262144
INFO:tensorflow:Weight body/encoder/layer_3/encoder_self_attention/qkv_transform_single/bias shape (1536,) size 1536
INFO:tensorflow:Weight body/encoder/layer_3/encoder_self_attention/qkv_transform_single/kernel shape (1, 1, 512, 1536) size 786432
INFO:tensorflow:Weight body/encoder/layer_3/layer_norm/layer_norm_bias shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_3/layer_norm/layer_norm_scale shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_3/layer_norm_1/layer_norm_bias shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_3/layer_norm_1/layer_norm_scale shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_4/conv_hidden_relu/conv1_single/bias shape (2048,) size 2048
INFO:tensorflow:Weight body/encoder/layer_4/conv_hidden_relu/conv1_single/kernel shape (1, 1, 512, 2048) size 1048576
INFO:tensorflow:Weight body/encoder/layer_4/conv_hidden_relu/conv2_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_4/conv_hidden_relu/conv2_single/kernel shape (1, 1, 2048, 512) size 1048576
INFO:tensorflow:Weight body/encoder/layer_4/encoder_self_attention/output_transform_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_4/encoder_self_attention/output_transform_single/kernel shape (1, 1, 512, 512) size 262144
INFO:tensorflow:Weight body/encoder/layer_4/encoder_self_attention/qkv_transform_single/bias shape (1536,) size 1536
INFO:tensorflow:Weight body/encoder/layer_4/encoder_self_attention/qkv_transform_single/kernel shape (1, 1, 512, 1536) size 786432
INFO:tensorflow:Weight body/encoder/layer_4/layer_norm/layer_norm_bias shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_4/layer_norm/layer_norm_scale shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_4/layer_norm_1/layer_norm_bias shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_4/layer_norm_1/layer_norm_scale shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_5/conv_hidden_relu/conv1_single/bias shape (2048,) size 2048
INFO:tensorflow:Weight body/encoder/layer_5/conv_hidden_relu/conv1_single/kernel shape (1, 1, 512, 2048) size 1048576
INFO:tensorflow:Weight body/encoder/layer_5/conv_hidden_relu/conv2_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_5/conv_hidden_relu/conv2_single/kernel shape (1, 1, 2048, 512) size 1048576
INFO:tensorflow:Weight body/encoder/layer_5/encoder_self_attention/output_transform_single/bias shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_5/encoder_self_attention/output_transform_single/kernel shape (1, 1, 512, 512) size 262144
INFO:tensorflow:Weight body/encoder/layer_5/encoder_self_attention/qkv_transform_single/bias shape (1536,) size 1536
INFO:tensorflow:Weight body/encoder/layer_5/encoder_self_attention/qkv_transform_single/kernel shape (1, 1, 512, 1536) size 786432
INFO:tensorflow:Weight body/encoder/layer_5/layer_norm/layer_norm_bias shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_5/layer_norm/layer_norm_scale shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_5/layer_norm_1/layer_norm_bias shape (512,) size 512
INFO:tensorflow:Weight body/encoder/layer_5/layer_norm_1/layer_norm_scale shape (512,) size 512
INFO:tensorflow:Weight body/target_space_embedding/kernel shape (32, 512) size 16384
INFO:tensorflow:Weight symbol_modality_31236_512/shared/weights_0 shape (1953, 512) size 999936
INFO:tensorflow:Weight symbol_modality_31236_512/shared/weights_10 shape (1952, 512) size 999424
INFO:tensorflow:Weight symbol_modality_31236_512/shared/weights_11 shape (1952, 512) size 999424
INFO:tensorflow:Weight symbol_modality_31236_512/shared/weights_12 shape (1952, 512) size 999424
INFO:tensorflow:Weight symbol_modality_31236_512/shared/weights_13 shape (1952, 512) size 999424
INFO:tensorflow:Weight symbol_modality_31236_512/shared/weights_14 shape (1952, 512) size 999424
INFO:tensorflow:Weight symbol_modality_31236_512/shared/weights_15 shape (1952, 512) size 999424
INFO:tensorflow:Weight symbol_modality_31236_512/shared/weights_1 shape (1953, 512) size 999936
INFO:tensorflow:Weight symbol_modality_31236_512/shared/weights_2 shape (1953, 512) size 999936
INFO:tensorflow:Weight symbol_modality_31236_512/shared/weights_3 shape (1953, 512) size 999936
INFO:tensorflow:Weight symbol_modality_31236_512/shared/weights_4 shape (1952, 512) size 999424
INFO:tensorflow:Weight symbol_modality_31236_512/shared/weights_5 shape (1952, 512) size 999424
INFO:tensorflow:Weight symbol_modality_31236_512/shared/weights_6 shape (1952, 512) size 999424
INFO:tensorflow:Weight symbol_modality_31236_512/shared/weights_7 shape (1952, 512) size 999424
INFO:tensorflow:Weight symbol_modality_31236_512/shared/weights_8 shape (1952, 512) size 999424
INFO:tensorflow:Weight symbol_modality_31236_512/shared/weights_9 shape (1952, 512) size 999424
INFO:tensorflow:Total trainable variables size: 60147712
INFO:tensorflow:Total embedding variables size: 16384
INFO:tensorflow:Total non-embedding variables size: 60131328
INFO:tensorflow:Computing gradients for global model_fn.
INFO:tensorflow:Global model_fn finished.
INFO:tensorflow:Create CheckpointSaverHook.
2017-06-27 04:34:58.910748: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-27 04:34:58.910798: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-27 04:34:58.910821: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
INFO:tensorflow:Saving checkpoints for 1 into /root/t2t_train/wmt_ende_tokens_32k/transformer-transformer_base/model.ckpt.
INFO:tensorflow:loss = 8.79561, step = 1
Traceback (most recent call last):
File "/usr/local/bin/t2t-trainer", line 83, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/usr/local/bin/t2t-trainer", line 79, in main
schedule=FLAGS.schedule)
File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_utils.py", line 240, in run
run_locally(exp_fn(output_dir))
File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_utils.py", line 531, in run_locally
exp.train()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train
hooks=self._train_monitors + extra_hooks)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 665, in _call_train
monitors=hooks)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit
loss = self._train_model(input_fn=input_fn, hooks=hooks)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1007, in _train_model
_, loss = mon_sess.run([model_fn_ops.train_op, model_fn_ops.loss])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 505, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 842, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 798, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 952, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 798, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 789, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 997, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1132, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[16,1,0] = -1 is not in [0, 31236)
[[Node: symbol_modality_31236_512_1/parallel_0/symbol_modality_31236_512/shared/Gather = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](symbol_modality_31236_512_1/parallel_0/symbol_modality_31236_512/shared/ConvertGradientToTensor_cc661786, symbol_modality_31236_512_1/parallel_0/symbol_modality_31236_512/shared/Squeeze)]]

Caused by op u'symbol_modality_31236_512_1/parallel_0/symbol_modality_31236_512/shared/Gather', defined at:
File "/usr/local/bin/t2t-trainer", line 83, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/usr/local/bin/t2t-trainer", line 79, in main
schedule=FLAGS.schedule)
File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_utils.py", line 240, in run
run_locally(exp_fn(output_dir))
File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_utils.py", line 531, in run_locally
exp.train()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train
hooks=self._train_monitors + extra_hooks)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 665, in _call_train
monitors=hooks)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit
loss = self._train_model(input_fn=input_fn, hooks=hooks)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 955, in _train_model
model_fn_ops = self._get_train_ops(features, labels)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1162, in _get_train_ops
return self._call_model_fn(features, labels, model_fn_lib.ModeKeys.TRAIN)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1133, in _call_model_fn
model_fn_results = self._model_fn(features, labels, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_utils.py", line 423, in model_fn
len(hparams.problems) - 1)
File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_utils.py", line 748, in _cond_on_index
return fn(cur_idx)
File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_utils.py", line 405, in nth_model
features, train, skip=(skipping_is_on and skip_this_one))
File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/t2t_model.py", line 387, in model_fn
sharded_features["targets"], dp)
File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/modality.py", line 115, in targets_bottom_sharded
return data_parallelism(self.targets_bottom, xs)
File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/expert_utils.py", line 294, in call
outputs.append(fns[i](*my_args[i], **my_kwargs[i]))
File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/models/modalities.py", line 94, in targets_bottom
return self.bottom_simple(x, "shared", reuse=True)
File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/models/modalities.py", line 80, in bottom_simple
ret = tf.gather(var, x)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 1179, in gather
validate_indices=validate_indices, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1269, in init
self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): indices[16,1,0] = -1 is not in [0, 31236)
[[Node: symbol_modality_31236_512_1/parallel_0/symbol_modality_31236_512/shared/Gather = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](symbol_modality_31236_512_1/parallel_0/symbol_modality_31236_512/shared/ConvertGradientToTensor_cc661786, symbol_modality_31236_512_1/parallel_0/symbol_modality_31236_512/shared/Squeeze)]]

Trying to reproduce wmt_ende_bpe32k

I' trying to reproduce wmt_ende_bpe32k. The data generations fails with the following error, however:

INFO:tensorflow:Generating training data for wmt_ende_bpe32k.
Traceback (most recent call last):
  File "/usr/local/bin/t2t-datagen", line 361, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/usr/local/bin/t2t-datagen", line 345, in main
    FLAGS.data_dir, FLAGS.num_shards, FLAGS.max_cases)
  File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/data_generators/generator_utils.py", line 113, in generate_files
    for case in generator:
  File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/data_generators/wmt.py", line 83, in token_generator
    source_ints = token_vocab.encode(source.strip()) + eos_list
  File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/data_generators/text_encoder.py", line 120, in encode
    ret = [self._token_to_id[tok] for tok in sentence.strip().split()]
AttributeError: 'TokenTextEncoder' object has no attribute '_token_to_id'

I'm using the following command:

PROBLEM=wmt_ende_bpe32k
MODEL=transformer
HPARAMS=transformer_base

DATA_DIR=$HOME/t2t_data
TMP_DIR=/tmp/t2t_datagen
TRAIN_DIR=$HOME/t2t_train/$PROBLEM/$MODEL-$HPARAMS

mkdir -p $DATA_DIR $TMP_DIR $TRAIN_DIR

# Generate data
t2t-datagen \
  --data_dir=$DATA_DIR \
  --tmp_dir=$TMP_DIR \
  --num_shards=100 \
  --problem=$PROBLEM

mv $TMP_DIR/vocab.bpe.32000 $DATA_DIR

# Train
# *  If you run out of memory, add --hparams='batch_size=2048' or even 1024.
t2t-trainer \
  --data_dir=$DATA_DIR \
  --problems=$PROBLEM \
  --model=$MODEL \
  --hparams_set=$HPARAMS \
  --output_dir=$TRAIN_DIR \
  --worker_gpu=2 \
  --log_dir logs

# Decode

DECODE_FILE=$DATA_DIR/decode_this.txt
echo "Hello world" >> $DECODE_FILE
echo "Goodbye world" >> $DECODE_FILE

BEAM_SIZE=4
ALPHA=0.6

t2t-trainer \
  --data_dir=$DATA_DIR \
  --problems=$PROBLEM \
  --model=$MODEL \
  --hparams_set=$HPARAMS \
  --output_dir=$TRAIN_DIR \
  --train_steps=0 \
  --eval_steps=0 \
  --decode_beam_size=$BEAM_SIZE \
  --decode_alpha=$ALPHA \
  --decode_from_file=$DECODE_FILE

cat $DECODE_FILE.$MODEL.$HPARAMS.beam$BEAM_SIZE.alpha$ALPHA.decodes

I've downloaded wmt16_en_de.tar.gz and placed it in /tmp/t2t_datagen as specified in wmt.py:

def _get_wmt_ende_dataset(directory, filename):
  """Extract the WMT en-de corpus `filename` to directory unless it's there."""
  train_path = os.path.join(directory, filename)
  if not (tf.gfile.Exists(train_path + ".de") and
          tf.gfile.Exists(train_path + ".en")):
    # We expect that this file has been downloaded from:
    # https://drive.google.com/open?id=0B_bZck-ksdkpM25jRUN2X2UxMm8 and placed
    # in `directory`.
    corpus_file = os.path.join(directory, "wmt16_en_de.tar.gz")
    with tarfile.open(corpus_file, "r:gz") as corpus_tar:
      corpus_tar.extractall(directory)
return train_path

SubwordTextEncoder should be bytes-based

@lukaszkaiser @vthorsteinsson @nshazeer

@vthorsteinsson's recent PR improved the compatibility between Python 2 and 3 but we seem to have lost some valuable functionality.

We want to have a SubwordTextEncoder that is fully invertible with a limited vocabulary and it should be able to encode anything. i.e. it should operate on bytes exclusively, so that the vocabulary only needs to grow by <= 256 entries. So if the input is Unicode (utf-8 encoded, or otherwise), it will be read in as individual bytes (and not Unicode characters), which means that decoding might break (i.e. the decoder might produce a sequence of bytes that is invalid unicode).

For datasets or tasks that wish to handle Unicode characters directly as part of the vocabulary, there can be a different version of the SubwordTextEncoder that does that (e.g. the one that is currently checked-in).

So the suggestion is to have 2 SubwordTextEncoders, one for just bytes, and another that deals with unicode (pretty much the one that's checked-in).

I may be misunderstanding the current functionality so please correct my mental model where it's wrong.

Out of memory on GPU in wmt_ende_tokens_32k

t2t-trainer runs out of GPU memory when training on a single nVidia 1080 GTX (8 GB) with the following parameters:

PROBLEM=wmt_ende_tokens_32k
MODEL=transformer
HPARAMS=transformer_base_single_gpu

Any hints on this?

More specifically, the exception ResourceExhaustedError is raised, cf. the following dump:

2017-06-22 11:55:49.859461: I tensorflow/core/common_runtime/bfc_allocator.cc:693]      Summary of in-use Chunks by size:
2017-06-22 11:55:49.859468: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 142 Chunks of size 256 totalling 35.5KiB
2017-06-22 11:55:49.859473: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1280 totalling 1.2KiB
2017-06-22 11:55:49.859477: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 414 Chunks of size 2048 totalling 828.0KiB
2017-06-22 11:55:49.859482: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 24 Chunks of size 4096 totalling 96.0KiB
2017-06-22 11:55:49.859487: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 48 Chunks of size 6144 totalling 288.0KiB
2017-06-22 11:55:49.859491: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 48 Chunks of size 8192 totalling 384.0KiB
2017-06-22 11:55:49.859495: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 5 Chunks of size 32768 totalling 160.0KiB
2017-06-22 11:55:49.859500: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 45824 totalling 44.8KiB
2017-06-22 11:55:49.859505: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 4 Chunks of size 65536 totalling 256.0KiB
2017-06-22 11:55:49.859509: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 96 Chunks of size 1048576 totalling 96.00MiB
2017-06-22 11:55:49.859514: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 24 Chunks of size 2097152 totalling 48.00MiB
2017-06-22 11:55:49.859518: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 46 Chunks of size 3145728 totalling 138.00MiB
2017-06-22 11:55:49.859523: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 65 Chunks of size 4030464 totalling 249.84MiB
2017-06-22 11:55:49.859527: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 97 Chunks of size 4194304 totalling 388.00MiB
2017-06-22 11:55:49.859532: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 227 Chunks of size 16711680 totalling 3.53GiB
2017-06-22 11:55:49.859536: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 43 Chunks of size 20889600 totalling 856.64MiB
2017-06-22 11:55:49.859541: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 64487424 totalling 61.50MiB
2017-06-22 11:55:49.859546: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 12 Chunks of size 66846720 totalling 765.00MiB
2017-06-22 11:55:49.859550: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1288182528 totalling 1.20GiB
2017-06-22 11:55:49.859554: I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 7.28GiB
2017-06-22 11:55:49.859561: I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit:                  7969800192
InUse:                  7813304576
MaxInUse:               7966874112
NumAllocs:                    1912
MaxAllocSize:           1288182528

2017-06-22 11:55:49.859613: W tensorflow/core/common_runtime/bfc_allocator.cc:277] *************************************************************************************************xxx
2017-06-22 11:55:49.859627: W tensorflow/core/framework/op_kernel.cc:1158] Resource exhausted: OOM when allocating tensor with shape[102,80,1,1,31488]
Traceback (most recent call last):
  File "/home/villi/tf/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1139, in _do_call
    return fn(*args)
  File "/home/villi/tf/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1121, in _run_fn
    status, run_metadata)
  File "/usr/lib/python3.6/contextlib.py", line 89, in __exit__
    next(self.gen)
  File "/home/villi/tf/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[102,80,1,1,31488]
         [[Node: symbol_modality_31488_512_2/parallel_0_1/symbol_modality_31488_512/padded_cross_entropy/smoothing_cross_entropy/one_hot = OneHot[T=DT_FLOAT, TI=DT_INT32, axis=-1, _device="/job:localhost/replica:0/task:0/gpu:0"](symbol_modality_31488_512_2/parallel_0_1/symbol_modality_31488_512/padded_cross_entropy/pad_with_zeros/pad_to_same_length/Pad_1/_2643, symbol_modality_31488_512_2/parallel_0_1/symbol_modality_31488_512/strided_slice, symbol_modality_31488_512_2/parallel_0_1/symbol_modality_31488_512/padded_cross_entropy/smoothing_cross_entropy/one_hot/on_value, symbol_modality_31488_512_2/parallel_0_1/symbol_modality_31488_512/padded_cross_entropy/smoothing_cross_entropy/truediv)]]
         [[Node: training/train/update/_2730 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_13171_training/train/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

Issue when trying to decode a file that was not part of the training

When I try decode a file that was not part of the training/testing set, the following error occurs:

INFO:tensorflow:Performing Decoding from a file.
INFO:tensorflow:Getting sorted inputs
INFO:tensorflow: batch 94
INFO:tensorflow:Deocding batch 0
Traceback (most recent call last):
  File "/usr/local/bin/t2t-trainer", line 83, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/usr/local/bin/t2t-trainer", line 79, in main
    schedule=FLAGS.schedule)
  File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_utils.py", line 240, in run
    run_locally(exp_fn(output_dir))
  File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_utils.py", line 543, in run_locally
    decode_from_file(estimator, FLAGS.decode_from_file)
  File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_utils.py", line 645, in decode_from_file
    result_iter = estimator.predict(input_fn=input_fn.next, as_iterable=True)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 590, in predict
    as_iterable=as_iterable)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 883, in _infer_model
    features = self._get_features_from_input_fn(input_fn)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 863, in _get_features_from_input_fn
    result = input_fn()
  File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_utils.py", line 725, in _decode_batch_input_fn
    input_ids = vocabulary.encode(inputs)
  File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/data_generators/text_encoder.py", line 132, in encode
    ret = [self._token_to_id[tok] for tok in sentence.strip().split()]
KeyError: '@-@'

The decoding command I use is as follows:

PROBLEM=wmt_ende_bpe32k
MODEL=transformer
HPARAMS=transformer_base

DATA_DIR=$HOME/t2t_data
TMP_DIR=/tmp/t2t_datagen
TRAIN_DIR=$HOME/t2t_train/$PROBLEM/$MODEL-$HPARAMS


BEAM_SIZE=4
ALPHA=0.6

t2t-trainer \
  --data_dir=$DATA_DIR \
  --problems=$PROBLEM \
  --model=$MODEL \
  --hparams_set=$HPARAMS \
  --output_dir=$TRAIN_DIR \
  --train_steps=0 \
  --eval_steps=0 \
  --decode_beam_size=$BEAM_SIZE \
  --decode_alpha=$ALPHA \
  --decode_from_file /tmp/t2t_datagen/newsdev2016.bpe.en

How to run the Walkthrough example with other data than WMT?

Is it possible to run the Walkthrough example from the website with other data than WMT?

I've tried changing the data paths in wmt.py:

_ENDE_TRAIN_DATASETS = [
    [
        "http://data.statmt.org/wmt16/translation-task/training-parallel-nc-v11.tgz",  # pylint: disable=line-too-long
        ("training-parallel-nc-v11/news-commentary-v11.de-en.en",
         "training-parallel-nc-v11/news-commentary-v11.de-en.de")
    ],
    [
        "http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz",
        ("commoncrawl.de-en.en", "commoncrawl.de-en.de")
    ],
    [
        "http://www.statmt.org/wmt13/training-parallel-europarl-v7.tgz",
        ("training/europarl-v7.de-en.en", "training/europarl-v7.de-en.de")
    ],
]
_ENDE_TEST_DATASETS = [
    [
        "http://data.statmt.org/wmt16/translation-task/dev.tgz",
        ("dev/newstest2013.en", "dev/newstest2013.de")

But when I run the example with new paths, it still downloads the WMT data...

Session error when running distributed training

Hi

When I run distributed training following the guides in https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/docs/distributed_training.md,
I configure with 1 ps and 2 workers. The ps works ok, but all the workers show errors:

tensorflow.python.framework.errors_impl.NotFoundError: No session factory registered for the given session options: {target: "10.150.144.48:1111" config: allow_soft_placement: true graph_options { optimizer_options { } }} Registered factories are {DIRECT_SESSION, GRPC_SESSION}.

The details of this error is as follows:

2017-06-25 06:41:26.914625: E tensorflow/core/common_runtime/session.cc:69] Not found: No session factory registered for the given session options: {target: "10.150.144.48:1111" config: allow_soft_placement: true graph_options { optimizer_options { } }} Registered factories are {DIRECT_SESSION, GRPC_SESSION}. {u'cluster': {u'ps': [u'10.150.144.48:3333'], u'worker': [u'10.150.144.48:1111', u'10.150.144.48:2222']}, u'task': {u'index': 0, u'type': u'worker'}} Traceback (most recent call last): File "/usr/local/bin/t2t-trainer", line 62, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/usr/local/bin/t2t-trainer", line 58, in main schedule=FLAGS.schedule) File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_utils.py", line 247, in run output_dir=FLAGS.output_dir) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 210, in run return _execute_schedule(experiment, schedule) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 47, in _execute_schedule return task() File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train hooks=self._train_monitors + extra_hooks) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 669, in _call_train monitors=hooks) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 289, in new_func return func(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit loss = self._train_model(input_fn=input_fn, hooks=hooks) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1003, in _train_model config=self._session_config File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 352, in MonitoredTrainingSession stop_grace_period_secs=stop_grace_period_secs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 648, in __init__ stop_grace_period_secs=stop_grace_period_secs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 477, in __init__ self._sess = _RecoverableSession(self._coordinated_creator) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 822, in __init__ _WrappedSession.__init__(self, self._create_session()) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 827, in _create_session return self._sess_creator.create_session() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 538, in create_session self.tf_sess = self._session_creator.create_session() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 412, in create_session init_fn=self._scaffold.init_fn) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 273, in prepare_session config=config) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 178, in _restore_checkpoint sess = session.Session(self._target, graph=self._graph, config=config) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1292, in __init__ super(Session, self).__init__(target, graph, config=config) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 562, in __init__ self._session = tf_session.TF_NewDeprecatedSession(opts, status) File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) tensorflow.python.framework.errors_impl.NotFoundError: No session factory registered for the given session options: {target: "10.150.144.48:1111" config: allow_soft_placement: true graph_options { optimizer_options { } }} Registered factories are {DIRECT_SESSION, GRPC_SESSION}. ERROR:tensorflow:================================== Object was never used (type <class 'tensorflow.python.framework.ops.Tensor'>): <tf.Tensor 'report_uninitialized_variables_1/boolean_mask/Gather:0' shape=(?,) dtype=string> If you want to mark it as used call its "mark_used()" method. It was originally created here: ['File "/usr/local/bin/t2t-trainer", line 62, in <module>\n tf.app.run()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run\n _sys.exit(main(_sys.argv[:1] + flags_passthrough))', 'File "/usr/local/bin/t2t-trainer", line 58, in main\n schedule=FLAGS.schedule)', 'File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_utils.py", line 247, in run\n output_dir=FLAGS.output_dir)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 210, in run\n return _execute_schedule(experiment, schedule)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 47, in _execute_schedule\n return task()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train\n hooks=self._train_monitors + extra_hooks)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 669, in _call_train\n monitors=hooks)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 289, in new_func\n return func(*args, **kwargs)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit\n loss = self._train_model(input_fn=input_fn, hooks=hooks)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1003, in _train_model\n config=self._session_config', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 352, in MonitoredTrainingSession\n stop_grace_period_secs=stop_grace_period_secs)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 648, in __init__\n stop_grace_period_secs=stop_grace_period_secs)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 477, in __init__\n self._sess = _RecoverableSession(self._coordinated_creator)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 822, in __init__\n _WrappedSession.__init__(self, self._create_session())', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 827, in _create_session\n return self._sess_creator.create_session()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 538, in create_session\n self.tf_sess = self._session_creator.create_session()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 403, in create_session\n self._scaffold.finalize()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 192, in finalize\n default_ready_for_local_init_op)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 254, in get_or_default\n op = default_constructor()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 189, in default_ready_for_local_init_op\n variables.global_variables())', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 170, in wrapped\n return _add_should_use_warning(fn(*args, **kwargs))', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 139, in _add_should_use_warning\n wrapped = TFShouldUseWarningWrapper(x)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 96, in __init__\n stack = [s.strip() for s in traceback.format_stack()]'] ==================================

It seems {DIRECT_SESSION, GRPC_SESSION}.` is not registered, So can you help to see this problem?

"Invalid argument: slice index 1 of dimension 1 out of bounds." Error when decoding the text to Class_Label using Transformer model

Hi,
I add a dataset to do the classification using the Transformer model. I generated the dataset successfully and trained the model successfully. But when I do the decoding, the invalid argument error is thrown.
The following codes are used to generate dataset, it's simply to pick up data from several dataset and make the combinations. Each combination is a category. The task is to find out which category is for a new set of data.
For an example, there are four dataset: A,B,C,D. They have several members like:
A: a0,a1,a2
B: b0,b1,b2,b3
C: c0,c1
D: d0,d1,d2
So there will be 15 combinations(aka. categories): A,B,C,D,AB,AC,AD,BC,BD,CD,ABC,ABD,ACD,BCD,ABCD. My generator will generate random number of sequences that picks members from set A,B,C,D and generate targets label between the 0 to 15.
All data generation and training worked well. But when I tried to do the decoding, The invalid argument error is thrown. This error is too unspecific and I have no idea how to solve it.
Please be kind to review my code what's wrong in it. Thanks a lot!
Generator:
This generator is to pick up members from several files and the file name is the single set name and assign a target classification label to it.

import numpy as np
import os, sys

from tensor2tensor.data_generators import generator_utils
from tensor2tensor.data_generators import text_encoder
from six.moves import xrange 
import ssl
import itertools	

def generateDicts(dataDir,nameFiles,idFrom=100):
	dictFileName=dataDir+"/name.dict"
	categories=len(nameFiles)
	nameset=dict()
	for name in nameFiles:
		nameFile=open(dataDir+"/"+name)
		names=list()
		for individual in nameFile:
			names.append(individual.strip())
		nameset[name]=names
		nameFile.close()
	allNames=set()
	for (k,v) in zip(nameset.keys(),nameset.values()) :
		allNames=allNames.union(set(v))
		print("lenOfAllNames after add "+k+" is "+str(len(allNames)))
	allNamesList=sorted(allNames)
	names=dict()
	idx=idFrom+1
	for name in allNamesList:
		names[idx]=name
		idx+=1
	dictFile=open(dictFileName,"w")
	for (k,v) in zip(names.keys(),names.values()):
		dictFile.write(v+"\n")
	dictFile.close()

	combos=dict()

	for i in xrange(len(nameFiles)):
		combination=itertools.combinations(nameFiles,i+1)
		for signleComb in combination:
			combos["_".join(signleComb)]=i
	return nameset,names,combos

def generateCase(nameset,names,nameFiles,combos,maxMembers):
	categories=len(nameFiles)
	categorySize=list()
	members = np.random.randint(maxMembers)+1
	leftMembers = members
	categoryList=list()
	inputs=list()

	for i in xrange(categories-1):
		categorySize.append(np.random.randint(leftMembers))
		leftMembers-=categorySize[i]
		if categorySize[i] != 0:
			categoryList.append(nameFiles[i])
			names=nameset[nameFiles[i]]
			nameLen=len(names)
			for j in xrange(categorySize[i]):
				nameIndex=np.random.randint(nameLen)
				inputs.append(names[nameIndex])

	i+=1
	if leftMembers != 0:
		categoryList.append(nameFiles[i])
		names=nameset[nameFiles[i]]
		nameLen=len(names)
		for j in xrange(leftMembers):
			nameIndex=np.random.randint(nameLen)
			inputs.append(names[nameIndex])

	cateStr="_".join(categoryList)
	outputs=[cateStr]
	return inputs,outputs

def party_party_generator(dataDir,nameFiles,maxMembers,numOfCases):
	nameset,names,combos=generateDicts(dataDir,nameFiles)
	#targetDict=dataDir+"/targets.dict"
	#targetDictFile=open(targetDict,"w")
	#for combo in combos:
	#	targetDictFile.write(combo+"\n")
	#targetDictFile.close()
	dictFileName=dataDir+"/name.dict"
	inputsTextToken=text_encoder.TokenTextEncoder(dictFileName)
	#targetsTextToken=text_encoder.TokenTextEncoder(targetDict)

	for i in xrange(numOfCases):
		inputs,outputs=generateCase(nameset,names,nameFiles,combos,maxMembers)
		strInput=" ".join(inputs)
		encodedInputs=inputsTextToken.encode(strInput)
		np.random.shuffle(encodedInputs)
		encodedOutputs=[combos[outputs[0]]]
		yield {"inputs":encodedInputs,"targets":encodedOutputs}

HParams I added
I feel a little bit confused about the input_space_id and target_space_id, it seems like I can set anything to it without any problems.

def party_party(model_hparams):
  """Party Party."""
  p = default_problem_hparams()
  nameDict=model_hparams.data_dir+"/name.dict"
  targetDict=model_hparams.data_dir+"/targets.dict"
  num_lines = sum(1 for line in open(nameDict))
  target_lines= sum(1 for line in open(targetDict))
  p.input_modality = {"inputs": (registry.Modalities.SYMBOL, num_lines)}
  p.vocabulary = {
    "inputs": text_encoder.TokenTextEncoder(vocab_filename=nameDict),
	  "targets": text_encoder.TextEncoder(),
  }
  p.target_modality = (registry.Modalities.CLASS_LABEL, target_lines)
  p.batch_size_multiplier = 4
  p.max_expected_batch_size_per_shard = 8
  p.loss_multiplier = 3.0
  p.input_space_id = 1
  p.target_space_id = 1
  return p

"party_party": lambda p:party_party(p),

The training script I am using:
Since I need to change the t2t codes frequently, I don't install it to my site-package directory.

tensor2tensor/bin/t2t-trainer --data_dir /mnt/5efa3937-4221-48b5-9660-85a4a7eb0cfd/data/ --problems=party_party --model=transformer --hparams_set=transformer_base_single_gpu --keep_checkpoint_max=5 --save_checkpoints_secs=3600 --hparams='batch_size=2048' --output_dir /mnt/5efa3937-4221-48b5-9660-85a4a7eb0cfd/model

The predict script I'm using:


# Decode
DATA_DIR=/mnt/5efa3937-4221-48b5-9660-85a4a7eb0cfd/data
PROBLEM=party_party
MODEL=transformer
DECODE_FILE=decode.txt
TRAIN_DIR=/mnt/5efa3937-4221-48b5-9660-85a4a7eb0cfd/model
HPARAMS=transformer_base_single_gpu
BEAM_SIZE=4
ALPHA=0.6

tensor2tensor/bin/t2t-trainer \
  --data_dir=$DATA_DIR \
  --problems=$PROBLEM \
  --model=$MODEL \
  --hparams_set=$HPARAMS \
  --output_dir=$TRAIN_DIR \
  --train_steps=0 \
  --eval_steps=0 \
  --decode_beam_size=$BEAM_SIZE \
  --decode_alpha=$ALPHA \
  --decode_from_file=$DECODE_FILE

The decode.txt is pretty simple:

324 f2r32f2r5 3f2fsfda fewfoIFE fselfj203 fselfj203 fj203rf2 3jr22iofj dsfslfkj23LJ dsf2f dsfslfkj23LJ dsflw2mf>K

This text will be encoded to int ids just like what the generator does.

However this text is random generated, the procedure is very common to classify the text corpus.
Please help to see where I made wrong. Thanks a lot!

Tensorboard Support?

Does the trainer currently write out logs for Tensorboard? I looked through the code in utils/trainer_utils.py, and while I see calls to tf.summary.scalar, I don't see a call to tf.summary.FileWriter.

If it does support Tensorboard, how do I configure it?
If not, I'll start working on a pull request tomorrow to implement this.

Decoding speed per sentence

Hi,

I have trained a transformer_big model for the wmt_ende_tokens_32k problem.
After 37118 steps, I found that it gives a decent result:

INFO:tensorflow:Saving dict for global step 37118:
	global_step = 37118,
	loss = 0.980365,
	metrics-wmt_ende_tokens_32k/accuracy = 0.789868,
	metrics-wmt_ende_tokens_32k/accuracy_per_sequence = 0.0,
	metrics-wmt_ende_tokens_32k/accuracy_top5 = 0.90224,
	metrics-wmt_ende_tokens_32k/bleu_score = 0.493593,
	metrics-wmt_ende_tokens_32k/neg_log_perplexity = -1.11336,
	metrics/accuracy = 0.789868,
	metrics/accuracy_per_sequence = 0.0,
	metrics/accuracy_top5 = 0.90224,
	metrics/bleu_score = 0.493593,
	metrics/neg_log_perplexity = -1.11336

I then tried to translate a newstest2014-deen-src.en file which consists of 10008 lines.
I followed the default HPARAMS setting for transformer_big, and set BEAM_SIZE=3 and ALPHA=0.6.

However, as the decoding process seemed to be taking forever, I re-tried the same process with a smaller file that consisted of just 10 lines. This time, the decoding took approx. 30 seconds after the loading of the learned model parameters.
Taking one second to decode a source sentence seems to be too long as this would suggest that translating a newstest2014-deen-src.en file would take a couple of hours.

Am I missing some options here?

How to run the reverse_demical40 task?

Here is my run script:

[g@pc:/home/g/Desktop/tensor2tensor/reverse]$ cat run.sh 
PROBLEM=algorithmic_reverse_decimal40
MODEL=baseline_lstm_seq2seq
HPARAMS=basic1
DATA_DIR=./t2t_data
TMP_DIR=./t2t_datagen
TRAIN_DIR=./t2t_train/$PROBLEM/$MODEL-$HPARAMS

mkdir -p $DATA_DIR $TMP_DIR $TRAIN_DIR

# Generate data
t2t-datagen \
  --data_dir=$DATA_DIR \
  --tmp_dir=$TMP_DIR \
  --problem=$PROBLEM

# mv $TMP_DIR/tokens.vocab.32768 $DATA_DIR

# Train
t2t-trainer \
  --data_dir=$DATA_DIR \
  --problems=$PROBLEM \
  --model=$MODEL \
  --hparams_set=$HPARAMS \
  --output_dir=$TRAIN_DIR

# Decode

DECODE_FILE=$DATA_DIR/decode_this.txt
echo "Hello world" >> $DECODE_FILE
echo "Goodbye world" >> $DECODE_FILE

BEAM_SIZE=4
ALPHA=0.6

t2t-trainer \
  --data_dir=$DATA_DIR \
  --problems=$PROBLEM \
  --model=$MODEL \
  --hparams_set=$HPARAMS \
  --output_dir=$TRAIN_DIR \
  --train_steps=0 \
  --eval_steps=0 \
  --beam_size=$BEAM_SIZE \
  --alpha=$ALPHA \
  --decode_from_file=$DECODE_FILE

cat $DECODE_FILE.$MODEL.$HPARAMS.beam$BEAM_SIZE.alpha$ALPHA.decodes

Output:

[g@pc:/home/g/Desktop/tensor2tensor/reverse]$ bash run.sh 
INFO:tensorflow:Generating training data for algorithmic_reverse_decimal40.
INFO:tensorflow:Generating case 0 for algorithmic_reverse_decimal40-unshuffled-train.
INFO:tensorflow:Generating development data for algorithmic_reverse_decimal40.
INFO:tensorflow:Generating case 0 for algorithmic_reverse_decimal40-unshuffled-dev.
INFO:tensorflow:Shuffling data...
INFO:tensorflow:read: 10000
INFO:tensorflow:read: 20000
INFO:tensorflow:read: 30000
INFO:tensorflow:read: 40000
INFO:tensorflow:read: 50000
INFO:tensorflow:read: 60000
INFO:tensorflow:read: 70000
INFO:tensorflow:read: 80000
INFO:tensorflow:read: 90000
INFO:tensorflow:read: 100000
INFO:tensorflow:write: 0
INFO:tensorflow:write: 10000
INFO:tensorflow:write: 20000
INFO:tensorflow:write: 30000
INFO:tensorflow:write: 40000
INFO:tensorflow:write: 50000
INFO:tensorflow:write: 60000
INFO:tensorflow:write: 70000
INFO:tensorflow:write: 80000
INFO:tensorflow:write: 90000
INFO:tensorflow:read: 10000
INFO:tensorflow:write: 0
INFO:tensorflow:Registry contents:

  Models: ['multi_model', 'baseline_lstm_seq2seq', 'slice_net', 'diagonal_neural_gpu', 'byte_net', 'transformer', 'attention_lm', 'neural_gpu', 'xception']

  HParams: ['transformer_h32', 'transformer_big_dr2', 'transformer_big_dr3', 'transformer_big_dr1', 'slicenet1', 'transformer_tiny', 'xception_base', 'transformer_dr2', 'transformer_parsing_base_dr6', 'basic1', 'transformer_k256', 'transformer_h16', 'transformer_ff1024', 'transformer_k128', 'slicenet1tiny', 'transformer_big_enfr', 'multimodel1p8', 'transformer_dr0', 'transformer_base', 'transformer_l8', 'transformer_parsing_big', 'transformer_hs1024', 'slicenet1noam', 'transformer_big_single_gpu', 'attention_lm_base', 'transformer_ff4096', 'transformer_single_gpu', 'transformer_ls2', 'transformer_ls0', 'transformer_hs256', 'neural_gpu1', 'transformer_h1', 'transformer_h4', 'transformer_l4', 'transformer_l2', 'bytenet_base']

  RangedHParams: ['transformer_big_single_gpu', 'basic1', 'slicenet1']
  
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 20, '_task_type': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f670e058c10>, '_model_dir': './t2t_train/algorithmic_reverse_decimal40/baseline_lstm_seq2seq-basic1', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_session_config': allow_soft_placement: true
graph_options {
  optimizer_options {
  }
}
, '_tf_random_seed': None, '_environment': 'local', '_num_worker_replicas': 0, '_task_id': 0, '_save_summary_steps': 100, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_evaluation_master': '', '_master': ''}
INFO:tensorflow:datashard_devices: ['gpu:0']
INFO:tensorflow:caching_devices: None
INFO:tensorflow:Performing local training.
INFO:tensorflow:datashard_devices: ['gpu:0']
INFO:tensorflow:caching_devices: None
INFO:tensorflow:Doing model_fn_body took 0.418 sec.
INFO:tensorflow:This model_fn took 0.649 sec.
INFO:tensorflow:Weight    body/lstm_seq2seq/decoder/rnn/multi_rnn_cell/cell_0/basic_lstm_cell/bias              shape    (256,)                 size    256
INFO:tensorflow:Weight    body/lstm_seq2seq/decoder/rnn/multi_rnn_cell/cell_0/basic_lstm_cell/kernel            shape    (128, 256)             size    32768
INFO:tensorflow:Weight    body/lstm_seq2seq/decoder/rnn/multi_rnn_cell/cell_1/basic_lstm_cell/bias              shape    (256,)                 size    256
INFO:tensorflow:Weight    body/lstm_seq2seq/decoder/rnn/multi_rnn_cell/cell_1/basic_lstm_cell/kernel            shape    (128, 256)             size    32768
INFO:tensorflow:Weight    body/lstm_seq2seq/decoder/rnn/multi_rnn_cell/cell_2/basic_lstm_cell/bias              shape    (256,)                 size    256
INFO:tensorflow:Weight    body/lstm_seq2seq/decoder/rnn/multi_rnn_cell/cell_2/basic_lstm_cell/kernel            shape    (128, 256)             size    32768
INFO:tensorflow:Weight    body/lstm_seq2seq/decoder/rnn/multi_rnn_cell/cell_3/basic_lstm_cell/bias              shape    (256,)                 size    256
INFO:tensorflow:Weight    body/lstm_seq2seq/decoder/rnn/multi_rnn_cell/cell_3/basic_lstm_cell/kernel            shape    (128, 256)             size    32768
INFO:tensorflow:Weight    body/lstm_seq2seq/encoder/rnn/multi_rnn_cell/cell_0/basic_lstm_cell/bias              shape    (256,)                 size    256
INFO:tensorflow:Weight    body/lstm_seq2seq/encoder/rnn/multi_rnn_cell/cell_0/basic_lstm_cell/kernel            shape    (128, 256)             size    32768
INFO:tensorflow:Weight    body/lstm_seq2seq/encoder/rnn/multi_rnn_cell/cell_1/basic_lstm_cell/bias              shape    (256,)                 size    256
INFO:tensorflow:Weight    body/lstm_seq2seq/encoder/rnn/multi_rnn_cell/cell_1/basic_lstm_cell/kernel            shape    (128, 256)             size    32768
INFO:tensorflow:Weight    body/lstm_seq2seq/encoder/rnn/multi_rnn_cell/cell_2/basic_lstm_cell/bias              shape    (256,)                 size    256
INFO:tensorflow:Weight    body/lstm_seq2seq/encoder/rnn/multi_rnn_cell/cell_2/basic_lstm_cell/kernel            shape    (128, 256)             size    32768
INFO:tensorflow:Weight    body/lstm_seq2seq/encoder/rnn/multi_rnn_cell/cell_3/basic_lstm_cell/bias              shape    (256,)                 size    256
INFO:tensorflow:Weight    body/lstm_seq2seq/encoder/rnn/multi_rnn_cell/cell_3/basic_lstm_cell/kernel            shape    (128, 256)             size    32768
INFO:tensorflow:Weight    symbol_modality_11_64/input_emb/weights_0                                             shape    (1, 64)                size    64
INFO:tensorflow:Weight    symbol_modality_11_64/input_emb/weights_10                                            shape    (1, 64)                size    64
INFO:tensorflow:Weight    symbol_modality_11_64/input_emb/weights_11                                            shape    (0, 64)                size    0
INFO:tensorflow:Weight    symbol_modality_11_64/input_emb/weights_12                                            shape    (0, 64)                size    0
INFO:tensorflow:Weight    symbol_modality_11_64/input_emb/weights_13                                            shape    (0, 64)                size    0
INFO:tensorflow:Weight    symbol_modality_11_64/input_emb/weights_14                                            shape    (0, 64)                size    0
INFO:tensorflow:Weight    symbol_modality_11_64/input_emb/weights_15                                            shape    (0, 64)                size    0
INFO:tensorflow:Weight    symbol_modality_11_64/input_emb/weights_1                                             shape    (1, 64)                size    64
INFO:tensorflow:Weight    symbol_modality_11_64/input_emb/weights_2                                             shape    (1, 64)                size    64
INFO:tensorflow:Weight    symbol_modality_11_64/input_emb/weights_3                                             shape    (1, 64)                size    64
INFO:tensorflow:Weight    symbol_modality_11_64/input_emb/weights_4                                             shape    (1, 64)                size    64
INFO:tensorflow:Weight    symbol_modality_11_64/input_emb/weights_5                                             shape    (1, 64)                size    64
INFO:tensorflow:Weight    symbol_modality_11_64/input_emb/weights_6                                             shape    (1, 64)                size    64
INFO:tensorflow:Weight    symbol_modality_11_64/input_emb/weights_7                                             shape    (1, 64)                size    64
INFO:tensorflow:Weight    symbol_modality_11_64/input_emb/weights_8                                             shape    (1, 64)                size    64
INFO:tensorflow:Weight    symbol_modality_11_64/input_emb/weights_9                                             shape    (1, 64)                size    64
INFO:tensorflow:Weight    symbol_modality_11_64/softmax/weights_0                                               shape    (1, 64)                size    64
INFO:tensorflow:Weight    symbol_modality_11_64/softmax/weights_10                                              shape    (1, 64)                size    64
INFO:tensorflow:Weight    symbol_modality_11_64/softmax/weights_11                                              shape    (0, 64)                size    0
INFO:tensorflow:Weight    symbol_modality_11_64/softmax/weights_12                                              shape    (0, 64)                size    0
INFO:tensorflow:Weight    symbol_modality_11_64/softmax/weights_13                                              shape    (0, 64)                size    0
INFO:tensorflow:Weight    symbol_modality_11_64/softmax/weights_14                                              shape    (0, 64)                size    0
INFO:tensorflow:Weight    symbol_modality_11_64/softmax/weights_15                                              shape    (0, 64)                size    0
INFO:tensorflow:Weight    symbol_modality_11_64/softmax/weights_1                                               shape    (1, 64)                size    64
INFO:tensorflow:Weight    symbol_modality_11_64/softmax/weights_2                                               shape    (1, 64)                size    64
INFO:tensorflow:Weight    symbol_modality_11_64/softmax/weights_3                                               shape    (1, 64)                size    64
INFO:tensorflow:Weight    symbol_modality_11_64/softmax/weights_4                                               shape    (1, 64)                size    64
INFO:tensorflow:Weight    symbol_modality_11_64/softmax/weights_5                                               shape    (1, 64)                size    64
INFO:tensorflow:Weight    symbol_modality_11_64/softmax/weights_6                                               shape    (1, 64)                size    64
INFO:tensorflow:Weight    symbol_modality_11_64/softmax/weights_7                                               shape    (1, 64)                size    64
INFO:tensorflow:Weight    symbol_modality_11_64/softmax/weights_8                                               shape    (1, 64)                size    64
INFO:tensorflow:Weight    symbol_modality_11_64/softmax/weights_9                                               shape    (1, 64)                size    64
INFO:tensorflow:Weight    symbol_modality_11_64/target_emb/weights_0                                            shape    (1, 64)                size    64
INFO:tensorflow:Weight    symbol_modality_11_64/target_emb/weights_10                                           shape    (1, 64)                size    64
INFO:tensorflow:Weight    symbol_modality_11_64/target_emb/weights_11                                           shape    (0, 64)                size    0
INFO:tensorflow:Weight    symbol_modality_11_64/target_emb/weights_12                                           shape    (0, 64)                size    0
INFO:tensorflow:Weight    symbol_modality_11_64/target_emb/weights_13                                           shape    (0, 64)                size    0
INFO:tensorflow:Weight    symbol_modality_11_64/target_emb/weights_14                                           shape    (0, 64)                size    0
INFO:tensorflow:Weight    symbol_modality_11_64/target_emb/weights_15                                           shape    (0, 64)                size    0
INFO:tensorflow:Weight    symbol_modality_11_64/target_emb/weights_1                                            shape    (1, 64)                size    64
INFO:tensorflow:Weight    symbol_modality_11_64/target_emb/weights_2                                            shape    (1, 64)                size    64
INFO:tensorflow:Weight    symbol_modality_11_64/target_emb/weights_3                                            shape    (1, 64)                size    64
INFO:tensorflow:Weight    symbol_modality_11_64/target_emb/weights_4                                            shape    (1, 64)                size    64
INFO:tensorflow:Weight    symbol_modality_11_64/target_emb/weights_5                                            shape    (1, 64)                size    64
INFO:tensorflow:Weight    symbol_modality_11_64/target_emb/weights_6                                            shape    (1, 64)                size    64
INFO:tensorflow:Weight    symbol_modality_11_64/target_emb/weights_7                                            shape    (1, 64)                size    64
INFO:tensorflow:Weight    symbol_modality_11_64/target_emb/weights_8                                            shape    (1, 64)                size    64
INFO:tensorflow:Weight    symbol_modality_11_64/target_emb/weights_9                                            shape    (1, 64)                size    64
INFO:tensorflow:Total trainable variables size: 266304
INFO:tensorflow:Total embedding variables size: 0
INFO:tensorflow:Total non-embedding variables size: 266304
INFO:tensorflow:Computing gradients for global model_fn.
INFO:tensorflow:Global model_fn finished.
INFO:tensorflow:Create CheckpointSaverHook.
2017-06-18 14:00:13.936233: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-18 14:00:13.936254: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-18 14:00:13.936261: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-06-18 14:00:13.936270: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-18 14:00:13.936276: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-06-18 14:00:14.068278: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-06-18 14:00:14.068727: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties: 
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:01:00.0
Total memory: 10.91GiB
Free memory: 10.57GiB
2017-06-18 14:00:14.068740: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 
2017-06-18 14:00:14.068744: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   Y 
2017-06-18 14:00:14.068749: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0)
2017-06-18 14:00:17.068205: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 4488 get requests, put_count=3034 evicted_count=1000 eviction_rate=0.329598 and unsatisfied allocation rate=0.569073
2017-06-18 14:00:17.068238: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110
INFO:tensorflow:Saving checkpoints for 1 into ./t2t_train/algorithmic_reverse_decimal40/baseline_lstm_seq2seq-basic1/model.ckpt.
INFO:tensorflow:loss = inf, step = 1
ERROR:tensorflow:Model diverged with loss = NaN.
Traceback (most recent call last):
  File "/usr/local/bin/t2t-trainer", line 4, in <module>
    __import__('pkg_resources').run_script('tensor2tensor==1.0.2', 't2t-trainer')
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 719, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 1511, in run_script
    exec(script_code, namespace, namespace)
  File "/usr/local/lib/python2.7/dist-packages/tensor2tensor-1.0.2-py2.7.egg/EGG-INFO/scripts/t2t-trainer", line 55, in <module>
    
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/usr/local/lib/python2.7/dist-packages/tensor2tensor-1.0.2-py2.7.egg/EGG-INFO/scripts/t2t-trainer", line 51, in main
    
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/trainer_utils.py", line 234, in run
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/trainer_utils.py", line 562, in run_locally
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit
    loss = self._train_model(input_fn=input_fn, hooks=hooks)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1007, in _train_model
    _, loss = mon_sess.run([model_fn_ops.train_op, model_fn_ops.loss])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 505, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 842, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 798, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 960, in run
    run_metadata=run_metadata))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/basic_session_run_hooks.py", line 477, in after_run
    raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.
INFO:tensorflow:Registry contents:

  Models: ['multi_model', 'baseline_lstm_seq2seq', 'slice_net', 'diagonal_neural_gpu', 'byte_net', 'transformer', 'attention_lm', 'neural_gpu', 'xception']

  HParams: ['transformer_h32', 'transformer_big_dr2', 'transformer_big_dr3', 'transformer_big_dr1', 'slicenet1', 'transformer_tiny', 'xception_base', 'transformer_dr2', 'transformer_parsing_base_dr6', 'basic1', 'transformer_k256', 'transformer_h16', 'transformer_ff1024', 'transformer_k128', 'slicenet1tiny', 'transformer_big_enfr', 'multimodel1p8', 'transformer_dr0', 'transformer_base', 'transformer_l8', 'transformer_parsing_big', 'transformer_hs1024', 'slicenet1noam', 'transformer_big_single_gpu', 'attention_lm_base', 'transformer_ff4096', 'transformer_single_gpu', 'transformer_ls2', 'transformer_ls0', 'transformer_hs256', 'neural_gpu1', 'transformer_h1', 'transformer_h4', 'transformer_l4', 'transformer_l2', 'bytenet_base']

  RangedHParams: ['transformer_big_single_gpu', 'basic1', 'slicenet1']
  
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 20, '_task_type': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f35ad359c10>, '_model_dir': './t2t_train/algorithmic_reverse_decimal40/baseline_lstm_seq2seq-basic1', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_session_config': allow_soft_placement: true
graph_options {
  optimizer_options {
  }
}
, '_tf_random_seed': None, '_environment': 'local', '_num_worker_replicas': 0, '_task_id': 0, '_save_summary_steps': 100, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_evaluation_master': '', '_master': ''}
INFO:tensorflow:datashard_devices: ['gpu:0']
INFO:tensorflow:caching_devices: None
INFO:tensorflow:Performing Decoding from a file.
INFO:tensorflow:Getting sorted inputs
INFO:tensorflow: batch 1
INFO:tensorflow:Deocding batch 0
Traceback (most recent call last):
  File "/usr/local/bin/t2t-trainer", line 4, in <module>
    __import__('pkg_resources').run_script('tensor2tensor==1.0.2', 't2t-trainer')
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 719, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 1511, in run_script
    exec(script_code, namespace, namespace)
  File "/usr/local/lib/python2.7/dist-packages/tensor2tensor-1.0.2-py2.7.egg/EGG-INFO/scripts/t2t-trainer", line 55, in <module>
    
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/usr/local/lib/python2.7/dist-packages/tensor2tensor-1.0.2-py2.7.egg/EGG-INFO/scripts/t2t-trainer", line 51, in main
    
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/trainer_utils.py", line 234, in run
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/trainer_utils.py", line 623, in run_locally
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 590, in predict
    as_iterable=as_iterable)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 883, in _infer_model
    features = self._get_features_from_input_fn(input_fn)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 863, in _get_features_from_input_fn
    result = input_fn()
  File "build/bdist.linux-x86_64/egg/tensor2tensor/utils/trainer_utils.py", line 743, in _decode_batch_input_fn
  File "build/bdist.linux-x86_64/egg/tensor2tensor/data_generators/text_encoder.py", line 60, in encode
ValueError: invalid literal for int() with base 10: 'Goodbye'
cat: ./t2t_data/decode_this.txt.baseline_lstm_seq2seq.basic1.beam4.alpha0.6.decodes: No such file or directory

Loss Decline very slowly in wmt_ende_tokens_32k

training on a single nVidia 1080 GTX (12 GB) with the following parameters:
--problems=wmt_ende_tokens_32k
--model=transformer
--hparams_set=transformer_big_single_gpu
--hparams='batch_size=2048'

When the loss value is around 3, the loss value drops very slowly:
step 0~ 30k:loss drops from 8.4 to 3
INFO:tensorflow:global_step/sec: 2.03158
INFO:tensorflow:loss = 3.62252, step = 29601 (49.221 sec)
INFO:tensorflow:global_step/sec: 2.03539
INFO:tensorflow:loss = 3.64336, step = 29701 (49.130 sec)
INFO:tensorflow:global_step/sec: 2.03153
INFO:tensorflow:loss = 3.58582, step = 29801 (49.226 sec)
INFO:tensorflow:global_step/sec: 2.02831
INFO:tensorflow:loss = 3.38816, step = 29901 (49.301 sec)
INFO:tensorflow:global_step/sec: 2.02674
INFO:tensorflow:loss = 3.40213, step = 30001 (49.340 sec)
INFO:tensorflow:global_step/sec: 2.03157
INFO:tensorflow:loss = 3.44571, step = 30101 (49.235 sec)
INFO:tensorflow:global_step/sec: 2.03308
INFO:tensorflow:loss = 3.15277, step = 30201 (49.175 sec)

step 120k:loss is jitter around 3
INFO:tensorflow:loss = 3.12874, step = 125101 (76.470 sec)
INFO:tensorflow:global_step/sec: 2.04413
INFO:tensorflow:loss = 3.09151, step = 125201 (48.925 sec)
INFO:tensorflow:global_step/sec: 2.03683
INFO:tensorflow:loss = 3.2518, step = 125301 (49.093 sec)
INFO:tensorflow:global_step/sec: 2.03616
INFO:tensorflow:loss = 3.90474, step = 125401 (49.113 sec)
INFO:tensorflow:global_step/sec: 2.04036
INFO:tensorflow:loss = 2.87875, step = 125501 (49.010 sec)
INFO:tensorflow:global_step/sec: 2.0414
INFO:tensorflow:loss = 3.47175, step = 125601 (48.986 sec)
INFO:tensorflow:global_step/sec: 2.03132
INFO:tensorflow:loss = 3.00751, step = 125701 (49.230 sec)
INFO:tensorflow:global_step/sec: 2.0305
INFO:tensorflow:loss = 2.81739, step = 125801 (49.247 sec)
INFO:tensorflow:global_step/sec: 2.03291
INFO:tensorflow:loss = 3.60361, step = 125901 (49.191 sec)
INFO:tensorflow:global_step/sec: 2.03915
INFO:tensorflow:loss = 2.91831, step = 126001 (49.041 sec)
INFO:tensorflow:global_step/sec: 2.02992
INFO:tensorflow:loss = 2.98262, step = 126101 (49.263 sec)
INFO:tensorflow:global_step/sec: 2.03459

Is it normal?
Can you give a reference value?

GPU usage

Hi all,
I tested the training example in readme.
I found that the volatile GPU-util of almost all GPUs are 0% except the first one but took all GPU memories. I'm not sure whether it's a tensorflow or tensor2tensor error.

Thank you

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M40 24GB      Off  | 0000:04:00.0     Off |                    0 |
| N/A   56C    P0   187W / 250W |  21871MiB / 22939MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M40 24GB      Off  | 0000:05:00.0     Off |                    0 |
| N/A   28C    P0    56W / 250W |  21806MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla M40 24GB      Off  | 0000:08:00.0     Off |                    0 |
| N/A   28C    P0    55W / 250W |  21804MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla M40 24GB      Off  | 0000:09:00.0     Off |                    0 |
| N/A   29C    P0    55W / 250W |  21804MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla M40 24GB      Off  | 0000:86:00.0     Off |                    0 |
| N/A   29C    P0    56W / 250W |  21808MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla M40 24GB      Off  | 0000:87:00.0     Off |                    0 |
| N/A   27C    P0    57W / 250W |  21806MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla M40 24GB      Off  | 0000:8A:00.0     Off |                    0 |
| N/A   30C    P0    57W / 250W |  21804MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla M40 24GB      Off  | 0000:8B:00.0     Off |                    0 |
| N/A   27C    P0    56W / 250W |  21804MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

t2t-trainer: command not found

I have run into some issues installing on my GPU with Ubuntu 14.04.5. it reaches the
Installing collected packages: mpmath, sympy, tensor2tensor line and then there is a permission denied error.
screen shot 2017-06-22 at 1 11 05 pm

I tried pip installing each individual error into the directory specified and then it said that everything was installed, but the t2t-trainer --registry_help did not work.

this step works fine on my CPU but I also run into the known issue of downloading the dataset.

compatibility TensorFlow 1.1.0

TensorFlow version : 1.1.0
OS:CentOS : 7.0
tensor2tensor version : 1.0.7 from pip

I encounter an exception when I run 'Work Through'.
tensor2tensor/utils/trainer_utils.py use tf.contrib.learn.Estimator.Invoke Estimator constructor with session_config.But In tensorflow 1.1.0 ,tf.contrib.learn.Estimator constructor no session_config args.
So tensor2tensor not compatibility with TensorFlow 1.1.0?

Decoding problem for the wmt_ende_tokens_32k task

Hi,
I run the basic wmt_ende_tokens_32k problem as showed in the Walkthrough in README.md, I use 8 gpus and the loss figure is as follows:
image

But when I run decoding with decode_from_file, the decoding file is the standard WMT2014 en-de test set(newstest2014.en), but the decoding output means nothing, just show irrelevant decoding sentences as follows:
image

The input and output can not match, and the BLEU score is 0.

When I run eval script following the answer provided by @lukaszkaiser in #36, the bleu score is nearly 0.006.

So what's the problems? Thanks!

generate_calculus_integrate_sample failure - Exceptions and/or ComplexInfinity

When called from t2t-datagen --problem=algorithmic_calculus_integrate ..., generate_calculus_integrate_sample() in tensor2tensor/data_generators/algorithmic_math.py raises exception, or causes subsequent KeyError exception in int_encoder().

The reason is an attempt to integrate expressions like "(b-d)/(a-a)" w.r.t. "b" - it leads to sympy.polys.polyerrors.PolynomialDivisionFailed or builds expresions with ComplexInfinity (aka zoo), confusing int_encoder().

The straightforward fix would be to put retry loop into calculus_integrate(). Alternatively, random_expr_with_required_var() could be refined.

how to use an existing model to decode in wmt problem

I trained the wmt transformer_base model, and see some model checkpoints like.

-rw-rw-r-- 1 public public         24 Jun 23 09:47 model.ckpt-55975.data-00000-of-00002
-rw-rw-r-- 1 public public 1320167432 Jun 23 09:47 model.ckpt-55975.data-00001-of-00002
-rw-rw-r-- 1 public public      10449 Jun 23 09:47 model.ckpt-55975.index
-rw-rw-r-- 1 public public   24328177 Jun 23 09:47 model.ckpt-55975.meta
-rw-rw-r-- 1 public public         24 Jun 23 09:57 model.ckpt-56426.data-00000-of-00002
-rw-rw-r-- 1 public public 1320167432 Jun 23 09:57 model.ckpt-56426.data-00001-of-00002
-rw-rw-r-- 1 public public      10414 Jun 23 09:57 model.ckpt-56426.index

There are 2 questions:

  1. The model.ckpt-xxxx.data-00001-of-00002 is the saved model?
  2. how to specify the model to use when decoding? I looked at the demo experiment in the readme.md, to see that the --model param is set to transformer, which I thought should be set to a specific model file. Will t2t-trainer automatically use the latest checkpoint of saved model?

DECODE_FILE=$DATA_DIR/decode_this.txt
echo "Hello world" >> $DECODE_FILE
echo "Goodbye world" >> $DECODE_FILE

BEAM_SIZE=4
ALPHA=0.6

t2t-trainer \
  --data_dir=$DATA_DIR \
  --problems=$PROBLEM \
  --model=$MODEL \
  --hparams_set=$HPARAMS \
  --output_dir=$TRAIN_DIR \
  --train_steps=0 \
  --eval_steps=0 \
  --decode_beam_size=$BEAM_SIZE \
  --decode_alpha=$ALPHA \
  --decode_from_file=$DECODE_FILE

Something wrong with the decoder result of Walkthrough example

I trained the model on two Tesla M60s, each of which is 8G. I did not modify any hyper-parameter. The loss seems not change after 50000 steps.

INFO:tensorflow:Saving checkpoints for 54877 into /data/t2t/t2t_train/wmt_ende_tokens_32k/transformer-transformer_base/model.ckpt.
INFO:tensorflow:global_step/sec: 0.858429
INFO:tensorflow:loss = 1.96779, step = 54901 (116.492 sec)
INFO:tensorflow:global_step/sec: 0.876813
INFO:tensorflow:loss = 1.96174, step = 55001 (114.049 sec)
INFO:tensorflow:global_step/sec: 0.864947
INFO:tensorflow:loss = 1.98628, step = 55101 (115.614 sec)
INFO:tensorflow:global_step/sec: 0.860629
INFO:tensorflow:loss = 2.26156, step = 55201 (116.195 sec)
INFO:tensorflow:global_step/sec: 0.864128
INFO:tensorflow:loss = 1.98318, step = 55301 (115.723 sec)
INFO:tensorflow:Saving checkpoints for 55396 into /data/t2t/t2t_train/wmt_ende_tokens_32k/transformer-transformer_base/model.ckpt.
INFO:tensorflow:global_step/sec: 0.852946
INFO:tensorflow:loss = 2.30657, step = 55401 (117.241 sec)
INFO:tensorflow:global_step/sec: 0.870939
INFO:tensorflow:loss = 2.11571, step = 55501 (114.819 sec)
INFO:tensorflow:global_step/sec: 0.86979
INFO:tensorflow:loss = 1.99461, step = 55601 (114.970 sec)
INFO:tensorflow:global_step/sec: 0.86269
INFO:tensorflow:loss = 2.01496, step = 55701 (115.916 sec)
INFO:tensorflow:global_step/sec: 0.869183
INFO:tensorflow:loss = 1.98261, step = 55801 (115.051 sec)
INFO:tensorflow:global_step/sec: 0.862935
INFO:tensorflow:loss = 1.88075, step = 55901 (115.883 sec)
INFO:tensorflow:Saving checkpoints for 55915 into /data/t2t/t2t_train/wmt_ende_tokens_32k/transformer-transformer_base/model.ckpt.
INFO:tensorflow:global_step/sec: 0.855085
INFO:tensorflow:loss = 1.9415, step = 56001 (116.948 sec)
INFO:tensorflow:global_step/sec: 0.86353
INFO:tensorflow:loss = 2.26614, step = 56101 (115.804 sec)
INFO:tensorflow:global_step/sec: 0.871136
INFO:tensorflow:loss = 2.14308, step = 56201 (114.793 sec)
INFO:tensorflow:global_step/sec: 0.860752
INFO:tensorflow:loss = 1.96734, step = 56301 (116.178 sec)
INFO:tensorflow:global_step/sec: 0.871609
INFO:tensorflow:loss = 1.98928, step = 56401 (114.730 sec)

However, the decoder result does not make any sense. Anyone knows the reason?

INFO:tensorflow:Inference results INPUT: Goodbye world
INFO:tensorflow:Inference results OUTPUT: Esconnectentareaconnectentkannconnectent
INFO:tensorflow:Inference results INPUT: Hello world
INFO:tensorflow:Inference results OUTPUT: Esconnectentareaconnectentkannconnectent

attention_lm based rescoring

@lukaszkaiser Can you please help me with an inference script that basically takes a lot of hypothesis sentences and gives a score for each sentence using tensor2tensor approach. The current rnn based lm approach is quite slow. Meanwhile, I will try training a character level language model using the same technique.

Thanks

I followed the following guideline to register a new hyperparameter sets,but failed.

I followed the following guideline:


You can currently do so for models, hyperparameter sets, and modalities. Please do submit a pull request if your component might be useful to others.

Here's an example with a new hyperparameter set:

In ~/usr/t2t_usr/my_registrations.py

from tensor2tensor.models import transformer
from tensor2tensor.utils import registry

@registry.register_hparams
def transformer_my_very_own_hparams_set():
hparams = transformer.transformer_base()
hparams.hidden_size = 1024
...

In ~/usr/t2t_usr/init.py

import my_registrations
t2t-trainer --t2t_usr_dir=~/usr/t2t_usr --registry_help
You'll see under the registered HParams your transformer_my_very_own_hparams_set, which you can directly use on the command line with the --hparams_set flag.


do the same, but I could not find "transformer_my_very_own_hparams_set" from the result, here is the log:

[@nmyjs_160_20 t2t_usr]# t2t-trainer --t2t_usr_dir=~/usr/t2t_usr --registry_help
INFO:tensorflow:
Registry contents:

Models: ['attention_lm', 'attention_lm_moe', 'baseline_lstm_seq2seq', 'byte_net', 'diagonal_neural_gpu', 'multi_model', 'neural_gpu', 'slice_net', 'transformer', 'xception']

HParams (by model):
* attention: ['attention_lm_base', 'attention_lm_moe_base', 'attention_lm_moe_large', 'attention_lm_moe_small']
* basic: ['basic_1']
* bytenet: ['bytenet_base']
* multimodel: ['multimodel_1p8']
* neuralgpu: ['neuralgpu_1']
* slicenet: ['slicenet_1', 'slicenet_1noam', 'slicenet_1tiny']
* transformer: ['transformer_base', 'transformer_base_single_gpu', 'transformer_big', 'transformer_big_dr1', 'transformer_big_dr2', 'transformer_big_enfr', 'transformer_big_single_gpu', 'transformer_dr0', 'transformer_dr2', 'transformer_ff1024', 'transformer_ff4096', 'transformer_h1', 'transformer_h16', 'transformer_h32', 'transformer_h4', 'transformer_hs1024', 'transformer_hs256', 'transformer_k128', 'transformer_k256', 'transformer_l2', 'transformer_l4', 'transformer_l8', 'transformer_ls0', 'transformer_ls2', 'transformer_parsing_base', 'transformer_parsing_big', 'transformer_tiny']
* xception: ['xception_base']

RangedHParams: ['basic1', 'slicenet1', 'transformer_big_single_gpu']

Modalities: ['audio:audio_spectral_modality', 'audio:default', 'audio:identity', 'class_label:class_label_2d', 'class_label:default', 'class_label:identity', 'generic:default', 'image:default', 'image:identity', 'image:small_image_modality', 'symbol:default', 'symbol:identity']

[@nmyjs_160_20 t2t_usr]# ls
init.py my_registrations.py

is there anyone can help me ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.