Comments (22)
It turns out that the separate "_" was a bug introduced inadvertedly in a recent PR by Villi (see the chat on Gitter). We didn't have it before, so it might be responsible for some of the lower BLEU, but maybe not that much -- we should correct it in any case.
Another point is that all results in the paper are obtained with checkpoint averaging. Use the avg_checkpoints
script from utils
on the last 20 checkpoints that are saved in your $TRAIN_DIR
. It's like a poor-man's version of Polyak averaging, but it's needed to reproduce our results (we're planning to add true Polyak averaging to the trainer at a later point).
And then you need to (1) tokenize the newstest and the (separated) decodes:
perl ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l de < $decodes > $decodes_file.tok
(2) Split on hyphens to be compatible with BLEU scores from other papers:
Put compounds in ATAT format (comparable to GNMT, ConvS2S)
perl -ple 's{(\S)-(\S)}{$1 ##AT##-##AT## $2}g' < $tok_gold_targets > $tok_gold_targets.atat
perl -ple 's{(\S)-(\S)}{$1 ##AT##-##AT## $2}g' < $decodes_file.target > $decodes_file.atat
(3) Finally run multi-bleu:
perl ~/mosesdecoder/scripts/generic/multi-bleu.perl $tok_gold_targets.atat < $decodes_file.atat
Especially doing the averaging and tokenizing (1) is important, detokenized BLEU is often quite a bit lower than tokenized one.
from tensor2tensor.
300k step trained model using 4 titan X
4th row in (C) model (d_model = 256, d_k=32, d_v=32)
newstest2013.{en,de} bleu 24.2
The paper said 24.5 without averaging. Now we are in the same level.
btw, after 20 checkpoint averaging, i got 24.78
from tensor2tensor.
The first size looks correct: 444K for dev set (it's only a few thousand sentence pairs, each sentence is ~20 ints, 2204 bytes/int gives ~160 bytes/sentence pair, so ~400kb looks ok). My -train is sharded 100x and I have 7MB in each file (the dataset is 4M pairs, so again, it makes sense).
from tensor2tensor.
Since a few people are complaining, could you post the details of your training and results? Did you train on 1 or many GPUs? For how many steps? What is the eval printing out? If it's 1-GPU, you should use the transformer_base_single_gpu
hparams config, we should make this clearer in the readme.
from tensor2tensor.
[3X titan black, base model]
global_step = 331361, loss = 1.5812, metrics-wmt_ende_tokens_32k/accuracy = 0.663404, metrics-wmt_ende_tokens_32k/accuracy_per_sequence = 0.000824742, metrics-wmt_ende_tokens_32k/accuracy_top5 = 0.84237, metrics-wmt_ende_tokens_32k/bleu_score = 0.325035, metrics-wmt_ende_tokens_32k/neg_log_perplexity = -1.79698, metrics/accuracy = 0.663404, metrics/accuracy_per_sequence = 0.000824742, metrics/accuracy_top5 = 0.84237, metrics/bleu_score = 0.325035, metrics/neg_log_perplexity = -1.79698
=> actual bleu : 21.x on newstest2013
[4x titan x, small model d_model = 256 , d_k = 32, d_v = 32, which is 4th-(C) model in table3 in the paper]
global_step = 167127, loss = 1.48184, metrics-wmt_ende_tokens_32k/accuracy = 0.675814, metrics-wmt_ende_tokens_32k/accuracy_per_sequence = 0.00234962, metrics-wmt_ende_tokens_32k/accuracy_top5 = 0.852813, metrics-wmt_ende_tokens_32k/bleu_score = 0.340511, metrics-wmt_ende_tokens_32k/neg_log_perplexity = -1.6753, metrics/accuracy = 0.675814, metrics/accuracy_per_sequence = 0.00234962, metrics/accuracy_top5 = 0.852813, metrics/bleu_score = 0.340511, metrics/neg_log_perplexity = -1.6753
=> actual bleu : 23.85 on newstest2013
however, i got "ran out of range" when eval steps are larger than 23.
If i understood correctly, dev dataset has much more pairs and more than 100 eval steps should not result "ran out of range". this is why i suspect dataset for the low performance.
from tensor2tensor.
I trained on 8 NVIDIA TITAN-XP with the “transformer_base“ parameters:
@registry.register_hparams
def transformer_base():
"""Set of hyperparameters."""
hparams = common_hparams.basic_params1()
hparams.hidden_size = 512
hparams.batch_size = 4096
hparams.max_length = 256
hparams.dropout = 0.0
hparams.clip_grad_norm = 0. # i.e. no gradient clipping
hparams.optimizer_adam_epsilon = 1e-9
hparams.learning_rate_decay_scheme = "noam"
hparams.learning_rate = 0.1
hparams.learning_rate_warmup_steps = 4000
hparams.initializer_gain = 1.0
hparams.num_hidden_layers = 6
hparams.initializer = "uniform_unit_scaling"
hparams.weight_decay = 0.0
hparams.optimizer_adam_beta1 = 0.9
hparams.optimizer_adam_beta2 = 0.98
hparams.num_sampled_classes = 0
hparams.label_smoothing = 0.1
hparams.shared_embedding_and_softmax_weights = int(True)
hparams.add_hparam("filter_size", 2048) # Add new ones like this.
# attention-related flags
hparams.add_hparam("num_heads", 8)
hparams.add_hparam("attention_key_channels", 0)
hparams.add_hparam("attention_value_channels", 0)
hparams.add_hparam("ffn_layer", "conv_hidden_relu")
hparams.add_hparam("parameter_attention_key_channels", 0)
hparams.add_hparam("parameter_attention_value_channels", 0)
# All hyperparameters ending in "dropout" are automatically set to 0.0
# when not in training mode.
hparams.add_hparam("attention_dropout", 0.0)
hparams.add_hparam("relu_dropout", 0.0)
hparams.add_hparam("residual_dropout", 0.1)
hparams.add_hparam("pos", "timing") # timing, none
hparams.add_hparam("nbr_decoder_problems", 1)
return hparams
The data is split into 100 parts.
The loss is too low and the BLEU on newstest2013 is only 20.37 (multi-bleu.perl).
Can this configuration achieve the performance in the paper?
Is it enough for 140K steps on 8 GPUs?
Why attention_dropout and relu_dropout are set to 0? Does this hurt BLEU?
INFO:tensorflow:Evaluation [1/20]
INFO:tensorflow:Evaluation [2/20]
INFO:tensorflow:Evaluation [3/20]
INFO:tensorflow:Evaluation [4/20]
INFO:tensorflow:Evaluation [5/20]
INFO:tensorflow:Evaluation [6/20]
INFO:tensorflow:Evaluation [7/20]
INFO:tensorflow:Evaluation [8/20]
INFO:tensorflow:Evaluation [9/20]
INFO:tensorflow:Evaluation [10/20]
INFO:tensorflow:Evaluation [11/20]
INFO:tensorflow:Evaluation [12/20]
INFO:tensorflow:Evaluation [13/20]
INFO:tensorflow:Evaluation [14/20]
INFO:tensorflow:Evaluation [15/20]
INFO:tensorflow:Evaluation [16/20]
INFO:tensorflow:Evaluation [17/20]
INFO:tensorflow:Evaluation [18/20]
INFO:tensorflow:Evaluation [19/20]
INFO:tensorflow:Evaluation [20/20]
INFO:tensorflow:Finished evaluation at 2017-06-28-04:44:23
INFO:tensorflow:Saving dict for global step 145673: global_step = 145673, loss = 0.787518, metrics-wmt_ende_tokens_32k/accuracy = 0.8182, metrics-wmt_ende_tokens_32k/accuracy_per_sequence = 0.00633413, metrics-wmt_ende_tokens_32k/accuracy_top5 = 0.925367, metrics-wmt_ende_tokens_32k/bleu_score = 0.496107, metrics-wmt_ende_tokens_32k/neg_log_perplexity = -0.944883, metrics/accuracy = 0.8182, metrics/accuracy_per_sequence = 0.00633413, metrics/accuracy_top5 = 0.925367, metrics/bleu_score = 0.496107, metrics/neg_log_perplexity = -0.944883
from tensor2tensor.
@zxw866 : that looks like a very strong model!
@neverdoubt : when you say "=> actual bleu : 23.85 on newstest2013", how do you measure that exactly? Do you use MOSES scripts, the recent version? Remember that newstest2014 is often 0.5 BLEU or more higher than '13, could you run on that? There is also the hyphenation-split issue which can be around 0.2 difference. We should probably replicate the BLEU calculation we use somewhere too.
Ah, also, we average the last 20 checkpoints with this script:
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/avg_checkpoints.py
Did you try that? Let's get your results to the same level as ours!
But with your hardware guys, you should try transformer_big
too!
from tensor2tensor.
I used recent MOSES multi-bleu.perl (https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/generic/multi-bleu.perl)
actual bleu scores are computed using newstest2013.en file (which is dev set)
i expect 25.8 (as is in table 3) from my base trained model.
Anyway i'll try big model.
from tensor2tensor.
Although “metrics-wmt_ende_tokens_32k/bleu_score = 0.496107“ is high, the BLEU on newstest2013 is only 20.37 (multi-bleu.perl).
Can this configuration achieve 25.8 in the paper ?
Is it enough for 140K steps on 8 GPUs ?
from tensor2tensor.
@neverdoubt Did you use newstest2013.en without preprocessing or did you postprocess the Tensor2Tensor output before BLEU scoring? I think multi-bleu
needs the source and reference to be tokenized...
from tensor2tensor.
The result of data generation in Walkthrough is about 1400M, which is double size of BPE training sets. I'm guessing '_' is used as an independent token in sentences, which led to the very low loss.
As shown in the data generation process:
I wonder if this is a BUG?
from tensor2tensor.
Was this with the newest version of T2T, i.e. 1.08? The one with the separate underscores? Would be nice to get a confirmation that those don't necessarily hurt model performance (and may even make it better ;-) )
from tensor2tensor.
@vthorsteinsson I used 1.0.7 for training (which has separate '_' issue), but my training data was created earlier version (maybe 1.0.2 or 1.0.4).
from tensor2tensor.
I added utils/get_ende_blue.sh
in 1.0.9. This is a script that includes the commands we used to get BLEU in the paper. You might need to fix a path to MOSES and tokenized newstest2013 there.
Please: could you average your checkpoints with utils/avg_checkpoints.py
and then run utils/get_ende_blue.sh
and report back the results? Just to make sure where your models really stand compared to our results, even despite possible tokenization differences. Thanks!
from tensor2tensor.
When using BPE training set, i got 24.64 on newstest2013. It's close to the results in the paper.
Next i will try utils/get_ende_blue.sh. Thanks!
from tensor2tensor.
Just remember that wmt_ende_bpe32k
is tokenized, so instead of the tokenizer call in the get_ende_bleu.sh
script, do this: perl -ple 's{@@ }{}g' > $decodes_file.target
. Also, did you average checkpoints? Let us know what numbers you get!
from tensor2tensor.
transformer_base hparams.
110k steps using 1 titan Xp, then 140k steps using 8 titan Xp.
I averaged 7 checkpoints.
I removed '@@' using "sed -r 's/(@@ )|(@@ ?$)//g'".
Then I got 24.64 on newstest2013.
I'm guessing the learning rate decay was affected in my experiment.
Next I plan to run the big model.
I really appreciate your help!
from tensor2tensor.
That looks reasonable. Did you run get_ende_bleu
, I mean esp. the "atat" part? That can be 0.2 or 0.3 BLEU if you forget it.
from tensor2tensor.
Hi @lukaszkaiser I have read the discussion above and found the details for calculation BLEU in #44 is different. So I want to make sure something:
What the format is needed for the decoding file (--decode_from_file=$DECODE_FILE, such as newstest2013 or newstest 2014), Do I need to do tokenizaton and put compounds in ATAT format before feeding into the decoding process? Do I also need to do bpe? Or just use the raw text sentences?
from tensor2tensor.
The input file should be detokenized, pure text. (Except if you do BPE, but I suggest trying without it.)
from tensor2tensor.
@lukaszkaiser When I feed pure test for decoding and checkpoint average, I can get a BLEU score 26 on the base configuration model.
from tensor2tensor.
That sounds reasonable. I'm closing this issue for now as it's gotten long and tokenization changed in 1.0.11. I hope things are ok now, but please either re-open or make a new issue if you see the problems again!
from tensor2tensor.
Related Issues (20)
- Can't create experiment
- What is suitable version of tensorflow for current tensor2tensor version(1.15.7)?
- Error when deploy tensor2tensor model after training
- Moving MNIST frame prediction using SV2P or Emily
- Comparison of OpenNMT-tf and tensor2tensor
- Use of Layer Normalization
- AttributeError: 'DummyModule' object has no attribute 'load_checkpoint'
- Unable to instantiate problem instance when calling use_vocab_from_other_problem
- Error: AttributeError: module 'tensorflow.compat.v2.__internal__' has no attribute 'monitoring'
- Colab notebook breaking after TF1 support removed HOT 1
- Question about bleu evaluation HOT 1
- AttributeError: 'AdafactorOptimizer' object has no attribute 'get_gradients' HOT 1
- Aadm is slower than Adafactor
- AttributeError: module 'tensorflow' has no attribute 'flags'
- 'NoneType' object has no attribute 'copy' HOT 1
- Potential bug in timing embedding
- Tensor2Tensor Intro notebook giving errors while running cell 4
- Absolute Position Encoding:Why are the two tensors not alternately merged? HOT 2
- Migrating T2T to TF2
- RuntimeError: There was no new checkpoint after the training. Eval status: missing checkpoint
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tensor2tensor.