I got quite low performance compared to the paper. So, i did some re

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I used recent MOSES multi-bleu.perl (<a href="https://raw.githubusercontent.com/moses-

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

proper size of wmt_ende_tokens_32k-{dev, train}* file? about tensor2tensor HOT 22 CLOSED

tensorflow commented on August 27, 2024

proper size of wmt_ende_tokens_32k-{dev, train}* file?

from tensor2tensor.

Comments (22)

lukaszkaiser commented on August 27, 2024 2

It turns out that the separate "_" was a bug introduced inadvertedly in a recent PR by Villi (see the chat on Gitter). We didn't have it before, so it might be responsible for some of the lower BLEU, but maybe not that much -- we should correct it in any case.

Another point is that all results in the paper are obtained with checkpoint averaging. Use the avg_checkpoints script from utils on the last 20 checkpoints that are saved in your $TRAIN_DIR. It's like a poor-man's version of Polyak averaging, but it's needed to reproduce our results (we're planning to add true Polyak averaging to the trainer at a later point).

And then you need to (1) tokenize the newstest and the (separated) decodes:
perl ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l de < $decodes > $decodes_file.tok
(2) Split on hyphens to be compatible with BLEU scores from other papers:

Put compounds in ATAT format (comparable to GNMT, ConvS2S)

perl -ple 's{(\S)-(\S)}{$1 ##AT##-##AT## $2}g' < $tok_gold_targets > $tok_gold_targets.atat
perl -ple 's{(\S)-(\S)}{$1 ##AT##-##AT## $2}g' < $decodes_file.target > $decodes_file.atat
(3) Finally run multi-bleu:
perl ~/mosesdecoder/scripts/generic/multi-bleu.perl $tok_gold_targets.atat < $decodes_file.atat

Especially doing the averaging and tokenizing (1) is important, detokenized BLEU is often quite a bit lower than tokenized one.

from tensor2tensor.

neverdoubt commented on August 27, 2024 1

300k step trained model using 4 titan X
4th row in (C) model (d_model = 256, d_k=32, d_v=32)
newstest2013.{en,de} bleu 24.2
The paper said 24.5 without averaging. Now we are in the same level.

btw, after 20 checkpoint averaging, i got 24.78

from tensor2tensor.

lukaszkaiser commented on August 27, 2024

The first size looks correct: 444K for dev set (it's only a few thousand sentence pairs, each sentence is ~20 ints, 2204 bytes/int gives ~160 bytes/sentence pair, so ~400kb looks ok). My -train is sharded 100x and I have 7MB in each file (the dataset is 4M pairs, so again, it makes sense).

from tensor2tensor.

lukaszkaiser commented on August 27, 2024

Since a few people are complaining, could you post the details of your training and results? Did you train on 1 or many GPUs? For how many steps? What is the eval printing out? If it's 1-GPU, you should use the transformer_base_single_gpu hparams config, we should make this clearer in the readme.

from tensor2tensor.

neverdoubt commented on August 27, 2024

[3X titan black, base model]

global_step = 331361, loss = 1.5812, metrics-wmt_ende_tokens_32k/accuracy = 0.663404, metrics-wmt_ende_tokens_32k/accuracy_per_sequence = 0.000824742, metrics-wmt_ende_tokens_32k/accuracy_top5 = 0.84237, metrics-wmt_ende_tokens_32k/bleu_score = 0.325035, metrics-wmt_ende_tokens_32k/neg_log_perplexity = -1.79698, metrics/accuracy = 0.663404, metrics/accuracy_per_sequence = 0.000824742, metrics/accuracy_top5 = 0.84237, metrics/bleu_score = 0.325035, metrics/neg_log_perplexity = -1.79698
=> actual bleu : 21.x on newstest2013

[4x titan x, small model d_model = 256 , d_k = 32, d_v = 32, which is 4th-(C) model in table3 in the paper]
global_step = 167127, loss = 1.48184, metrics-wmt_ende_tokens_32k/accuracy = 0.675814, metrics-wmt_ende_tokens_32k/accuracy_per_sequence = 0.00234962, metrics-wmt_ende_tokens_32k/accuracy_top5 = 0.852813, metrics-wmt_ende_tokens_32k/bleu_score = 0.340511, metrics-wmt_ende_tokens_32k/neg_log_perplexity = -1.6753, metrics/accuracy = 0.675814, metrics/accuracy_per_sequence = 0.00234962, metrics/accuracy_top5 = 0.852813, metrics/bleu_score = 0.340511, metrics/neg_log_perplexity = -1.6753
=> actual bleu : 23.85 on newstest2013
however, i got "ran out of range" when eval steps are larger than 23.
If i understood correctly, dev dataset has much more pairs and more than 100 eval steps should not result "ran out of range". this is why i suspect dataset for the low performance.

from tensor2tensor.

zxw866 commented on August 27, 2024

I trained on 8 NVIDIA TITAN-XP with the “transformer_base“ parameters:
@registry.register_hparams
def transformer_base():
"""Set of hyperparameters."""
hparams = common_hparams.basic_params1()
hparams.hidden_size = 512
hparams.batch_size = 4096
hparams.max_length = 256
hparams.dropout = 0.0
hparams.clip_grad_norm = 0. # i.e. no gradient clipping
hparams.optimizer_adam_epsilon = 1e-9
hparams.learning_rate_decay_scheme = "noam"
hparams.learning_rate = 0.1
hparams.learning_rate_warmup_steps = 4000
hparams.initializer_gain = 1.0
hparams.num_hidden_layers = 6
hparams.initializer = "uniform_unit_scaling"
hparams.weight_decay = 0.0
hparams.optimizer_adam_beta1 = 0.9
hparams.optimizer_adam_beta2 = 0.98
hparams.num_sampled_classes = 0
hparams.label_smoothing = 0.1
hparams.shared_embedding_and_softmax_weights = int(True)

hparams.add_hparam("filter_size", 2048) # Add new ones like this.
# attention-related flags
hparams.add_hparam("num_heads", 8)
hparams.add_hparam("attention_key_channels", 0)
hparams.add_hparam("attention_value_channels", 0)
hparams.add_hparam("ffn_layer", "conv_hidden_relu")
hparams.add_hparam("parameter_attention_key_channels", 0)
hparams.add_hparam("parameter_attention_value_channels", 0)
# All hyperparameters ending in "dropout" are automatically set to 0.0
# when not in training mode.
hparams.add_hparam("attention_dropout", 0.0)
hparams.add_hparam("relu_dropout", 0.0)
hparams.add_hparam("residual_dropout", 0.1)
hparams.add_hparam("pos", "timing") # timing, none
hparams.add_hparam("nbr_decoder_problems", 1)
return hparams

The data is split into 100 parts.
The loss is too low and the BLEU on newstest2013 is only 20.37 (multi-bleu.perl).
Can this configuration achieve the performance in the paper?
Is it enough for 140K steps on 8 GPUs?
Why attention_dropout and relu_dropout are set to 0? Does this hurt BLEU?

INFO:tensorflow:Evaluation [1/20]
INFO:tensorflow:Evaluation [2/20]
INFO:tensorflow:Evaluation [3/20]
INFO:tensorflow:Evaluation [4/20]
INFO:tensorflow:Evaluation [5/20]
INFO:tensorflow:Evaluation [6/20]
INFO:tensorflow:Evaluation [7/20]
INFO:tensorflow:Evaluation [8/20]
INFO:tensorflow:Evaluation [9/20]
INFO:tensorflow:Evaluation [10/20]
INFO:tensorflow:Evaluation [11/20]
INFO:tensorflow:Evaluation [12/20]
INFO:tensorflow:Evaluation [13/20]
INFO:tensorflow:Evaluation [14/20]
INFO:tensorflow:Evaluation [15/20]
INFO:tensorflow:Evaluation [16/20]
INFO:tensorflow:Evaluation [17/20]
INFO:tensorflow:Evaluation [18/20]
INFO:tensorflow:Evaluation [19/20]
INFO:tensorflow:Evaluation [20/20]
INFO:tensorflow:Finished evaluation at 2017-06-28-04:44:23
INFO:tensorflow:Saving dict for global step 145673: global_step = 145673, loss = 0.787518, metrics-wmt_ende_tokens_32k/accuracy = 0.8182, metrics-wmt_ende_tokens_32k/accuracy_per_sequence = 0.00633413, metrics-wmt_ende_tokens_32k/accuracy_top5 = 0.925367, metrics-wmt_ende_tokens_32k/bleu_score = 0.496107, metrics-wmt_ende_tokens_32k/neg_log_perplexity = -0.944883, metrics/accuracy = 0.8182, metrics/accuracy_per_sequence = 0.00633413, metrics/accuracy_top5 = 0.925367, metrics/bleu_score = 0.496107, metrics/neg_log_perplexity = -0.944883

from tensor2tensor.

lukaszkaiser commented on August 27, 2024

@zxw866 : that looks like a very strong model!

@neverdoubt : when you say "=> actual bleu : 23.85 on newstest2013", how do you measure that exactly? Do you use MOSES scripts, the recent version? Remember that newstest2014 is often 0.5 BLEU or more higher than '13, could you run on that? There is also the hyphenation-split issue which can be around 0.2 difference. We should probably replicate the BLEU calculation we use somewhere too.

Ah, also, we average the last 20 checkpoints with this script:
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/avg_checkpoints.py

Did you try that? Let's get your results to the same level as ours!

But with your hardware guys, you should try transformer_big too!

from tensor2tensor.

neverdoubt commented on August 27, 2024

I used recent MOSES multi-bleu.perl (https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/generic/multi-bleu.perl)

actual bleu scores are computed using newstest2013.en file (which is dev set)
i expect 25.8 (as is in table 3) from my base trained model.
Anyway i'll try big model.

from tensor2tensor.

zxw866 commented on August 27, 2024

Although “metrics-wmt_ende_tokens_32k/bleu_score = 0.496107“ is high, the BLEU on newstest2013 is only 20.37 (multi-bleu.perl).
Can this configuration achieve 25.8 in the paper ?
Is it enough for 140K steps on 8 GPUs ?

from tensor2tensor.

mehmedes commented on August 27, 2024

@neverdoubt Did you use newstest2013.en without preprocessing or did you postprocess the Tensor2Tensor output before BLEU scoring? I think multi-bleu needs the source and reference to be tokenized...

from tensor2tensor.

zxw866 commented on August 27, 2024

The result of data generation in Walkthrough is about 1400M, which is double size of BPE training sets. I'm guessing '_' is used as an independent token in sentences, which led to the very low loss.
As shown in the data generation process:

I wonder if this is a BUG?

from tensor2tensor.

vthorsteinsson commented on August 27, 2024

Was this with the newest version of T2T, i.e. 1.08? The one with the separate underscores? Would be nice to get a confirmation that those don't necessarily hurt model performance (and may even make it better ;-) )

from tensor2tensor.

neverdoubt commented on August 27, 2024

@vthorsteinsson I used 1.0.7 for training (which has separate '_' issue), but my training data was created earlier version (maybe 1.0.2 or 1.0.4).

from tensor2tensor.

lukaszkaiser commented on August 27, 2024

I added utils/get_ende_blue.sh in 1.0.9. This is a script that includes the commands we used to get BLEU in the paper. You might need to fix a path to MOSES and tokenized newstest2013 there.

Please: could you average your checkpoints with utils/avg_checkpoints.py and then run utils/get_ende_blue.sh and report back the results? Just to make sure where your models really stand compared to our results, even despite possible tokenization differences. Thanks!

from tensor2tensor.

zxw866 commented on August 27, 2024

When using BPE training set, i got 24.64 on newstest2013. It's close to the results in the paper.
Next i will try utils/get_ende_blue.sh. Thanks!

from tensor2tensor.

lukaszkaiser commented on August 27, 2024

Just remember that wmt_ende_bpe32k is tokenized, so instead of the tokenizer call in the get_ende_bleu.sh script, do this: perl -ple 's{@@ }{}g' > $decodes_file.target. Also, did you average checkpoints? Let us know what numbers you get!

from tensor2tensor.

zxw866 commented on August 27, 2024

transformer_base hparams.
110k steps using 1 titan Xp, then 140k steps using 8 titan Xp.
I averaged 7 checkpoints.
I removed '@@' using "sed -r 's/(@@ )|(@@ ?$)//g'".
Then I got 24.64 on newstest2013.
I'm guessing the learning rate decay was affected in my experiment.
Next I plan to run the big model.
I really appreciate your help！

from tensor2tensor.

lukaszkaiser commented on August 27, 2024

That looks reasonable. Did you run get_ende_bleu, I mean esp. the "atat" part? That can be 0.2 or 0.3 BLEU if you forget it.

from tensor2tensor.

tobyyouup commented on August 27, 2024

Hi @lukaszkaiser I have read the discussion above and found the details for calculation BLEU in #44 is different. So I want to make sure something:

What the format is needed for the decoding file (--decode_from_file=$DECODE_FILE, such as newstest2013 or newstest 2014), Do I need to do tokenizaton and put compounds in ATAT format before feeding into the decoding process? Do I also need to do bpe? Or just use the raw text sentences?

from tensor2tensor.

lukaszkaiser commented on August 27, 2024

The input file should be detokenized, pure text. (Except if you do BPE, but I suggest trying without it.)

from tensor2tensor.

tobyyouup commented on August 27, 2024

@lukaszkaiser When I feed pure test for decoding and checkpoint average, I can get a BLEU score 26 on the base configuration model.

from tensor2tensor.

lukaszkaiser commented on August 27, 2024

That sounds reasonable. I'm closing this issue for now as it's gotten long and tokenization changed in 1.0.11. I hope things are ok now, but please either re-open or make a new issue if you see the problems again!

from tensor2tensor.

proper size of wmt_ende_tokens_32k-{dev, train}* file? about tensor2tensor HOT 22 CLOSED

Comments (22)

Put compounds in ATAT format (comparable to GNMT, ConvS2S)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent