Giter Club home page Giter Club logo

Comments (22)

lukaszkaiser avatar lukaszkaiser commented on August 27, 2024 2

It turns out that the separate "_" was a bug introduced inadvertedly in a recent PR by Villi (see the chat on Gitter). We didn't have it before, so it might be responsible for some of the lower BLEU, but maybe not that much -- we should correct it in any case.

Another point is that all results in the paper are obtained with checkpoint averaging. Use the avg_checkpoints script from utils on the last 20 checkpoints that are saved in your $TRAIN_DIR. It's like a poor-man's version of Polyak averaging, but it's needed to reproduce our results (we're planning to add true Polyak averaging to the trainer at a later point).

And then you need to (1) tokenize the newstest and the (separated) decodes:
perl ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l de < $decodes > $decodes_file.tok
(2) Split on hyphens to be compatible with BLEU scores from other papers:

Put compounds in ATAT format (comparable to GNMT, ConvS2S)

perl -ple 's{(\S)-(\S)}{$1 ##AT##-##AT## $2}g' < $tok_gold_targets > $tok_gold_targets.atat
perl -ple 's{(\S)-(\S)}{$1 ##AT##-##AT## $2}g' < $decodes_file.target > $decodes_file.atat
(3) Finally run multi-bleu:
perl ~/mosesdecoder/scripts/generic/multi-bleu.perl $tok_gold_targets.atat < $decodes_file.atat

Especially doing the averaging and tokenizing (1) is important, detokenized BLEU is often quite a bit lower than tokenized one.

from tensor2tensor.

neverdoubt avatar neverdoubt commented on August 27, 2024 1

300k step trained model using 4 titan X
4th row in (C) model (d_model = 256, d_k=32, d_v=32)
newstest2013.{en,de} bleu 24.2
The paper said 24.5 without averaging. Now we are in the same level.

btw, after 20 checkpoint averaging, i got 24.78

from tensor2tensor.

lukaszkaiser avatar lukaszkaiser commented on August 27, 2024

The first size looks correct: 444K for dev set (it's only a few thousand sentence pairs, each sentence is ~20 ints, 2204 bytes/int gives ~160 bytes/sentence pair, so ~400kb looks ok). My -train is sharded 100x and I have 7MB in each file (the dataset is 4M pairs, so again, it makes sense).

from tensor2tensor.

lukaszkaiser avatar lukaszkaiser commented on August 27, 2024

Since a few people are complaining, could you post the details of your training and results? Did you train on 1 or many GPUs? For how many steps? What is the eval printing out? If it's 1-GPU, you should use the transformer_base_single_gpu hparams config, we should make this clearer in the readme.

from tensor2tensor.

neverdoubt avatar neverdoubt commented on August 27, 2024

[3X titan black, base model]

global_step = 331361, loss = 1.5812, metrics-wmt_ende_tokens_32k/accuracy = 0.663404, metrics-wmt_ende_tokens_32k/accuracy_per_sequence = 0.000824742, metrics-wmt_ende_tokens_32k/accuracy_top5 = 0.84237, metrics-wmt_ende_tokens_32k/bleu_score = 0.325035, metrics-wmt_ende_tokens_32k/neg_log_perplexity = -1.79698, metrics/accuracy = 0.663404, metrics/accuracy_per_sequence = 0.000824742, metrics/accuracy_top5 = 0.84237, metrics/bleu_score = 0.325035, metrics/neg_log_perplexity = -1.79698
=> actual bleu : 21.x on newstest2013

[4x titan x, small model d_model = 256 , d_k = 32, d_v = 32, which is 4th-(C) model in table3 in the paper]
global_step = 167127, loss = 1.48184, metrics-wmt_ende_tokens_32k/accuracy = 0.675814, metrics-wmt_ende_tokens_32k/accuracy_per_sequence = 0.00234962, metrics-wmt_ende_tokens_32k/accuracy_top5 = 0.852813, metrics-wmt_ende_tokens_32k/bleu_score = 0.340511, metrics-wmt_ende_tokens_32k/neg_log_perplexity = -1.6753, metrics/accuracy = 0.675814, metrics/accuracy_per_sequence = 0.00234962, metrics/accuracy_top5 = 0.852813, metrics/bleu_score = 0.340511, metrics/neg_log_perplexity = -1.6753
=> actual bleu : 23.85 on newstest2013
however, i got "ran out of range" when eval steps are larger than 23.
If i understood correctly, dev dataset has much more pairs and more than 100 eval steps should not result "ran out of range". this is why i suspect dataset for the low performance.

from tensor2tensor.

zxw866 avatar zxw866 commented on August 27, 2024

I trained on 8 NVIDIA TITAN-XP with the “transformer_base“ parameters:
@registry.register_hparams
def transformer_base():
"""Set of hyperparameters."""
hparams = common_hparams.basic_params1()
hparams.hidden_size = 512
hparams.batch_size = 4096
hparams.max_length = 256
hparams.dropout = 0.0
hparams.clip_grad_norm = 0. # i.e. no gradient clipping
hparams.optimizer_adam_epsilon = 1e-9
hparams.learning_rate_decay_scheme = "noam"
hparams.learning_rate = 0.1
hparams.learning_rate_warmup_steps = 4000
hparams.initializer_gain = 1.0
hparams.num_hidden_layers = 6
hparams.initializer = "uniform_unit_scaling"
hparams.weight_decay = 0.0
hparams.optimizer_adam_beta1 = 0.9
hparams.optimizer_adam_beta2 = 0.98
hparams.num_sampled_classes = 0
hparams.label_smoothing = 0.1
hparams.shared_embedding_and_softmax_weights = int(True)

hparams.add_hparam("filter_size", 2048) # Add new ones like this.
# attention-related flags
hparams.add_hparam("num_heads", 8)
hparams.add_hparam("attention_key_channels", 0)
hparams.add_hparam("attention_value_channels", 0)
hparams.add_hparam("ffn_layer", "conv_hidden_relu")
hparams.add_hparam("parameter_attention_key_channels", 0)
hparams.add_hparam("parameter_attention_value_channels", 0)
# All hyperparameters ending in "dropout" are automatically set to 0.0
# when not in training mode.
hparams.add_hparam("attention_dropout", 0.0)
hparams.add_hparam("relu_dropout", 0.0)
hparams.add_hparam("residual_dropout", 0.1)
hparams.add_hparam("pos", "timing") # timing, none
hparams.add_hparam("nbr_decoder_problems", 1)
return hparams

loss

The data is split into 100 parts.
The loss is too low and the BLEU on newstest2013 is only 20.37 (multi-bleu.perl).
Can this configuration achieve the performance in the paper?
Is it enough for 140K steps on 8 GPUs?
Why attention_dropout and relu_dropout are set to 0? Does this hurt BLEU?

INFO:tensorflow:Evaluation [1/20]
INFO:tensorflow:Evaluation [2/20]
INFO:tensorflow:Evaluation [3/20]
INFO:tensorflow:Evaluation [4/20]
INFO:tensorflow:Evaluation [5/20]
INFO:tensorflow:Evaluation [6/20]
INFO:tensorflow:Evaluation [7/20]
INFO:tensorflow:Evaluation [8/20]
INFO:tensorflow:Evaluation [9/20]
INFO:tensorflow:Evaluation [10/20]
INFO:tensorflow:Evaluation [11/20]
INFO:tensorflow:Evaluation [12/20]
INFO:tensorflow:Evaluation [13/20]
INFO:tensorflow:Evaluation [14/20]
INFO:tensorflow:Evaluation [15/20]
INFO:tensorflow:Evaluation [16/20]
INFO:tensorflow:Evaluation [17/20]
INFO:tensorflow:Evaluation [18/20]
INFO:tensorflow:Evaluation [19/20]
INFO:tensorflow:Evaluation [20/20]
INFO:tensorflow:Finished evaluation at 2017-06-28-04:44:23
INFO:tensorflow:Saving dict for global step 145673: global_step = 145673, loss = 0.787518, metrics-wmt_ende_tokens_32k/accuracy = 0.8182, metrics-wmt_ende_tokens_32k/accuracy_per_sequence = 0.00633413, metrics-wmt_ende_tokens_32k/accuracy_top5 = 0.925367, metrics-wmt_ende_tokens_32k/bleu_score = 0.496107, metrics-wmt_ende_tokens_32k/neg_log_perplexity = -0.944883, metrics/accuracy = 0.8182, metrics/accuracy_per_sequence = 0.00633413, metrics/accuracy_top5 = 0.925367, metrics/bleu_score = 0.496107, metrics/neg_log_perplexity = -0.944883

from tensor2tensor.

lukaszkaiser avatar lukaszkaiser commented on August 27, 2024

@zxw866 : that looks like a very strong model!

@neverdoubt : when you say "=> actual bleu : 23.85 on newstest2013", how do you measure that exactly? Do you use MOSES scripts, the recent version? Remember that newstest2014 is often 0.5 BLEU or more higher than '13, could you run on that? There is also the hyphenation-split issue which can be around 0.2 difference. We should probably replicate the BLEU calculation we use somewhere too.

Ah, also, we average the last 20 checkpoints with this script:
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/avg_checkpoints.py

Did you try that? Let's get your results to the same level as ours!

But with your hardware guys, you should try transformer_big too!

from tensor2tensor.

neverdoubt avatar neverdoubt commented on August 27, 2024

I used recent MOSES multi-bleu.perl (https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/generic/multi-bleu.perl)

actual bleu scores are computed using newstest2013.en file (which is dev set)
i expect 25.8 (as is in table 3) from my base trained model.
Anyway i'll try big model.

from tensor2tensor.

zxw866 avatar zxw866 commented on August 27, 2024

Although “metrics-wmt_ende_tokens_32k/bleu_score = 0.496107“ is high, the BLEU on newstest2013 is only 20.37 (multi-bleu.perl).
Can this configuration achieve 25.8 in the paper ?
Is it enough for 140K steps on 8 GPUs ?

from tensor2tensor.

mehmedes avatar mehmedes commented on August 27, 2024

@neverdoubt Did you use newstest2013.en without preprocessing or did you postprocess the Tensor2Tensor output before BLEU scoring? I think multi-bleu needs the source and reference to be tokenized...

from tensor2tensor.

zxw866 avatar zxw866 commented on August 27, 2024

The result of data generation in Walkthrough is about 1400M, which is double size of BPE training sets. I'm guessing '_' is used as an independent token in sentences, which led to the very low loss.
As shown in the data generation process:

image

I wonder if this is a BUG?

from tensor2tensor.

vthorsteinsson avatar vthorsteinsson commented on August 27, 2024

Was this with the newest version of T2T, i.e. 1.08? The one with the separate underscores? Would be nice to get a confirmation that those don't necessarily hurt model performance (and may even make it better ;-) )

from tensor2tensor.

neverdoubt avatar neverdoubt commented on August 27, 2024

@vthorsteinsson I used 1.0.7 for training (which has separate '_' issue), but my training data was created earlier version (maybe 1.0.2 or 1.0.4).

from tensor2tensor.

lukaszkaiser avatar lukaszkaiser commented on August 27, 2024

I added utils/get_ende_blue.sh in 1.0.9. This is a script that includes the commands we used to get BLEU in the paper. You might need to fix a path to MOSES and tokenized newstest2013 there.

Please: could you average your checkpoints with utils/avg_checkpoints.py and then run utils/get_ende_blue.sh and report back the results? Just to make sure where your models really stand compared to our results, even despite possible tokenization differences. Thanks!

from tensor2tensor.

zxw866 avatar zxw866 commented on August 27, 2024

When using BPE training set, i got 24.64 on newstest2013. It's close to the results in the paper.
Next i will try utils/get_ende_blue.sh. Thanks!

from tensor2tensor.

lukaszkaiser avatar lukaszkaiser commented on August 27, 2024

Just remember that wmt_ende_bpe32k is tokenized, so instead of the tokenizer call in the get_ende_bleu.sh script, do this: perl -ple 's{@@ }{}g' > $decodes_file.target. Also, did you average checkpoints? Let us know what numbers you get!

from tensor2tensor.

zxw866 avatar zxw866 commented on August 27, 2024

transformer_base hparams.
110k steps using 1 titan Xp, then 140k steps using 8 titan Xp.
I averaged 7 checkpoints.
I removed '@@' using "sed -r 's/(@@ )|(@@ ?$)//g'".
Then I got 24.64 on newstest2013.
I'm guessing the learning rate decay was affected in my experiment.
Next I plan to run the big model.
I really appreciate your help!

from tensor2tensor.

lukaszkaiser avatar lukaszkaiser commented on August 27, 2024

That looks reasonable. Did you run get_ende_bleu, I mean esp. the "atat" part? That can be 0.2 or 0.3 BLEU if you forget it.

from tensor2tensor.

tobyyouup avatar tobyyouup commented on August 27, 2024

Hi @lukaszkaiser I have read the discussion above and found the details for calculation BLEU in #44 is different. So I want to make sure something:

What the format is needed for the decoding file (--decode_from_file=$DECODE_FILE, such as newstest2013 or newstest 2014), Do I need to do tokenizaton and put compounds in ATAT format before feeding into the decoding process? Do I also need to do bpe? Or just use the raw text sentences?

from tensor2tensor.

lukaszkaiser avatar lukaszkaiser commented on August 27, 2024

The input file should be detokenized, pure text. (Except if you do BPE, but I suggest trying without it.)

from tensor2tensor.

tobyyouup avatar tobyyouup commented on August 27, 2024

@lukaszkaiser When I feed pure test for decoding and checkpoint average, I can get a BLEU score 26 on the base configuration model.

from tensor2tensor.

lukaszkaiser avatar lukaszkaiser commented on August 27, 2024

That sounds reasonable. I'm closing this issue for now as it's gotten long and tokenization changed in 1.0.11. I hope things are ok now, but please either re-open or make a new issue if you see the problems again!

from tensor2tensor.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.