kimiyoung / transformer-xl Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
- data : ../data/wikitext-103/
- dataset : wt103
- n_layer : 16
- n_head : 10
- d_head : 41
- d_embed : 410
- d_model : 410
- d_inner : 2100
- dropout : 0.1
- dropatt : 0.0
- init : normal
- emb_init : normal
- init_range : 0.1
- emb_init_range : 0.01
- init_std : 0.02
- proj_init_std : 0.01
- optim : adam
- lr : 0.00025
- mom : 0.0
- scheduler : cosine
- warmup_step : 0
- decay_rate : 0.5
- lr_min : 0.0
- clip : 0.25
- clip_nonemb : False
- max_step : 200000
- batch_size : 60
- batch_chunk : 1
- tgt_len : 150
- eval_tgt_len : 150
- ext_len : 0
- mem_len : 0
- not_tied : False
- seed : 1111
- cuda : True
- adaptive : True
- div_val : 1
- pre_lnorm : False
- varlen : False
- multi_gpu : True
- log_interval : 200
- eval_interval : 4000
- work_dir : wt103_workdir/-wt103/20190121-201645
- restart : False
- restart_dir :
- debug : False
- same_length : False
- attn_type : 2
- clamp_len : -1
- eta_min : 0.0
- gpu0_bsz : 4
- max_eval_steps : -1
- sample_softmax : -1
- patience : 0
- finetune_v2 : False
- finetune_v3 : False
- fp16 : False
- static_loss_scale : 1
- dynamic_loss_scale : False
- tied : True
- n_token : 267735
- n_all_param : 148417118
- n_nonemb_param : 38376800
- clip : 0.25
- eta_min : 0.0
- finetune_v3 : False
- n_layer : 12
- pre_lnorm : False
- n_head : 8
- proj_init_std : 0.01
- emb_init_range : 0.01
- fp16 : False
- n_nonemb_param : 40949760
- scheduler : cosine
- work_dir : enwiki8_task-enwik8/20190127-180347
- batch_size : 22
- debug : False
- dropatt : 0.0
- init_std : 0.02
- lr : 0.00025
- cuda : True
- data : ../data/enwik8/
- emb_init : normal
- ext_len : 0
- sample_softmax : -1
- eval_tgt_len : 128
- restart_dir :
- mom : 0.0
- clamp_len : -1
- max_eval_steps : -1
- batch_chunk : 1
- multi_gpu : True
- mem_len : 512
- dynamic_loss_scale : False
- d_embed : 512
- max_step : 400000
- attn_type : 0
- lr_min : 0.0
- static_loss_scale : 1
- init : normal
- patience : 0
- dropout : 0.1
- finetune_v2 : False
- d_head : 64
- same_length : False
- dataset : enwik8
- init_range : 0.1
- d_model : 512
- tgt_len : 512
- optim : adam
- d_inner : 2048
- warmup_step : 0
- restart : False
- seed : 1111
- adaptive : False
- n_token : 204
- log_interval : 200
- varlen : False
- tied : True
- clip_nonemb : False
- decay_rate : 0.5
- div_val : 1
- gpu0_bsz : 4
- not_tied : False
- eval_interval : 4000
- n_all_param : 41055436
====================================================================================================
#params = 41055436
#non emb params = 40949760
Traceback (most recent call last):
File "train.py", line 539, in
train()
File "train.py", line 445, in train
ret = para_model(data, target, *mems)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/ub16c9/ub16_prj/transformer-xl/pytorch/utils/data_parallel.py", line 64, in forward
inputs, kwargs = self.scatter(inputs, kwargs, device_ids)
File "/home/ub16c9/ub16_prj/transformer-xl/pytorch/utils/data_parallel.py", line 80, in scatter
bsz_unit = (bsz - gpu0_bsz) // (num_dev - 1)
ZeroDivisionError: integer division or modulo by zero
ub16c9@ub16c9-gpu:~/ub16_prj/transformer-xl/pytorch$
What changes we need to perform inside the script to train in a new corpus ?
I have checked the script and there is a lot of if condition depend on each corpus.
In the past couple of months I've been trying to get a vanilla Transformer to be able to do text generation, more specifically to generate a long sentence starting from a small prime sentence.
This has thus far failed, and in multiple discussions it was pointed out that a Transformer only works on fixed-length context, while text generation is an incomplete-sentence task, therefor it is not fixed-length.
In your educated guess, is there merit in trying to fit this new architecture to the task of text generation now that the fixed-length problem has been solved?
Hi,
thanks for the releasing the TensorFlow and PyTorch code for your Transformer-XL ❤️
I would like to ask, if you plan to provide some pre-trained models for the PyTorch implementation? I was only able to find the TensorFlow checkpoints...
Thanks in advance,
Stefan
RuntimeError: CUDA out of memory.
My GPU is 11441MiB.
How to reproduce 128M-model?
Thank you
@kimiyoung @zihangdai
I think it's good to find out transformer-xl's potential abilities of supporting different languages, thus, any plan to release multilingual models?
when i use bash sota/enwik8.sh
Preprocess test set...
Loading cached dataset...
Traceback (most recent call last):
File "data_utils.py", line 586, in
tf.app.run(main)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "data_utils.py", line 382, in main
corpus = get_lm_corpus(FLAGS.data_dir, FLAGS.dataset)
File "data_utils.py", line 345, in get_lm_corpus
corpus = pickle.load(fp)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 153592: ordinal not in range(128)
Run evaluation on test set...
I0304 14:20:51.687984 139744890042112 tf_logging.py:115] n_token 204
Traceback (most recent call last):
File "train_gpu.py", line 475, in
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "train_gpu.py", line 471, in main
evaluate(n_token, cutoffs, "/gpu:0")
File "train_gpu.py", line 366, in evaluate
use_tpu=False)
File "/home/gaodihe/PycharmProjects/transformer-xl/tf/data_utils.py", line 424, in get_input_fn
num_core_per_host, use_tpu=use_tpu)
File "/home/gaodihe/PycharmProjects/transformer-xl/tf/data_utils.py", line 415, in load_record_info
with open(record_info_path, "r") as fp:
FileNotFoundError: [Errno 2] No such file or directory: './/pretrained_xl/tf_enwik8/data/tfrecords/record_info-test.bsz-16.tlen-128.json'
this error happens.
How could i fix it?
Thanks
hello,I have read this paper but I'm confused at the relative attention score. I don't know why the transpose(u) and the transpose(v) were defined separately as they seemed have the same meaning. Is there anything that I should consider ?
Hello, very nice work, and thank you for sharing the source.
I was looking at the PyTorch implementation and I was wondering how you are able to make the multi-GPU work with sparse updates, especially when fp16 is activated. Because none of the sparse/fp16 or sparse/distributed are currently implemented in PyTorch. My feeling is that in the current code, you have an optimizer that synchronizes the parameters across GPUs as expected, but the sparse updates are never synchronized which should result in slightly different models in each process. Or maybe I am missing something?
Thank you
Hi,
Thanks for this great piece of work (research and code), it's very impressive!
I am wondering why the PyTorch version has the additional parameter ext_len
which doesn't seems to be used in the TensorFlow version.
I am not sure where it is wrong but I have been training enwik8 with your TensorFlow code and default parameters on 4 GPUs for 4 days and the loss never drops below 4.2 meanwhile the learning rate already drops to 0.000001. Is there any special tricks to replicate the experiment?
Thanks.
P.S. I am using Python 3 and TensorFlow 1.11.0. I have not tried on the other 3 datasets yet. I also tried transformer-xl on a private dataset (where a single-layer word-level LSTM can achieve around 60%+ accuracy), and its loss also never drops below 4.2 and accuracy never goes higher than 15%.
Hi,
How do i create a tfrecord for a new sentence and get its perplexity using the evaluate function.
Thanks.
I am trying to train with 1 billion corpus on Tesla P40. Following are the values being used.
N_LAYER = 12
D_MODEL = 512,
D_EMBED = 512,
D_INNER = 2048,
D_HEAD = 64
I also tried with a BSZ of 128, it still gives OOM error.
Hello,
Thanks for the pytorch version of transformer-XL. I trained a model on my own corpus and it ran smoothly but I can't seem to load the model back from the checkpoint. I tried loading it the same way as in the eval.py
script. Printing the model gives an attribute error.
model = torch.load('model.pt')
print(model)
AttributeError: 'NoneType' object has no attribute 'size'
Love the work and really glad to see getdata.sh
was useful and that you extended it! Wrangling datasets is never the fun part ;)
Can you clarify the license for the released code by adding a LICENSE file?
So after I have trained this model and replicated it, How could i go about getting the sentence level log probability given a sentence?
Thank you for releasing such an awesome and easy to use code!
Could you, please, elaborate a little bit on PyTorch implementation multi GPU setup? More concretely, what does the parameter "gpu0_bsz" mean and what parameters should I change to scale this code to setups with the number of GPUs more (or less) than 4?
From the description it seems that "gpu0_bsz" is the batch size for GPU0, but it is not clear to me why it should differ from batch sizes on other GPUs.
Hi, there:
So nice that you release the original code. Maybe a little difficult for me to reproduce: (
After nearly 1.5 days for matching your paper and code, still... some questions about model structure, hope you could help, maybe some foolish ...
What's the difference between RelLearnableMultiHeadAttn
and RelPartialLearnableMultiHeadAttn
?
Seem the most important part is the construction of embedding (A+B+C+D), but the first one doesn't use the position embedding in "Attention is all you need"?
Can you explain the function _rel_shift
in detail for me?
Especially the top -4 line code, I don't know why we need this?
What happens when the param div_val > 1
and what's the meaning of the cutoff_xxx
?
More specifically, I think what we need is the part of code when div_val==1
.
Hope you could help me, thx.
Hi, I'm getting NAN values in the first forward pass of the model (in the first layer), generally caused by the first AC calculation. I'm wondering if this is an issue with the initial weights of the model? If so, any advice to help with this issue? I have made some changes to the model and this will help me determine if this is a known issue or if I have introduced a bug. Thanks.
Thank you for your transformer code!
When I ran the code, I encountered such issue:
Traceback (most recent call last):
File "train.py", line 539, in
train()
File "train.py", line 451, in train
loss.backward()
File "/data1/baiye/miniconda3/envs/torch04py3/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/data1/baiye/miniconda3/envs/torch04py3/lib/python3.6/site-packages/torch/autograd/init.py", line 89, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation
My enviroment is pytorch 0.4. And I checked the code and did not find any inplace operation.
Good works! I have two question about your tf codes.
The first:
In the paper, query vector is calculated using previous layer's hidden state rather than the concatenated pre-layer's memory and hidden state. However, in the tf code, I found the query vector is calculated the same as key vector and value vector.
The second:
Each layers has memory tensor with shape [mem_len, batch_size, d_model]. When calculating query, key and value vector, the input vector of tf.layers.dens
is the concatenation of current layer's memory and pre-layers' output. which seems be conflict with the paper. Besides, why stop gradient in _cache_mem
method rather than in rel_multihead_attn
, the later seems to make better sense.
Hello!
Could you, please, provide hyperparameters for training models with close to SOTA perplexity on PTB and WT2 (if you experimented with the latter, as it has the corresponding choice in data utils)? Am I right that two changes I need to make to the released code is to add variational dropout and ASGD optimizer? If you have a code which produces necessary changes, it would be great.
Thanks
Can you include a simple script for generating text with a pretrained Transformer-XL language model? We are primarily using the PyTorch codebase but I am sure Tensorflow users would also appreciate this example.
If including this script is outside the scope of the project repository, could an informal example be provided in this issue thread?
I noticed that there are two flags in the pytorch train.py
script (--finetune_v2
and --finetune_v3
) that don't seem to be used in any of the code. These flags suggest that there might be something special I need to do for finetuning Transformer XL. Might I be missing something?
Currently, I am running finetuning experiments simply by specifying --restart
as well as --restart_dir
and changing the dataset
Hi,
Does anyone know about the function of parameters 'bin_sizes' and 'cutoffs' used for lm1b model?
Thanks for your help
First of all thanks for providing the training scripts.
Is there any script available to do the fine tuning for sentence classification?
On the invocation of _update_mems(self, hids, mems, qlen, mlen), swap of parameters seems a typo?
transformer-xl/pytorch/mem_transformer.py
Line 733 in 44781ed
transformer-xl/pytorch/mem_transformer.py
Line 619 in 44781ed
Thank you for such easy to read code & repo - can be seen that a lot of hard work has gone into it! Secondly, found your work from Sebastian Ruder NLP newsletter and as he put it as: "Peer review is an imprecise process and gems may sometimes fall through the cracks." Your work was under one of the gems and I totally agree!
Now specifically, I tried using wt103 in Tensor2Tensor and I'm getting an error of:
NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
Key transformer/body/decoder/layer_0/ffn/conv1/bias not found in checkpoint
[[node save/RestoreV2_1 (defined at /home/ubuntu/tensor2tensor/venv/lib/python3.5/site-packages/tensor2tensor/utils/decoding.py:586) = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_1/tensor_names, save/RestoreV2_1/shape_and_slices)]]
I suppose it comes from the wrong hparams I am using?
@registry.register_hparams
def transformer_xl():
"""Hparams for transformer-xl"""
hparams = transformer.transformer_base()
hparams.batch_size = 2048
hparams.hidden_size = 4096
hparams.filter_size = 3072
hparams.num_hidden_layers = 18
hparams.num_heads = 16
hparams.max_length = 1024
hparams.eval_drop_long_sequences = True
return hparams
Thanks for the codes! I am sure my question will be asked over and over and over again in near future. And I also read your paper which is all about comparison against vanilla transformer.
But still, in terms of performance, have you compared your great model against BERT? I mean it may not be a 100% fair comparison. But at the end of the day... which one (BERT or Tranformer-XL) is better on typical NLP tasks? Thanks.
Hence, the model lacks necessary contextual information needed to well predict the first few symbols.
So the backward direction part of model can not do:
predict the first few symbols
Hi:
I am reading the code of the model, but I can not find which part is encoder and which is decoder. So i want to know that is transformer-xl like a word-embedding layer as BERT instead of a seq2seq model? Thanks a lot.
Hi,
I would like to train a model on TPU, but I'm not able to find the correct settings for a v2-8 TPU.
What parameters are needed for NUM_HOST
and NUM_CORE
? I tried different values, but I always get num_replicas should be (8), got (XXX).
error messages.
What TPU model did you use for the 1 Billion word benchmark?
Can I create the tfrecords
locally (on a non-TPU) in the train_data
step?
Thanks :)
Seems VLIDA_BSZ should be replaced with VALID_BSZ in a couple places. For example:
Do you have any plans to release the trained models listed in the paper?
Hello!
Can you, please, provide the bash script for training Transformer-XL on PTB dataset with PyTorch?
Thanks!
Hi, thanks for your excellent work. Transformer-xl is the most elegant model for long sequences by now. Do you plan to finetune pretrained models for document classification just like Bert?
预测速度快是因为state reuse?是用 两个32seq_len+state-reuse 和 一个64seq_len+no-state-reuse 比吗?
多谢多谢多谢!
@kimiyoung
@zihangdai
--dataset=enwik8
but dataset name - lm1b
dataset
parameter to lm1b
and run sh sota/lm1b.sh
I have this error:Assign requires shapes of both tensors to match. lhs shape= [153470,20] rhs shape= [153472,20]
[[node save/Assign_5 (defined at /Users/path/transformer-xl/tf/train_gpu.py:419) ]]
with config: python train.py --cuda --data ../data/one-billion-words/ --dataset lm1b --adaptive --n_layer 18 --d_model 1024 --div_val 4 --n_head 8 --d_head 128 --d_inner 4096 --dropout 0.0 --dropatt 0.0 --optim adam --log-interval 5 --eval-interval 20 --warmup_step 20000 --max_step 500000 --lr 0.00025 --tgt_len 32 --mem_len 32 --eval_tgt_len 32 --batch_size 240 --batch_chunk 8 --work_dir exps, so run on only one GPU, do you think possible to achieve similar results as in the paper?
Hi,
I observed values to be slighty different when evaluating the perplexity of set of sentences with batch_size = 1
vs looping through the sentences one by one. (all other parameters being same)
difference in loss is 0.7707 vs 0.7564 in other
I created the data iterator using dataset="lm1b"
Note: I modified the corpus.vocab.encode_file
to encode the input sentence instead of reading from file
Any particular reason why this is observed.
I use the code here: https://github.com/huggingface/pytorch-pretrained-BERT
I run run_classifier.py
with bert-base-uncased
and max_seq_length=128
on the MRPC task.
The log:
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
02/21/2019 12:11:44 - INFO - __main__ - device: cpu n_gpu: 1, distributed training: False, 16-bits training: False
02/21/2019 12:11:45 - INFO - pytorch_pretrained_bert.tokenization - loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /home/tong.guo/.pytorch_pretrained_bert/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
02/21/2019 12:11:45 - INFO - pytorch_pretrained_bert.modeling - loading archive file ../model_file/bert-base-uncased.tar.gz
02/21/2019 12:11:45 - INFO - pytorch_pretrained_bert.modeling - extracting archive file ../model_file/bert-base-uncased.tar.gz to temp dir /tmp/tmpaho9_3dk
02/21/2019 12:11:50 - INFO - pytorch_pretrained_bert.modeling - Model config {
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"vocab_size": 30522
}
02/21/2019 12:11:55 - INFO - pytorch_pretrained_bert.modeling - Weights of BertForSequenceClassification not initialized from pretrained model: ['classifier.weight', 'classifier.bias']
02/21/2019 12:11:55 - INFO - pytorch_pretrained_bert.modeling - Weights from pretrained model not used in BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
02/21/2019 12:11:55 - INFO - pytorch_pretrained_bert.modeling - loading archive file ../model_file/bert-base-uncased.tar.gz
02/21/2019 12:11:55 - INFO - pytorch_pretrained_bert.modeling - extracting archive file ../model_file/bert-base-uncased.tar.gz to temp dir /tmp/tmpfehb71wu
02/21/2019 12:11:59 - INFO - pytorch_pretrained_bert.modeling - Model config {
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"vocab_size": 30522
}
02/21/2019 12:12:03 - INFO - pytorch_pretrained_bert.modeling - Weights of BertForSequenceClassification not initialized from pretrained model: ['classifier.weight', 'classifier.bias']
02/21/2019 12:12:03 - INFO - pytorch_pretrained_bert.modeling - Weights from pretrained model not used in BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
02/21/2019 12:12:03 - INFO - __main__ - *** Example ***
02/21/2019 12:12:03 - INFO - __main__ - guid: dev-1
02/21/2019 12:12:03 - INFO - __main__ - tokens: [CLS] [UNK] ' s chief operating officer , [UNK] [UNK] , and [UNK] [UNK] , the chief financial officer , will report directly to [UNK] [UNK] . [SEP] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] and [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] will report to [UNK] . [SEP]
02/21/2019 12:12:03 - INFO - __main__ - input_ids: 101 100 1005 1055 2708 4082 2961 1010 100 100 1010 1998 100 100 1010 1996 2708 3361 2961 1010 2097 3189 3495 2000 100 100 1012 102 100 100 100 100 100 100 1998 100 100 100 100 100 100 2097 3189 2000 100 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
02/21/2019 12:12:03 - INFO - __main__ - input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
02/21/2019 12:12:03 - INFO - __main__ - segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
02/21/2019 12:12:03 - INFO - __main__ - label: 1 (id = 1)
02/21/2019 12:12:03 - INFO - __main__ - *** Example ***
02/21/2019 12:12:03 - INFO - __main__ - guid: dev-2
02/21/2019 12:12:03 - INFO - __main__ - tokens: [CLS] [UNK] world ' s two largest auto ##makers said their [UNK] . [UNK] . sales declined more than predicted last month as a late summer sales frenzy caused more of an industry backlash than expected . [SEP] [UNK] sales at both [UNK] and [UNK] . 2 [UNK] [UNK] [UNK] . declined more than predicted as a late summer sales frenzy prompted a larger - than - expected industry backlash . [SEP]
02/21/2019 12:12:03 - INFO - __main__ - input_ids: 101 100 2088 1005 1055 2048 2922 8285 12088 2056 2037 100 1012 100 1012 4341 6430 2062 2084 10173 2197 3204 2004 1037 2397 2621 4341 21517 3303 2062 1997 2019 3068 25748 2084 3517 1012 102 100 4341 2012 2119 100 1998 100 1012 1016 100 100 100 1012 6430 2062 2084 10173 2004 1037 2397 2621 4341 21517 9469 1037 3469 1011 2084 1011 3517 3068 25748 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
02/21/2019 12:12:03 - INFO - __main__ - input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
02/21/2019 12:12:03 - INFO - __main__ - segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
02/21/2019 12:12:03 - INFO - __main__ - label: 1 (id = 1)
02/21/2019 12:12:03 - INFO - __main__ - *** Example ***
02/21/2019 12:12:03 - INFO - __main__ - guid: dev-3
02/21/2019 12:12:03 - INFO - __main__ - tokens: [CLS] [UNK] to the federal [UNK] for [UNK] [UNK] and [UNK] ( news - web sites ) , there were 19 reported cases of me ##as ##les in the [UNK] [UNK] in 2002 . [SEP] [UNK] [UNK] for [UNK] [UNK] and [UNK] said there were 19 reported cases of me ##as ##les in the [UNK] [UNK] in 2002 . [SEP]
02/21/2019 12:12:03 - INFO - __main__ - input_ids: 101 100 2000 1996 2976 100 2005 100 100 1998 100 1006 2739 1011 4773 4573 1007 1010 2045 2020 2539 2988 3572 1997 2033 3022 4244 1999 1996 100 100 1999 2526 1012 102 100 100 2005 100 100 1998 100 2056 2045 2020 2539 2988 3572 1997 2033 3022 4244 1999 1996 100 100 1999 2526 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
02/21/2019 12:12:03 - INFO - __main__ - input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
02/21/2019 12:12:03 - INFO - __main__ - segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
02/21/2019 12:12:03 - INFO - __main__ - label: 1 (id = 1)
02/21/2019 12:12:03 - INFO - __main__ - *** Example ***
02/21/2019 12:12:03 - INFO - __main__ - guid: dev-4
02/21/2019 12:12:03 - INFO - __main__ - tokens: [CLS] [UNK] tropical storm rapidly developed in the [UNK] of [UNK] [UNK] and was expected to hit somewhere along the [UNK] or [UNK] coasts by [UNK] night . [SEP] [UNK] tropical storm rapidly developed in the [UNK] of [UNK] on [UNK] and could have hurricane - force winds when it hits land somewhere along the [UNK] coast [UNK] night . [SEP]
02/21/2019 12:12:03 - INFO - __main__ - input_ids: 101 100 5133 4040 5901 2764 1999 1996 100 1997 100 100 1998 2001 3517 2000 2718 4873 2247 1996 100 2030 100 20266 2011 100 2305 1012 102 100 5133 4040 5901 2764 1999 1996 100 1997 100 2006 100 1998 2071 2031 7064 1011 2486 7266 2043 2009 4978 2455 4873 2247 1996 100 3023 100 2305 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
02/21/2019 12:12:03 - INFO - __main__ - input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
02/21/2019 12:12:03 - INFO - __main__ - segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
02/21/2019 12:12:03 - INFO - __main__ - label: 0 (id = 0)
02/21/2019 12:12:03 - INFO - __main__ - *** Example ***
02/21/2019 12:12:03 - INFO - __main__ - guid: dev-5
02/21/2019 12:12:03 - INFO - __main__ - tokens: [CLS] [UNK] company didn ' t detail the costs of the replacement and repairs . [SEP] [UNK] company officials expect the costs of the replacement work to run into the millions of dollars . [SEP]
02/21/2019 12:12:03 - INFO - __main__ - input_ids: 101 100 2194 2134 1005 1056 6987 1996 5366 1997 1996 6110 1998 10315 1012 102 100 2194 4584 5987 1996 5366 1997 1996 6110 2147 2000 2448 2046 1996 8817 1997 6363 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
02/21/2019 12:12:03 - INFO - __main__ - input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
02/21/2019 12:12:03 - INFO - __main__ - segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
02/21/2019 12:12:03 - INFO - __main__ - label: 0 (id = 0)
02/21/2019 12:12:04 - INFO - __main__ - ***** Running evaluation *****
02/21/2019 12:12:04 - INFO - __main__ - Num examples = 1725
02/21/2019 12:12:04 - INFO - __main__ - Batch size = 8
Evaluating: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 216/216 [06:06<00:00, 1.70s/it]
02/21/2019 12:18:11 - INFO - __main__ - ***** Eval results *****
02/21/2019 12:18:11 - INFO - __main__ - eval_accuracy = 0.33507246376811595
02/21/2019 12:18:11 - INFO - __main__ - eval_loss = 1.002936492777533
02/21/2019 12:18:11 - INFO - __main__ - global_step = 0
02/21/2019 12:18:11 - INFO - __main__ - loss = Non
The speed is about 1.7 s/batch
I run run_transfo_xl.py
on the wikitext-103
task.
The log:
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
02/20/2019 19:49:30 - INFO - __main__ - device: cuda
02/20/2019 19:49:30 - INFO - pytorch_pretrained_bert.tokenization_transfo_xl - loading vocabulary file ../model_file/transfo-xl-wt103-vocab.bin
02/20/2019 19:49:30 - INFO - pytorch_pretrained_bert.tokenization_transfo_xl - loading corpus file ../model_file/transfo-xl-wt103-corpus.bin
02/20/2019 19:49:36 - INFO - pytorch_pretrained_bert.modeling_transfo_xl - loading weights file ../model_file/transfo-xl-wt103-pytorch_model.bin
02/20/2019 19:49:36 - INFO - pytorch_pretrained_bert.modeling_transfo_xl - loading configuration file ../model_file/transfo-xl-wt103-config.json
02/20/2019 19:49:36 - INFO - pytorch_pretrained_bert.modeling_transfo_xl - Model config {
"adaptive": true,
"attn_type": 0,
"clamp_len": 1000,
"cutoffs": [
20000,
40000,
200000
],
"d_embed": 1024,
"d_head": 64,
"d_inner": 4096,
"d_model": 1024,
"div_val": 4,
"dropatt": 0.0,
"dropout": 0.1,
"ext_len": 0,
"init": "normal",
"init_range": 0.01,
"init_std": 0.02,
"mem_len": 1600,
"n_head": 16,
"n_layer": 18,
"n_token": 267735,
"pre_lnorm": false,
"proj_init_std": 0.01,
"same_length": true,
"sample_softmax": -1,
"tgt_len": 128,
"tie_projs": [
false,
true,
true,
true
],
"tie_weight": true,
"untie_r": true
}
02/20/2019 19:49:51 - INFO - __main__ - Evaluating with bsz 10 tgt_len 128 ext_len 0 mem_len 1600 clamp_len 1000
02/20/2019 19:57:35 - INFO - __main__ - Time : 464.00s, 2416.66ms/segment
02/20/2019 19:57:35 - INFO - __main__ - ====================================================================================================
02/20/2019 19:57:35 - INFO - __main__ - | test loss 2.90 | test ppl 18.213
02/20/2019 19:57:35 - INFO - __main__ - ====================================================================================================
The speed is about 2.4 s/batch
when I read your tf code,i am really confused about the codes below?
rw_head_q = w_head_q + r_w_bias
rr_head_q = w_head_q + r_r_bias
AC = tf.einsum('ibnd,jbnd->ijbn', rw_head_q, w_head_k)
BD = tf.einsum('ibnd,jnd->ijbn', rr_head_q, r_head_k)
BD = rel_shift(BD)
could you tell me what are the variables r_w_bias and r_r_bias mean in your paper?
and could you explain about this code shortly?
Thanks for your effort
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.