namisan / mt-dnn Goto Github PK
View Code? Open in Web Editor NEWMulti-Task Deep Neural Networks for Natural Language Understanding
License: MIT License
Multi-Task Deep Neural Networks for Natural Language Understanding
License: MIT License
Hi, I have couple of questions about the test set numbers, can you please clarify:
What's the test set number of original single MT-DNN model (non KD)? The paper says it's 82.7
vs leaderboard says 85.1
? The difference seems more than just WNLI.
What's the difference between MTDNN-KD vs MTDNN-ensemble in Table 3 in https://arxiv.org/pdf/1904.09482.pdf paper.
If I first finetune BERT in task_1, then finetune the same BERT in task_2, and final get the finetuend BERT in task_1 & task_2. Is it same as mt-dnn in task_1 & task_2?
求解答
In the arxiv paper it is stated:
In the multi-task fine-tuning stage, we use minibatch based stochastic gradient descent (SGD) to
learn the parameters of our model (i.e., the parameters of all shared layers and task-specific layers) as shown in Algorithm 1. In each epoch, a
mini-batch b_t
is selected(e.g., among all 9 GLUE
tasks), and the model is updated according to the
task-specific objective for the task t. This approximately optimizes the sum of all multi-task objectives.
If I understand it correctly, in your code this multi-task fine-tuning stage is called MTL refinement
. Then why do you fine-tune for each task in single task setting in your fine-tuning
stage? There is no such stage in the original paper.
Also, in run_mt_dnn.sh
there are lines:
train_datasets="mnli,rte,qqp,qnli,mrpc,sst,cola,stsb" test_datasets="mnli_matched,mnli_mismatched,rte"
Why do you only test on mnli
and rte
and not test on all other tasks?
I would also like to ask if I can switch from BERT large
to BERT base
there because i only have one 1080 GTX card.
Thank you.
In the line 45 of mt_dnn\model.py,
self.optimizer = optim.sgd(parameters, opt['learning_rate'],
parameters
should be optimizer_parameters
?
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
Namespace(answer_att_hidden_size=128, answer_att_type='bilinear', answer_dropout_p=0.1, answer_mem_drop_p=0.1, answer_mem_type=1, answer_merge_opt=1, answer_num_turn=5, answer_opt=0, answer_rnn_type='gru', answer_sum_att_type='bilinear', answer_weight_norm_on=False, batch_size=8, batch_size_eval=8, bert_dropout_p=0.1, bert_l2norm=0.0, cuda=True, data_dir='data/mt_dnn', data_sort_on=False, dropout_p=0.1, dropout_w=0.0, dump_state_on=False, ema_gamma=0.995, ema_opt=0, embedding_opt=0, epochs=5, freeze_layers=-1, global_grad_clipping=1.0, grad_clipping=0, have_lr_scheduler=True, init_checkpoint='mt_dnn_models/bert_model_base.pt', init_ratio=1, label_size='3', learning_rate=5e-05, log_file='mt-dnn-train.log', log_per_updates=500, lr_gamma=0.5, max_seq_len=512, mem_cum_type='simple', mix_opt=0, momentum=0, mtl_opt=0, multi_gpu_on=False, multi_step_lr='10,20,30', name='farmer', optimizer='adamax', output_dir='checkpoint', pw_tasks=['qnnli'], ratio=0, scheduler_type='ms', seed=2018, task_config_path='configs/tasks_config.json', test_datasets=['mnli_mismatched', 'mnli_matched'], train_datasets=['mnli'], update_bert_opt=0, vb_dropout=True, warmup=0.1, warmup_schedule='warmup_linear', weight_decay=0)
07/01/2019 01:37:37 0
07/01/2019 01:37:37 Launching the MT-DNN training
07/01/2019 01:37:37 Loading data/mt_dnn/mnli_train.json as task 0
Traceback (most recent call last):
File "train.py", line 350, in
main()
File "train.py", line 178, in main
train_data = BatchGen(BatchGen.load(train_path, True, pairwise=pw_task, maxlen=args.max_seq_len),
File "/home/ubuntu/paraphrase/mt-dnn/mt_dnn/batcher.py", line 55, in load
with open(path, 'r', encoding='utf-8') as reader:
FileNotFoundError: [Errno 2] No such file or directory: 'data/mt_dnn/mnli_train.json'
I use the tutorial to train, what does this error mean? What does the Json file contain?
When mtl_opt is on (by default), do all GLUE tasks with the same n_class share the same task-specific classifier? Intuitively this seems a little strange that as distinct tasks as SST and QQP use the same classifier.
Another question is, answer_opt is set to 0 in all scripts, so when does SAN apply?
Thank you.
Hi, I am having trouble reproducing the result on the QNLI dataset. How did you choose the hyper-parameters?
I had tried to do multitask learning on original BERT for seven Chinese tasks. I tried to feed data with task sampling and data sampling. The results are somehow confusing. When feeding model using task sampling strategy just like MTDNN, the mtl trained model can't achieve better results than single-task-training, But when i apply data sampling, mtl trained model can achieve better results than BERT, EIRNE on lcqmc, chnsenticorp and xnli. Is there any theoritical guidance for mtl on different datasets
Pytorch-0.4.1 cannot be installed with cuda-10. How can I proceed? As it is giving an error.
please help. Thank you
As per the current scripts/run_rte.sh
, you are giving answer_opt=0
for rte
. I am curious what was the reason of throwing away the SAN
module for finetuning of the pair tasks and going back to using pooler
.
你好
想请教你们两个问题。
1. multi-task带来的促进应该和task之间的联系有关,什么时候task能够带来好的提升,以至于什么时候task反而会影响对方的表现?有没有一些理论的分析呢?
2.文章中的多个task的objective function,是否考虑过给他们设置不同的权重呢,比如最终的loss = 权重1loss1+权重2loss2....
I just noticed that the performance on WNLI have got a huge improvement! That's really marvelous. Can't wait to know more details about it.
Hi,
In Pytorch bert I used to put txt file and run the extract_features and embeddings,py file...how should I use this for the same.
Abhinandan
Thanks for your work and we have a problem with the reproduce. We use the released pretrained model "mt_dnn_large.pt" and directly run "run_stsb.sh", "run_rte" and so on, however, we get low accuracy. I want to know is there any problem with the running procedure or the released model? Or something else such as the number of GPU and bach size?
Hi,
Is there any online demo interface for mt-dnn? I am working on a task similar to textual entailment and I would like to see how mt-dnn performs on a very small subset of my dataset?
What are the changes we have to make in the tran.py file in order to train for specific purpose(STSB). Some of the changes I have made are following:-
Will there be a tensorflow verison of mt-dnn?
prepro.py
uses the tokenizer for bert-base-uncased
. Then if we use a large model and/or an uncased model, would there be inconsistencies?
Is there any way to load the model after training and classify based on the task-specific dense layers (I assume for each task a specific dense layer is learned for classification) or are these not stored with the model.pt file and lost? E.g. if one wants to predict classes for samples of an unseen dataset on the, let's say, specific MNLI dense layer.
When the procedure is interrupted accidentally. I want to resume the training procedure with the checkpoint in checkpoints/model_*.pt
. But I don't know why it comes out some error:
Traceback (most recent call last):
File "../train.py", line 352, in
main()
File "../train.py", line 314, in main
model.update(batch_meta, batch_data)
File "/home/mt-dnn/mt_dnn/model.py", line 171, in update
self.optimizer.step()
File "/home/mt-dnn/module/bert_optim.py", line 120, in step
exp_avg.mul_(beta1).add_(1 - beta1, grad)
RuntimeError: Expected object of type torch.FloatTensor but found type torch.cuda.FloatTensor for argument #4 'other'
I didn't change any other code but just hyperparameters. How to deal with such a problem?
What is the preprocessing/changes we need to do in order to get the prediction on NLI task for any two user defined input statements??
On the glue leaderboard, the QNLI performance can get 96.0. But when I train the network, the dev accuracy can only get 92.2. I don't have V100 GPU. The training setting is batch size 8 on 8 TITAN X. Can you provide some details when you train the network on QNLI dataset?
Line 25 in 25c3bd1
When running on the CuDNN backend, there might be reproducibility issue.
According to pytorch document, two further options must be set:
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
Hello,
Do you have a rough timeline for when you might release:
Thank you!
If "ratio" argument is provided, then in each epoch only random_picks=min(len(MNLI task batches) * args.ratio, len(all other tasks batches))
from len(all other tasks batches)
are used, instead of using all other tasks batches. Such behavior seems strange to me, considering the fact that MNLI has most examples throughout the GLUE dataset. Could you please explain, what is the benefit of not using some of the examples from smaller datasets on each epoch?
Thank you very much.
I'm trying to finetune this model on a specific task. But when I run the example scripts of run_rte.sh, the multi-gpu training is not working when I set CUDA_VISIBLE_DEVICES="0,1,2,3", it seems that only the first gpu is working.
I have only 4 1080 ti gpus, when I run the run_rte.sh script, I can only adjust the batch size to 1 to avoid out of memory, can this be possible training on 1080Ti?
Is the downloaded mt_dnn_base.pt
the same as the resulting model after running scripts/run_mt_dnn.sh
? If not, what is their relationship?
New one is: https://github.com/jsalt18-sentence-repl/jiant
The current one is deprecated, and would fail to download some of the datasets.
Thanks for your work and we have a problem with the stochastic prediction dropout.
def generate_mask(new_data, dropout_p=0.0, is_training=False):
if not is_training: dropout_p = 0.0
new_data = (1-dropout_p) * (new_data.zero_() + 1)
for i in range(new_data.size(0)):
one = random.randint(0, new_data.size(1)-1)
new_data[i][one] = 1
print(new_data)
mask = Variable(1.0/(1 - dropout_p) * torch.bernoulli(new_data), requires_grad=False)
return mask
When call this function as above, the element in the matrix will be greater than one, I want to known why multiply 'Variable(1.0/(1 - dropout_p)'. It will cause the element greater than one.
wget https://mrc.blob.core.windows.net/mt-dnn-model/bert_base_chinese.pt
Note that it is converted from Google's BERT model: https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip
Hi namisan. How do you convert bert model to yours ? I am very insteresting in that. Can you share your covert method. Thank you very much.
I tried to finetune the pretrained mt-dnn model on a specific task, but the GPU memory usage keeps growing. After training for 1000 iterations, the memory is fully occupied, causing 'run out of memory' error. The experiment was conducted on 1080TI.
Looks like by default SciTail does not use SAN, because it uses the default value for --answer_opt
, which is 0. Why is that?
I'm on commit 25c3bd1
05/27/2019 03:38:53 Loaded 5463 QNLI test samples
Traceback (most recent call last):
File "prepro.py", line 352, in
main(args)
File "prepro.py", line 193, in main
qnnli_train_data = load_qnnli(qnli_train_path, GLOBAL_MAP['qnli'])
File "/home/markus/mt-dnn/data_utils/glue_utils.py", line 113, in load_qnnli
assert len(lines) % 2 == 0
AssertionError
Dear,
while a nn.ModuleList() is used for the scoring_list, this is not the case for the dropout_list:
Lines 16 to 24 in eb0aef4
This seems to have an impact when switching the network mode in class MTDNNModel from training to evaluation. The scoring_list appears to be correctly switched to evaluation mode while this is not happening for the dropout_list. As a consequence, dropout is active at prediction time.
Jan Luts
Hi,
Thanks for the repo and your model. Thanks for your time.
So I have an intro question about the model.
So this model takes a pretrained BERT model and then jointly fine tunes it on 4 tasks: single sentence classification, pairwise text similarity, pairwise text classification and pairwise ranking.
For pairwise text classification it has an additional stochastic answer network attached to the head but for the other three tasks it just attaches the specific loss function for the three tasks.
The whole network is trained end to end.
Is this correct?
Thanks for the help and God bless!
While printing the values of test_predictions in stsb task I was getting 0 for all cases. Then I printed the scores. I thought the scores were from 0 to 5 where 5 representing most similar but when I printed them I found that some of the similarity scores were negative. Can you clarify what is the range of similarity score.
Another problem I am getting is that the test results change if I run the model again( I am not training the model, I am just predicting the test set). I thought it was due to the dropout, so I made the dropout 0 but I am still getting different results every time I ran it.
I even saw that you have used self.network.eval() which makes the network in testing mode thus even if there is any dropout layer then it ignores it.
Hello,
I have trained the mt-dnn, initialized with mt_dnn_large.pt, with some new input and with the following parameters
`{'log_file': 'checkpoints/mt-dnn-NL-labels_adamax_answer_opt1_gc0_ggc1_2019-04-15T2138/log.log', 'init_checkpoint': '../mt_dnn_models/mt_dnn_large.pt', 'data_dir': '../data/mt_dnn', 'data_sort_on': False, 'name': 'farmer', 'train_datasets': ['mnli', 'rte', 'qnli'], 'test_datasets': ['mnli_matched', 'mnli_mismatched', 'rte'], 'pw_tasks': ['qnnli'], 'update_bert_opt': 0, 'multi_gpu_on': True, 'mem_cum_type': 'simple', 'answer_num_turn': 5, 'answer_mem_drop_p': 0.1, 'answer_att_hidden_size': 128, 'answer_att_type': 'bilinear', 'answer_rnn_type': 'gru', 'answer_sum_att_type': 'bilinear', 'answer_merge_opt': 1, 'answer_mem_type': 1, 'answer_dropout_p': 0.1, 'answer_weight_norm_on': False, 'dump_state_on': False, 'answer_opt': [1, 1, 1], 'label_size': '3,2,2', 'mtl_opt': 0, 'ratio': 0, 'mix_opt': 0, 'max_seq_len': 512, 'init_ratio': 1, 'cuda': True, 'log_per_updates': 500, 'epochs': 5, 'batch_size': 16, 'batch_size_eval': 8, 'optimizer': 'adamax', 'grad_clipping': 0.0, 'global_grad_clipping': 1.0, 'weight_decay': 0, 'learning_rate': 5e-05, 'momentum': 0, 'warmup': 0.1, 'warmup_schedule': 'warmup_linear', 'vb_dropout': True, 'dropout_p': 0.1, 'dropout_w': 0.0, 'bert_dropout_p': 0.1, 'ema_opt': 0, 'ema_gamma': 0.995, 'have_lr_scheduler': True, 'multi_step_lr': '10,20,30', 'freeze_layers': -1, 'embedding_opt': 0, 'lr_gamma': 0.5, 'bert_l2norm': 0.0, 'scheduler_type': 'ms', 'output_dir': 'checkpoints/mt-dnn-NL-labels_adamax_answer_opt1_gc0_ggc1_2019-04-15T2138', 'seed': 2018, 'task_config_path': 'configs/tasks_config.json', 'tasks_dropout_p': [0.1, 0.1, 0.1]}``
I train this model for e.g., 4 epochs and I want to perform the fine tuning and domain adaptation experiments based on the resulting model, "checkpoint/model_4.pt". Therefore, in run_rte.sh or snli_domain_adaptation_bash.sh I replace the initial checkpoint from ../mt_dnn_models/mt_dnn_base.pt to checkpoint/model_4.pt.
Doing so, I get the following error
export CUDA_VISIBLE_DEVICES=4,5
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
Namespace(answer_att_hidden_size=128, answer_att_type='bilinear', answer_dropout_p=0.1, answer_mem_drop_p=0.1, answer_mem_type=1, answer_merge_opt=1, answer_num_turn=5, answer_opt=0, answer_rnn_type='gru', answer_sum_att_type='bilinear', answer_weight_norm_on=False, batch_size=32, batch_size_eval=8, bert_dropout_p=0.1, bert_l2norm=0.0, cuda=True, data_dir='../data/mt_dnn/', data_sort_on=False, dropout_p=0.1, dropout_w=0.0, dump_state_on=False, ema_gamma=0.995, ema_opt=0, embedding_opt=0, epochs=5, freeze_layers=-1, global_grad_clipping=1.0, grad_clipping=0.0, have_lr_scheduler=True, init_checkpoint='checkpoints/mt-dnn-NL-labels_adamax_answer_opt1_gc0_ggc1_2019-04-15T2138/model_4.pt', init_ratio=1, label_size='3', learning_rate=2e-05, log_file='checkpoints/mt-dnn-rte_adamax_answer_opt0_gc0_ggc1_2019-04-16T1636/log.log', log_per_updates=500, lr_gamma=0.5, max_seq_len=512, mem_cum_type='simple', mix_opt=0, momentum=0, mtl_opt=0, multi_gpu_on=False, multi_step_lr='10,20,30', name='farmer', optimizer='adamax', output_dir='checkpoints/mt-dnn-rte_adamax_answer_opt0_gc0_ggc1_2019-04-16T1636', pw_tasks=['qnnli'], ratio=0, scheduler_type='ms', seed=2018, task_config_path='configs/tasks_config.json', test_datasets=['rte'], train_datasets=['rte'], update_bert_opt=0, vb_dropout=True, warmup=0.1, warmup_schedule='warmup_linear', weight_decay=0)
04/16/2019 04:36:15 0
04/16/2019 04:36:15 Launching the MT-DNN training
04/16/2019 04:36:15 Loading ../data/mt_dnn/rte_train.json as task 0
Loaded 2490 samples out of 2490
04/16/2019 04:36:15 2
Loaded 277 samples out of 277
Loaded 3000 samples out of 3000
04/16/2019 04:36:15 ####################
04/16/2019 04:36:15 {'log_file': 'checkpoints/mt-dnn-rte_adamax_answer_opt0_gc0_ggc1_2019-04-16T1636/log.log', 'init_checkpoint': 'checkpoints/mt-dnn-NL-labels_adamax_answer_opt1_gc0_ggc1_2019-04-15T2138/model_4.pt', 'data_dir': '../data/mt_dnn/', 'data_sort_on': False, 'name': 'farmer', 'train_datasets': ['rte'], 'test_datasets': ['rte'], 'pw_tasks': ['qnnli'], 'update_bert_opt': 0, 'multi_gpu_on': False, 'mem_cum_type': 'simple', 'answer_num_turn': 5, 'answer_mem_drop_p': 0.1, 'answer_att_hidden_size': 128, 'answer_att_type': 'bilinear', 'answer_rnn_type': 'gru', 'answer_sum_att_type': 'bilinear', 'answer_merge_opt': 1, 'answer_mem_type': 1, 'answer_dropout_p': 0.1, 'answer_weight_norm_on': False, 'dump_state_on': False, 'answer_opt': [0], 'label_size': '2', 'mtl_opt': 0, 'ratio': 0, 'mix_opt': 0, 'max_seq_len': 512, 'init_ratio': 1, 'cuda': True, 'log_per_updates': 500, 'epochs': 5, 'batch_size': 32, 'batch_size_eval': 8, 'optimizer': 'adamax', 'grad_clipping': 0.0, 'global_grad_clipping': 1.0, 'weight_decay': 0, 'learning_rate': 2e-05, 'momentum': 0, 'warmup': 0.1, 'warmup_schedule': 'warmup_linear', 'vb_dropout': True, 'dropout_p': 0.1, 'dropout_w': 0.0, 'bert_dropout_p': 0.1, 'ema_opt': 0, 'ema_gamma': 0.995, 'have_lr_scheduler': True, 'multi_step_lr': '10,20,30', 'freeze_layers': -1, 'embedding_opt': 0, 'lr_gamma': 0.5, 'bert_l2norm': 0.0, 'scheduler_type': 'ms', 'output_dir': 'checkpoints/mt-dnn-rte_adamax_answer_opt0_gc0_ggc1_2019-04-16T1636', 'seed': 2018, 'task_config_path': 'configs/tasks_config.json', 'tasks_dropout_p': [0.1]}
04/16/2019 04:36:15 ####################
04/16/2019 04:36:35
############# Model Arch of MT-DNN #############
SANBertNetwork(
(bert): BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(30522, 1024)
(position_embeddings): Embedding(512, 1024)
(token_type_embeddings): Embedding(2, 1024)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(1): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(2): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(3): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(4): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(5): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(6): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(7): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(8): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(9): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(10): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(11): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(12): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(13): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(14): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(15): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(16): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(17): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(18): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(19): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(20): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(21): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(22): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(23): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
)
)
(pooler): BertPooler(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(activation): Tanh()
)
)
(scoring_list): ModuleList(
(0): SANClassifier(
(dropout): DropoutWrapper()
(query_wsum): SelfAttnWrapper(
(att): LinearSelfAttn(
(linear): Linear(in_features=1024, out_features=1, bias=True)
(dropout): DropoutWrapper()
)
)
(attn): FlatSimilarityWrapper(
(att_dropout): DropoutWrapper()
(score_func): BilinearFlatSim(
(linear): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): DropoutWrapper()
)
)
(rnn): GRUCell(1024, 1024)
(classifier): Classifier(
(dropout): DropoutWrapper()
(proj): Linear(in_features=4096, out_features=3, bias=True)
)
)
(1): SANClassifier(
(dropout): DropoutWrapper()
(query_wsum): SelfAttnWrapper(
(att): LinearSelfAttn(
(linear): Linear(in_features=1024, out_features=1, bias=True)
(dropout): DropoutWrapper()
)
)
(attn): FlatSimilarityWrapper(
(att_dropout): DropoutWrapper()
(score_func): BilinearFlatSim(
(linear): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): DropoutWrapper()
)
)
(rnn): GRUCell(1024, 1024)
(classifier): Classifier(
(dropout): DropoutWrapper()
(proj): Linear(in_features=4096, out_features=2, bias=True)
)
)
(2): SANClassifier(
(dropout): DropoutWrapper()
(query_wsum): SelfAttnWrapper(
(att): LinearSelfAttn(
(linear): Linear(in_features=1024, out_features=1, bias=True)
(dropout): DropoutWrapper()
)
)
(attn): FlatSimilarityWrapper(
(att_dropout): DropoutWrapper()
(score_func): BilinearFlatSim(
(linear): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): DropoutWrapper()
)
)
(rnn): GRUCell(1024, 1024)
(classifier): Classifier(
(dropout): DropoutWrapper()
(proj): Linear(in_features=4096, out_features=2, bias=True)
)
)
)
)
04/16/2019 04:36:35 Total number of params: 357215242
04/16/2019 04:36:36 At epoch 0
Traceback (most recent call last):
File "../train.py", line 354, in
main()
File "../train.py", line 316, in main
model.update(batch_meta, batch_data)
File "../mt_dnn/model.py", line 151, in update
self.optimizer.step()
File "../module/bert_optim.py", line 119, in step
exp_avg.mul_(beta1).add_(1 - beta1, grad)
RuntimeError: expected type torch.FloatTensor but got torch.cuda.FloatTensor
May I ask you what is the problem?
I can run either of these scripts using ../mt_dnn_models/mt_dnn_large.pt but not using the model that trained myself, i.e., model_4.pt
![Uploading 图片.png…]()
Hello!
I'd like to know more of the details of ensembling multiple MT-DNN models.
Specifically for the STS-B task, could you describe the steps you took to get the ensemble? Given the large model sizes, did you load them to the GPUs sequentially and just saved the individual model outputs to be averaged later? I'd appreciate your response!
I specified --multi_gpu_on
and set CUDA_VISIBLE_DEVICES
to multiple devices, but it looks like only one GPU is used? Why is that?
Hi,
I am now working on a sentiment analysis task and found that this architecture might be a good choice for my application.
However, I am quite confused on how to use this architecture on my custom data. More specifically,
if I would like to directly apply this architecture to my custom data. What I should do? Is there a more detailed tutorial for the use of the architecture?
Looking forward to inclusion of an open source license, otherwise the default copyrights apply which doesn't allow reproduction.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.