utterworks / fast-bert Goto Github PK

View Code? Open in Web Editor NEW

1.9K 1.9K 341.0 2.13 MB

Super easy library for BERT based NLP models

License: Apache License 2.0

Python 71.13% Jupyter Notebook 25.86% Shell 1.58% Dockerfile 1.43%

bert fast-bert fastai transformers

fast-bert's People

Contributors

Stargazers

Watchers

Forkers

templeblock elavin11 ml-lab allensmile cclauss ericxsun stjordanis legendtianjin pawel-kranzberg kendricklee91 jbdatascience soonhwan-kwon zorrock wuxiaobo ogarin allouachesamir bin2000 hiyoung-asr dylanxia2017 gym0569 benzei bharatr21 linhduongtuan pbellinga fredriko laugustyniak parsing-science pavangadde royam0820 mgao05 dsiginn gartentrio slidersun amir22010 yuchia0518 skygram junnyu yifuliu rahulpatraiitkgp triper1022 silencewinter biranchi2018 mjheller weiyanwuda jcarlosneto iamweiweishi anoop2019 baconwaffle leeon2vec xmxoxo shashisingh darshanpatel11 whs1111 vijayk2000 emtropyml mfaisal arun-ghontale danduma mitsvision pzhao16me aiedward little1tow shannonyu srravula1 micseb souschefistry dragomirradev jingmouren ferplascencia tanyanghzsd eanunez amitgayar miyuiki rgaonkar arita37 alberduris shaohongbai rafikrhouma02 hello-ram danyalandriano yanghaocsg ivylee vochicong nomiluks codeaudit annepoiteonai prashant118 rosssong itssimon 4ertovo4ka enzoampil ramakth1 bopo gitrekm rileyshe smpotdar summon-ml thousandoaks abulhasanat astr0w1ng

fast-bert's Issues

cannot import name 'ConstantLR'

When installing fast_bert after pip install fast_bert, i got this in from fast_bert.learner import BertLearner

Unsupported operand type(s) for /: 'str' and 'str'

when I tried to run the example. I got the error:

databunch = BertDataBunch('./data/', './data/',
                          tokenizer='bert-base-uncased',
                          train_file='train.csv',
                          val_file='val.csv',
                          label_file='labels.csv',
                          text_col='text',
                          label_col='label',
                          batch_size_per_gpu=16,
                          max_seq_length=512,
                          multi_gpu=True,
                          multi_label=False,
                          model_type='bert',
                          no_cache=True)

TypeError Traceback (most recent call last)
in
11 multi_label=False,
12 model_type='bert',
---> 13 no_cache=True)

/data/miniconda3/envs/pt/lib/python3.7/site-packages/fast_bert/data_cls.py in init(self, data_dir, label_dir, tokenizer, train_file, val_file, test_data, label_file, text_col, label_col, batch_size_per_gpu, max_seq_length, multi_gpu, multi_label, backend, model_type, logger, clear_cache, no_cache)
288 self.tokenizer = tokenizer
289 self.data_dir = data_dir
--> 290 self.cache_dir = data_dir/'cache'
291 self.max_seq_length = max_seq_length
292 self.batch_size_per_gpu = batch_size_per_gpu

TypeError: unsupported operand type(s) for /: 'str' and 'str'
Could you help me to deal with that?

BertLearner.from_pretrained_model stuck

Everything works perfectly until I want to create the BertLearner.
When I run following cell
learner = BertLearner.from_pretrained_model(databunch, 'bert-base-multilingual-uncased', metrics, device, logger, finetuned_wgts_path=None, is_fp16=args['fp16'], loss_scale=args['loss_scale'], multi_gpu=multi_gpu, multi_label=False)

the cell is stuck loading.
The logger gives me following hints:

`07/17/2019 10:05:36 - INFO - pytorch_pretrained_bert.modeling - loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased.tar.gz from cache at /home/ec2-user/.pytorch_pretrained_bert/437da855f7aeb6dcc47ee03b11ac55bfbc069d31354f6867f3b298aad8429925.dd2dce7e7331017693bd2230dbc8015b12a975201a420a856a6efbf7ae9d84c5
07/17/2019 10:05:36 - INFO - pytorch_pretrained_bert.modeling - extracting archive file /home/ec2-user/.pytorch_pretrained_bert/437da855f7aeb6dcc47ee03b11ac55bfbc069d31354f6867f3b298aad8429925.dd2dce7e7331017693bd2230dbc8015b12a975201a420a856a6efbf7ae9d84c5 to temp dir /tmp/tmp5yuiacnx
07/17/2019 10:05:43 - INFO - pytorch_pretrained_bert.modeling - Model config {
"attention_probs_dropout_prob": 0.1,
"directionality": "bidi",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pooler_fc_size": 768,
"pooler_num_attention_heads": 12,
"pooler_num_fc_layers": 3,
"pooler_size_per_head": 128,
"pooler_type": "first_token_transform",
"type_vocab_size": 2,
"vocab_size": 105879
}

07/17/2019 10:05:48 - INFO - pytorch_pretrained_bert.modeling - Weights of BertForSequenceClassification not initialized from pretrained model: ['classifier.weight', 'classifier.bias']
07/17/2019 10:05:48 - INFO - pytorch_pretrained_bert.modeling - Weights from pretrained model not used in BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']`

RuntimeError: The size of tensor a (2) must match the size of tensor b (9833) at non-singleton dimension

I followed your instructions using my data.
Since the batch_size was too big for my data i changed it to 6.

Then i got this error during evaluation:

08/23/2019 17:50:14 - INFO - root - Running evaluation---------------------------------------------------------| 0.82% [49/5955 00:37<1:15:53] 08/23/2019 17:50:14 - INFO - root - Num examples = 9833 08/23/2019 17:50:14 - INFO - root - Batch size = 6 Traceback (most recent call last): File "train_fast_bert_doc_rerank.py", line 81, in <module> optimizer_type="lamb" File "/usr/local/lib/python3.6/site-packages/fast_bert/learner_cls.py", line 295, in fit results = self.validate() File "/usr/local/lib/python3.6/site-packages/fast_bert/learner_cls.py", line 382, in validate validation_scores[metric['name']] = metric['function'](all_logits, all_labels) File "/usr/local/lib/python3.6/site-packages/fast_bert/metrics.py", line 31, in accuracy_thresh return ((y_pred > thresh) == y_true.byte()).float().mean().item() RuntimeError: The size of tensor a (2) must match the size of tensor b (9833) at non-singleton dimension
Could you help me?
Thank you in advance

High confidence for False Positive results

I have trained multi class text classifier using BERT. I a getting accuracy around 90%. The only issue is the model is classifying out of domain sentences with very high confidence score(e.g. 0.9954564 score).
I have seen in other models like space supervised it classify out of domain sentences with very low confidence which helps to detect them. Is there any method to solve this problem?

learner.fit and learner.validate - AttributeError: 'Tensor' object has no attribute 'bool'

/content/xlnet_cased_L-12_H-768_A-12/output/tensorboard
Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
09/08/2019 14:37:51 - INFO - root - * Running training *
09/08/2019 14:37:51 - INFO - root - Num examples = 1000
09/08/2019 14:37:51 - INFO - root - Num Epochs = 6
09/08/2019 14:37:51 - INFO - root - Total train batch size (w. parallel, distributed & accumulation) = 8
09/08/2019 14:37:51 - INFO - root - Gradient Accumulation steps = 1
09/08/2019 14:37:51 - INFO - root - Total optimization steps = 750
0.00% [0/6 00:00<00:00]
100.00% [125/125 04:24<00:00]
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
09/08/2019 14:42:16 - INFO - root - Running evaluation
09/08/2019 14:42:16 - INFO - root - Num examples = 1000
09/08/2019 14:42:16 - INFO - root - Batch size = 8
100.00% [125/125 01:19<00:00]

AttributeError Traceback (most recent call last)
in ()
----> 1 learner.fit(args.num_train_epochs, args.learning_rate, validate=True)

2 frames
/usr/local/lib/python3.6/dist-packages/fast_bert/metrics.py in accuracy_thresh(y_pred, y_true, thresh, sigmoid)
29 if sigmoid:
30 y_pred = y_pred.sigmoid()
---> 31 return ((y_pred > thresh) == y_true.bool()).float().mean().item()
32 # return np.mean(((y_pred>thresh)==y_true.byte()).float().cpu().numpy(), axis=1).sum()
33

AttributeError: 'Tensor' object has no attribute 'bool'

F1-score always 0

I use metrics as [{'name': 'F1-score', 'function': F1}], run the samples data for 4 epoch.

However, after each epoch, I got the F1 score is 0, what's wrong?

from fast_bert.learner import *
from fast_bert.metrics import *
from pytorch_pretrained_bert.tokenization import BertTokenizer

from bert_data import *

import torch
from fastai.text import *
import datetime

run_start_time = datetime.datetime.today().strftime('%Y-%m-%d_%H-%M-%S')

LOG_PATH=Path('logs/')  
MODEL_PATH=Path('models/') 

if not LOG_PATH.exists():
  LOG_PATH.mkdir()
import logging

args = {
    "run_text": "my_test",
    "max_seq_length": 512,
    "do_lower_case": True,
    "train_batch_size": 16,
    "learning_rate": 6e-5,
    "num_train_epochs": 12.0,
    "warmup_proportion": 0.002,
    "local_rank": -1,
    "gradient_accumulation_steps": 1,
    "fp16": True,
    "loss_scale": 128
}

logfile = str(LOG_PATH/'log-{}-{}.txt'.format(run_start_time, args["run_text"]))

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
    datefmt='%m/%d/%Y %H:%M:%S',
    handlers=[
        logging.FileHandler(logfile),
        logging.StreamHandler(sys.stdout)
    ])

logger = logging.getLogger()

device = torch.device('cuda')

if torch.cuda.device_count() > 1:
    multi_gpu = True
else:
    multi_gpu = False
    
print('multi_gpu={}'.format('True' if multi_gpu else 'False'))

DATA_PATH = Path('data/sample/data/')     
LABEL_PATH = Path('data/sample/labels')  

BERT_PRETRAINED_MODEL = "bert/bert-base-uncased"

args["do_lower_case"] = True
args["train_batch_size"] = 16
args["learning_rate"] = 6e-5
args["max_seq_length"] = 512
args["fp16"] = True

tokenizer = BertTokenizer.from_pretrained(BERT_PRETRAINED_MODEL, 
                                          do_lower_case=args['do_lower_case'])

label_cols = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
databunch = BertDataBunch(DATA_PATH, LABEL_PATH, tokenizer, train_file='train.csv', val_file='valid.csv',
                          test_data='test.csv', label_file="labels.csv",
                          text_col="comment_text", label_col=label_cols,
                          bs=args['train_batch_size'], maxlen=args['max_seq_length'], 
                          multi_gpu=multi_gpu, multi_label=True)

#metrics = [{'name': 'accuracy', 'function': accuracy_multilabel}]                          
#metrics = [{'name': 'roc_auc', 'function': roc_auc}]                          
metrics = [{'name': 'F1-score', 'function': F1}]                          
learner = BertLearner.from_pretrained_model(databunch, BERT_PRETRAINED_MODEL, metrics, device, logger, 
                                            is_fp16=args['fp16'], loss_scale=args['loss_scale'], 
                                            multi_gpu=multi_gpu,  multi_label=True)
learner.fit(4, lr=args['learning_rate'], schedule_type="warmup_linear")

Attention Weights

Does this return the attention weights that is possible to obtain from the BERT model through PyTorch transformers?

500+ multi-labels only predicts zeros

This might not be an issue related to fast-bert, but I give it a shot here either way. I now have a dataset of 500+ labels. At first, fast-bert predicts various values between 0-1 for every label which seems fine, but the more I train it the more it predicts only zeros for everything. Logically, it seems wise as only 1/500 is a positive label while the rest are zeros. Is there a way to fix this? Can I change the loss function somehow? Possibly introduce class weights to really penalize false-negatives?

TypeError: init() got an unexpected keyword argument 'max_grad_norm'

I have an issue when running
learner.fit(epochs=6,
lr=6e-5,
validate=True, # Evaluate the model after each epoch
schedule_type="warmup_linear")

using the following learner object:

logger = logging.getLogger()
device_cuda = torch.device("cuda")
metrics = [{'name': 'accuracy', 'function': accuracy}]

learner = BertLearner.from_pretrained_model(
databunch,
pretrained_path='bert-base-uncased',
metrics=metrics,
device=device_cuda,
logger=logger,
#output_dir=OUTPUT_DIR,
finetuned_wgts_path=None,
#warmup_steps=500,
multi_gpu=True,
is_fp16=True,
multi_label=False,
max_grad_norm=1.0)

TypeError Traceback (most recent call last)
in
2 lr=6e-5,
3 validate=True, # Evaluate the model after each epoch
----> 4 schedule_type="warmup_linear")

~/.conda/envs/transformers/lib/python3.7/site-packages/fast_bert/learner.py in fit(self, epochs, lr, validate, schedule_type)
462
463 if self.use_amp_optimizer == False:
--> 464 self.fit_old(epochs, lr, validate=validate, schedule_type=schedule_type)
465 return
466

~/.conda/envs/transformers/lib/python3.7/site-packages/fast_bert/learner.py in fit_old(self, epochs, lr, validate, schedule_type)
573 num_train_steps = int(len(self.data.train_dl) / self.grad_accumulation_steps * epochs)
574 if self.optimizer is None:
--> 575 self.optimizer, self.schedule = self.get_optimizer_old(lr , num_train_steps)
576
577 t_total = num_train_steps

~/.conda/envs/transformers/lib/python3.7/site-packages/fast_bert/learner.py in get_optimizer_old(self, lr, num_train_steps, schedule_type)
233 lr=lr,
234 bias_correction=False,
--> 235 max_grad_norm=1.0)
236
237 if self.loss_scale == 0:

TypeError: init() got an unexpected keyword argument 'max_grad_norm'

Does anyone know how to fix?
Thanks!

cant see any metric while training

earlier I was able to see accuracy and f beta score while training the model but now I can't see anything. Model just completes its epoch and not printing anything.
any suggestions?

Can't read in train.csv

Hi,

I'm trying to test out fast-bert, and when I setup a train.csv file as follows:
index text label
0 test neg
2 test2 pos

tab seperated test file, I get the following error:

Traceback (most recent call last):
File "/home/w3pt/.local/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 4729, in get_value
return libindex.get_value_box(s, key)
File "pandas/_libs/index.pyx", line 51, in pandas._libs.index.get_value_box
File "pandas/_libs/index.pyx", line 47, in pandas._libs.index.get_value_at
File "pandas/_libs/util.pxd", line 98, in pandas._libs.util.get_value_at
File "pandas/_libs/util.pxd", line 83, in pandas._libs.util.validate_indexer
TypeError: 'str' object cannot be interpreted as an integer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "bert.py", line 17, in
model_type='bert')
File "/home/w3pt/.local/lib/python3.7/site-packages/fast_bert/data_cls.py", line 332, in init
train_file, text_col=text_col, label_col=label_col)
File "/home/w3pt/.local/lib/python3.7/site-packages/fast_bert/data_cls.py", line 222, in get_train_examples
return self._create_examples(data_df, "train", text_col=text_col, label_col=label_col)
File "/home/w3pt/.local/lib/python3.7/site-packages/fast_bert/data_cls.py", line 257, in _create_examples
return list(df.apply(lambda row: InputExample(guid=row.index, text_a=row[text_col], label=str(row[label_col])), axis=1))
File "/home/w3pt/.local/lib/python3.7/site-packages/pandas/core/frame.py", line 6906, in apply
return op.get_result()
File "/home/w3pt/.local/lib/python3.7/site-packages/pandas/core/apply.py", line 186, in get_result
return self.apply_standard()
File "/home/w3pt/.local/lib/python3.7/site-packages/pandas/core/apply.py", line 292, in apply_standard
self.apply_series_generator()
File "/home/w3pt/.local/lib/python3.7/site-packages/pandas/core/apply.py", line 321, in apply_series_generator
results[i] = self.f(v)
File "/home/w3pt/.local/lib/python3.7/site-packages/fast_bert/data_cls.py", line 257, in
return list(df.apply(lambda row: InputExample(guid=row.index, text_a=row[text_col], label=str(row[label_col])), axis=1))
File "/home/w3pt/.local/lib/python3.7/site-packages/pandas/core/series.py", line 1064, in getitem
result = self.index.get_value(self, key)
File "/home/w3pt/.local/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 4737, in get_value
raise e1
File "/home/w3pt/.local/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 4723, in get_value
return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
File "pandas/_libs/index.pyx", line 80, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 88, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: ('text', 'occurred at index 0')

Code:
from fast_bert.data_cls import BertDataBunch
from pathlib import Path
DATA_PATH = Path('./')
LABEL_PATH = Path('./')

databunch = BertDataBunch(DATA_PATH, LABEL_PATH,
tokenizer='bert-base-uncased',
train_file='train.csv',
val_file='val.csv',
label_file='labels.csv',
text_col='text',
label_col='label',
batch_size_per_gpu=16,
max_seq_length=512,
multi_gpu=True,
multi_label=False,

Am I doing something wrong?

Support for multi-label and multi-class text classification using DistilBERT

How can I use DistilBERT for multi-label classification for building a fast and deploy-able model?

Incomplete class Learner(object)

Hi @kaushaltrivedi. Thanks so much for creating this library, it's great.

I was using it a few days ago and it worked well. But now I'm getting an import error from fast_bert/learner.py. I think it's due to a incomplete class Learner(object):. Complete message below:

File "/usr/local/lib/python3.6/dist-packages/fast_bert/learner.py", line 61 class BertLearner(object): ^ IndentationError: expected an indented block

Problem with multiclass model

When I tried to run the model for multi-class problem after training and running evaluation it throws
RuntimeError Traceback (most recent call last) 1 learner.fit(args.num_train_epochs, args.learning_rate, validate=True) 52 if len(types) <= 1: ---> 53 return orig_fn(*args, **kwargs) 54 elif len(types) == 2 and types == set(['HalfTensor', 'FloatTensor']): 55 new_args = utils.casted_args(cast_fn, RuntimeError: The size of tensor a (4) must match the size of tensor b (74) at non-singleton dimension 1
Metrics I have used is fbeta.

Could the lamb optimizer be used in ImageNet classification?

Thank you for your contribution.
Like the paper said, the lamb optimizer could also be used for ImageNet classification. I am trying to incorporate the lamb here to my own code. Could the optimizer you contributed here be also applied in this kind of classification?
Many thanks.

notebook not working out of the box

I'm trying to just get the included toxicity notebook to work from a fresh clone and am having some issues:

Out of the box, the data & labels directory are pointing to the wrong place and the DataBunch is using filenames that are not part of the repo. These are fixed easily enough.
It would help if there was a pointer to where to get the PyTorch pretrained model uncased_L-12_H-768_A-12. There is a Google download which will not work with the from_pretrained_model cell:

FileNotFoundError: [Errno 2] No such file or directory: '../../bert/bert-models/uncased_L-12_H-768_A-12/pytorch_model.bin'

I have been able to get past this step by instead of using 'bert-base-uncased' instead of BERT_PRETRAINED_PATH as the model spec in the tokenizer and from_pretrained_model steps.

Once I get everything loaded, RuntimeError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 7.43 GiB total capacity; 6.91 GiB already allocated; 10.94 MiB free; 24.36 MiB cached)

This is a standard 8G GPU compute engine instance on GCP. Advice on how to not run out of memory would help the tutorial a lot.

weights not initialized when saving/loading

When i train a fastbert model and save it using save_and_reload(), the model output is not consistent with the models output before saving.

code to reproduce:

from fast_bert import BertClassificationPredictor


databunch = BertDataBunch(args['data_dir'], LABEL_PATH, tokenizer, train_file='train.csv', val_file='val.csv',
                      test_data=test_df['content'].tolist(),
                      text_col="content", label_col=label_cols,
                      bs=args['train_batch_size'], maxlen=args['max_seq_length'], 
                      multi_gpu=True, multi_label=True)
databunch.save()

metrics = []
metrics.append({'name': 'accuracy_thresh', 'function': accuracy_thresh})
metrics.append({'name': 'roc_auc', 'function': roc_auc})
metrics.append({'name': 'fbeta', 'function': fbeta})
metrics.append({'name': 'accuracy_single', 'function': accuracy_multilabel})

learner = BertLearner.from_pretrained_model(databunch, BERT_PRETRAINED_PATH, metrics, device, logger, 
                                            finetuned_wgts_path=FINETUNED_PATH, 
                                            is_fp16=args['fp16'], loss_scale=args['loss_scale'], 
                                            multi_gpu=True,  multi_label=True,)
learner.fit(4, lr=args['learning_rate'], schedule_type="warmup_cosine_hard_restarts",validate=True)

#save prediction on test set
prediction_before_saving = learner.predict_batch(test_df['content'].tolist())

model_path = os.getcwd()+'/fastBertModels'
model_name = 'fastBert_split_'+str(idx)+'_test'
learner.save_and_reload(model_path,model_name)
predictor = BertClassificationPredictor(model_path=model_path+'/'+model_name+'.bin', pretrained_path = BERT_PRETRAINED_PATH, label_path = LABEL_PATH, multi_label=True)

#save prediction on test set (again)
prediction_after_loading = predictor.predict_batch(test_df['content'].tolist())

#remove column names from predictions 
prediction_before_saving = [[x[0][1],x[1][1]] for x in prediction_before_saving]
prediction_after_loading = [[x[0][1],x[1][1]] for x in prediction_after_loading]


for x,y in zip(prediction_before_saving,prediction_after_loading):
    print(x==y,x,y)

I also get a bunch of warnings regarding the bert model weights when i run save_and_reload(), as well as when i load the model into a BertClassificationPredictor. I suspect this to be the culprit (example below).

 05/28/2019 22:13:30 - INFO - pytorch_pretrained_bert.modeling -   loading archive file uncased_L-12_H-768_A-12 from cache at uncased_L-12_H-768_A-12
05/28/2019 22:13:30 - INFO - pytorch_pretrained_bert.modeling -   Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 30522
}
05/28/2019 22:13:36 - INFO - pytorch_pretrained_bert.modeling -   Weights of BertForMultiLabelSequenceClassification not initialized from pretrained model: ['bert.embeddings.word_embeddings.weight', 'bert.embeddings.position_embeddings.weight', 'bert.embeddings.token_type_embeddings.weight', 'bert.embeddings.LayerNorm.weight', 'bert.embeddings.LayerNorm.bias', 'bert.encoder.layer.0.attention.self.query.weight', 'bert.encoder.layer.0.attention.self.query.bias', 'bert.encoder.layer.0.attention.self.key.weight', 'bert.encoder.layer.0.attention.self.key.bias', 'bert.encoder.layer.0.attention.self.value.weight', 'bert.encoder.layer.0.attention.self.value.bias', 'bert.encoder.layer.0.attention.output.dense.weight', 'bert.encoder.layer.0.attention.output.dense.bias', 'bert.encoder.layer.0.attention.output.LayerNorm.weight', 'bert.encoder.layer.0.attention.output.LayerNorm.bias', 'bert.encoder.layer.0.intermediate.dense.weight', 'bert.encoder.layer.0.intermediate.dense.bias', 'bert.encoder.layer.0.output.dense.weight', 'bert.encoder.layer.0.output.dense.bias', 'bert.encoder.layer.0.output.LayerNorm.weight', 'bert.encoder.layer.0.output.LayerNorm.bias', 'bert.encoder.layer.1.attention.self.query.weight', 'bert.encoder.layer.1.attention.self.query.bias', 'bert.encoder.layer.1.attention.self.key.weight', 'bert.encoder.layer.1.attention.self.key.bias', 'bert.encoder.layer.1.attention.self.value.weight', 'bert.encoder.layer.1.attention.self.value.bias', 'bert.encoder.layer.1.attention.output.dense.weight', 'bert.encoder.layer.1.attention.output.dense.bias', 'bert.encoder.layer.1.attention.output.LayerNorm.weight', 'bert.encoder.layer.1.attention.output.LayerNorm.bias', 'bert.encoder.layer.1.intermediate.dense.weight', 'bert.encoder.layer.1.intermediate.dense.bias', 'bert.encoder.layer.1.output.dense.weight', 'bert.encoder.layer.1.output.dense.bias', 'bert.encoder.layer.1.output.LayerNorm.weight', 'bert.encoder.layer.1.output.LayerNorm.bias', 'bert.encoder.layer.2.attention.self.query.weight', 'bert.encoder.layer.2.attention.self.query.bias', 'bert.encoder.layer.2.attention.self.key.weight', 'bert.encoder.layer.2.attention.self.key.bias', 'bert.encoder.layer.2.attention.self.value.weight', 'bert.encoder.layer.2.attention.self.value.bias', 'bert.encoder.layer.2.attention.output.dense.weight', 'bert.encoder.layer.2.attention.output.dense.bias', 'bert.encoder.layer.2.attention.output.LayerNorm.weight', 'bert.encoder.layer.2.attention.output.LayerNorm.bias', 'bert.encoder.layer.2.intermediate.dense.weight', 'bert.encoder.layer.2.intermediate.dense.bias', 'bert.encoder.layer.2.output.dense.weight', 'bert.encoder.layer.2.output.dense.bias', 'bert.encoder.layer.2.output.LayerNorm.weight', 'bert.encoder.layer.2.output.LayerNorm.bias', 'bert.encoder.layer.3.attention.self.query.weight', 'bert.encoder.layer.3.attention.self.query.bias', 'bert.encoder.layer.3.attention.self.key.weight', 'bert.encoder.layer.3.attention.self.key.bias', 'bert.encoder.layer.3.attention.self.value.weight', 'bert.encoder.layer.3.attention.self.value.bias', 'bert.encoder.layer.3.attention.output.dense.weight', 'bert.encoder.layer.3.attention.output.dense.bias', 'bert.encoder.layer.3.attention.output.LayerNorm.weight', 'bert.encoder.layer.3.attention.output.LayerNorm.bias', 'bert.encoder.layer.3.intermediate.dense.weight', 'bert.encoder.layer.3.intermediate.dense.bias', 'bert.encoder.layer.3.output.dense.weight', 'bert.encoder.layer.3.output.dense.bias', 'bert.encoder.layer.3.output.LayerNorm.weight', 'bert.encoder.layer.3.output.LayerNorm.bias', 'bert.encoder.layer.4.attention.self.query.weight', 'bert.encoder.layer.4.attention.self.query.bias', 'bert.encoder.layer.4.attention.self.key.weight', 'bert.encoder.layer.4.attention.self.key.bias', 'bert.encoder.layer.4.attention.self.value.weight', 'bert.encoder.layer.4.attention.self.value.bias', 'bert.encoder.layer.4.attention.output.dense.weight', 'bert.encoder.layer.4.attention.output.dense.bias', 'bert.encoder.layer.4.attention.output.LayerNorm.weight', 'bert.encoder.layer.4.attention.output.LayerNorm.bias', 'bert.encoder.layer.4.intermediate.dense.weight', 'bert.encoder.layer.4.intermediate.dense.bias', 'bert.encoder.layer.4.output.dense.weight', 'bert.encoder.layer.4.output.dense.bias', 'bert.encoder.layer.4.output.LayerNorm.weight', 'bert.encoder.layer.4.output.LayerNorm.bias', 'bert.encoder.layer.5.attention.self.query.weight', 'bert.encoder.layer.5.attention.self.query.bias', 'bert.encoder.layer.5.attention.self.key.weight', 'bert.encoder.layer.5.attention.self.key.bias', 'bert.encoder.layer.5.attention.self.value.weight', 'bert.encoder.layer.5.attention.self.value.bias', 'bert.encoder.layer.5.attention.output.dense.weight', 'bert.encoder.layer.5.attention.output.dense.bias', 'bert.encoder.layer.5.attention.output.LayerNorm.weight', 'bert.encoder.layer.5.attention.output.LayerNorm.bias', 'bert.encoder.layer.5.intermediate.dense.weight', 'bert.encoder.layer.5.intermediate.dense.bias', 'bert.encoder.layer.5.output.dense.weight', 'bert.encoder.layer.5.output.dense.bias', 'bert.encoder.layer.5.output.LayerNorm.weight', 'bert.encoder.layer.5.output.LayerNorm.bias', 'bert.encoder.layer.6.attention.self.query.weight', 'bert.encoder.layer.6.attention.self.query.bias', 'bert.encoder.layer.6.attention.self.key.weight', 'bert.encoder.layer.6.attention.self.key.bias', 'bert.encoder.layer.6.attention.self.value.weight', 'bert.encoder.layer.6.attention.self.value.bias', 'bert.encoder.layer.6.attention.output.dense.weight', 'bert.encoder.layer.6.attention.output.dense.bias', 'bert.encoder.layer.6.attention.output.LayerNorm.weight', 'bert.encoder.layer.6.attention.output.LayerNorm.bias', 'bert.encoder.layer.6.intermediate.dense.weight', 'bert.encoder.layer.6.intermediate.dense.bias', 'bert.encoder.layer.6.output.dense.weight', 'bert.encoder.layer.6.output.dense.bias', 'bert.encoder.layer.6.output.LayerNorm.weight', 'bert.encoder.layer.6.output.LayerNorm.bias', 'bert.encoder.layer.7.attention.self.query.weight', 'bert.encoder.layer.7.attention.self.query.bias', 'bert.encoder.layer.7.attention.self.key.weight', 'bert.encoder.layer.7.attention.self.key.bias', 'bert.encoder.layer.7.attention.self.value.weight', 'bert.encoder.layer.7.attention.self.value.bias', 'bert.encoder.layer.7.attention.output.dense.weight', 'bert.encoder.layer.7.attention.output.dense.bias', 'bert.encoder.layer.7.attention.output.LayerNorm.weight', 'bert.encoder.layer.7.attention.output.LayerNorm.bias', 'bert.encoder.layer.7.intermediate.dense.weight', 'bert.encoder.layer.7.intermediate.dense.bias', 'bert.encoder.layer.7.output.dense.weight', 'bert.encoder.layer.7.output.dense.bias', 'bert.encoder.layer.7.output.LayerNorm.weight', 'bert.encoder.layer.7.output.LayerNorm.bias', 'bert.encoder.layer.8.attention.self.query.weight', 'bert.encoder.layer.8.attention.self.query.bias', 'bert.encoder.layer.8.attention.self.key.weight', 'bert.encoder.layer.8.attention.self.key.bias', 'bert.encoder.layer.8.attention.self.value.weight', 'bert.encoder.layer.8.attention.self.value.bias', 'bert.encoder.layer.8.attention.output.dense.weight', 'bert.encoder.layer.8.attention.output.dense.bias', 'bert.encoder.layer.8.attention.output.LayerNorm.weight', 'bert.encoder.layer.8.attention.output.LayerNorm.bias', 'bert.encoder.layer.8.intermediate.dense.weight', 'bert.encoder.layer.8.intermediate.dense.bias', 'bert.encoder.layer.8.output.dense.weight', 'bert.encoder.layer.8.output.dense.bias', 'bert.encoder.layer.8.output.LayerNorm.weight', 'bert.encoder.layer.8.output.LayerNorm.bias', 'bert.encoder.layer.9.attention.self.query.weight', 'bert.encoder.layer.9.attention.self.query.bias', 'bert.encoder.layer.9.attention.self.key.weight', 'bert.encoder.layer.9.attention.self.key.bias', 'bert.encoder.layer.9.attention.self.value.weight', 'bert.encoder.layer.9.attention.self.value.bias', 'bert.encoder.layer.9.attention.output.dense.weight', 'bert.encoder.layer.9.attention.output.dense.bias', 'bert.encoder.layer.9.attention.output.LayerNorm.weight', 'bert.encoder.layer.9.attention.output.LayerNorm.bias', 'bert.encoder.layer.9.intermediate.dense.weight', 'bert.encoder.layer.9.intermediate.dense.bias', 'bert.encoder.layer.9.output.dense.weight', 'bert.encoder.layer.9.output.dense.bias', 'bert.encoder.layer.9.output.LayerNorm.weight', 'bert.encoder.layer.9.output.LayerNorm.bias', 'bert.encoder.layer.10.attention.self.query.weight', 'bert.encoder.layer.10.attention.self.query.bias', 'bert.encoder.layer.10.attention.self.key.weight', 'bert.encoder.layer.10.attention.self.key.bias', 'bert.encoder.layer.10.attention.self.value.weight', 'bert.encoder.layer.10.attention.self.value.bias', 'bert.encoder.layer.10.attention.output.dense.weight', 'bert.encoder.layer.10.attention.output.dense.bias', 'bert.encoder.layer.10.attention.output.LayerNorm.weight', 'bert.encoder.layer.10.attention.output.LayerNorm.bias', 'bert.encoder.layer.10.intermediate.dense.weight', 'bert.encoder.layer.10.intermediate.dense.bias', 'bert.encoder.layer.10.output.dense.weight', 'bert.encoder.layer.10.output.dense.bias', 'bert.encoder.layer.10.output.LayerNorm.weight', 'bert.encoder.layer.10.output.LayerNorm.bias', 'bert.encoder.layer.11.attention.self.query.weight', 'bert.encoder.layer.11.attention.self.query.bias', 'bert.encoder.layer.11.attention.self.key.weight', 'bert.encoder.layer.11.attention.self.key.bias', 'bert.encoder.layer.11.attention.self.value.weight', 'bert.encoder.layer.11.attention.self.value.bias', 'bert.encoder.layer.11.attention.output.dense.weight', 'bert.encoder.layer.11.attention.output.dense.bias', 'bert.encoder.layer.11.attention.output.LayerNorm.weight', 'bert.encoder.layer.11.attention.output.LayerNorm.bias', 'bert.encoder.layer.11.intermediate.dense.weight', 'bert.encoder.layer.11.intermediate.dense.bias', 'bert.encoder.layer.11.output.dense.weight', 'bert.encoder.layer.11.output.dense.bias', 'bert.encoder.layer.11.output.LayerNorm.weight', 'bert.encoder.layer.11.output.LayerNorm.bias', 'bert.pooler.dense.weight', 'bert.pooler.dense.bias', 'classifier.weight', 'classifier.bias']
05/28/2019 22:13:36 - INFO - pytorch_pretrained_bert.modeling -   Weights from pretrained model not used in BertForMultiLabelSequenceClassification: ['module.bert.embeddings.word_embeddings.weight', 'module.bert.embeddings.position_embeddings.weight', 'module.bert.embeddings.token_type_embeddings.weight', 'module.bert.embeddings.LayerNorm.weight', 'module.bert.embeddings.LayerNorm.bias', 'module.bert.encoder.layer.0.attention.self.query.weight', 'module.bert.encoder.layer.0.attention.self.query.bias', 'module.bert.encoder.layer.0.attention.self.key.weight', 'module.bert.encoder.layer.0.attention.self.key.bias', 'module.bert.encoder.layer.0.attention.self.value.weight', 'module.bert.encoder.layer.0.attention.self.value.bias', 'module.bert.encoder.layer.0.attention.output.dense.weight', 'module.bert.encoder.layer.0.attention.output.dense.bias', 'module.bert.encoder.layer.0.attention.output.LayerNorm.weight', 'module.bert.encoder.layer.0.attention.output.LayerNorm.bias', 'module.bert.encoder.layer.0.intermediate.dense.weight', 'module.bert.encoder.layer.0.intermediate.dense.bias', 'module.bert.encoder.layer.0.output.dense.weight', 'module.bert.encoder.layer.0.output.dense.bias', 'module.bert.encoder.layer.0.output.LayerNorm.weight', 'module.bert.encoder.layer.0.output.LayerNorm.bias', 'module.bert.encoder.layer.1.attention.self.query.weight', 'module.bert.encoder.layer.1.attention.self.query.bias', 'module.bert.encoder.layer.1.attention.self.key.weight', 'module.bert.encoder.layer.1.attention.self.key.bias', 'module.bert.encoder.layer.1.attention.self.value.weight', 'module.bert.encoder.layer.1.attention.self.value.bias', 'module.bert.encoder.layer.1.attention.output.dense.weight', 'module.bert.encoder.layer.1.attention.output.dense.bias', 'module.bert.encoder.layer.1.attention.output.LayerNorm.weight', 'module.bert.encoder.layer.1.attention.output.LayerNorm.bias', 'module.bert.encoder.layer.1.intermediate.dense.weight', 'module.bert.encoder.layer.1.intermediate.dense.bias', 'module.bert.encoder.layer.1.output.dense.weight', 'module.bert.encoder.layer.1.output.dense.bias', 'module.bert.encoder.layer.1.output.LayerNorm.weight', 'module.bert.encoder.layer.1.output.LayerNorm.bias', 'module.bert.encoder.layer.2.attention.self.query.weight', 'module.bert.encoder.layer.2.attention.self.query.bias', 'module.bert.encoder.layer.2.attention.self.key.weight', 'module.bert.encoder.layer.2.attention.self.key.bias', 'module.bert.encoder.layer.2.attention.self.value.weight', 'module.bert.encoder.layer.2.attention.self.value.bias', 'module.bert.encoder.layer.2.attention.output.dense.weight', 'module.bert.encoder.layer.2.attention.output.dense.bias', 'module.bert.encoder.layer.2.attention.output.LayerNorm.weight', 'module.bert.encoder.layer.2.attention.output.LayerNorm.bias', 'module.bert.encoder.layer.2.intermediate.dense.weight', 'module.bert.encoder.layer.2.intermediate.dense.bias', 'module.bert.encoder.layer.2.output.dense.weight', 'module.bert.encoder.layer.2.output.dense.bias', 'module.bert.encoder.layer.2.output.LayerNorm.weight', 'module.bert.encoder.layer.2.output.LayerNorm.bias', 'module.bert.encoder.layer.3.attention.self.query.weight', 'module.bert.encoder.layer.3.attention.self.query.bias', 'module.bert.encoder.layer.3.attention.self.key.weight', 'module.bert.encoder.layer.3.attention.self.key.bias', 'module.bert.encoder.layer.3.attention.self.value.weight', 'module.bert.encoder.layer.3.attention.self.value.bias', 'module.bert.encoder.layer.3.attention.output.dense.weight', 'module.bert.encoder.layer.3.attention.output.dense.bias', 'module.bert.encoder.layer.3.attention.output.LayerNorm.weight', 'module.bert.encoder.layer.3.attention.output.LayerNorm.bias', 'module.bert.encoder.layer.3.intermediate.dense.weight', 'module.bert.encoder.layer.3.intermediate.dense.bias', 'module.bert.encoder.layer.3.output.dense.weight', 'module.bert.encoder.layer.3.output.dense.bias', 'module.bert.encoder.layer.3.output.LayerNorm.weight', 'module.bert.encoder.layer.3.output.LayerNorm.bias', 'module.bert.encoder.layer.4.attention.self.query.weight', 'module.bert.encoder.layer.4.attention.self.query.bias', 'module.bert.encoder.layer.4.attention.self.key.weight', 'module.bert.encoder.layer.4.attention.self.key.bias', 'module.bert.encoder.layer.4.attention.self.value.weight', 'module.bert.encoder.layer.4.attention.self.value.bias', 'module.bert.encoder.layer.4.attention.output.dense.weight', 'module.bert.encoder.layer.4.attention.output.dense.bias', 'module.bert.encoder.layer.4.attention.output.LayerNorm.weight', 'module.bert.encoder.layer.4.attention.output.LayerNorm.bias', 'module.bert.encoder.layer.4.intermediate.dense.weight', 'module.bert.encoder.layer.4.intermediate.dense.bias', 'module.bert.encoder.layer.4.output.dense.weight', 'module.bert.encoder.layer.4.output.dense.bias', 'module.bert.encoder.layer.4.output.LayerNorm.weight', 'module.bert.encoder.layer.4.output.LayerNorm.bias', 'module.bert.encoder.layer.5.attention.self.query.weight', 'module.bert.encoder.layer.5.attention.self.query.bias', 'module.bert.encoder.layer.5.attention.self.key.weight', 'module.bert.encoder.layer.5.attention.self.key.bias', 'module.bert.encoder.layer.5.attention.self.value.weight', 'module.bert.encoder.layer.5.attention.self.value.bias', 'module.bert.encoder.layer.5.attention.output.dense.weight', 'module.bert.encoder.layer.5.attention.output.dense.bias', 'module.bert.encoder.layer.5.attention.output.LayerNorm.weight', 'module.bert.encoder.layer.5.attention.output.LayerNorm.bias', 'module.bert.encoder.layer.5.intermediate.dense.weight', 'module.bert.encoder.layer.5.intermediate.dense.bias', 'module.bert.encoder.layer.5.output.dense.weight', 'module.bert.encoder.layer.5.output.dense.bias', 'module.bert.encoder.layer.5.output.LayerNorm.weight', 'module.bert.encoder.layer.5.output.LayerNorm.bias', 'module.bert.encoder.layer.6.attention.self.query.weight', 'module.bert.encoder.layer.6.attention.self.query.bias', 'module.bert.encoder.layer.6.attention.self.key.weight', 'module.bert.encoder.layer.6.attention.self.key.bias', 'module.bert.encoder.layer.6.attention.self.value.weight', 'module.bert.encoder.layer.6.attention.self.value.bias', 'module.bert.encoder.layer.6.attention.output.dense.weight', 'module.bert.encoder.layer.6.attention.output.dense.bias', 'module.bert.encoder.layer.6.attention.output.LayerNorm.weight', 'module.bert.encoder.layer.6.attention.output.LayerNorm.bias', 'module.bert.encoder.layer.6.intermediate.dense.weight', 'module.bert.encoder.layer.6.intermediate.dense.bias', 'module.bert.encoder.layer.6.output.dense.weight', 'module.bert.encoder.layer.6.output.dense.bias', 'module.bert.encoder.layer.6.output.LayerNorm.weight', 'module.bert.encoder.layer.6.output.LayerNorm.bias', 'module.bert.encoder.layer.7.attention.self.query.weight', 'module.bert.encoder.layer.7.attention.self.query.bias', 'module.bert.encoder.layer.7.attention.self.key.weight', 'module.bert.encoder.layer.7.attention.self.key.bias', 'module.bert.encoder.layer.7.attention.self.value.weight', 'module.bert.encoder.layer.7.attention.self.value.bias', 'module.bert.encoder.layer.7.attention.output.dense.weight', 'module.bert.encoder.layer.7.attention.output.dense.bias', 'module.bert.encoder.layer.7.attention.output.LayerNorm.weight', 'module.bert.encoder.layer.7.attention.output.LayerNorm.bias', 'module.bert.encoder.layer.7.intermediate.dense.weight', 'module.bert.encoder.layer.7.intermediate.dense.bias', 'module.bert.encoder.layer.7.output.dense.weight', 'module.bert.encoder.layer.7.output.dense.bias', 'module.bert.encoder.layer.7.output.LayerNorm.weight', 'module.bert.encoder.layer.7.output.LayerNorm.bias', 'module.bert.encoder.layer.8.attention.self.query.weight', 'module.bert.encoder.layer.8.attention.self.query.bias', 'module.bert.encoder.layer.8.attention.self.key.weight', 'module.bert.encoder.layer.8.attention.self.key.bias', 'module.bert.encoder.layer.8.attention.self.value.weight', 'module.bert.encoder.layer.8.attention.self.value.bias', 'module.bert.encoder.layer.8.attention.output.dense.weight', 'module.bert.encoder.layer.8.attention.output.dense.bias', 'module.bert.encoder.layer.8.attention.output.LayerNorm.weight', 'module.bert.encoder.layer.8.attention.output.LayerNorm.bias', 'module.bert.encoder.layer.8.intermediate.dense.weight', 'module.bert.encoder.layer.8.intermediate.dense.bias', 'module.bert.encoder.layer.8.output.dense.weight', 'module.bert.encoder.layer.8.output.dense.bias', 'module.bert.encoder.layer.8.output.LayerNorm.weight', 'module.bert.encoder.layer.8.output.LayerNorm.bias', 'module.bert.encoder.layer.9.attention.self.query.weight', 'module.bert.encoder.layer.9.attention.self.query.bias', 'module.bert.encoder.layer.9.attention.self.key.weight', 'module.bert.encoder.layer.9.attention.self.key.bias', 'module.bert.encoder.layer.9.attention.self.value.weight', 'module.bert.encoder.layer.9.attention.self.value.bias', 'module.bert.encoder.layer.9.attention.output.dense.weight', 'module.bert.encoder.layer.9.attention.output.dense.bias', 'module.bert.encoder.layer.9.attention.output.LayerNorm.weight', 'module.bert.encoder.layer.9.attention.output.LayerNorm.bias', 'module.bert.encoder.layer.9.intermediate.dense.weight', 'module.bert.encoder.layer.9.intermediate.dense.bias', 'module.bert.encoder.layer.9.output.dense.weight', 'module.bert.encoder.layer.9.output.dense.bias', 'module.bert.encoder.layer.9.output.LayerNorm.weight', 'module.bert.encoder.layer.9.output.LayerNorm.bias', 'module.bert.encoder.layer.10.attention.self.query.weight', 'module.bert.encoder.layer.10.attention.self.query.bias', 'module.bert.encoder.layer.10.attention.self.key.weight', 'module.bert.encoder.layer.10.attention.self.key.bias', 'module.bert.encoder.layer.10.attention.self.value.weight', 'module.bert.encoder.layer.10.attention.self.value.bias', 'module.bert.encoder.layer.10.attention.output.dense.weight', 'module.bert.encoder.layer.10.attention.output.dense.bias', 'module.bert.encoder.layer.10.attention.output.LayerNorm.weight', 'module.bert.encoder.layer.10.attention.output.LayerNorm.bias', 'module.bert.encoder.layer.10.intermediate.dense.weight', 'module.bert.encoder.layer.10.intermediate.dense.bias', 'module.bert.encoder.layer.10.output.dense.weight', 'module.bert.encoder.layer.10.output.dense.bias', 'module.bert.encoder.layer.10.output.LayerNorm.weight', 'module.bert.encoder.layer.10.output.LayerNorm.bias', 'module.bert.encoder.layer.11.attention.self.query.weight', 'module.bert.encoder.layer.11.attention.self.query.bias', 'module.bert.encoder.layer.11.attention.self.key.weight', 'module.bert.encoder.layer.11.attention.self.key.bias', 'module.bert.encoder.layer.11.attention.self.value.weight', 'module.bert.encoder.layer.11.attention.self.value.bias', 'module.bert.encoder.layer.11.attention.output.dense.weight', 'module.bert.encoder.layer.11.attention.output.dense.bias', 'module.bert.encoder.layer.11.attention.output.LayerNorm.weight', 'module.bert.encoder.layer.11.attention.output.LayerNorm.bias', 'module.bert.encoder.layer.11.intermediate.dense.weight', 'module.bert.encoder.layer.11.intermediate.dense.bias', 'module.bert.encoder.layer.11.output.dense.weight', 'module.bert.encoder.layer.11.output.dense.bias', 'module.bert.encoder.layer.11.output.LayerNorm.weight', 'module.bert.encoder.layer.11.output.LayerNorm.bias', 'module.bert.pooler.dense.weight', 'module.bert.pooler.dense.bias', 'module.classifier.weight', 'module.classifier.bias']

Save model weights on epoch with best score

It could be nice to have an option to save the model with the best validation score for a given metric.
Also it could be nice just to have a function to do anything on each epoch's end.

roc_auc

When I tried to use a metric roc_auc, I got an error:
ValueError: Found input variables with inconsistent numbers of samples: [64, 128]

"train_batch_size": 64, "eval_batch_size": 64,

multi_label=False

Multiple Output Predictions

Hello,

It's possible to create a model that uses pre-trained BERT (or any other model), and feeds data from multiple datasets to predict multiple outputs?

Example, which I have 4 text datasets:
Dataset A contains [ ValueA, ValueB, ValueC ]
Dataset B contains [ ValueA, ValueB, ValueC, ValueD, ValueE, ValueF ]
Dataset C contains [ ValueA, ValueB ]
Dataset D contains [ ValueD, ValueE, ValueF ]

Since all of them are on English, I hope to use BERT to enchance the similarity between datasets.

Approaches that I thought:

Create a general y, and add 0. to empty fields which I don't have for it. In this case, my prediction would be [ ValueA, ValueB, ValueC, ValueD, ValueE, ValueF ]

Accuracy_multilabel function probably incorrect

Hi,

def accuracy_multilabel(y_pred:Tensor, y_true:Tensor, sigmoid:bool=True):
    if sigmoid: y_pred = y_pred.sigmoid()
    outputs = np.argmax(y_pred, axis=1)
    real_vals = np.argmax(y_true, axis=1)
    return np.mean(outputs.numpy() == real_vals.numpy())

in this block.

This piece of code seems incorrect as the shape of y_pred and y_true is (Batch_size, class_space).
Doing a np.argmax with axis=1 returns a single class index value for each sample.
This is what we do for multi-class classification.

However in multi-class classification we don't normally use sigmoid on y_pred, although it is not wrong.

This function seems much like accuracy_multiclass rather than accuracy_multilabel

How to save best model based on metrics?

I wonder if there is something like different callbacks in fastai for saving models and earlystopping?

NameError: name 'threshold' is not defined

NameError Traceback (most recent call last)
in ()
8 from pytorch_pretrained_bert.tokenization import BertTokenizer
9
---> 10 from fast_bert.data import BertDataBunch
11 from fast_bert.learner import BertLearner
12 from fast_bert.metrics import accuracy, accuracy_thresh, fbeta, roc_auc

/opt/conda/lib/python3.6/site-packages/fast_bert/init.py in ()
1 from .modeling import BertForMultiLabelSequenceClassification
2 from .data import BertDataBunch, InputExample, InputFeatures, MultiLabelTextProcessor, convert_examples_to_features
----> 3 from .metrics import accuracy, accuracy_thresh, fbeta, roc_auc, accuracy_multilabel
4 from .learner import BertLearner
5 from .prediction import BertClassificationPredictor

/opt/conda/lib/python3.6/site-packages/fast_bert/metrics.py in ()
54 return roc_auc["micro"]
55
---> 56 def Hamming_loss(y_pred:Tensor, y_true:Tensor, sigmoid:bool = True, thresh:float = threshold, sample_weight = None):
57 if sigmoid: y_pred = y_pred.sigmoid()
58 y_pred = (y_pred > thresh).float()

NameError: name 'threshold' is not defined

[Question]:comparison of DistilBERT

I was checking the memory consumption of RoBERTa and DistilBERT. I found there is no significant change in memory usage. Although Inference time is around 1sec for DistilBERT and for RoBERTa is 2sec.
Memory usage on CPU:
Port 9000: DistilBERT
Port 9002: RoBERTa

Have you seen any significant change in memory usage @kaushaltrivedi

Target Size not same as input size.

Hi,

Target size (torch.Size([0, 6])) must be the same as input size (torch.Size([32, 6]))

Below is the code.

databunch = BertDataBunch('fast-bert/sample_data/multi_label_toxic_comments/data', 'fast-bert/sample_data/multi_label_toxic_comments/label', tokenizer,
train_file='train_sample.csv', val_file='val_sample.csv',label_file='labels.csv',label_col=None,
bs=args['train_batch_size'], maxlen=args['max_seq_length'],
multi_gpu=multi_gpu, multi_label=True)

metrics = []
metrics.append({'name': 'accuracy', 'function': accuracy})

learner = BertLearner.from_pretrained_model(databunch, 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz', metrics, device, logger=None,
finetuned_wgts_path=None,
is_fp16=args['fp16'], loss_scale=args['loss_scale'],
multi_gpu=multi_gpu, multi_label=True)

learner.fit(1, lr=args['learning_rate'],
schedule_type="warmup_cosine_hard_restarts")

Undefine name: random_word()

random_word() is called twice but it is not defined or imported.

https://github.com/codertimo/BERT-pytorch/blob/master/bert_pytorch/dataset/dataset.py#L63

Error with multi_label=False in BertDataBunch

I am trying to detect lies in text, so it can either be the person telling the truth or a lie.

So this is not a multi_label problem, and therefore my BertDataBunch is looking like


databunch = BertDataBunch(args['data_dir'], LABEL_PATH, tokenizer, train_file='train.csv', val_file='val.csv',
                          test_data='test.csv',
                          text_col="content", label_col=label_cols,
                          bs=args['train_batch_size'], maxlen=args['max_seq_length'], 
                          multi_gpu=multi_gpu, multi_label=False)

However I am then getting a keyerror

'lie 0\nName: 0, dtype: object'

Inference on CPU crashes

I'm unable to load a trained model for inference on my Mac which doesn't have an Nvidia GPU.
I think it is because of this line. It should have a check around it to make sure CUDA is available before being called.

TypeError: unsupported operand type(s) for /: 'str' and 'str'

Hi,
I'm getting
TypeError: unsupported operand type(s) for /: 'str' and 'str'
error when calling BertDataBunch function. I'm actually surprised how it works for others because in line 294 of data_cls.py there is divide symbol between two strings:

292 self.tokenizer = tokenizer
293 self.data_dir = data_dir
--> 294 self.cache_dir = data_dir/'cache'
295 self.max_seq_length = max_seq_length
296 self.batch_size_per_gpu = batch_size_per_gpu
Thanks!

full code multi label classfication

Load the model on CPU instead of GPU

I cant load the model on CPU instead of GPU while trained on GPU. Can somebody tell me

module 'torch.distributed' has no attribute 'init_process_group'

Running the following code results in the following error,

databunch = BertDataBunch(DATA_PATH, LABEL_PATH, tokenizer, 
                          train_file='train.csv', val_file='valid.csv', label_file='labels.csv',
                          bs=args['train_batch_size'], maxlen=args['max_seq_length'], 
                          multi_gpu=multi_gpu, multi_label=False)

    373                 train_sampler = RandomSampler(train_data)
    374             else:
--> 375                 torch.distributed.init_process_group(backend="nccl", 
    376                                      init_method = "tcp://localhost:23459",
    377                                      rank=0, world_size=1)

AttributeError: module 'torch.distributed' has no attribute 'init_process_group'```

AttributeError: 'str' object has no attribute 'input_ids'

Hi, I came to this error, how to solve

Traceback (most recent call last):
File "fastBertDemo.py", line 23, in
model_type='bert')
File "/usr/local/python3/lib/python3.6/site-packages/fast_bert/data_cls.py", line 332, in init
train_dataset = self.get_dataset_from_examples(train_examples, 'train')
File "/usr/local/python3/lib/python3.6/site-packages/fast_bert/data_cls.py", line 431, in get_dataset_from_examples
all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
File "/usr/local/python3/lib/python3.6/site-packages/fast_bert/data_cls.py", line 431, in
all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
AttributeError: 'str' object has no attribute 'input_ids'

## ERROR when importing Leaner in Fast-Bert

logits as a result

What do we need to specify for the labels if we need logits as a result?

Saving bin file from learner and then load it

I used learner.save_and_reload to save my model and an output of pretrained_bert.bin occured. How can i used this .bin file and classify with learner.predict_batches() as i have been stuck for ages and i dont know how.

Shuffling of datasets

Does fast-bert shuffle the train, val and eval datasets?

Unable to load pretrained model using BertLearner

Getting "TypeError: init_weights() takes 1 positional argument but 2 were given" when running the below code for any of bert, xlnet model. Please note that this code was working couple of days back.

learner = BertLearner.from_pretrained_model(
databunch,
pretrained_path='bert-base-uncased',#xlnet-large-cased, bert-base-uncased
metrics=metrics,
device=device_cuda,
logger=logger,
output_dir=OUTPUT_DIR,
finetuned_wgts_path=None,
warmup_steps=500,
multi_gpu=True,
is_fp16=True,
multi_label=True,
logging_steps=50)

Finetune on all layers

How can I train this model with fine tuning all layers?

args = Box({ "run_text": "multilabel toxic comments with freezable layers", "train_size": -1, "val_size": -1, "log_path": LOG_PATH, "full_data_dir": DATA_PATH, "data_dir": DATA_PATH, "task_name": "toxic_classification_lib", "no_cuda": False, "bert_model": BERT_PRETRAINED_PATH, "output_dir": OUTPUT_PATH, "max_seq_length": 512, "do_train": True, "do_eval": True, "do_lower_case": True, "train_batch_size": 8, "eval_batch_size": 16, "learning_rate": 5e-5, "num_train_epochs": 4, "warmup_proportion": 0.0, "no_cuda": False, "local_rank": -1, "seed": 42, "gradient_accumulation_steps": 1, "optimize_on_cpu": False, "fp16": True, "fp16_opt_level": "O1", "weight_decay": 0.0, "adam_epsilon": 1e-8, "max_grad_norm": 1.0, "max_steps": -1, "warmup_steps": 500, "logging_steps": 50, "eval_all_checkpoints": True, "overwrite_output_dir": True, "overwrite_cache": False, "seed": 42, "loss_scale": 128, "task_name": 'intent', "model_name": 'bert-base-uncased', "model_type": 'bert' })

databunch = BertDataBunch(args['data_dir'], LABEL_PATH, args.model_name, train_file='train.csv', val_file='val.csv', test_data='test.csv', text_col="text", label_col=label_cols, batch_size_per_gpu=args['train_batch_size'], max_seq_length=args['max_seq_length'], multi_gpu=args.multi_gpu, multi_label=True, model_type=args.model_type)

learner = BertLearner.from_pretrained_model(databunch, args.model_name, metrics=metrics, device=device, logger=logger, output_dir=args.output_dir, finetuned_wgts_path=FINETUNED_PATH, warmup_steps=args.warmup_steps, multi_gpu=args.multi_gpu, is_fp16=args.fp16, multi_label=True, logging_steps=0)

learner.fit doesn't show result after every epoch

Hi @kaushaltrivedi ,
I used:

learner.fit(epochs=6, 
			lr=6e-5, 
			validate=True. 	# Evaluate the model after each epoch
			schedule_type="warmup_cosine")

However, that code onlys checks after the whole training, not after each epoch.
What could I do?
Thanks

Runtime Crashes on Google Colab

I was trying to create Databunch on Google Colab, using the sentiments140 twitter dataset from google colab. But no matter what batch size I use the GPU always crashes. I tried all batch sizes from 2 to 256. But the runtime crashes every single time. Can anyone please help me to solve the issue.

databunch = BertDataBunch(DATA_PATH, LABEL_PATH, tokenizer='xlnet-base-cased', train_file= 'df_train2.csv', val_file = 'df_valid2.csv', label_file = 'labels.csv', text_col = 'text', label_col = 'label', batch_size_per_gpu=2, max_seq_length=128, multi_gpu=False, multi_label=False, model_type='xlnet', )

This is the code where it crashes.

Torch not compiled with Cuda enabled

from fast_bert.learner_cls import BertLearner
from fast_bert.metrics import accuracy
import logging

logger = logging.getLogger()
device_cuda = torch.device('cpu') #torch.device("cuda")
metrics = [{'name': 'accuracy', 'function': accuracy}]

learner = BertLearner.from_pretrained_model(
databunch,
pretrained_path='bert-base-uncased',
metrics=metrics,
device=device_cuda,
logger=logger,
output_dir=MODEL_PATH,
finetuned_wgts_path=None,
warmup_steps=500,
multi_gpu=multi_gpu,
is_fp16=True,
multi_label=False,
logging_steps=50)

AssertionError Traceback (most recent call last)
in
19 is_fp16=True,
20 multi_label=False,
---> 21 logging_steps=50)

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/fast_bert/learner_cls.py in from_pretrained_model(dataBunch, pretrained_path, output_dir, metrics, device, logger, finetuned_wgts_path, multi_gpu, is_fp16, loss_scale, warmup_steps, fp16_opt_level, grad_accumulation_steps, multi_label, max_grad_norm, adam_epsilon, logging_steps)
67 model = model_class[0].from_pretrained(pretrained_path, config=config)
68
---> 69 device_id = torch.cuda.current_device()
70 model.to(device)
71

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/cuda/init.py in current_device()
349 def current_device():
350 r"""Returns the index of a currently selected device."""
--> 351 _lazy_init()
352 return torch._C._cuda_getDevice()
353

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/cuda/init.py in _lazy_init()
160 raise RuntimeError(
161 "Cannot re-initialize CUDA in forked subprocess. " + msg)
--> 162 _check_driver()
163 torch._C._cuda_init()
164 _cudart = _load_cudart()

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/cuda/init.py in _check_driver()
73 def _check_driver():
74 if not hasattr(torch._C, '_cuda_isDriverSufficient'):
---> 75 raise AssertionError("Torch not compiled with CUDA enabled")
76 if not torch._C._cuda_isDriverSufficient():
77 if torch._C._cuda_getDriverVersion() == 0:

AssertionError: Torch not compiled with CUDA enabled

I can't run a model on os X and I was wondering if I could train without using cuda?

unresolved problem

/usr/local/lib/python3.6/dist-packages/fast_bert/learner_cls.py in fit(self, epochs, lr, validate, schedule_type, optimizer_type)
211 def fit(self, epochs, lr, validate=True, schedule_type="warmup_cosine", optimizer_type='lamb'):
212
--> 213 tensorboard_dir = self.output_dir/'tensorboard'
214 tensorboard_dir.mkdir(exist_ok=True)
215 print(tensorboard_dir)

TypeError: unsupported operand type(s) for /: 'str' and 'str

How to train on unsupervised data only, to get domain specific embeddings representations

I saw that we could train labeled dataset using your module. But I have huge corpus of unlabeled text data which are in sentence sequence representations. I just want to train language model kind of model on my data to learn about domain specific word or sentence representations interms of embeddings so than I can use those embddings for downstram unsupervised tasks. Do you have any idea how can I train bert pretrained model on my corpus. Thank you.

More than 6 multi-labels possible?

I'm trying to train fast-bert on a custom multi-labeled dataset (10 labels). It works perfectly when I strip down my dataset to only use 6 labels (same number as the provided toxic comments dataset), but when I try to switch the labels to be more or less than that, I get the following error:

Traceback (most recent call last):
File "multilabel.py", line 149, in <module> learner.fit(args.num_train_epochs, args.learning_rate, validate=True)
File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/fast_bert/learner_cls.py", line 271, in fit outputs = self.model(**inputs)
File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__ result = self.forward(*input, **kwargs)
File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/fast_bert/modeling.py", line 194, in forward loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1, self.num_labels))
RuntimeError: shape '[-1, 10]' is invalid for input of size 36

Seems like fast-bert is hard-coded to strictly work for only 6 multi-labels. Especially considering I get these different errors when I change the batch size as following with 10 labels in my dataset:

batch_size = 2 --> RuntimeError: shape '[-1, 10]' is invalid for input of size 12
(2 batch_size * 6 (labels?) = 12?)

batch_size = 4 --> RuntimeError: shape '[-1, 10]' is invalid for input of size 24
(4 batch_size * 6 (labels?) = 24?)

batch_size = 6 --> RuntimeError: shape '[-1, 10]' is invalid for input of size 36
(6 batch_size * 6 (labels?) = 36?)

batch_size = 8 --> RuntimeError: shape '[-1, 10]' is invalid for input of size 48
(8 batch_size * 6 (labels?) = 48?)

Any ideas how I can solve fast-bert to use more than 6 labels?

Classification Metrics usage

How can i use the confusion matrix for each class and the other metrics in this link #17 ??

error at learner.fit while running the new-toxic-multilable sample notebook

learner.fit(args.num_train_epochs, args.learning_rate, validate=True)

RuntimeError Traceback (most recent call last)
in
----> 1 learner.fit(args.num_train_epochs, args.learning_rate, validate=True)

~/.conda/envs/fastbert/lib/python3.6/site-packages/fast_bert/learner_cls.py in fit(self, epochs, lr, validate, schedule_type, optimizer_type)
311 # Evaluate the model after every epoch
312 if validate:
--> 313 results = self.validate()
314 for key, value in results.items():
315 self.logger.info("eval_{} after epoch {}: {}: ".format(key, (epoch + 1), value))

~/.conda/envs/fastbert/lib/python3.6/site-packages/fast_bert/learner_cls.py in validate(self)
382 # Evaluation metrics
383 for metric in self.metrics:
--> 384 validation_scores[metric['name']] = metric['function'](all_logits, all_labels)
385
386 results = {'loss': eval_loss }

~/.conda/envs/fastbert/lib/python3.6/site-packages/fast_bert/metrics.py in accuracy_thresh(y_pred, y_true, thresh, sigmoid)
29 if sigmoid:
30 y_pred = y_pred.sigmoid()
---> 31 return ((y_pred > thresh) == y_true.byte()).float().mean().item()
32 # return np.mean(((y_pred>thresh)==y_true.byte()).float().cpu().numpy(), axis=1).sum()
33

~/.conda/envs/fastbert/lib/python3.6/site-packages/apex/amp/wrap.py in wrapper(*args, **kwargs)
51
52 if len(types) <= 1:
---> 53 return orig_fn(*args, **kwargs)
54 elif len(types) == 2 and types == set(['HalfTensor', 'FloatTensor']):
55 new_args = utils.casted_args(cast_fn,

RuntimeError: Expected object of scalar type Bool but got scalar type Byte for argument #2 'other'

Distributed package doesn't have NCCL built in

Hi,
Is it possible to use fast-bert to make submissions on kaggle? I tried but it threw the above error while making databunch.

Unable to use learner.fit() because of Apex dependencies

Hi, I'm trying to follow the notebook example provided in this repo with some of my own data. However, when I go to fit the model, I get the following:

ModuleNotFoundError Traceback (most recent call last)
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fast_bert/learner.py in get_optimizer(self, lr, num_train_steps, schedule_type)
197 try:
--> 198 from apex.optimizers import FP16_Optimizer
199 from apex.optimizers import FusedAdam

ModuleNotFoundError: No module named `'apex.optimizers'

I have installed Apex correctly using NVIDIA's documentation, and the Apex directory appears the same as in their repo, which leads me to think it's a fast-bert issue. I am using an AWS instance (ml.p3.8xlarge), and my environment is conda_pytorch_p36.

Thanks in advance for any help,

Darren