charles9n / bert-sklearn Goto Github PK
View Code? Open in Web Editor NEWa sklearn wrapper for Google's BERT model
License: Apache License 2.0
a sklearn wrapper for Google's BERT model
License: Apache License 2.0
Thanks for making this available! Great stuff!
Wanted to let you know about the dead link in your README.md:
https://github.com/charles9n/bert-sklearn-tmp/blob/master/other_examples/IMDb.ipynb
should be
https://github.com/charles9n/bert-sklearn/blob/master/other_examples/IMDb.ipynb
Some parameters like "epochs" should be an argument to the fit function and not to the model constructor.
transformers
version: 4.3.3I use os.environ["CUDA_VISIBLE_DEVICES"]="6,7"
to choose GPUs and everything else in the code is pretty straightforward with using BertClassifier()
as model. I am able to run it with CPU with no such issue.
model = BertClassifier()
model.bert_model = 'bert-base-uncased'
model.max_seq_length = 512
model.train_batch_size = 8
model.eval_batch_size = 8
I had some issue with Transformers then I resolved it by actually removing the bits of code that sets up DataParallel
, huggingface/transformers#10634. I am still not sure why this happens.
0it [00:00, ?it/s]Building sklearn text classifier...
Loading bert-base-uncased model...
Defaulting to linear classifier/regressor
Loading Pytorch checkpoint
train data size: 1320, validation data size: 146
Training : 0%| | 0/42 [00:09<?, ?it/s]
0it [00:27, ?it/s] | 0/42 [00:00<?, ?it/s]
Traceback (most recent call last):
File "seg_pred_skl.py", line 46, in <module>
model.fit(X_train, y_train)
File "/mnt/sdb/env1/lib/python3.6/site-packages/bert_sklearn/sklearn.py", line 374, in fit
self.model = finetune(self.model, texts_a, texts_b, labels, config)
File "/mnt/sdb/env1/lib/python3.6/site-packages/bert_sklearn/finetune.py", line 121, in finetune
loss, _ = model(*batch)
File "/mnt/sdb/env1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/mnt/sdb/env1/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/mnt/sdb/env1/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/mnt/sdb/env1/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/mnt/sdb/env1/lib/python3.6/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
File "/mnt/sdb/env1/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/mnt/sdb/env1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/mnt/sdb/env1/lib/python3.6/site-packages/bert_sklearn/model/model.py", line 95, in forward
output_all_encoded_layers=False)
File "/mnt/sdb/env1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/mnt/sdb/env1/lib/python3.6/site-packages/bert_sklearn/model/pytorch_pretrained/modeling.py", line 959, in forward
extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
StopIteration
When I tried to load a saved model, for example,
from bert_sklearn import load_model
m = load_model('/tmp/trained_model.bin')
I got the following output (I installed the latest BERT version, with PyTorch=1.0 and Python=3.6):
...
03/15/2019 21:15:50 - INFO - pytorch_pretrained_bert.modeling - Model config {
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"vocab_size": 30522
}
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-2-3b5446aaea48> in <module>()
----> 1 m = load_model('/tmp/trained_model.bin')
~/anaconda3/lib/python3.6/site-packages/bert_sklearn/sklearn.py in load_model(filename)
516
517 model_ctor = classes[class_name]
--> 518 model = model_ctor(restore_file = filename)
519 return model
520
~/anaconda3/lib/python3.6/site-packages/bert_sklearn/sklearn.py in __init__(self, label_list, bert_model, num_mlp_hiddens, num_mlp_layers, restore_file, epochs, max_seq_length, train_batch_size, eval_batch_size, learning_rate, warmup_proportion, gradient_accumulation_steps, fp16, loss_scale, local_rank, use_cuda, random_state, validation_fraction, logfile)
115
116 if restore_file is not None:
--> 117 self.load_model(restore_file)
118 else:
119 args, _, _, values = inspect.getargvalues(inspect.currentframe())
~/anaconda3/lib/python3.6/site-packages/bert_sklearn/sklearn.py in load_model(self, restore_file)
151 else:
152 # restore model from restore_file
--> 153 self.model, self.tokenizer, state = self.restore(restore_file)
154 params = state['params']
155 self.set_params(**params)
~/anaconda3/lib/python3.6/site-packages/bert_sklearn/sklearn.py in restore(self, model_filename)
367 model_type=model_type,
368 num_mlp_layers=num_mlp_layers,
--> 369 num_mlp_hiddens=num_mlp_hiddens)
370
371 return model, tokenizer, state
~/anaconda3/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling.py in from_pretrained(cls, pretrained_model_name, cache_dir, *inputs, **kwargs)
502 print(inputs)
503 print(kwargs.keys())
--> 504 model = cls(config, *inputs, **kwargs)
505 weights_path = os.path.join(serialization_dir, WEIGHTS_NAME)
506 state_dict = torch.load(weights_path)
TypeError: __init__() got an unexpected keyword argument 'state_dict'
According to the subject.
Hi!
Is it possible to train a regression for multiple output variables? If i indicate a multicolumn label as y, the model fails.
When I changed:
import PreTrainedBertModel to
import BertPreTrainedModel as PreTrainedBertModel
seemed to be quick a temporary fix
which version of scikit-learn do you use for this project?
Thanks.
I trained this model on google colab on a GPU. I am trying to load the model on my local machine(CPU) using load_model() but unable to load it. I am unable to to use pytorch.save() function too to save the model. It gives some error unable to pickle threads.
How can I solve this?
How to change "hidden_dropout_prob" or "attention_probs_dropout_prob" value?
I used below code. Thank you!
from bert_sklearn import BertClassifier
from bert_sklearn import BertRegressor
from bert_sklearn import load_model
# define model
model = BertClassifier() # text/text pair classification
# try different options..
model.max_seq_length = 512
Thank you Charles for providing bert_sklearn.
This is more a general question than a specific issue.
In the SciBert, BioBert tutorial for biomedial NER, tokenisation of the dataset is done manually.
I was wondering why a BERT tokenizer is not used and how the use of a manual tokenizer affects the performance?
Thanks
Hi! I'm using this model for a multiclass text classification problem and the results I get are not as good as I expected. I suspect that one problem might be that my classes are not well defined and are overlapping (i.e. there are classes that are too similar to each other). In order to verify this I would like to make some analysis on the embeddings that the model is using internally. Is there a way to retrieve/access the internal embeddings of the training set? I haven't been able to find a way so far.
Thanks in advance!
I use bert-sklearn in a benchmark scenario,
so I repeatedly construct and use BertClassifiers, like this:
m1 = BertClassifier( bert_model="biobert-base-cased")
m1.fit(..)
m1.predict(..)
m1.save(..)
....
m2 = BertClassifier( )
m2.fit(..)
m2.predict(..)
m2.save(..)
Doing so fails on using the second classifier with a "out of GPU memory" error.
Executing the code with only one model at a time works.
So I suppose there is a GPU memory leak somewhere. Or do I need to do something special to free memory ?
Hello,
Thank you for the excellent work on bert-sklearn!
I can install it fine using the repository, but to distribute another package that would rely on bert-sklearn it would be great to be able to install from the python package index. Is it possible to add a package there? I can also do it if that is fine with you. Is there anything that I should be aware of before starting?
Edit: I managed to use the github url to install with pip, so this is irrelevant.
Kind regards,
Fabio
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.