charles9n / bert-sklearn Goto Github PK

a sklearn wrapper for Google's BERT model

License: Apache License 2.0

Python 36.64% Jupyter Notebook 61.60% Perl 1.76%

bert conll-2003 language-model named-entity-recognition natural-language-processing ner nlp pytorch scikit-learn transfer-learning

bert-sklearn's People

Contributors

Stargazers

Watchers

bert-sklearn's Issues

URL for IMDb notebook

Thanks for making this available! Great stuff!

Wanted to let you know about the dead link in your README.md:

https://github.com/charles9n/bert-sklearn-tmp/blob/master/other_examples/IMDb.ipynb

should be

https://github.com/charles9n/bert-sklearn/blob/master/other_examples/IMDb.ipynb

Redesigning the fit function

Some parameters like "epochs" should be an argument to the fit function and not to the model constructor.

Issue with Multi-GPU

transformers version: 4.3.3
Platform: Linux-4.15.0-132-generic-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9
PyTorch version (GPU?): 1.7.1 (True)
Tensorflow version (GPU?): 2.3.0 (True)
Using GPU in script?: Yes, multi GeForce RTX 2080 Ti GPUs
NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2

I use os.environ["CUDA_VISIBLE_DEVICES"]="6,7" to choose GPUs and everything else in the code is pretty straightforward with using BertClassifier() as model. I am able to run it with CPU with no such issue.

    model = BertClassifier()
    model.bert_model = 'bert-base-uncased'
    model.max_seq_length = 512
    model.train_batch_size = 8
    model.eval_batch_size = 8

I had some issue with Transformers then I resolved it by actually removing the bits of code that sets up DataParallel, huggingface/transformers#10634. I am still not sure why this happens.

0it [00:00, ?it/s]Building sklearn text classifier...
Loading bert-base-uncased model...
Defaulting to linear classifier/regressor
Loading Pytorch checkpoint
train data size: 1320, validation data size: 146
Training  :   0%|                                                                                                                                             | 0/42 [00:09<?, ?it/s]
0it [00:27, ?it/s]                                                                                                                                            | 0/42 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "seg_pred_skl.py", line 46, in <module>
    model.fit(X_train, y_train)
  File "/mnt/sdb/env1/lib/python3.6/site-packages/bert_sklearn/sklearn.py", line 374, in fit
    self.model = finetune(self.model, texts_a, texts_b, labels, config)
  File "/mnt/sdb/env1/lib/python3.6/site-packages/bert_sklearn/finetune.py", line 121, in finetune
    loss, _ = model(*batch)
  File "/mnt/sdb/env1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/sdb/env1/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/mnt/sdb/env1/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/mnt/sdb/env1/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/mnt/sdb/env1/lib/python3.6/site-packages/torch/_utils.py", line 428, in reraise
    raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/mnt/sdb/env1/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/mnt/sdb/env1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/sdb/env1/lib/python3.6/site-packages/bert_sklearn/model/model.py", line 95, in forward
    output_all_encoded_layers=False)
  File "/mnt/sdb/env1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/sdb/env1/lib/python3.6/site-packages/bert_sklearn/model/pytorch_pretrained/modeling.py", line 959, in forward
    extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
StopIteration

Had problem with using load_model() // no issue of training a model

When I tried to load a saved model, for example,

from bert_sklearn import load_model
m = load_model('/tmp/trained_model.bin')

I got the following output (I installed the latest BERT version, with PyTorch=1.0 and Python=3.6):

...
03/15/2019 21:15:50 - INFO - pytorch_pretrained_bert.modeling -   Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 30522
}
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-3b5446aaea48> in <module>()
----> 1 m = load_model('/tmp/trained_model.bin')

~/anaconda3/lib/python3.6/site-packages/bert_sklearn/sklearn.py in load_model(filename)
    516 
    517     model_ctor = classes[class_name]
--> 518     model = model_ctor(restore_file = filename)
    519     return model
    520 

~/anaconda3/lib/python3.6/site-packages/bert_sklearn/sklearn.py in __init__(self, label_list, bert_model, num_mlp_hiddens, num_mlp_layers, restore_file, epochs, max_seq_length, train_batch_size, eval_batch_size, learning_rate, warmup_proportion, gradient_accumulation_steps, fp16, loss_scale, local_rank, use_cuda, random_state, validation_fraction, logfile)
    115 
    116         if restore_file is not None:
--> 117             self.load_model(restore_file)
    118         else:
    119             args, _, _, values = inspect.getargvalues(inspect.currentframe())

~/anaconda3/lib/python3.6/site-packages/bert_sklearn/sklearn.py in load_model(self, restore_file)
    151         else:
    152             # restore model from restore_file
--> 153             self.model, self.tokenizer, state = self.restore(restore_file)
    154             params = state['params']
    155             self.set_params(**params)

~/anaconda3/lib/python3.6/site-packages/bert_sklearn/sklearn.py in restore(self, model_filename)
    367                             model_type=model_type,
    368                             num_mlp_layers=num_mlp_layers,
--> 369                             num_mlp_hiddens=num_mlp_hiddens)
    370 
    371         return model, tokenizer, state

~/anaconda3/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling.py in from_pretrained(cls, pretrained_model_name, cache_dir, *inputs, **kwargs)
    502         print(inputs)
    503         print(kwargs.keys())
--> 504         model = cls(config, *inputs, **kwargs)
    505         weights_path = os.path.join(serialization_dir, WEIGHTS_NAME)
    506         state_dict = torch.load(weights_path)

TypeError: __init__() got an unexpected keyword argument 'state_dict'

Is there any plan to support multi-label classification task?

According to the subject.

Regression for multiple labels?

Hi!
Is it possible to train a regression for multiple output variables? If i indicate a multicolumn label as y, the model fails.

pytorch_pretrained_bert/modeling.py name error -- PreTrainedBertModel

When I changed:
import PreTrainedBertModel to
import BertPreTrainedModel as PreTrainedBertModel
seemed to be quick a temporary fix

load_model() error

Thank you Charles for providing this awesome pkg.

I run into an error when reloading the model using load_model(). How should I fix it?

which version of scikit-learn do you use for this project?

Thanks.

Unable to load GPU trained model on CPU

I trained this model on google colab on a GPU. I am trying to load the model on my local machine(CPU) using load_model() but unable to load it. I am unable to to use pytorch.save() function too to save the model. It gives some error unable to pickle threads.
How can I solve this?

How to change hidden_dropout_prob value?

How to change "hidden_dropout_prob" or "attention_probs_dropout_prob" value?
I used below code. Thank you!

from bert_sklearn import BertClassifier
from bert_sklearn import BertRegressor
from bert_sklearn import load_model

# define model
model = BertClassifier()         # text/text pair classification

# try different options..
model.max_seq_length = 512

Tokenization for Bio NER

Thank you Charles for providing bert_sklearn.
This is more a general question than a specific issue.
In the SciBert, BioBert tutorial for biomedial NER, tokenisation of the dataset is done manually.
I was wondering why a BERT tokenizer is not used and how the use of a manual tokenizer affects the performance?
Thanks

Is there a way to access the embeddings of the training set of a fitted model?

Hi! I'm using this model for a multiclass text classification problem and the results I get are not as good as I expected. I suspect that one problem might be that my classes are not well defined and are overlapping (i.e. there are classes that are too similar to each other). In order to verify this I would like to make some analysis on the embeddings that the model is using internally. Is there a way to retrieve/access the internal embeddings of the training set? I haven't been able to find a way so far.

Thanks in advance!

Unable to reload the saved model to retrain from last ckpt

NameError

Thank you for providing this work.I meet a error in model.fit() that "NameError: name 'pretrained_model_name_or_path' is not defined"

GPU memory leak ?

I use bert-sklearn in a benchmark scenario,
so I repeatedly construct and use BertClassifiers, like this:

m1 = BertClassifier( bert_model="biobert-base-cased")
m1.fit(..)
m1.predict(..)
m1.save(..)

....

m2 = BertClassifier( )
m2.fit(..)
m2.predict(..)
m2.save(..)

Doing so fails on using the second classifier with a "out of GPU memory" error.
Executing the code with only one model at a time works.

So I suppose there is a GPU memory leak somewhere. Or do I need to do something special to free memory ?

PyPI package

Hello,

Thank you for the excellent work on bert-sklearn!

I can install it fine using the repository, but to distribute another package that would rely on bert-sklearn it would be great to be able to install from the python package index. Is it possible to add a package there? I can also do it if that is fine with you. Is there anything that I should be aware of before starting?

Edit: I managed to use the github url to install with pip, so this is irrelevant.

Kind regards,
Fabio

charles9n / bert-sklearn Goto Github PK

bert-sklearn's People

Contributors

Stargazers

Watchers

Forkers

bert-sklearn's Issues

Recommend Projects

Recommend Topics

Recommend Org