Giter Club home page Giter Club logo

bert-sklearn's People

Contributors

charles9n avatar ezesalta avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bert-sklearn's Issues

URL for IMDb notebook

Thanks for making this available! Great stuff!

Wanted to let you know about the dead link in your README.md:

https://github.com/charles9n/bert-sklearn-tmp/blob/master/other_examples/IMDb.ipynb

should be

https://github.com/charles9n/bert-sklearn/blob/master/other_examples/IMDb.ipynb

Issue with Multi-GPU

  • transformers version: 4.3.3
  • Platform: Linux-4.15.0-132-generic-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.6.9
  • PyTorch version (GPU?): 1.7.1 (True)
  • Tensorflow version (GPU?): 2.3.0 (True)
  • Using GPU in script?: Yes, multi GeForce RTX 2080 Ti GPUs
  • NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2

I use os.environ["CUDA_VISIBLE_DEVICES"]="6,7" to choose GPUs and everything else in the code is pretty straightforward with using BertClassifier() as model. I am able to run it with CPU with no such issue.

    model = BertClassifier()
    model.bert_model = 'bert-base-uncased'
    model.max_seq_length = 512
    model.train_batch_size = 8
    model.eval_batch_size = 8

I had some issue with Transformers then I resolved it by actually removing the bits of code that sets up DataParallel, huggingface/transformers#10634. I am still not sure why this happens.

0it [00:00, ?it/s]Building sklearn text classifier...
Loading bert-base-uncased model...
Defaulting to linear classifier/regressor
Loading Pytorch checkpoint
train data size: 1320, validation data size: 146
Training  :   0%|                                                                                                                                             | 0/42 [00:09<?, ?it/s]
0it [00:27, ?it/s]                                                                                                                                            | 0/42 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "seg_pred_skl.py", line 46, in <module>
    model.fit(X_train, y_train)
  File "/mnt/sdb/env1/lib/python3.6/site-packages/bert_sklearn/sklearn.py", line 374, in fit
    self.model = finetune(self.model, texts_a, texts_b, labels, config)
  File "/mnt/sdb/env1/lib/python3.6/site-packages/bert_sklearn/finetune.py", line 121, in finetune
    loss, _ = model(*batch)
  File "/mnt/sdb/env1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/sdb/env1/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/mnt/sdb/env1/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/mnt/sdb/env1/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/mnt/sdb/env1/lib/python3.6/site-packages/torch/_utils.py", line 428, in reraise
    raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/mnt/sdb/env1/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/mnt/sdb/env1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/sdb/env1/lib/python3.6/site-packages/bert_sklearn/model/model.py", line 95, in forward
    output_all_encoded_layers=False)
  File "/mnt/sdb/env1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/sdb/env1/lib/python3.6/site-packages/bert_sklearn/model/pytorch_pretrained/modeling.py", line 959, in forward
    extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
StopIteration

Had problem with using load_model() // no issue of training a model

When I tried to load a saved model, for example,

from bert_sklearn import load_model
m = load_model('/tmp/trained_model.bin')

I got the following output (I installed the latest BERT version, with PyTorch=1.0 and Python=3.6):

...
03/15/2019 21:15:50 - INFO - pytorch_pretrained_bert.modeling -   Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 30522
}
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-3b5446aaea48> in <module>()
----> 1 m = load_model('/tmp/trained_model.bin')

~/anaconda3/lib/python3.6/site-packages/bert_sklearn/sklearn.py in load_model(filename)
    516 
    517     model_ctor = classes[class_name]
--> 518     model = model_ctor(restore_file = filename)
    519     return model
    520 

~/anaconda3/lib/python3.6/site-packages/bert_sklearn/sklearn.py in __init__(self, label_list, bert_model, num_mlp_hiddens, num_mlp_layers, restore_file, epochs, max_seq_length, train_batch_size, eval_batch_size, learning_rate, warmup_proportion, gradient_accumulation_steps, fp16, loss_scale, local_rank, use_cuda, random_state, validation_fraction, logfile)
    115 
    116         if restore_file is not None:
--> 117             self.load_model(restore_file)
    118         else:
    119             args, _, _, values = inspect.getargvalues(inspect.currentframe())

~/anaconda3/lib/python3.6/site-packages/bert_sklearn/sklearn.py in load_model(self, restore_file)
    151         else:
    152             # restore model from restore_file
--> 153             self.model, self.tokenizer, state = self.restore(restore_file)
    154             params = state['params']
    155             self.set_params(**params)

~/anaconda3/lib/python3.6/site-packages/bert_sklearn/sklearn.py in restore(self, model_filename)
    367                             model_type=model_type,
    368                             num_mlp_layers=num_mlp_layers,
--> 369                             num_mlp_hiddens=num_mlp_hiddens)
    370 
    371         return model, tokenizer, state

~/anaconda3/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling.py in from_pretrained(cls, pretrained_model_name, cache_dir, *inputs, **kwargs)
    502         print(inputs)
    503         print(kwargs.keys())
--> 504         model = cls(config, *inputs, **kwargs)
    505         weights_path = os.path.join(serialization_dir, WEIGHTS_NAME)
    506         state_dict = torch.load(weights_path)

TypeError: __init__() got an unexpected keyword argument 'state_dict'

Regression for multiple labels?

Hi!
Is it possible to train a regression for multiple output variables? If i indicate a multicolumn label as y, the model fails.

load_model() error

Thank you Charles for providing this awesome pkg.

I run into an error when reloading the model using load_model(). How should I fix it?

WX20230415-130443@2x

Unable to load GPU trained model on CPU

I trained this model on google colab on a GPU. I am trying to load the model on my local machine(CPU) using load_model() but unable to load it. I am unable to to use pytorch.save() function too to save the model. It gives some error unable to pickle threads.
How can I solve this?

How to change hidden_dropout_prob value?

How to change "hidden_dropout_prob" or "attention_probs_dropout_prob" value?
I used below code. Thank you!

from bert_sklearn import BertClassifier
from bert_sklearn import BertRegressor
from bert_sklearn import load_model

# define model
model = BertClassifier()         # text/text pair classification

# try different options..
model.max_seq_length = 512

Tokenization for Bio NER

Thank you Charles for providing bert_sklearn.
This is more a general question than a specific issue.
In the SciBert, BioBert tutorial for biomedial NER, tokenisation of the dataset is done manually.
I was wondering why a BERT tokenizer is not used and how the use of a manual tokenizer affects the performance?
Thanks

Is there a way to access the embeddings of the training set of a fitted model?

Hi! I'm using this model for a multiclass text classification problem and the results I get are not as good as I expected. I suspect that one problem might be that my classes are not well defined and are overlapping (i.e. there are classes that are too similar to each other). In order to verify this I would like to make some analysis on the embeddings that the model is using internally. Is there a way to retrieve/access the internal embeddings of the training set? I haven't been able to find a way so far.

Thanks in advance!

NameError

Thank you for providing this work.I meet a error in model.fit() that "NameError: name 'pretrained_model_name_or_path' is not defined"
1718587678743

GPU memory leak ?

I use bert-sklearn in a benchmark scenario,
so I repeatedly construct and use BertClassifiers, like this:

m1 = BertClassifier( bert_model="biobert-base-cased")
m1.fit(..)
m1.predict(..)
m1.save(..)

....

m2 = BertClassifier( )
m2.fit(..)
m2.predict(..)
m2.save(..)

Doing so fails on using the second classifier with a "out of GPU memory" error.
Executing the code with only one model at a time works.

So I suppose there is a GPU memory leak somewhere. Or do I need to do something special to free memory ?

PyPI package

Hello,

Thank you for the excellent work on bert-sklearn!

I can install it fine using the repository, but to distribute another package that would rely on bert-sklearn it would be great to be able to install from the python package index. Is it possible to add a package there? I can also do it if that is fine with you. Is there anything that I should be aware of before starting?

Edit: I managed to use the github url to install with pip, so this is irrelevant.

Kind regards,
Fabio

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.