kaushaltrivedi / bert-toxic-comments-multilabel Goto Github PK
View Code? Open in Web Editor NEWMultilabel classification for Toxic comments challenge using Bert
Multilabel classification for Toxic comments challenge using Bert
Dear @kaushaltrivedi,
Apart from the loss function you mentioned in the blog, what are your suggestions in using this implementation for multi-class multi-label classification?
Thanks.
I tried to rerun the code and get the file not found Error. Where can I download the pytorch_model.bin? Or what method do I need to use? I can't find anything helpful on the net.
I tried to download the model with "BertModel.from_pretrained('bert-base-uncased')", but couldn't find the pytorch_model.bin.
Hi, thanks for sharing this wonderful work
I recently try to re-run the code and cannot import the pytorch_pretrained_bert correctly
and I figure out the module in huggingface/pytorch-pretrained-BERT have changed from PreTrainedBertModel
to BertPreTrainedModel
, just a reminder for those who facing the same issue
Hi,
I am getting issue in installation of apex
As given in the blog, I tried the following commands:
!git clone https://github.com/NVIDIA/apex
cd apex
!pip install -v --no-cache-dir --global-option="--pyprof" --global-option="--cpp_ext" --global-option="--cuda_ext"
But the installation ends up with message :
Kindly help as I am stuck.
Thanks,
Deepti
I am having trouble with this line:
model = BertForMultiLabelSequenceClassification.from_pretrained(bert_model_path, num_labels=num_labels)
Where bert_model_path
is a path to a pytorch_model.bin.tar.gz
file.
First, I get a complaint that the bert_config.json
file (in the same folder) is not in the new temp folder. If I move it there manually, I get an error (an INFO message really) saying:
Weights of MultiLabelBert not initialized from pretrained model
Is this a bug or am I missing something?
I run the code in ipythonnotebook one cell by one cell,when it run the fit()
function,it start to train,but it comes Segmentation fault (core dumped)
. Have some one know how to solve it???Thanks a lot!
While running the notebook I'm stuck at the above mentioned error. The code is:
eval_examples = processor.get_dev_examples(args['data_dir'], size=args['val_size'])
def eval():
......
Error:
FileNotFoundError Traceback (most recent call last)
in
1 # Eval Fn
----> 2 eval_examples = processor.get_dev_examples(args['data_dir'], size=args['val_size'])
3 def eval():
4 args['output_dir'].mkdir(exist_ok=True)
5
in get_dev_examples(self, data_dir, size)
22 filename = 'val.csv'
23 if size == -1:
---> 24 data_df = pd.read_csv(os.path.join(data_dir, filename))
25 # data_df['comment_text'] = data_df['comment_text'].apply(cleanHtml)
26 return self._create_examples(data_df, "dev")
/anaconda/envs/py36/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, doublequote, delim_whitespace, low_memory, memory_map, float_precision)
676 skip_blank_lines=skip_blank_lines)
677
--> 678 return _read(filepath_or_buffer, kwds)
679
680 parser_f.name = name
/anaconda/envs/py36/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
438
439 # Create the parser.
--> 440 parser = TextFileReader(filepath_or_buffer, **kwds)
441
442 if chunksize or iterator:
/anaconda/envs/py36/lib/python3.6/site-packages/pandas/io/parsers.py in init(self, f, engine, **kwds)
785 self.options['has_index_names'] = kwds['has_index_names']
786
--> 787 self._make_engine(self.engine)
788
789 def close(self):
/anaconda/envs/py36/lib/python3.6/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
1012 def _make_engine(self, engine='c'):
1013 if engine == 'c':
-> 1014 self._engine = CParserWrapper(self.f, **self.options)
1015 else:
1016 if engine == 'python':
/anaconda/envs/py36/lib/python3.6/site-packages/pandas/io/parsers.py in init(self, src, **kwds)
1706 kwds['usecols'] = self.usecols
1707
-> 1708 self._reader = parsers.TextReader(src, **kwds)
1709
1710 passed_names = self.names is None
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.cinit()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()
FileNotFoundError: File b'kaggle_data/toxic_comments/tmp/val.csv' does not exist
Great work, thanks for sharing, @kaushaltrivedi . I am trying to run this code, but having issues related to the input data. I downloaded the dataset from Kaggle. But, it looks like there are missing files like the classes.txt. Could you please explain the format of those?
Thanks.
model.module.freeze_bert_encoder()
and model.module.unfreeze_bert_encoder()
produce an error. Calling those methods from model
works fine.
I have loaded the data (Toxic dataset) and tried to run the model using batch_size_per_gpu = 4
but i am getting the below error.
ValueError: Target size (torch.Size([0, 6])) must be the same as input size (torch.Size([4, 6]))
Could you please help here.
when i read the code,i found u pass the label_ids as a list, though u define a dict named 'label_map', u don't convert label_ids to float number ,anything wrong in that???
I am having issues while importing apex. I get an error similar to the ones posted in run_classifier repository.
in
----> 1 import apex
2 import pandas as pd
3 import numpy as np
4 import torch
5
/anaconda3/lib/python3.7/site-packages/apex/init.py in
16 from apex.exceptions import (ApexAuthSecret,
17 ApexSessionSecret)
---> 18 from apex.interfaces import (ApexImplementation,
19 IApex)
20 from apex.lib.libapex import (groupfinder,
/anaconda3/lib/python3.7/site-packages/apex/interfaces.py in
8 pass
9
---> 10 class ApexImplementation(object):
11 """ Class so that we can tell if Apex is installed from other
12 applications
/anaconda3/lib/python3.7/site-packages/apex/interfaces.py in ApexImplementation()
12 applications
13 """
---> 14 implements(IApex)
/anaconda3/lib/python3.7/site-packages/zope/interface/declarations.py in implements(*interfaces)
481 # the coverage for this block there. :(
482 if PYTHON3:
--> 483 raise TypeError(_ADVICE_ERROR % 'implementer')
484 _implements("implements", interfaces, classImplements)
485
TypeError: Class advice impossible in Python3. Use the @Implementer class decorator instead.
Could anyone help me with this?
when i run the notebook to
model.module.unfreeze_bert_encoder()
got this error
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-48-e5502767395c> in <module>
----> 1 model.module.unfreeze_bert_encoder()
c:\users\jiang\.conda\envs\python3.6\lib\site-packages\torch\nn\modules\module.py in __getattr__(self, name)
946 return modules[name]
947 raise AttributeError("'{}' object has no attribute '{}'".format(
--> 948 type(self).__name__, name))
949
950 def __setattr__(self, name: str, value: Union[Tensor, 'Module']) -> None:
AttributeError: 'BertForMultiLabelSequenceClassification' object has no attribute 'module'
where i missed ?
I would like to split my data and train the model in little chunks. So after one training when the model is saved I would like to get that model but instead of making predictions I want to train it further. Is that possible?
@kaushaltrivedi thanks in advance
Hi, I am trying to run the BERT multilabel classification and was wondering what is contained in the 'val.csv' file? Thanks :)
I have a Tesla GPU which has only 16 Gb -- much less than what you used for your experiment described in the Medium article. As a result, I had to reduce the max sequence length from 512 to 128, and the batch size from 32 to 16. After 4 epochs, the validation accuracies of the various toxic comment categories were around 0.6 to 0.65. I wonder if increasing the number of epochs would help increase the performance.
In addition, is there a way to continue training a model -- say after 4 epochs, if the validation results are not good, can I continue the training rather than restart the training with a larger number of epochs? Is it sufficient to just rerun fit()`?
Thanks !
The function get_labels
is used to get the labels from the source csv files, and the length of this is used to get the size of the last layer in the model. However, the first column is the ID of the document, not a label, so using this results in a size mismatch in the model, which is unable to train.
if self.labels == None: self.labels = list(pd.read_csv(os.path.join(self.data_dir, "classes.txt"),header=None)[0].values)
Removing the first value (or saying num_labels = len(labels - 1)
) fixes this problem.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.