ai-forever / ner-bert Goto Github PK

View Code? Open in Web Editor NEW

404.0 19.0 98.0 487 KB

BERT-NER (nert-bert) with google bert https://github.com/google-research.

License: MIT License

Python 33.77% Jupyter Notebook 66.23%

python python3 pytorch bert ner bilstm-crf attention nlp transfer-learning pytorch-model

ner-bert's Introduction

0. Papers

There are two solutions based on this architecture.

BSNLP 2019 ACL workshop: solution and paper on multilingual shared task.
The second place solution of Dialogue AGRR-2019 task and paper.

Description

This repository contains solution of NER task based on PyTorch reimplementation of Google's TensorFlow repository for the BERT model that was released together with the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.

This implementation can load any pre-trained TensorFlow checkpoint for BERT (in particular Google's pre-trained models).

Old version is in "old" branch.

2. Usage

2.1 Create data

from modules.data import bert_data
data = bert_data.LearnData.create(
    train_df_path=train_df_path,
    valid_df_path=valid_df_path,
    idx2labels_path="/path/to/vocab",
    clear_cache=True
)

2.2 Create model

from modules.models.bert_models import BERTBiLSTMAttnCRF
model = BERTBiLSTMAttnCRF.create(len(data.train_ds.idx2label))

2.3 Create Learner

from modules.train.train import NerLearner
num_epochs = 100
learner = NerLearner(
    model, data, "/path/for/save/best/model", t_total=num_epochs * len(data.train_dl))

2.4 Predict

from modules.data.bert_data import get_data_loader_for_predict
learner.load_model()
dl = get_data_loader_for_predict(data, df_path="/path/to/df/for/predict")
preds = learner.predict(dl)

2.5 Evaluate

from sklearn_crfsuite.metrics import flat_classification_report
from modules.analyze_utils.utils import bert_labels2tokens, voting_choicer
from modules.analyze_utils.plot_metrics import get_bert_span_report
from modules.analyze_utils.main_metrics import precision_recall_f1


pred_tokens, pred_labels = bert_labels2tokens(dl, preds)
true_tokens, true_labels = bert_labels2tokens(dl, [x.bert_labels for x in dl.dataset])
tokens_report = flat_classification_report(true_labels, pred_labels, digits=4)
print(tokens_report)

results = precision_recall_f1(true_labels, pred_labels)

3. Results

We didn't search best parametres and obtained the following results.

Model	Data set	Dev F1 tok	Dev F1 span	Test F1 tok	Test F1 span
OURS
M-BERTCRF-IO	FactRuEval	-	-	0.8543	0.8409
M-BERTNCRF-IO	FactRuEval	-	-	0.8637	0.8516
M-BERTBiLSTMCRF-IO	FactRuEval	-	-	0.8835	0.8718
M-BERTBiLSTMNCRF-IO	FactRuEval	-	-	0.8632	0.8510
M-BERTAttnCRF-IO	FactRuEval	-	-	0.8503	0.8346
M-BERTBiLSTMAttnCRF-IO	FactRuEval	-	-	0.8839	0.8716
M-BERTBiLSTMAttnNCRF-IO	FactRuEval	-	-	0.8807	0.8680
M-BERTBiLSTMAttnCRF-fit_BERT-IO	FactRuEval	-	-	0.8823	0.8709
M-BERTBiLSTMAttnNCRF-fit_BERT-IO	FactRuEval	-	-	0.8583	0.8456
-	-	-	-	-	-
BERTBiLSTMCRF-IO	CoNLL-2003	0.9629	-	0.9221	-
B-BERTBiLSTMCRF-IO	CoNLL-2003	0.9635	-	0.9229	-
B-BERTBiLSTMAttnCRF-IO	CoNLL-2003	0.9614	-	0.9237	-
B-BERTBiLSTMAttnNCRF-IO	CoNLL-2003	0.9631	-	0.9249	-
Current SOTA
DeepPavlov-RuBERT-NER	FactRuEval	-	-	-	0.8266
CSE	CoNLL-2003	-	-	0.931	-
BERT-LARGE	CoNLL-2003	0.966	-	0.928	-
BERT-BASE	CoNLL-2003	0.964	-	0.924	-

ner-bert's People

Contributors

Stargazers

Watchers

Forkers

airob delaiahz sc89703312 bloodd sayduke newenglandml weiczhu eva-n27 sungjinlees ruimao1988 codeants2012 deepphysicvision sloth2012 sra1github zhongyunuestc npubird allensmile zhouyonglong gthb zorrock oliverhao daishu7 wuliuyuedetian sbmaruf lz-chen junjieqian nikolay-gerasimenko yanzhenms rahul-1996 rahulsmehta happyyolanda shangcaiwangtao liuwq168 jxinyee gaomenggithub shubhampachori12110095 janciswang renatotn7 gaokunlun vnik18 wfs2018 peihuaining emanuelaboros xiaojie2018 virts anksng 18106574249 h-tayyarmadabushi gokunwu lhbonifacio 2448795365 xumeng123 zjcanjux singhranjodh frankey419 90217 aiedward yanghaihuo jind11 jessie0624 fishredleaf huicao1995 wsydl liangtao123 lzjpaul kudddy learnerhouse flypythoncom headonenjoy philippelaval zeionara manaslu8 nlpi mteterin rhtrht rnekrasov-msk daviligade sunnyhuma171 zoumt1633 nitthapr jimmy-inl sxrczh patelrajnath beeblook rihabelya sakutepov techthiyanes xuezhicai cybersys fenglansun vassalos aqhali apprikatai mfawadakbar belle9217 sorokinvld

ner-bert's Issues

how to add some new feature?

I want to add some feature in data， ex: is_in_some_vocab？
To train more generalized model.
How can I do this?

Sorry, this is embarrassing, it is my first time I submit an issue on github.

This model use a lots of memory

Beside bert model, after I train my model.
Load bert and trained model, also data, it need 4.5G, this is very large.
And during this situation, very hard to deploy online.
So is there anyway to reduce memory use?

Hello, is there a LICENSE for this project? Thanks

No BERT Fine-Tuning?

Thanks for the comprehensive examples in this repository.

Please correct me if I'm wrong but from my understanding, the benefits of this repository rely on the replacement of ELMo embeddings with BERT embeddings. This seems to work just fine resulting in good improvements. However, it doesn't seem very intuitive to me since it misses the fine-tuning of BERT for the NER task.

Have you tried anything in this direction of fine-tuning BERT for NER? Or to express it differently: Why do you need an LSTM encoder as a wrapper around BERT encodings?

RuntimeError: Expected tensor [2, 512, 1], src [2, 371, 6] and index [2, 512, 1] to have the same size apart from dimension 2

I use my own dataseta, and in the training process of CRF, the following error occurs:

line 483, in main
 global_step, tr_loss = train(args, train_dataset, model, tokenizer, labels,pos, pad_token_label_id)

line 138, in train
   loss = model.score(batch)  # model outputs are always tuple in pytorch-transformers (see doc)

line 52, in score
   return self.crf.score(output, labels_mask, labels)
 
line 46, in score
   gold_score = self.crf.calc_gold_score(logits, labels, lens)

line 99, in calc_gold_score
   unary_score = self.calc_unary_score(logits, labels, lens).sum(

line 93, in calc_unary_score
   scores = torch.gather(logits, 2, labels_exp).squeeze(-1)

RuntimeError: Expected tensor [2, 512, 1], src [2, 371, 6] and index [2, 512, 1] to have the same size apart from dimension 2

question about model

should we calculate F1-score with micro-average or macro-average?

In the jupyter notebook "conll2003 BERTBiLSTMCRF" in the "examples" folder, the result report is as follow:

I notice you put macro-avg "0.9221" in the "README.md" file, but it seems like that the code at "https://paperswithcode.com/sota/named-entity-recognition-ner-on-conll-2003" adopt the micro-avg value as the final F1-value.

I would appreciate it very much if you can tell me why, thanks.

Predict a sentence using BERTBiLSTMAttnNCRF without passing a dataloader

I had an issue while building a function that only predicts a sentence without passing a dataloader instance
Theses are the steps I followed:
sentence= 'put a sentence'
bert_tokens = []
tok_map = []
tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased")
# label2idx = {"[PAD]": pad_idx, '[CLS]': 1, '[SEP]': 2, "X": 3}
# idx2label = ["[PAD]", '[CLS]', '[SEP]', "X"]
orig_tokens = sentence.split()
orig_tokens = ["[CLS]"] + orig_tokens + ["[SEP]"]
for origin_token in orig_tokens:
cur_tokens = tokenizer.tokenize(origin_token)
bert_tokens.extend(cur_tokens)
tok_map.append(len(bert_tokens))
input_ids = tokenizer.convert_tokens_to_ids(bert_tokens)
input_mask = [1] * len(input_ids)
while len(input_ids) < 424:
input_mask.append(0)
tok_map.append(-1)
input_ids.append(0)
input_type_ids = [0] * len(input_ids)

The problem is I couldn't figure out what batch is in order to predict using model.forward(batch)
I tried this:

batch=[[0],[0],[0]]
batch[0]=input_ids
batch[1]=input_type_ids
batch[2]=input_mask
learner.model.forward(batch)
and this is what I got:

~/ner-bert-master-last-version/ner-bert-master-last-version/modules/models/bert_models.py in forward(self, batch)
46 def forward(self, batch):
47 input_, labels_mask, input_type_ids = batch[:3]
---> 48 input_embeddings = self.embeddings(batch)
49 output, _ = self.lstm.forward(batch)
50 output, _ = self.attn(output, output, output, None)

~/.local/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),

~/ner-bert-master-last-version/ner-bert-master-last-version/modules/layers/embedders.py in forward(self, batch)
59 token_type_ids=batch[2],
60 attention_mask=batch[1],
---> 61 output_all_encoded_layers=self.config["mode"] == "weighted")
62 if self.config["mode"] == "weighted":
63 encoded_layers = torch.stack([a * b for a, b in zip(encoded_layers, self.bert_weights)])

~/.local/lib/python3.7/site-packages/pytorch_pretrained_bert/modeling.py in forward(self, input_ids, token_type_ids, attention_mask, output_all_encoded_layers)
718 # this attention mask is more simple than the triangular masking of causal attention
719 # used in OpenAI GPT, we just need to prepare the broadcast dimension here.
--> 720 extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
721
722 # Since attention_mask is 1.0 for positions we want to attend and 0.0 for

AttributeError: 'list' object has no attribute 'unsqueeze'

can you please help!

No uploaded ckpts and data?

Hi
Did I understand correctly that you don't have an intend to share checkpoints from decoders training, sharing these notebooks others to replicate your results?

Also, I really can't find anywhere data from FactRuEval corpus (not only in this repository). May be you can share a link where you've downloaded it?
I've found only opencorpora corpus on kaggle, but when I opened it, I've found out that there are no classic NER tags such as 'loc', 'per', etc. Only tags that are adj, verb and others.

I am reimplementing Atis-NMT but I Couldn't find train filtered.csv atis dataset.

what is

No module named 'elmoformanylangs'

Hello! How to solve this error? I can't just get this with pip install

It arises while I do the convertion of checkpoints:

Traceback (most recent call last):
File "/home/jupyter/ner-bert-master/convert_tf_checkpoint_to_pytorch.py", line 27, in
from modules.layers.bert_modeling import BertConfig, BertModel
File "/home/jupyter/ner-bert-master/modules/init.py", line 1, in
from .train.train import NerLearner
File "/home/jupyter/ner-bert-master/modules/train/train.py", line 9, in
from modules.models.released_models import released_models
File "/home/jupyter/ner-bert-master/modules/models/init.py", line 1, in
from .bert_models import BertBiLSTMCRF, BertBiLSTMAttnCRF
File "/home/jupyter/ner-bert-master/modules/models/bert_models.py", line 1, in
from modules.layers.encoders import *
File "/home/jupyter/ner-bert-master/modules/layers/encoders.py", line 3, in
from .embedders import BertEmbedder
File "/home/jupyter/ner-bert-master/modules/layers/embedders.py", line 9, in
from elmoformanylangs.modules.embedding_layer import EmbeddingLayer
ModuleNotFoundError: No module named 'elmoformanylangs'

Also it arises when I do import

ModuleNotFoundError Traceback (most recent call last)
in
----> 1 from modules import BertNerData as NerData

~/ner-bert-master/modules/init.py in
----> 1 from .train.train import NerLearner
2 from .data.bert_data import BertNerData
3 from .models.bert_models import BertBiLSTMCRF
4
5

~/ner-bert-master/modules/train/train.py in
7 import json
8 from modules.data.bert_data import BertNerData
----> 9 from modules.models.released_models import released_models
10
11

~/ner-bert-master/modules/models/init.py in
----> 1 from .bert_models import BertBiLSTMCRF, BertBiLSTMAttnCRF
2
3
4 all = ["BertBiLSTMCRF", "BertBiLSTMAttnCRF"]

~/ner-bert-master/modules/models/bert_models.py in
----> 1 from modules.layers.encoders import *
2 from modules.layers.decoders import *
3 from modules.layers.embedders import *
4 import abc
5 import sys

~/ner-bert-master/modules/layers/encoders.py in
1 from torch import nn
2 import torch
----> 3 from .embedders import BertEmbedder
4
5

~/ner-bert-master/modules/layers/embedders.py in
7 import json
8 from torch import nn
----> 9 from elmoformanylangs.modules.embedding_layer import EmbeddingLayer
10 from elmoformanylangs.frontend import Model
11

ModuleNotFoundError: No module named 'elmoformanylangs'

loss explosion

I am using Bert-CRF model to do Named Entity Recognition task. I am using the average of the last four layers as the input of CRF model. But the loss will increase and become Nan in a few batches. Anyone meet that problem before? Any suggestions will be appreciated!

the pytorch version?

which the pytorch version should use?

Key Error creating NerData

Hello again!

I have a strange error while I run
data = NerData.create(train_path, valid_path, vocab_file)

KeyError Traceback (most recent call last)
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3063 try:
-> 3064 return self._engine.get_loc(key)
3065 except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: '1'

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last)
in
----> 1 data = NerData.create(train_path, valid_path, vocab_file)

~/ner-bert-master/modules/data/bert_data.py in create(cls, train_path, valid_path, vocab_file, batch_size, cuda, is_cls, data_type, max_seq_len, is_meta)
389 raise NotImplementedError("No requested mode :(.")
390 return cls(train_path, valid_path, vocab_file, data_type, *fn(
--> 391 train_path, valid_path, vocab_file, batch_size, cuda, is_cls, do_lower_case, max_seq_len, is_meta),
392 batch_size=batch_size, cuda=cuda, is_meta=is_meta)

~/ner-bert-master/modules/data/bert_data.py in get_bert_data_loaders(train, valid, vocab_file, batch_size, cuda, is_cls, do_lower_case, max_seq_len, is_meta, label2idx, cls2idx)
279 tokenizer = tokenization.FullTokenizer(vocab_file=vocab_file, do_lower_case=do_lower_case)
280 train_f, label2idx = get_data(
--> 281 train, tokenizer, label2idx, cls2idx=cls2idx, is_cls=is_cls, max_seq_len=max_seq_len, is_meta=is_meta)
282 if is_cls:
283 label2idx, cls2idx = label2idx

~/ner-bert-master/modules/data/bert_data.py in get_data(df, tokenizer, label2idx, max_seq_len, pad, cls2idx, is_cls, is_meta)
145 all_args.extend([df["1"].tolist(), df["0"].tolist(), df["2"].tolist()])
146 else:
--> 147 all_args.extend([df["1"].tolist(), df["0"].tolist()])
148 if is_meta:
149 all_args.append(df["3"].tolist())

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in getitem(self, key)
2686 return self._getitem_multilevel(key)
2687 else:
-> 2688 return self._getitem_column(key)
2689
2690 def _getitem_column(self, key):

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in _getitem_column(self, key)
2693 # get column
2694 if self.columns.is_unique:
-> 2695 return self._get_item_cache(key)
2696
2697 # duplicate columns & possible reduce dimensionality

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
2484 res = cache.get(item)
2485 if res is None:
-> 2486 values = self._data.get(item)
2487 res = self._box_item_values(item, values)
2488 cache[item] = res

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals.py in get(self, item, fastpath)
4113
4114 if not isna(item):
-> 4115 loc = self.items.get_loc(item)
4116 else:
4117 indexer = np.arange(len(self.items))[isna(self.items)]

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3064 return self._engine.get_loc(key)
3065 except KeyError:
-> 3066 return self._engine.get_loc(self._maybe_cast_indexer(key))
3067
3068 indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: '1'

May be you've faced it too. I can't google something similar to my case

prev_label = "" Error

Hi,

I got an error when I run

data = NerData.create(train_path, valid_path, vocab_file)

~/Downloads/ner-bert-master/modules/data/bert_data.py in get_data(df, tokenizer, label2idx, max_seq_len, pad, cls2idx, is_cls, is_meta)
189 if label != "O":
190 label = label.split("")[1]
--> 191 if label == prev_label:
192 prefix = "I"
193 prev_label = label

UnboundLocalError: local variable 'prev_label' referenced before assignment

And if I cancel the comment in line 187: prev_label ="", it turns out missing the 'I_O 'label

more details for the models ?

The models are briefly introduced here. Are there any papers or blogs introducing the models like BertBiLSTMAttnCRF in detailed?

Mistake in main_metrics

current_tag = current_token.split('', 1)[-1] # replace '-' with ""
This bug will totoally change the evaluation result.

some errors in readme

For atis project, it seems you have put conll.ipynb as examples...and vice versa

UnicodeDecodeError on the vocab file

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 3799: character maps to

when passing the vocab file to NerData.create() getting unicode error
need to add something like this in the code to make it work
open(vocab_file,encoding='utf8').read()

Do you plan to train bert on Russian and use it as pretrained?

Question about this release

i see this message in the readme "Sorry, we are in developing. Release is coming soon :("
What does the message mean ? Does it mean that some part of the code is missing some of the functionalities ? Is the code complete ?
Thanks

FP16 and NVIDIA Apex support

Hi @king-menin ,

Do you support FP16 and NVIDIA Apex as the Pytorch BERT examples?
Thanks.

BIO vs IO

Hello.

In your example (https://github.com/sberbank-ai/ner-bert/blob/master/examples/factrueval-nmt.ipynb) you are using bio markups.
But in code (bert_data.py (187)):

  # prev_label = ""

    for idx_, (orig_token, label) in enumerate(zip(orig_tokens, labels)):
        # Fix BIO to IO as BERT proposed https://arxiv.org/pdf/1810.04805.pdf

        try:

you use io. how do you get the original markup after training?

convert_tf_checkpoint_to_pytorch.py error

Hi,

When running the conversion script convert_tf_checkpoint_to_pytorch.py I get the following error when trying to convert the BERT-Base, uncased model:

Traceback (most recent call last):
  File "convert_tf_checkpoint_to_pytorch.py", line 107, in <module>
    convert()
  File "convert_tf_checkpoint_to_pytorch.py", line 100, in convert
    pointer.data = torch.from_numpy(array)
TypeError: expected np.ndarray (got numpy.ndarray)

Any ideas? Thanks

a small bug in `get_mean_max_metric` function

Thanks so much for your super nice implementation of NER-BERT! I'm writing to report a small bug.

We use the get_mean_max_metric in BERT-NER/modules/utils/plot_metrics.py to calculate the metric results, which will determine whether to save model during training time.

However, in this function, [3 + m_idx] in [float(h.split("\n")[-2].split()[3 + m_idx]) for h in history] should be [2 + m_idx], in order to obtain the correct results of target metrics.

Please fix this bug, thanks a lot!

The length of token sequence is different from tag sequence

Hi! Thank you for sharing this implementation.
I am using it to train on a NER dataset in Norwegian. I wanted to store the model predictions in CONLL file format but encounter some issues with, I assume, token mapping in the input sequence.
Basically, I got the tokens, y_true, y_pred from function get_bert_span_report() in plot_metrics.py.
And I want to write the predictions into a file by:

def write_to_conll(tokens, y_true, y_pred, conll_fpath):
    assert len(tokens) == len(y_true) and len(y_true) == len(y_pred)
    with conll_fpath.open('a') as f:
        for sentence, y_t, y_p in zip(tokens, y_true, y_pred): 
            assert len(sentence) == len(y_t) and len(y_t) == len(y_p)     
            for i in range(len(sentence)):
                newline = '{}\t{}\t{}\n'.format(sentence[i], y_t[i], y_p[i])
                f.write(newline)
            f.write('\n')

The assertion assert len(sentence) == len(y_t) fails. The length of token sequence is always longer than the length of tag sequence. For example:

['Dette', 'er', 'som', 'George', 'Orwells', 'nytale', ',', 'av', 'samme', 'logiske', 'gehalt', 'som', '"', 'fred', 'er', 'krig', '"', ',', '"', 'stillhet', 'er', 'larm', '"', ',', '"', 'lys', 'er', 'mørke', '"', '.']
['O', 'O', 'O', 'PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
['O', 'O', 'O', 'PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
# len(sentence)=30, len(y_t)=29, len(y_p)=29
['I', 'diktet', '"', 'Om', 'å', 'vokse', 'nedover', '"', 'skriver', 'Rolf', 'Jacobsen', ':']
['O', 'O', 'O', 'PROD', 'O', 'O', 'PER', 'O']
['O', 'O', 'O', 'PROD', 'O', 'O', 'PER', 'O']
# len(sentence)=12, len(y_t)=8, len(y_p)=8

What should I do to get the equal length of token sequence and tag sequence?