Giter Club home page Giter Club logo

ner-bert's Introduction

0. Papers

There are two solutions based on this architecture.

  1. BSNLP 2019 ACL workshop: solution and paper on multilingual shared task.
  2. The second place solution of Dialogue AGRR-2019 task and paper.

Description

This repository contains solution of NER task based on PyTorch reimplementation of Google's TensorFlow repository for the BERT model that was released together with the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.

This implementation can load any pre-trained TensorFlow checkpoint for BERT (in particular Google's pre-trained models).

Old version is in "old" branch.

2. Usage

2.1 Create data

from modules.data import bert_data
data = bert_data.LearnData.create(
    train_df_path=train_df_path,
    valid_df_path=valid_df_path,
    idx2labels_path="/path/to/vocab",
    clear_cache=True
)

2.2 Create model

from modules.models.bert_models import BERTBiLSTMAttnCRF
model = BERTBiLSTMAttnCRF.create(len(data.train_ds.idx2label))

2.3 Create Learner

from modules.train.train import NerLearner
num_epochs = 100
learner = NerLearner(
    model, data, "/path/for/save/best/model", t_total=num_epochs * len(data.train_dl))

2.4 Predict

from modules.data.bert_data import get_data_loader_for_predict
learner.load_model()
dl = get_data_loader_for_predict(data, df_path="/path/to/df/for/predict")
preds = learner.predict(dl)

2.5 Evaluate

from sklearn_crfsuite.metrics import flat_classification_report
from modules.analyze_utils.utils import bert_labels2tokens, voting_choicer
from modules.analyze_utils.plot_metrics import get_bert_span_report
from modules.analyze_utils.main_metrics import precision_recall_f1


pred_tokens, pred_labels = bert_labels2tokens(dl, preds)
true_tokens, true_labels = bert_labels2tokens(dl, [x.bert_labels for x in dl.dataset])
tokens_report = flat_classification_report(true_labels, pred_labels, digits=4)
print(tokens_report)

results = precision_recall_f1(true_labels, pred_labels)

3. Results

We didn't search best parametres and obtained the following results.

Model Data set Dev F1 tok Dev F1 span Test F1 tok Test F1 span
OURS
M-BERTCRF-IO FactRuEval - - 0.8543 0.8409
M-BERTNCRF-IO FactRuEval - - 0.8637 0.8516
M-BERTBiLSTMCRF-IO FactRuEval - - 0.8835 0.8718
M-BERTBiLSTMNCRF-IO FactRuEval - - 0.8632 0.8510
M-BERTAttnCRF-IO FactRuEval - - 0.8503 0.8346
M-BERTBiLSTMAttnCRF-IO FactRuEval - - 0.8839 0.8716
M-BERTBiLSTMAttnNCRF-IO FactRuEval - - 0.8807 0.8680
M-BERTBiLSTMAttnCRF-fit_BERT-IO FactRuEval - - 0.8823 0.8709
M-BERTBiLSTMAttnNCRF-fit_BERT-IO FactRuEval - - 0.8583 0.8456
- - - - - -
BERTBiLSTMCRF-IO CoNLL-2003 0.9629 - 0.9221 -
B-BERTBiLSTMCRF-IO CoNLL-2003 0.9635 - 0.9229 -
B-BERTBiLSTMAttnCRF-IO CoNLL-2003 0.9614 - 0.9237 -
B-BERTBiLSTMAttnNCRF-IO CoNLL-2003 0.9631 - 0.9249 -
Current SOTA
DeepPavlov-RuBERT-NER FactRuEval - - - 0.8266
CSE CoNLL-2003 - - 0.931 -
BERT-LARGE CoNLL-2003 0.966 - 0.928 -
BERT-BASE CoNLL-2003 0.964 - 0.924 -

ner-bert's People

Contributors

king-menin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ner-bert's Issues

how to add some new feature?

I want to add some feature in data, ex: is_in_some_vocab?
To train more generalized model.
How can I do this?

This model use a lots of memory

Beside bert model, after I train my model.
Load bert and trained model, also data, it need 4.5G, this is very large.
And during this situation, very hard to deploy online.
So is there anyway to reduce memory use?

No BERT Fine-Tuning?

Thanks for the comprehensive examples in this repository.

Please correct me if I'm wrong but from my understanding, the benefits of this repository rely on the replacement of ELMo embeddings with BERT embeddings. This seems to work just fine resulting in good improvements. However, it doesn't seem very intuitive to me since it misses the fine-tuning of BERT for the NER task.

Have you tried anything in this direction of fine-tuning BERT for NER? Or to express it differently: Why do you need an LSTM encoder as a wrapper around BERT encodings?

RuntimeError: Expected tensor [2, 512, 1], src [2, 371, 6] and index [2, 512, 1] to have the same size apart from dimension 2

I use my own dataseta, and in the training process of CRF, the following error occurs:

line 483, in main
 global_step, tr_loss = train(args, train_dataset, model, tokenizer, labels,pos, pad_token_label_id)

line 138, in train
   loss = model.score(batch)  # model outputs are always tuple in pytorch-transformers (see doc)

line 52, in score
   return self.crf.score(output, labels_mask, labels)
 
line 46, in score
   gold_score = self.crf.calc_gold_score(logits, labels, lens)

line 99, in calc_gold_score
   unary_score = self.calc_unary_score(logits, labels, lens).sum(

line 93, in calc_unary_score
   scores = torch.gather(logits, 2, labels_exp).squeeze(-1)

RuntimeError: Expected tensor [2, 512, 1], src [2, 371, 6] and index [2, 512, 1] to have the same size apart from dimension 2

Predict a sentence using BERTBiLSTMAttnNCRF without passing a dataloader

I had an issue while building a function that only predicts a sentence without passing a dataloader instance
Theses are the steps I followed:
sentence= 'put a sentence'
bert_tokens = []
tok_map = []
tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased")
# label2idx = {"[PAD]": pad_idx, '[CLS]': 1, '[SEP]': 2, "X": 3}
# idx2label = ["[PAD]", '[CLS]', '[SEP]', "X"]
orig_tokens = sentence.split()
orig_tokens = ["[CLS]"] + orig_tokens + ["[SEP]"]
for origin_token in orig_tokens:
cur_tokens = tokenizer.tokenize(origin_token)
bert_tokens.extend(cur_tokens)
tok_map.append(len(bert_tokens))
input_ids = tokenizer.convert_tokens_to_ids(bert_tokens)
input_mask = [1] * len(input_ids)
while len(input_ids) < 424:
input_mask.append(0)
tok_map.append(-1)
input_ids.append(0)
input_type_ids = [0] * len(input_ids)

The problem is I couldn't figure out what batch is in order to predict using model.forward(batch)
I tried this:

batch=[[0],[0],[0]]
batch[0]=input_ids
batch[1]=input_type_ids
batch[2]=input_mask
learner.model.forward(batch)
and this is what I got:

~/ner-bert-master-last-version/ner-bert-master-last-version/modules/models/bert_models.py in forward(self, batch)
46 def forward(self, batch):
47 input_, labels_mask, input_type_ids = batch[:3]
---> 48 input_embeddings = self.embeddings(batch)
49 output, _ = self.lstm.forward(batch)
50 output, _ = self.attn(output, output, output, None)

~/.local/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),

~/ner-bert-master-last-version/ner-bert-master-last-version/modules/layers/embedders.py in forward(self, batch)
59 token_type_ids=batch[2],
60 attention_mask=batch[1],
---> 61 output_all_encoded_layers=self.config["mode"] == "weighted")
62 if self.config["mode"] == "weighted":
63 encoded_layers = torch.stack([a * b for a, b in zip(encoded_layers, self.bert_weights)])

~/.local/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),

~/.local/lib/python3.7/site-packages/pytorch_pretrained_bert/modeling.py in forward(self, input_ids, token_type_ids, attention_mask, output_all_encoded_layers)
718 # this attention mask is more simple than the triangular masking of causal attention
719 # used in OpenAI GPT, we just need to prepare the broadcast dimension here.
--> 720 extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
721
722 # Since attention_mask is 1.0 for positions we want to attend and 0.0 for

AttributeError: 'list' object has no attribute 'unsqueeze'

can you please help!

No uploaded ckpts and data?

Hi
Did I understand correctly that you don't have an intend to share checkpoints from decoders training, sharing these notebooks others to replicate your results?

Also, I really can't find anywhere data from FactRuEval corpus (not only in this repository). May be you can share a link where you've downloaded it?
I've found only opencorpora corpus on kaggle, but when I opened it, I've found out that there are no classic NER tags such as 'loc', 'per', etc. Only tags that are adj, verb and others.

No module named 'elmoformanylangs'

Hello! How to solve this error? I can't just get this with pip install

It arises while I do the convertion of checkpoints:

Traceback (most recent call last):
File "/home/jupyter/ner-bert-master/convert_tf_checkpoint_to_pytorch.py", line 27, in
from modules.layers.bert_modeling import BertConfig, BertModel
File "/home/jupyter/ner-bert-master/modules/init.py", line 1, in
from .train.train import NerLearner
File "/home/jupyter/ner-bert-master/modules/train/train.py", line 9, in
from modules.models.released_models import released_models
File "/home/jupyter/ner-bert-master/modules/models/init.py", line 1, in
from .bert_models import BertBiLSTMCRF, BertBiLSTMAttnCRF
File "/home/jupyter/ner-bert-master/modules/models/bert_models.py", line 1, in
from modules.layers.encoders import *
File "/home/jupyter/ner-bert-master/modules/layers/encoders.py", line 3, in
from .embedders import BertEmbedder
File "/home/jupyter/ner-bert-master/modules/layers/embedders.py", line 9, in
from elmoformanylangs.modules.embedding_layer import EmbeddingLayer
ModuleNotFoundError: No module named 'elmoformanylangs'

Also it arises when I do import


ModuleNotFoundError Traceback (most recent call last)
in
----> 1 from modules import BertNerData as NerData

~/ner-bert-master/modules/init.py in
----> 1 from .train.train import NerLearner
2 from .data.bert_data import BertNerData
3 from .models.bert_models import BertBiLSTMCRF
4
5

~/ner-bert-master/modules/train/train.py in
7 import json
8 from modules.data.bert_data import BertNerData
----> 9 from modules.models.released_models import released_models
10
11

~/ner-bert-master/modules/models/init.py in
----> 1 from .bert_models import BertBiLSTMCRF, BertBiLSTMAttnCRF
2
3
4 all = ["BertBiLSTMCRF", "BertBiLSTMAttnCRF"]

~/ner-bert-master/modules/models/bert_models.py in
----> 1 from modules.layers.encoders import *
2 from modules.layers.decoders import *
3 from modules.layers.embedders import *
4 import abc
5 import sys

~/ner-bert-master/modules/layers/encoders.py in
1 from torch import nn
2 import torch
----> 3 from .embedders import BertEmbedder
4
5

~/ner-bert-master/modules/layers/embedders.py in
7 import json
8 from torch import nn
----> 9 from elmoformanylangs.modules.embedding_layer import EmbeddingLayer
10 from elmoformanylangs.frontend import Model
11

ModuleNotFoundError: No module named 'elmoformanylangs'

loss explosion

I am using Bert-CRF model to do Named Entity Recognition task. I am using the average of the last four layers as the input of CRF model. But the loss will increase and become Nan in a few batches. Anyone meet that problem before? Any suggestions will be appreciated!

Key Error creating NerData

Hello again!

I have a strange error while I run
data = NerData.create(train_path, valid_path, vocab_file)


KeyError Traceback (most recent call last)
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3063 try:
-> 3064 return self._engine.get_loc(key)
3065 except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: '1'

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last)
in
----> 1 data = NerData.create(train_path, valid_path, vocab_file)

~/ner-bert-master/modules/data/bert_data.py in create(cls, train_path, valid_path, vocab_file, batch_size, cuda, is_cls, data_type, max_seq_len, is_meta)
389 raise NotImplementedError("No requested mode :(.")
390 return cls(train_path, valid_path, vocab_file, data_type, *fn(
--> 391 train_path, valid_path, vocab_file, batch_size, cuda, is_cls, do_lower_case, max_seq_len, is_meta),
392 batch_size=batch_size, cuda=cuda, is_meta=is_meta)

~/ner-bert-master/modules/data/bert_data.py in get_bert_data_loaders(train, valid, vocab_file, batch_size, cuda, is_cls, do_lower_case, max_seq_len, is_meta, label2idx, cls2idx)
279 tokenizer = tokenization.FullTokenizer(vocab_file=vocab_file, do_lower_case=do_lower_case)
280 train_f, label2idx = get_data(
--> 281 train, tokenizer, label2idx, cls2idx=cls2idx, is_cls=is_cls, max_seq_len=max_seq_len, is_meta=is_meta)
282 if is_cls:
283 label2idx, cls2idx = label2idx

~/ner-bert-master/modules/data/bert_data.py in get_data(df, tokenizer, label2idx, max_seq_len, pad, cls2idx, is_cls, is_meta)
145 all_args.extend([df["1"].tolist(), df["0"].tolist(), df["2"].tolist()])
146 else:
--> 147 all_args.extend([df["1"].tolist(), df["0"].tolist()])
148 if is_meta:
149 all_args.append(df["3"].tolist())

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in getitem(self, key)
2686 return self._getitem_multilevel(key)
2687 else:
-> 2688 return self._getitem_column(key)
2689
2690 def _getitem_column(self, key):

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in _getitem_column(self, key)
2693 # get column
2694 if self.columns.is_unique:
-> 2695 return self._get_item_cache(key)
2696
2697 # duplicate columns & possible reduce dimensionality

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
2484 res = cache.get(item)
2485 if res is None:
-> 2486 values = self._data.get(item)
2487 res = self._box_item_values(item, values)
2488 cache[item] = res

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals.py in get(self, item, fastpath)
4113
4114 if not isna(item):
-> 4115 loc = self.items.get_loc(item)
4116 else:
4117 indexer = np.arange(len(self.items))[isna(self.items)]

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3064 return self._engine.get_loc(key)
3065 except KeyError:
-> 3066 return self._engine.get_loc(self._maybe_cast_indexer(key))
3067
3068 indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: '1'

May be you've faced it too. I can't google something similar to my case

prev_label = "" Error

Hi,

I got an error when I run

data = NerData.create(train_path, valid_path, vocab_file)

~/Downloads/ner-bert-master/modules/data/bert_data.py in get_data(df, tokenizer, label2idx, max_seq_len, pad, cls2idx, is_cls, is_meta)
189 if label != "O":
190 label = label.split("")[1]
--> 191 if label == prev_label:
192 prefix = "I
"
193 prev_label = label

UnboundLocalError: local variable 'prev_label' referenced before assignment

And if I cancel the comment in line 187: prev_label ="", it turns out missing the 'I_O 'label

more details for the models ?

The models are briefly introduced here. Are there any papers or blogs introducing the models like BertBiLSTMAttnCRF in detailed?

Mistake in main_metrics

current_tag = current_token.split('', 1)[-1] # replace '-' with ""
This bug will totoally change the evaluation result.

UnicodeDecodeError on the vocab file

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 3799: character maps to

when passing the vocab file to NerData.create() getting unicode error
need to add something like this in the code to make it work
open(vocab_file,encoding='utf8').read()

Question about this release

i see this message in the readme "Sorry, we are in developing. Release is coming soon :("
What does the message mean ? Does it mean that some part of the code is missing some of the functionalities ? Is the code complete ?
Thanks

convert_tf_checkpoint_to_pytorch.py error

Hi,

When running the conversion script convert_tf_checkpoint_to_pytorch.py I get the following error when trying to convert the BERT-Base, uncased model:

Traceback (most recent call last):
  File "convert_tf_checkpoint_to_pytorch.py", line 107, in <module>
    convert()
  File "convert_tf_checkpoint_to_pytorch.py", line 100, in convert
    pointer.data = torch.from_numpy(array)
TypeError: expected np.ndarray (got numpy.ndarray)

Any ideas? Thanks

a small bug in `get_mean_max_metric` function

Thanks so much for your super nice implementation of NER-BERT! I'm writing to report a small bug.

We use the get_mean_max_metric in BERT-NER/modules/utils/plot_metrics.py to calculate the metric results, which will determine whether to save model during training time.

However, in this function, [3 + m_idx] in [float(h.split("\n")[-2].split()[3 + m_idx]) for h in history] should be [2 + m_idx], in order to obtain the correct results of target metrics.

Please fix this bug, thanks a lot!

The length of token sequence is different from tag sequence

Hi! Thank you for sharing this implementation.
I am using it to train on a NER dataset in Norwegian. I wanted to store the model predictions in CONLL file format but encounter some issues with, I assume, token mapping in the input sequence.
Basically, I got the tokens, y_true, y_pred from function get_bert_span_report() in plot_metrics.py.
And I want to write the predictions into a file by:

def write_to_conll(tokens, y_true, y_pred, conll_fpath):
    assert len(tokens) == len(y_true) and len(y_true) == len(y_pred)
    with conll_fpath.open('a') as f:
        for sentence, y_t, y_p in zip(tokens, y_true, y_pred): 
            assert len(sentence) == len(y_t) and len(y_t) == len(y_p)     
            for i in range(len(sentence)):
                newline = '{}\t{}\t{}\n'.format(sentence[i], y_t[i], y_p[i])
                f.write(newline)
            f.write('\n')

The assertion assert len(sentence) == len(y_t) fails. The length of token sequence is always longer than the length of tag sequence. For example:

['Dette', 'er', 'som', 'George', 'Orwells', 'nytale', ',', 'av', 'samme', 'logiske', 'gehalt', 'som', '"', 'fred', 'er', 'krig', '"', ',', '"', 'stillhet', 'er', 'larm', '"', ',', '"', 'lys', 'er', 'mørke', '"', '.']
['O', 'O', 'O', 'PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
['O', 'O', 'O', 'PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
# len(sentence)=30, len(y_t)=29, len(y_p)=29
['I', 'diktet', '"', 'Om', 'å', 'vokse', 'nedover', '"', 'skriver', 'Rolf', 'Jacobsen', ':']
['O', 'O', 'O', 'PROD', 'O', 'O', 'PER', 'O']
['O', 'O', 'O', 'PROD', 'O', 'O', 'PER', 'O']
# len(sentence)=12, len(y_t)=8, len(y_p)=8

What should I do to get the equal length of token sequence and tag sequence?

Explanation for the tables

Hi, what does "Total spans in test set" mean in the tables?

BTW, i got similar results on conll2003 data.
It seems bert is not so effective for NER.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.