ai-forever / ner-bert Goto Github PK

View Code? Open in Web Editor NEW

404.0 19.0 98.0 487 KB

BERT-NER (nert-bert) with google bert https://github.com/google-research.

License: MIT License

Python 33.77% Jupyter Notebook 66.23%

python python3 pytorch bert ner bilstm-crf attention nlp transfer-learning pytorch-model

ner-bert's Issues

prev_label = "" Error

Hi,

I got an error when I run

data = NerData.create(train_path, valid_path, vocab_file)

~/Downloads/ner-bert-master/modules/data/bert_data.py in get_data(df, tokenizer, label2idx, max_seq_len, pad, cls2idx, is_cls, is_meta)
189 if label != "O":
190 label = label.split("")[1]
--> 191 if label == prev_label:
192 prefix = "I"
193 prev_label = label

UnboundLocalError: local variable 'prev_label' referenced before assignment

And if I cancel the comment in line 187: prev_label ="", it turns out missing the 'I_O 'label

The length of token sequence is different from tag sequence

Hi! Thank you for sharing this implementation.
I am using it to train on a NER dataset in Norwegian. I wanted to store the model predictions in CONLL file format but encounter some issues with, I assume, token mapping in the input sequence.
Basically, I got the tokens, y_true, y_pred from function get_bert_span_report() in plot_metrics.py.
And I want to write the predictions into a file by:

def write_to_conll(tokens, y_true, y_pred, conll_fpath):
    assert len(tokens) == len(y_true) and len(y_true) == len(y_pred)
    with conll_fpath.open('a') as f:
        for sentence, y_t, y_p in zip(tokens, y_true, y_pred): 
            assert len(sentence) == len(y_t) and len(y_t) == len(y_p)     
            for i in range(len(sentence)):
                newline = '{}\t{}\t{}\n'.format(sentence[i], y_t[i], y_p[i])
                f.write(newline)
            f.write('\n')

The assertion assert len(sentence) == len(y_t) fails. The length of token sequence is always longer than the length of tag sequence. For example:

['Dette', 'er', 'som', 'George', 'Orwells', 'nytale', ',', 'av', 'samme', 'logiske', 'gehalt', 'som', '"', 'fred', 'er', 'krig', '"', ',', '"', 'stillhet', 'er', 'larm', '"', ',', '"', 'lys', 'er', 'mørke', '"', '.']
['O', 'O', 'O', 'PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
['O', 'O', 'O', 'PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
# len(sentence)=30, len(y_t)=29, len(y_p)=29
['I', 'diktet', '"', 'Om', 'å', 'vokse', 'nedover', '"', 'skriver', 'Rolf', 'Jacobsen', ':']
['O', 'O', 'O', 'PROD', 'O', 'O', 'PER', 'O']
['O', 'O', 'O', 'PROD', 'O', 'O', 'PER', 'O']
# len(sentence)=12, len(y_t)=8, len(y_p)=8

What should I do to get the equal length of token sequence and tag sequence?

No uploaded ckpts and data?

Hi
Did I understand correctly that you don't have an intend to share checkpoints from decoders training, sharing these notebooks others to replicate your results?

Also, I really can't find anywhere data from FactRuEval corpus (not only in this repository). May be you can share a link where you've downloaded it?
I've found only opencorpora corpus on kaggle, but when I opened it, I've found out that there are no classic NER tags such as 'loc', 'per', etc. Only tags that are adj, verb and others.

how to add some new feature?

I want to add some feature in data， ex: is_in_some_vocab？
To train more generalized model.
How can I do this?

Sorry, this is embarrassing, it is my first time I submit an issue on github.

RuntimeError: Expected tensor [2, 512, 1], src [2, 371, 6] and index [2, 512, 1] to have the same size apart from dimension 2

I use my own dataseta, and in the training process of CRF, the following error occurs:

line 483, in main
 global_step, tr_loss = train(args, train_dataset, model, tokenizer, labels,pos, pad_token_label_id)

line 138, in train
   loss = model.score(batch)  # model outputs are always tuple in pytorch-transformers (see doc)

line 52, in score
   return self.crf.score(output, labels_mask, labels)
 
line 46, in score
   gold_score = self.crf.calc_gold_score(logits, labels, lens)

line 99, in calc_gold_score
   unary_score = self.calc_unary_score(logits, labels, lens).sum(

line 93, in calc_unary_score
   scores = torch.gather(logits, 2, labels_exp).squeeze(-1)

RuntimeError: Expected tensor [2, 512, 1], src [2, 371, 6] and index [2, 512, 1] to have the same size apart from dimension 2

Key Error creating NerData

Hello again!

I have a strange error while I run
data = NerData.create(train_path, valid_path, vocab_file)

KeyError Traceback (most recent call last)
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3063 try:
-> 3064 return self._engine.get_loc(key)
3065 except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: '1'

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last)
in
----> 1 data = NerData.create(train_path, valid_path, vocab_file)

~/ner-bert-master/modules/data/bert_data.py in create(cls, train_path, valid_path, vocab_file, batch_size, cuda, is_cls, data_type, max_seq_len, is_meta)
389 raise NotImplementedError("No requested mode :(.")
390 return cls(train_path, valid_path, vocab_file, data_type, *fn(
--> 391 train_path, valid_path, vocab_file, batch_size, cuda, is_cls, do_lower_case, max_seq_len, is_meta),
392 batch_size=batch_size, cuda=cuda, is_meta=is_meta)

~/ner-bert-master/modules/data/bert_data.py in get_bert_data_loaders(train, valid, vocab_file, batch_size, cuda, is_cls, do_lower_case, max_seq_len, is_meta, label2idx, cls2idx)
279 tokenizer = tokenization.FullTokenizer(vocab_file=vocab_file, do_lower_case=do_lower_case)
280 train_f, label2idx = get_data(
--> 281 train, tokenizer, label2idx, cls2idx=cls2idx, is_cls=is_cls, max_seq_len=max_seq_len, is_meta=is_meta)
282 if is_cls:
283 label2idx, cls2idx = label2idx

~/ner-bert-master/modules/data/bert_data.py in get_data(df, tokenizer, label2idx, max_seq_len, pad, cls2idx, is_cls, is_meta)
145 all_args.extend([df["1"].tolist(), df["0"].tolist(), df["2"].tolist()])
146 else:
--> 147 all_args.extend([df["1"].tolist(), df["0"].tolist()])
148 if is_meta:
149 all_args.append(df["3"].tolist())

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in getitem(self, key)
2686 return self._getitem_multilevel(key)
2687 else:
-> 2688 return self._getitem_column(key)
2689
2690 def _getitem_column(self, key):

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in _getitem_column(self, key)
2693 # get column
2694 if self.columns.is_unique:
-> 2695 return self._get_item_cache(key)
2696
2697 # duplicate columns & possible reduce dimensionality

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
2484 res = cache.get(item)
2485 if res is None:
-> 2486 values = self._data.get(item)
2487 res = self._box_item_values(item, values)
2488 cache[item] = res

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals.py in get(self, item, fastpath)
4113
4114 if not isna(item):
-> 4115 loc = self.items.get_loc(item)
4116 else:
4117 indexer = np.arange(len(self.items))[isna(self.items)]

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3064 return self._engine.get_loc(key)
3065 except KeyError:
-> 3066 return self._engine.get_loc(self._maybe_cast_indexer(key))
3067
3068 indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: '1'

May be you've faced it too. I can't google something similar to my case

Explanation for the tables

Hi, what does "Total spans in test set" mean in the tables?

BTW, i got similar results on conll2003 data.
It seems bert is not so effective for NER.

something wrong in bert_data.py?

in the get_data funciton, should variable cls be a list?

No module named 'elmoformanylangs'

Hello! How to solve this error? I can't just get this with pip install

It arises while I do the convertion of checkpoints:

Traceback (most recent call last):
File "/home/jupyter/ner-bert-master/convert_tf_checkpoint_to_pytorch.py", line 27, in
from modules.layers.bert_modeling import BertConfig, BertModel
File "/home/jupyter/ner-bert-master/modules/init.py", line 1, in
from .train.train import NerLearner
File "/home/jupyter/ner-bert-master/modules/train/train.py", line 9, in
from modules.models.released_models import released_models
File "/home/jupyter/ner-bert-master/modules/models/init.py", line 1, in
from .bert_models import BertBiLSTMCRF, BertBiLSTMAttnCRF
File "/home/jupyter/ner-bert-master/modules/models/bert_models.py", line 1, in
from modules.layers.encoders import *
File "/home/jupyter/ner-bert-master/modules/layers/encoders.py", line 3, in
from .embedders import BertEmbedder
File "/home/jupyter/ner-bert-master/modules/layers/embedders.py", line 9, in
from elmoformanylangs.modules.embedding_layer import EmbeddingLayer
ModuleNotFoundError: No module named 'elmoformanylangs'

Also it arises when I do import

ModuleNotFoundError Traceback (most recent call last)
in
----> 1 from modules import BertNerData as NerData

~/ner-bert-master/modules/init.py in
----> 1 from .train.train import NerLearner
2 from .data.bert_data import BertNerData
3 from .models.bert_models import BertBiLSTMCRF
4
5

~/ner-bert-master/modules/train/train.py in
7 import json
8 from modules.data.bert_data import BertNerData
----> 9 from modules.models.released_models import released_models
10
11

~/ner-bert-master/modules/models/init.py in
----> 1 from .bert_models import BertBiLSTMCRF, BertBiLSTMAttnCRF
2
3
4 all = ["BertBiLSTMCRF", "BertBiLSTMAttnCRF"]

~/ner-bert-master/modules/models/bert_models.py in
----> 1 from modules.layers.encoders import *
2 from modules.layers.decoders import *
3 from modules.layers.embedders import *
4 import abc
5 import sys

~/ner-bert-master/modules/layers/encoders.py in
1 from torch import nn
2 import torch
----> 3 from .embedders import BertEmbedder
4
5

~/ner-bert-master/modules/layers/embedders.py in
7 import json
8 from torch import nn
----> 9 from elmoformanylangs.modules.embedding_layer import EmbeddingLayer
10 from elmoformanylangs.frontend import Model
11

ModuleNotFoundError: No module named 'elmoformanylangs'

some errors in readme

For atis project, it seems you have put conll.ipynb as examples...and vice versa

No BERT Fine-Tuning?

Thanks for the comprehensive examples in this repository.

Please correct me if I'm wrong but from my understanding, the benefits of this repository rely on the replacement of ELMo embeddings with BERT embeddings. This seems to work just fine resulting in good improvements. However, it doesn't seem very intuitive to me since it misses the fine-tuning of BERT for the NER task.

Have you tried anything in this direction of fine-tuning BERT for NER? Or to express it differently: Why do you need an LSTM encoder as a wrapper around BERT encodings?

Mistake in main_metrics

current_tag = current_token.split('', 1)[-1] # replace '-' with ""
This bug will totoally change the evaluation result.

the pytorch version?

which the pytorch version should use?

FP16 and NVIDIA Apex support

Hi @king-menin ,

Do you support FP16 and NVIDIA Apex as the Pytorch BERT examples?
Thanks.

UnboundLocalError: local variable 'prev_label' referenced before assignment

This model use a lots of memory

Beside bert model, after I train my model.
Load bert and trained model, also data, it need 4.5G, this is very large.
And during this situation, very hard to deploy online.
So is there anyway to reduce memory use?

more details for the models ?

The models are briefly introduced here. Are there any papers or blogs introducing the models like BertBiLSTMAttnCRF in detailed?

loss explosion

I am using Bert-CRF model to do Named Entity Recognition task. I am using the average of the last four layers as the input of CRF model. But the loss will increase and become Nan in a few batches. Anyone meet that problem before? Any suggestions will be appreciated!

a small bug in `get_mean_max_metric` function

Thanks so much for your super nice implementation of NER-BERT! I'm writing to report a small bug.

We use the get_mean_max_metric in BERT-NER/modules/utils/plot_metrics.py to calculate the metric results, which will determine whether to save model during training time.

However, in this function, [3 + m_idx] in [float(h.split("\n")[-2].split()[3 + m_idx]) for h in history] should be [2 + m_idx], in order to obtain the correct results of target metrics.

Please fix this bug, thanks a lot!

I am reimplementing Atis-NMT but I Couldn't find train filtered.csv atis dataset.

Question about this release

i see this message in the readme "Sorry, we are in developing. Release is coming soon :("
What does the message mean ? Does it mean that some part of the code is missing some of the functionalities ? Is the code complete ?
Thanks

UnicodeDecodeError on the vocab file

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 3799: character maps to

when passing the vocab file to NerData.create() getting unicode error
need to add something like this in the code to make it work
open(vocab_file,encoding='utf8').read()

Predict a sentence using BERTBiLSTMAttnNCRF without passing a dataloader

I had an issue while building a function that only predicts a sentence without passing a dataloader instance
Theses are the steps I followed:
sentence= 'put a sentence'
bert_tokens = []
tok_map = []
tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased")
# label2idx = {"[PAD]": pad_idx, '[CLS]': 1, '[SEP]': 2, "X": 3}
# idx2label = ["[PAD]", '[CLS]', '[SEP]', "X"]
orig_tokens = sentence.split()
orig_tokens = ["[CLS]"] + orig_tokens + ["[SEP]"]
for origin_token in orig_tokens:
cur_tokens = tokenizer.tokenize(origin_token)
bert_tokens.extend(cur_tokens)
tok_map.append(len(bert_tokens))
input_ids = tokenizer.convert_tokens_to_ids(bert_tokens)
input_mask = [1] * len(input_ids)
while len(input_ids) < 424:
input_mask.append(0)
tok_map.append(-1)
input_ids.append(0)
input_type_ids = [0] * len(input_ids)

The problem is I couldn't figure out what batch is in order to predict using model.forward(batch)
I tried this:

batch=[[0],[0],[0]]
batch[0]=input_ids
batch[1]=input_type_ids
batch[2]=input_mask
learner.model.forward(batch)
and this is what I got:

~/ner-bert-master-last-version/ner-bert-master-last-version/modules/models/bert_models.py in forward(self, batch)
46 def forward(self, batch):
47 input_, labels_mask, input_type_ids = batch[:3]
---> 48 input_embeddings = self.embeddings(batch)
49 output, _ = self.lstm.forward(batch)
50 output, _ = self.attn(output, output, output, None)

~/.local/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),

~/ner-bert-master-last-version/ner-bert-master-last-version/modules/layers/embedders.py in forward(self, batch)
59 token_type_ids=batch[2],
60 attention_mask=batch[1],
---> 61 output_all_encoded_layers=self.config["mode"] == "weighted")
62 if self.config["mode"] == "weighted":
63 encoded_layers = torch.stack([a * b for a, b in zip(encoded_layers, self.bert_weights)])

~/.local/lib/python3.7/site-packages/pytorch_pretrained_bert/modeling.py in forward(self, input_ids, token_type_ids, attention_mask, output_all_encoded_layers)
718 # this attention mask is more simple than the triangular masking of causal attention
719 # used in OpenAI GPT, we just need to prepare the broadcast dimension here.
--> 720 extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
721
722 # Since attention_mask is 1.0 for positions we want to attend and 0.0 for

AttributeError: 'list' object has no attribute 'unsqueeze'

can you please help!

should we calculate F1-score with micro-average or macro-average?

In the jupyter notebook "conll2003 BERTBiLSTMCRF" in the "examples" folder, the result report is as follow:

I notice you put macro-avg "0.9221" in the "README.md" file, but it seems like that the code at "https://paperswithcode.com/sota/named-entity-recognition-ner-on-conll-2003" adopt the micro-avg value as the final F1-value.

I would appreciate it very much if you can tell me why, thanks.

Traceback (most recent call last):
  File "convert_tf_checkpoint_to_pytorch.py", line 107, in <module>
    convert()
  File "convert_tf_checkpoint_to_pytorch.py", line 100, in convert
    pointer.data = torch.from_numpy(array)
TypeError: expected np.ndarray (got numpy.ndarray)

Any ideas? Thanks

BIO vs IO

Hello.

In your example (https://github.com/sberbank-ai/ner-bert/blob/master/examples/factrueval-nmt.ipynb) you are using bio markups.
But in code (bert_data.py (187)):

  # prev_label = ""

    for idx_, (orig_token, label) in enumerate(zip(orig_tokens, labels)):
        # Fix BIO to IO as BERT proposed https://arxiv.org/pdf/1810.04805.pdf

        try:

you use io. how do you get the original markup after training?

ai-forever / ner-bert Goto Github PK

ner-bert's Issues

Recommend Projects

Recommend Topics

Recommend Org