ai-forever / ner-bert Goto Github PK
View Code? Open in Web Editor NEWBERT-NER (nert-bert) with google bert https://github.com/google-research.
License: MIT License
BERT-NER (nert-bert) with google bert https://github.com/google-research.
License: MIT License
Hi,
I got an error when I run
data = NerData.create(train_path, valid_path, vocab_file)
~/Downloads/ner-bert-master/modules/data/bert_data.py in get_data(df, tokenizer, label2idx, max_seq_len, pad, cls2idx, is_cls, is_meta)
189 if label != "O":
190 label = label.split("")[1]
--> 191 if label == prev_label:
192 prefix = "I"
193 prev_label = labelUnboundLocalError: local variable 'prev_label' referenced before assignment
And if I cancel the comment in line 187: prev_label ="", it turns out missing the 'I_O 'label
Hi! Thank you for sharing this implementation.
I am using it to train on a NER dataset in Norwegian. I wanted to store the model predictions in CONLL file format but encounter some issues with, I assume, token mapping in the input sequence.
Basically, I got the tokens
, y_true
, y_pred
from function get_bert_span_report()
in plot_metrics.py
.
And I want to write the predictions into a file by:
def write_to_conll(tokens, y_true, y_pred, conll_fpath):
assert len(tokens) == len(y_true) and len(y_true) == len(y_pred)
with conll_fpath.open('a') as f:
for sentence, y_t, y_p in zip(tokens, y_true, y_pred):
assert len(sentence) == len(y_t) and len(y_t) == len(y_p)
for i in range(len(sentence)):
newline = '{}\t{}\t{}\n'.format(sentence[i], y_t[i], y_p[i])
f.write(newline)
f.write('\n')
The assertion assert len(sentence) == len(y_t)
fails. The length of token sequence is always longer than the length of tag sequence. For example:
['Dette', 'er', 'som', 'George', 'Orwells', 'nytale', ',', 'av', 'samme', 'logiske', 'gehalt', 'som', '"', 'fred', 'er', 'krig', '"', ',', '"', 'stillhet', 'er', 'larm', '"', ',', '"', 'lys', 'er', 'mørke', '"', '.']
['O', 'O', 'O', 'PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
['O', 'O', 'O', 'PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
# len(sentence)=30, len(y_t)=29, len(y_p)=29
['I', 'diktet', '"', 'Om', 'å', 'vokse', 'nedover', '"', 'skriver', 'Rolf', 'Jacobsen', ':']
['O', 'O', 'O', 'PROD', 'O', 'O', 'PER', 'O']
['O', 'O', 'O', 'PROD', 'O', 'O', 'PER', 'O']
# len(sentence)=12, len(y_t)=8, len(y_p)=8
What should I do to get the equal length of token sequence and tag sequence?
Hi
Did I understand correctly that you don't have an intend to share checkpoints from decoders training, sharing these notebooks others to replicate your results?
Also, I really can't find anywhere data from FactRuEval corpus (not only in this repository). May be you can share a link where you've downloaded it?
I've found only opencorpora corpus on kaggle, but when I opened it, I've found out that there are no classic NER tags such as 'loc', 'per', etc. Only tags that are adj, verb and others.
I want to add some feature in data, ex: is_in_some_vocab?
To train more generalized model.
How can I do this?
I use my own dataseta, and in the training process of CRF, the following error occurs:
line 483, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer, labels,pos, pad_token_label_id)
line 138, in train
loss = model.score(batch) # model outputs are always tuple in pytorch-transformers (see doc)
line 52, in score
return self.crf.score(output, labels_mask, labels)
line 46, in score
gold_score = self.crf.calc_gold_score(logits, labels, lens)
line 99, in calc_gold_score
unary_score = self.calc_unary_score(logits, labels, lens).sum(
line 93, in calc_unary_score
scores = torch.gather(logits, 2, labels_exp).squeeze(-1)
RuntimeError: Expected tensor [2, 512, 1], src [2, 371, 6] and index [2, 512, 1] to have the same size apart from dimension 2
Hello again!
I have a strange error while I run
data = NerData.create(train_path, valid_path, vocab_file)
KeyError Traceback (most recent call last)
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3063 try:
-> 3064 return self._engine.get_loc(key)
3065 except KeyError:pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: '1'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
in
----> 1 data = NerData.create(train_path, valid_path, vocab_file)~/ner-bert-master/modules/data/bert_data.py in create(cls, train_path, valid_path, vocab_file, batch_size, cuda, is_cls, data_type, max_seq_len, is_meta)
389 raise NotImplementedError("No requested mode :(.")
390 return cls(train_path, valid_path, vocab_file, data_type, *fn(
--> 391 train_path, valid_path, vocab_file, batch_size, cuda, is_cls, do_lower_case, max_seq_len, is_meta),
392 batch_size=batch_size, cuda=cuda, is_meta=is_meta)~/ner-bert-master/modules/data/bert_data.py in get_bert_data_loaders(train, valid, vocab_file, batch_size, cuda, is_cls, do_lower_case, max_seq_len, is_meta, label2idx, cls2idx)
279 tokenizer = tokenization.FullTokenizer(vocab_file=vocab_file, do_lower_case=do_lower_case)
280 train_f, label2idx = get_data(
--> 281 train, tokenizer, label2idx, cls2idx=cls2idx, is_cls=is_cls, max_seq_len=max_seq_len, is_meta=is_meta)
282 if is_cls:
283 label2idx, cls2idx = label2idx~/ner-bert-master/modules/data/bert_data.py in get_data(df, tokenizer, label2idx, max_seq_len, pad, cls2idx, is_cls, is_meta)
145 all_args.extend([df["1"].tolist(), df["0"].tolist(), df["2"].tolist()])
146 else:
--> 147 all_args.extend([df["1"].tolist(), df["0"].tolist()])
148 if is_meta:
149 all_args.append(df["3"].tolist())/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in getitem(self, key)
2686 return self._getitem_multilevel(key)
2687 else:
-> 2688 return self._getitem_column(key)
2689
2690 def _getitem_column(self, key):/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in _getitem_column(self, key)
2693 # get column
2694 if self.columns.is_unique:
-> 2695 return self._get_item_cache(key)
2696
2697 # duplicate columns & possible reduce dimensionality/opt/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
2484 res = cache.get(item)
2485 if res is None:
-> 2486 values = self._data.get(item)
2487 res = self._box_item_values(item, values)
2488 cache[item] = res/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals.py in get(self, item, fastpath)
4113
4114 if not isna(item):
-> 4115 loc = self.items.get_loc(item)
4116 else:
4117 indexer = np.arange(len(self.items))[isna(self.items)]/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3064 return self._engine.get_loc(key)
3065 except KeyError:
-> 3066 return self._engine.get_loc(self._maybe_cast_indexer(key))
3067
3068 indexer = self.get_indexer([key], method=method, tolerance=tolerance)pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: '1'
May be you've faced it too. I can't google something similar to my case
Hi, what does "Total spans in test set" mean in the tables?
BTW, i got similar results on conll2003 data.
It seems bert is not so effective for NER.
Hello! How to solve this error? I can't just get this with pip install
It arises while I do the convertion of checkpoints:
Traceback (most recent call last):
File "/home/jupyter/ner-bert-master/convert_tf_checkpoint_to_pytorch.py", line 27, in
from modules.layers.bert_modeling import BertConfig, BertModel
File "/home/jupyter/ner-bert-master/modules/init.py", line 1, in
from .train.train import NerLearner
File "/home/jupyter/ner-bert-master/modules/train/train.py", line 9, in
from modules.models.released_models import released_models
File "/home/jupyter/ner-bert-master/modules/models/init.py", line 1, in
from .bert_models import BertBiLSTMCRF, BertBiLSTMAttnCRF
File "/home/jupyter/ner-bert-master/modules/models/bert_models.py", line 1, in
from modules.layers.encoders import *
File "/home/jupyter/ner-bert-master/modules/layers/encoders.py", line 3, in
from .embedders import BertEmbedder
File "/home/jupyter/ner-bert-master/modules/layers/embedders.py", line 9, in
from elmoformanylangs.modules.embedding_layer import EmbeddingLayer
ModuleNotFoundError: No module named 'elmoformanylangs'
Also it arises when I do import
ModuleNotFoundError Traceback (most recent call last)
in
----> 1 from modules import BertNerData as NerData~/ner-bert-master/modules/init.py in
----> 1 from .train.train import NerLearner
2 from .data.bert_data import BertNerData
3 from .models.bert_models import BertBiLSTMCRF
4
5~/ner-bert-master/modules/train/train.py in
7 import json
8 from modules.data.bert_data import BertNerData
----> 9 from modules.models.released_models import released_models
10
11~/ner-bert-master/modules/models/init.py in
----> 1 from .bert_models import BertBiLSTMCRF, BertBiLSTMAttnCRF
2
3
4 all = ["BertBiLSTMCRF", "BertBiLSTMAttnCRF"]~/ner-bert-master/modules/models/bert_models.py in
----> 1 from modules.layers.encoders import *
2 from modules.layers.decoders import *
3 from modules.layers.embedders import *
4 import abc
5 import sys~/ner-bert-master/modules/layers/encoders.py in
1 from torch import nn
2 import torch
----> 3 from .embedders import BertEmbedder
4
5~/ner-bert-master/modules/layers/embedders.py in
7 import json
8 from torch import nn
----> 9 from elmoformanylangs.modules.embedding_layer import EmbeddingLayer
10 from elmoformanylangs.frontend import Model
11ModuleNotFoundError: No module named 'elmoformanylangs'
For atis project, it seems you have put conll.ipynb as examples...and vice versa
Thanks for the comprehensive examples in this repository.
Please correct me if I'm wrong but from my understanding, the benefits of this repository rely on the replacement of ELMo embeddings with BERT embeddings. This seems to work just fine resulting in good improvements. However, it doesn't seem very intuitive to me since it misses the fine-tuning of BERT for the NER task.
Have you tried anything in this direction of fine-tuning BERT for NER? Or to express it differently: Why do you need an LSTM encoder as a wrapper around BERT encodings?
current_tag = current_token.split('', 1)[-1] # replace '-' with ""
This bug will totoally change the evaluation result.
which the pytorch version should use?
Hi @king-menin ,
Do you support FP16 and NVIDIA Apex as the Pytorch BERT examples?
Thanks.
UnboundLocalError: local variable 'prev_label' referenced before assignment
Beside bert model, after I train my model.
Load bert and trained model, also data, it need 4.5G, this is very large.
And during this situation, very hard to deploy online.
So is there anyway to reduce memory use?
The models are briefly introduced here. Are there any papers or blogs introducing the models like BertBiLSTMAttnCRF in detailed?
I am using Bert-CRF model to do Named Entity Recognition task. I am using the average of the last four layers as the input of CRF model. But the loss will increase and become Nan in a few batches. Anyone meet that problem before? Any suggestions will be appreciated!
Thanks so much for your super nice implementation of NER-BERT! I'm writing to report a small bug.
We use the get_mean_max_metric
in BERT-NER/modules/utils/plot_metrics.py
to calculate the metric results, which will determine whether to save model during training time.
However, in this function, [3 + m_idx]
in [float(h.split("\n")[-2].split()[3 + m_idx]) for h in history]
should be [2 + m_idx]
, in order to obtain the correct results of target metrics.
Please fix this bug, thanks a lot!
i see this message in the readme "Sorry, we are in developing. Release is coming soon :("
What does the message mean ? Does it mean that some part of the code is missing some of the functionalities ? Is the code complete ?
Thanks
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 3799: character maps to
when passing the vocab file to NerData.create() getting unicode error
need to add something like this in the code to make it work
open(vocab_file,encoding='utf8').read()
I had an issue while building a function that only predicts a sentence without passing a dataloader instance
Theses are the steps I followed:
sentence= 'put a sentence'
bert_tokens = []
tok_map = []
tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased")
# label2idx = {"[PAD]": pad_idx, '[CLS]': 1, '[SEP]': 2, "X": 3}
# idx2label = ["[PAD]", '[CLS]', '[SEP]', "X"]
orig_tokens = sentence.split()
orig_tokens = ["[CLS]"] + orig_tokens + ["[SEP]"]
for origin_token in orig_tokens:
cur_tokens = tokenizer.tokenize(origin_token)
bert_tokens.extend(cur_tokens)
tok_map.append(len(bert_tokens))
input_ids = tokenizer.convert_tokens_to_ids(bert_tokens)
input_mask = [1] * len(input_ids)
while len(input_ids) < 424:
input_mask.append(0)
tok_map.append(-1)
input_ids.append(0)
input_type_ids = [0] * len(input_ids)
The problem is I couldn't figure out what batch is in order to predict using model.forward(batch)
I tried this:
batch=[[0],[0],[0]]
batch[0]=input_ids
batch[1]=input_type_ids
batch[2]=input_mask
learner.model.forward(batch)
and this is what I got:
~/ner-bert-master-last-version/ner-bert-master-last-version/modules/models/bert_models.py in forward(self, batch)
46 def forward(self, batch):
47 input_, labels_mask, input_type_ids = batch[:3]
---> 48 input_embeddings = self.embeddings(batch)
49 output, _ = self.lstm.forward(batch)
50 output, _ = self.attn(output, output, output, None)
~/.local/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),
~/ner-bert-master-last-version/ner-bert-master-last-version/modules/layers/embedders.py in forward(self, batch)
59 token_type_ids=batch[2],
60 attention_mask=batch[1],
---> 61 output_all_encoded_layers=self.config["mode"] == "weighted")
62 if self.config["mode"] == "weighted":
63 encoded_layers = torch.stack([a * b for a, b in zip(encoded_layers, self.bert_weights)])
~/.local/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),
~/.local/lib/python3.7/site-packages/pytorch_pretrained_bert/modeling.py in forward(self, input_ids, token_type_ids, attention_mask, output_all_encoded_layers)
718 # this attention mask is more simple than the triangular masking of causal attention
719 # used in OpenAI GPT, we just need to prepare the broadcast dimension here.
--> 720 extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
721
722 # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
AttributeError: 'list' object has no attribute 'unsqueeze'
can you please help!
In the jupyter notebook "conll2003 BERTBiLSTMCRF" in the "examples" folder, the result report is as follow:
I notice you put macro-avg "0.9221" in the "README.md" file, but it seems like that the code at "https://paperswithcode.com/sota/named-entity-recognition-ner-on-conll-2003" adopt the micro-avg value as the final F1-value.
I would appreciate it very much if you can tell me why, thanks.
Hi,
When running the conversion script convert_tf_checkpoint_to_pytorch.py I get the following error when trying to convert the BERT-Base, uncased model:
Traceback (most recent call last):
File "convert_tf_checkpoint_to_pytorch.py", line 107, in <module>
convert()
File "convert_tf_checkpoint_to_pytorch.py", line 100, in convert
pointer.data = torch.from_numpy(array)
TypeError: expected np.ndarray (got numpy.ndarray)
Any ideas? Thanks
Hello.
In your example (https://github.com/sberbank-ai/ner-bert/blob/master/examples/factrueval-nmt.ipynb) you are using bio markups.
But in code (bert_data.py (187)):
# prev_label = ""
for idx_, (orig_token, label) in enumerate(zip(orig_tokens, labels)):
# Fix BIO to IO as BERT proposed https://arxiv.org/pdf/1810.04805.pdf
try:
you use io. how do you get the original markup after training?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.