Giter Club home page Giter Club logo

Comments (47)

bheinzerling avatar bheinzerling commented on May 4, 2024 51

I have some code for preparing batches here:

https://github.com/bheinzerling/dougu/blob/2f54b14d588f17d77b7a8bca9f4e5eb38d6a2805/dougu/bert.py#L98

The important methods are subword_tokenize_to_ids and subword_tokenize, you can probably ignore the other stuff.

With this, feature extraction for each sentence, i.e. a list of tokens, is simply:

bert = dougu.bert.Bert.Model("bert-base-cased")
featurized_sentences = []
for tokens in sentences:
    features = {}
    features["bert_ids"], features["bert_mask"], features["bert_token_starts"] = bert.subword_tokenize_to_ids(tokens)
    featurized_sentences.append(features)

Then I use a custom collate function for a DataLoader that turns featurized_sentences into batches:

def collate_fn(featurized_sentences_batch):
    bert_batch = [torch.cat(features[key] for features in featurized_sentences], dim=0) for key in ("bert_ids", "bert_mask", "bert_token_starts")]
    return bert_batch

A simple sequence tagger module would look something like this:

class SequenceTagger(torch.nn.Module):
    def __init__(self, data_parallel=True):
           bert = BertModel.from_pretrained("bert-base-cased").to(device=torch.device("cuda"))
           if data_parallel:
                self.bert = torch.nn.DataParallel(bert)
           else:
               self.bert = bert
           bert_dim = 786 # (or get the dim from BertEmbeddings)
           n_labels = 5  # need to set this for your task
           self.out = torch.nn.Linear(bert_dim, n_labels)
           ...  # droput, log_softmax...
    
     def forward(self, bert_batch, true_labels):
            bert_ids, bert_mask, bert_token_starts = bert_batch
            # truncate to longest sequence length in batch (usually much smaller than 512) to save GPU RAM
            max_length = (bert_mask != 0).max(0)[0].nonzero()[-1].item()
            if max_length < bert_ids.shape[1]:
                  bert_ids = bert_ids[:, :max_length]
                  bert_mask = bert_mask[:, :max_length]

            segment_ids = torch.zeros_like(bert_mask)  # dummy segment IDs, since we only have one sentence
            bert_last_layer = self.bert(bert_ids, segment_ids)[0][-1]
            # select the states representing each token start, for each instance in the batch
            bert_token_reprs = [
                   layer[starts.nonzero().squeeze(1)]
                   for layer, starts in zip(bert_last_layer, bert_token_starts)]
            # need to pad because sentence length varies
            padded_bert_token_reprs = pad_sequence(
                   bert_token_reprs, batch_first=True, padding_value=-1)
            # output/classification layer: input bert states and get log probabilities for cross entropy loss
            pred_logits = self.log_softmax(self.out(self.dropout(padded_bert_token_reprs)))
            mask = true_labels != -1  # I did set label = -1 for all padding tokens somewhere else
            loss = cross_entropy(pred_logits, true_labels)
            # average/reduce the loss according to the actual number of of predictions (i.e. one prediction per token).
            loss /= mask.float().sum()
            return loss

Wrote this without checking if it runs (my actual code is tied into some other things so I cannot just copy&paste it), but it should help you get started.

from transformers.

Single430 avatar Single430 commented on May 4, 2024 12
labels = ['B-PERS', 'I-PERS', 'O', 'B-LOC', 'I-LOC']
labels2id = {'B-PERS': 0, 'I-PERS': 1, 'O': 2, 'B-LOC': 3, 'I-LOC': 4}
sent = ['[CLS]', 'john', 'johan',  '##son', 'lives',  'in', 'ramat', 'gan', '.', '[SEP]']
labels = [2, 0, 1, 1, 2, 2, 3, 4, 2, 2]
attention_mask = [0, 1, 1, 1, 1, 1, 1, 1, 1, 0]
sentence_id = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

@AlxndrMlk

from transformers.

bheinzerling avatar bheinzerling commented on May 4, 2024 10

@zhaoxy92 what sequence labeling task are you doing? I've got CoNLL'03 NER running with the bert-base-cased model, and also found the same sensitivity to hyper-parameters.

The best dev F1 score i've gotten after half a day a day of trying some parameters is 92.4 94.6, which is a bit lower than the 96.4 dev score for BERT_base reported in the paper. I guess more tuning will increase the score some more.

The best configuration for me so far is:

  • Batch size: 160 (on four P40 GPUs with 24GB RAM each). Smaller batch sizes that fit on one or two GPUs give bad results.
  • Optimizer: Adam with learning rate 1e-4. Tried BertAdam with learning rate 1e-5, but it didn't seem to converge.
  • fp16/fp32: Only fp32 works. Tried fp16 (half precision) to allow larger batch sizes, but this gave really low scores, with and without loss scaling.

Also, properly averaging the loss is important: Not just loss /= batch_size. You need to take into account padding and word pieces without predictions (google-research/bert#33 (comment)). If you have a mask tensor that indicates which bert inputs correspond to tagged tokens, then the proper averaging is loss /= mask.float().sum

Another tip, truncating the input (#66) enables much larger batch sizes. Without it the largest possible batch size was 56, but with truncating 160 is possible.

from transformers.

kamalkraj avatar kamalkraj commented on May 4, 2024 4

https://github.com/kamalkraj/BERT-NER
Replicated results from BERT paper

from transformers.

nijianmo avatar nijianmo commented on May 4, 2024 2

Thanks for sharing these tips here! It helps a lot.

I tried to finetune BERT on multiple imbalanced datasets and found the result quite unstable... For an imbalanced dataset, I mean there are much more O labels than the others under the {B,I,O} tagging scheme. Tried weighted cross-entropy loss but the performance is still not as expected. Has anyone met the same issue?

Thanks!

from transformers.

g-jing avatar g-jing commented on May 4, 2024 2

@nijianmo Hi, I am recently considering using weighted loss in NER task. I wonder if you have tried weighted crf or weighted softmax in pytorch implementation. If so, did you get a good performance ? Thanks in advance.

from transformers.

bheinzerling avatar bheinzerling commented on May 4, 2024 2

@ramithp that was added in v2 of the paper, but wasn't present in v1, which is the version the discussion here refers to

from transformers.

bheinzerling avatar bheinzerling commented on May 4, 2024 2

@sougata-fiz

When I wrote that code, self.bert(bert_ids, segment_ids) returned a tuple, of which the first element contained all hidden states. I think this changed at some point. What BertModel's forward returns now is described here: https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_bert.py#L648, so you would have to make the appropriate changes.
Alternatively, you could also try the TokenClassification models, which have since been added: https://huggingface.co/transformers/v2.5.0/model_doc/auto.html#automodelfortokenclassification

from transformers.

linhlt-it-ee avatar linhlt-it-ee commented on May 4, 2024 2

@shushanxingzhe : I think you are using label 'O' as padding label in your code. From my view point, you should have another label 'PAD' for padding instead using 'O' label

from transformers.

srslynow avatar srslynow commented on May 4, 2024 1

@bheinzerling with the risk of going off topic here, would you mind sharing your code? I'd love to read and adapt it for a similar sequential classification task.

from transformers.

bheinzerling avatar bheinzerling commented on May 4, 2024 1

@kugwzk I didn't do any more CoNLL'03 runs since the numbers reported in the BERT paper were apparently achieved by using document context, which is different from the standard sentence-based evaluation. You can find more details here: allenai/allennlp#2067 (comment)

from transformers.

dangal95 avatar dangal95 commented on May 4, 2024 1

Hi all,

I am trying to train the BERT model on some data that I have. However, I am having trouble understanding how to adjust the labels following tokenization. I am trying to perform word level classification (similar to NER)

If I have the following tokenized sentence and its' labels:

original_tokens = ['The', <start>', 'eng-30-01258617-a', '<end>', 'frailty']
original_labels = [0, 2, 3, 4, 1]

Then after using the BERT tokenizer I get the following:
bert_tokens = ['[CLS]', 'the', '<start>', 'eng-30-01258617-a', '<end>', 'frail', '##ty', '[SEP]']

Also, I adjust my label array as follows:
bert_labels = [0, 2, 3, 4, 1, 1]

N.B. Tokens such as eng-30-01258617-a are not tokenized further as I included an ignore list which contains words and tokens that I do not want tokenized and I swapped them with the [unusedXXX] tokens found in the vocab.txt file.

Notice how the last word 'frailty' is transformed into ['frail', '##ty'] and the label '1' which was used for the whole word is now placed under each word piece. Is this the correct way of doing it? If you would like a more in-depth explanation of what I am trying to achieve you can read the following: https://stackoverflow.com/questions/56129165/how-to-handle-labels-when-using-the-berts-wordpiece-tokenizer

Any help would be greatly appreciated! Thanks in advance

from transformers.

bheinzerling avatar bheinzerling commented on May 4, 2024 1

@dangal95, adjusting the original labels is probably not the best way. A simpler method that works well is described in this issue, here #64 (comment)

from transformers.

weizhepei avatar weizhepei commented on May 4, 2024 1

Many thanks to @bheinzerling! For those who may concern , I've implemented a NER model based on pytorch-transformers and @bheinzerling's idea, which might help you get a quick start on it. Welcome to check this out.

from transformers.

chnsh avatar chnsh commented on May 4, 2024 1

Thanks to #64 (comment), I could get the implementation to work - for anyone else that's struggling to reproduce the results: https://github.com/chnsh/BERT-NER-CoNLL

from transformers.

kamalkraj avatar kamalkraj commented on May 4, 2024 1

BERT-NER in Tensorflow 2.0
https://github.com/kamalkraj/BERT-NER-TF

from transformers.

imayachita avatar imayachita commented on May 4, 2024 1

Thanks for sharing these tips here! It helps a lot.

I tried to finetune BERT on multiple imbalanced datasets and found the result quite unstable... For an imbalanced dataset, I mean there are much more O labels than the others under the {B,I,O} tagging scheme. Tried weighted cross-entropy loss but the performance is still not as expected. Has anyone met the same issue?

Thanks!

Hi @nijianmo, did you find any workaround for this? Thanks!

from transformers.

AlxndrMlk avatar AlxndrMlk commented on May 4, 2024 1

Hi everyone!

Thanks for your posts! I was wondering - could anyone post an explicit example of how the properly formatted data for NER using BERT would look like? It is not entirely clean to me from the paper and the comments I've found.

Let's say we have a following sentence and labels:

sent = "John Johanson lives in Ramat Gan."
labels = ['B-PERS', 'I-PERS', 'O', 'O', 'B-LOC', 'I-LOC']

Would data that we input to the model be something like this:

sent = ['[CLS]', 'john', 'johan',  '##son', 'lives',  'in', 'ramat', 'gan', '.', '[SEP]']
labels = ['O', 'B-PERS', 'I-PERS', 'I-PERS', 'O', 'O', 'B-LOC', 'I-LOC', 'O', 'O']
attention_mask = [0, 1, 1, 1, 1, 1, 1, 1, 1, 0]
sentence_id = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

?

Thank you!

from transformers.

Single430 avatar Single430 commented on May 4, 2024 1
labels = ['B-PERS', 'I-PERS', 'O', 'B-LOC', 'I-LOC']
labels2id = {'B-PERS': 0, 'I-PERS': 1, 'O': 2, 'B-LOC': 3, 'I-LOC': 4}
sent = ['[CLS]', 'john', 'johan',  '##son', 'lives',  'in', 'ramat', 'gan', '.', '[SEP]']
labels = [2, 0, 1, 1, 2, 2, 3, 4, 2, 2]
attention_mask = [0, 1, 1, 1, 1, 1, 1, 1, 1, 0]
sentence_id = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

@AlxndrMlk

Hello,if we have the following sentence:

sent = "Johanson lives in Ramat Gan."
labels = ['B-PERS', 'O', 'O', 'B-LOC', 'I-LOC']

Would “Johanson” be processed like this?

'johan',  '##son'  
'B-PERS'    'I-PERS'

or like this?

'johan',  '##son'  
'B-PERS'   'B-PERS'      

thanks you!

The middle one is right, you need to add a label to labels ‘I-PERS’

from transformers.

thomwolf avatar thomwolf commented on May 4, 2024

Well that seems like a good approach. Maybe you can find some inspiration in the code of the BertForQuestionAnswering model? It is not exactly what you are doing but maybe it can help.

from transformers.

zhaoxy92 avatar zhaoxy92 commented on May 4, 2024

Thanks. It worked. However, a interesting issue about BERT is that it's highly sensitive to learning rate, which makes it very difficult to combine with other models

from transformers.

zhaoxy92 avatar zhaoxy92 commented on May 4, 2024

I am also working on CoNLL03. Similar results as you got.

from transformers.

rremani avatar rremani commented on May 4, 2024

@bheinzerling Thanks a lot for the starter, got awesome results!

from transformers.

kugwzk avatar kugwzk commented on May 4, 2024

Hi~@bheinzerling
I uesd batch size=16, and lr=2e-5, get the dev F1=0.951 and test F1=0.914 which lower than ELMO. What about your result now?

from transformers.

kugwzk avatar kugwzk commented on May 4, 2024

Hmmm...I think they should tell that in the paper...And do you know where to find that they used document context?

from transformers.

bheinzerling avatar bheinzerling commented on May 4, 2024

That's what the folks over at allennlp said. I don't know where they got this information, maybe personal communication with one of the BERT authors?

from transformers.

kugwzk avatar kugwzk commented on May 4, 2024

Anyway, thank you very much for tell me that.

from transformers.

JianLiu91 avatar JianLiu91 commented on May 4, 2024

https://github.com/JianLiu91/bert_ner gives a solution that is very easy to understand.
However, I still wonder whether is the best practice.

from transformers.

stale avatar stale commented on May 4, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

from transformers.

g-jing avatar g-jing commented on May 4, 2024

I have some code for preparing batches here:

https://github.com/bheinzerling/dougu/blob/2f54b14d588f17d77b7a8bca9f4e5eb38d6a2805/dougu/bert.py#L98

The important methods are subword_tokenize_to_ids and subword_tokenize, you can probably ignore the other stuff.

With this, feature extraction for each sentence, i.e. a list of tokens, is simply:

bert = dougu.bert.Bert.Model("bert-base-cased")
featurized_sentences = []
for tokens in sentences:
    features = {}
    features["bert_ids"], features["bert_mask"], features["bert_token_starts"] = bert.subword_tokenize_to_ids(tokens)
    featurized_sentences.append(features)

Then I use a custom collate function for a DataLoader that turns featurized_sentences into batches:

def collate_fn(featurized_sentences_batch):
    bert_batch = [torch.cat(features[key] for features in featurized_sentences], dim=0) for key in ("bert_ids", "bert_mask", "bert_token_starts")]
    return bert_batch

A simple sequence tagger module would look something like this:

class SequenceTagger(torch.nn.Module):
    def __init__(self, data_parallel=True):
           bert = BertModel.from_pretrained("bert-base-cased").to(device=torch.device("cuda"))
           if data_parallel:
                self.bert = torch.nn.DataParallel(bert)
           else:
               self.bert = bert
           bert_dim = 786 # (or get the dim from BertEmbeddings)
           n_labels = 5  # need to set this for your task
           self.out = torch.nn.Linear(bert_dim, n_labels)
           ...  # droput, log_softmax...
    
     def forward(self, bert_batch, true_labels):
            bert_ids, bert_mask, bert_token_starts = bert_batch
            # truncate to longest sequence length in batch (usually much smaller than 512) to save GPU RAM
            max_length = (bert_mask != 0).max(0)[0].nonzero()[-1].item()
            if max_length < bert_ids.shape[1]:
                  bert_ids = bert_ids[:, :max_length]
                  bert_mask = bert_mask[:, :max_length]

            segment_ids = torch.zeros_like(bert_mask)  # dummy segment IDs, since we only have one sentence
            bert_last_layer = self.bert(bert_ids, segment_ids)[0][-1]
            # select the states representing each token start, for each instance in the batch
            bert_token_reprs = [
                   layer[starts.nonzero().squeeze(1)]
                   for layer, starts in zip(bert_last_layer, bert_token_starts)]
            # need to pad because sentence length varies
            padded_bert_token_reprs = pad_sequence(
                   bert_token_reprs, batch_first=True, padding_value=-1)
            # output/classification layer: input bert states and get log probabilities for cross entropy loss
            pred_logits = self.log_softmax(self.out(self.dropout(padded_bert_token_reprs)))
            mask = true_labels != -1  # I did set label = -1 for all padding tokens somewhere else
            loss = cross_entropy(pred_logits, true_labels)
            # average/reduce the loss according to the actual number of of predictions (i.e. one prediction per token).
            loss /= mask.float().sum()
            return loss

Wrote this without checking if it runs (my actual code is tied into some other things so I cannot just copy&paste it), but it should help you get started.

I did not realize there is a method subword_tokenize until seeing your post. I did spend a lot of time wirte this method.

from transformers.

ramithp avatar ramithp commented on May 4, 2024

That's what the folks over at allennlp said. I don't know where they got this information, maybe personal communication with one of the BERT authors?

Just adding a bit of clarification since I revisited the paper after reading that comment.

From the BERT Paper Section 5.3 (https://arxiv.org/pdf/1810.04805.pdf)
In this section, we compare the two approaches by applying BERT to the CoNLL-2003 Named Entity Recognition (NER) task (Tjong Kim Sang and De Meulder, 2003). In the input to BERT, we use a case-preserving WordPiece model, and we include the maximal document context provided by the data.

from transformers.

ramithp avatar ramithp commented on May 4, 2024

@bheinzerling Yeah, I just realized that. No wonder I couldn't remember seeing it earlier. Thanks for confirming it. Just wanted to add that bit to the thread in case there were others that haven't read the revision.

from transformers.

g-jing avatar g-jing commented on May 4, 2024

@zhaoxy92 @thomwolf @bheinzerling @srslynow @rremani
Sorry about tag all of you. I wonder how to set the weight decay other than the BERT structure, for example the crf parameter after BERT output. Should I set it to be 0.01 or 0? Sorry again for tagging all of you because it is kind of urgent.

from transformers.

srslynow avatar srslynow commented on May 4, 2024

@zhaoxy92 @thomwolf @bheinzerling @srslynow @rremani
Sorry about tag all of you. I wonder how to set the weight decay other than the BERT structure, for example the crf parameter after BERT output. Should I set it to be 0.01 or 0? Sorry again for tagging all of you because it is kind of urgent.

This repository does not use a CRF for NER classification? Anyway, parameters of a CRF depend on the data distribution you have. These links might be usefull: https://towardsdatascience.com/conditional-random-field-tutorial-in-pytorch-ca0d04499463 and https://pytorch.org/tutorials/beginner/nlp/advanced_tutorial.html

from transformers.

g-jing avatar g-jing commented on May 4, 2024

@srslynow Thanks for your answer! I am familiar with CRF, but kind of confused how to set the weight decay when the CRF is connected with BERT. The authors or huggingface seem not to have mentioned how to set weight decay beside the BERT structure.

from transformers.

antgr avatar antgr commented on May 4, 2024

ple sequence tagger

I have some code for preparing batches here:

https://github.com/bheinzerling/dougu/blob/2f54b14d588f17d77b7a8bca9f4e5eb38d6a2805/dougu/bert.py#L98

The important methods are subword_tokenize_to_ids and subword_tokenize, you can probably ignore the other stuff.

With this, feature extraction for each sentence, i.e. a list of tokens, is simply:

bert = dougu.bert.Bert.Model("bert-base-cased")
featurized_sentences = []
for tokens in sentences:
    features = {}
    features["bert_ids"], features["bert_mask"], features["bert_token_starts"] = bert.subword_tokenize_to_ids(tokens)
    featurized_sentences.append(features)

Then I use a custom collate function for a DataLoader that turns featurized_sentences into batches:

def collate_fn(featurized_sentences_batch):
    bert_batch = [torch.cat(features[key] for features in featurized_sentences], dim=0) for key in ("bert_ids", "bert_mask", "bert_token_starts")]
    return bert_batch

A simple sequence tagger module would look something like this:

class SequenceTagger(torch.nn.Module):
    def __init__(self, data_parallel=True):
           bert = BertModel.from_pretrained("bert-base-cased").to(device=torch.device("cuda"))
           if data_parallel:
                self.bert = torch.nn.DataParallel(bert)
           else:
               self.bert = bert
           bert_dim = 786 # (or get the dim from BertEmbeddings)
           n_labels = 5  # need to set this for your task
           self.out = torch.nn.Linear(bert_dim, n_labels)
           ...  # droput, log_softmax...
    
     def forward(self, bert_batch, true_labels):
            bert_ids, bert_mask, bert_token_starts = bert_batch
            # truncate to longest sequence length in batch (usually much smaller than 512) to save GPU RAM
            max_length = (bert_mask != 0).max(0)[0].nonzero()[-1].item()
            if max_length < bert_ids.shape[1]:
                  bert_ids = bert_ids[:, :max_length]
                  bert_mask = bert_mask[:, :max_length]

            segment_ids = torch.zeros_like(bert_mask)  # dummy segment IDs, since we only have one sentence
            bert_last_layer = self.bert(bert_ids, segment_ids)[0][-1]
            # select the states representing each token start, for each instance in the batch
            bert_token_reprs = [
                   layer[starts.nonzero().squeeze(1)]
                   for layer, starts in zip(bert_last_layer, bert_token_starts)]
            # need to pad because sentence length varies
            padded_bert_token_reprs = pad_sequence(
                   bert_token_reprs, batch_first=True, padding_value=-1)
            # output/classification layer: input bert states and get log probabilities for cross entropy loss
            pred_logits = self.log_softmax(self.out(self.dropout(padded_bert_token_reprs)))
            mask = true_labels != -1  # I did set label = -1 for all padding tokens somewhere else
            loss = cross_entropy(pred_logits, true_labels)
            # average/reduce the loss according to the actual number of of predictions (i.e. one prediction per token).
            loss /= mask.float().sum()
            return loss

Wrote this without checking if it runs (my actual code is tied into some other things so I cannot just copy&paste it), but it should help you get started.

bert_last_layer

Hi, I am trying to make your code work, and here is my setup: I re-declare as free functions and constants everything that is needed

import numpy as np
from pytorch_transformers import BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
SEP = "[SEP]"
MASK = '[MASK]'
CLS = "[CLS]"
max_len = 100
def flatten(list_of_lists):
    for list in list_of_lists:
        for item in list:
            yield item
def convert_tokens_to_ids(tokens, pad=True):
        token_ids = tokenizer.convert_tokens_to_ids(tokens)
        ids = torch.tensor([token_ids]).to(device="cpu")
        assert ids.size(1) < max_len
        if pad:
            padded_ids = torch.zeros(1, max_len).to(ids)
            padded_ids[0, :ids.size(1)] = ids
            mask = torch.zeros(1, max_len).to(ids)
            mask[0, :ids.size(1)] = 1
            return padded_ids, mask
        else:
            return ids
    
def subword_tokenize(tokens):
        """Segment each token into subwords while keeping track of
        token boundaries.
        Parameters
        ----------
        tokens: A sequence of strings, representing input tokens.
        Returns
        -------
        A tuple consisting of:
            - A list of subwords, flanked by the special symbols required
                by Bert (CLS and SEP).
            - An array of indices into the list of subwords, indicating
                that the corresponding subword is the start of a new
                token. For example, [1, 3, 4, 7] means that the subwords
                1, 3, 4, 7 are token starts, while all other subwords
                (0, 2, 5, 6, 8...) are in or at the end of tokens.
                This list allows selecting Bert hidden states that
                represent tokens, which is necessary in sequence
                labeling.
        """
        subwords = list(map(tokenizer.tokenize, tokens))
        print ("subwords: ", subwords)
        subword_lengths = list(map(len, subwords))
        subwords = [CLS] + list(flatten(subwords)) + [SEP]
        print ("subwords: ", subwords)
        token_start_idxs = 1 + np.cumsum([0] + subword_lengths[:-1])
        return subwords, token_start_idxs

def subword_tokenize_to_ids(tokens):
        """Segment each token into subwords while keeping track of
        token boundaries and convert subwords into IDs.
        Parameters
        ----------
        tokens: A sequence of strings, representing input tokens.
        Returns
        -------
        A tuple consisting of:
            - A list of subword IDs, including IDs of the special
                symbols (CLS and SEP) required by Bert.
            - A mask indicating padding tokens.
            - An array of indices into the list of subwords. See
                doc of subword_tokenize.
        """
        subwords, token_start_idxs = subword_tokenize(tokens)
        subword_ids, mask = convert_tokens_to_ids(subwords)
        token_starts = torch.zeros(1, 100).to(subword_ids)
        token_starts[0, token_start_idxs] = 1
        return subword_ids, mask, token_starts

and then i try to add your extra code.
i try to understand the code for this simple case:

sentences = [["the", "rolerationing", "ends"], ["A", "sequence", "of", "strings" ,",", "representing", "input", "tokens", "."]]

it is
max_length = (bert_mask != 0).max(0)[0].nonzero()[-1].item()
which is 11

Some questions:
1)

bert(bert_ids, segment_ids)

is this the same with
bert(bert_ids) ?
In that case the following is not needed: segment_ids = torch.zeros_like(bert_mask) # dummy segment IDs, since we only have one sentence

Also i do not understand what the comment means... ( # dummy segment IDs, since we only have one sentence)

bert_last_layer = self.bert(bert_ids, segment_ids)[0][-1]
why do you take the last one? Here -1 is the last sentence. Why do we say last layer?
Also for the above simple example its size is torch.Size([11, 768]). Is this what we want?

from transformers.

antgr avatar antgr commented on May 4, 2024

Is this development makes outdated this conversation? Can you please clarify?

def convert_examples_to_features(examples,

from transformers.

 avatar commented on May 4, 2024

I guess so

from transformers.

stale avatar stale commented on May 4, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

from transformers.

sougata-fiz avatar sougata-fiz commented on May 4, 2024

I have some code for preparing batches here:

https://github.com/bheinzerling/dougu/blob/2f54b14d588f17d77b7a8bca9f4e5eb38d6a2805/dougu/bert.py#L98

The important methods are subword_tokenize_to_ids and subword_tokenize, you can probably ignore the other stuff.

With this, feature extraction for each sentence, i.e. a list of tokens, is simply:

bert = dougu.bert.Bert.Model("bert-base-cased")
featurized_sentences = []
for tokens in sentences:
    features = {}
    features["bert_ids"], features["bert_mask"], features["bert_token_starts"] = bert.subword_tokenize_to_ids(tokens)
    featurized_sentences.append(features)

Then I use a custom collate function for a DataLoader that turns featurized_sentences into batches:

def collate_fn(featurized_sentences_batch):
    bert_batch = [torch.cat(features[key] for features in featurized_sentences], dim=0) for key in ("bert_ids", "bert_mask", "bert_token_starts")]
    return bert_batch

A simple sequence tagger module would look something like this:

class SequenceTagger(torch.nn.Module):
    def __init__(self, data_parallel=True):
           bert = BertModel.from_pretrained("bert-base-cased").to(device=torch.device("cuda"))
           if data_parallel:
                self.bert = torch.nn.DataParallel(bert)
           else:
               self.bert = bert
           bert_dim = 786 # (or get the dim from BertEmbeddings)
           n_labels = 5  # need to set this for your task
           self.out = torch.nn.Linear(bert_dim, n_labels)
           ...  # droput, log_softmax...
    
     def forward(self, bert_batch, true_labels):
            bert_ids, bert_mask, bert_token_starts = bert_batch
            # truncate to longest sequence length in batch (usually much smaller than 512) to save GPU RAM
            max_length = (bert_mask != 0).max(0)[0].nonzero()[-1].item()
            if max_length < bert_ids.shape[1]:
                  bert_ids = bert_ids[:, :max_length]
                  bert_mask = bert_mask[:, :max_length]

            segment_ids = torch.zeros_like(bert_mask)  # dummy segment IDs, since we only have one sentence
            bert_last_layer = self.bert(bert_ids, segment_ids)[0][-1]
            # select the states representing each token start, for each instance in the batch
            bert_token_reprs = [
                   layer[starts.nonzero().squeeze(1)]
                   for layer, starts in zip(bert_last_layer, bert_token_starts)]
            # need to pad because sentence length varies
            padded_bert_token_reprs = pad_sequence(
                   bert_token_reprs, batch_first=True, padding_value=-1)
            # output/classification layer: input bert states and get log probabilities for cross entropy loss
            pred_logits = self.log_softmax(self.out(self.dropout(padded_bert_token_reprs)))
            mask = true_labels != -1  # I did set label = -1 for all padding tokens somewhere else
            loss = cross_entropy(pred_logits, true_labels)
            # average/reduce the loss according to the actual number of of predictions (i.e. one prediction per token).
            loss /= mask.float().sum()
            return loss

Wrote this without checking if it runs (my actual code is tied into some other things so I cannot just copy&paste it), but it should help you get started.

@bheinzerling
The line bert_last_layer = bert_layers[0][-1] just takes the hidden representation of the last training example in the batch. Is this intended?

from transformers.

chutaklee avatar chutaklee commented on May 4, 2024

@dangal95, adjusting the original labels is probably not the best way. A simpler method that works well is described in this issue, here #64 (comment)

Hi, could you explain why adjusting the original labels is not suggested? It seems quite easy and straightforward.

# reference: https://github.com/huggingface/transformers/issues/64#issuecomment-443703063
def flatten(list_of_lists):
    for list in list_of_lists:
        for item in list:
           yield item

def subword_tokenize(tokens, labels):
    assert len(tokens) == len(labels)

    subwords = list(map(tokenizer.tokenize, tokens))
    subword_lengths = list(map(len, subwords))
    subwords = [CLS] + list(flatten(subwords)) + [SEP]
    token_start_idxs = 1 + np.cumsum([0] + subword_lengths[:-1])
    bert_labels = [[label] + (sublen-1) * ["X"] for sublen, label in zip(subword_lengths, labels)]
    bert_labels = ["O"] + list(flatten(bert_labels)) + ["O"]

    assert len(subwords) == len(bert_labels)
    return subwords, token_start_idxs, bert_labels
>> tokens = tokenizer.basic_tokenizer.tokenize("John Johanson lives in Ramat Gan.")
>> print(tokens)
['john', 'johanson', 'lives', 'in', 'ramat', 'gan', '.']
>> labels = ['B-PERS', 'I-PERS', 'O', 'O', 'B-LOC', 'I-LOC', 'O']
>> subword_tokenize(tokens, labels)
(['[CLS]',   'john',   'johan',   '##son',   'lives',   'in',   'rama',   '##t',   'gan',   '.',   '[SEP]'],  
array([1, 2, 4, 5, 6, 8, 9]),  
['O', 'B-PERS', 'I-PERS', 'X', 'O', 'O', 'B-LOC', 'X', 'I-LOC', 'O', 'O'])

from transformers.

zhouyongjie avatar zhouyongjie commented on May 4, 2024
labels = ['B-PERS', 'I-PERS', 'O', 'B-LOC', 'I-LOC']
labels2id = {'B-PERS': 0, 'I-PERS': 1, 'O': 2, 'B-LOC': 3, 'I-LOC': 4}
sent = ['[CLS]', 'john', 'johan',  '##son', 'lives',  'in', 'ramat', 'gan', '.', '[SEP]']
labels = [2, 0, 1, 1, 2, 2, 3, 4, 2, 2]
attention_mask = [0, 1, 1, 1, 1, 1, 1, 1, 1, 0]
sentence_id = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

@AlxndrMlk

Hello,if we have the following sentence:

sent = "Johanson lives in Ramat Gan."
labels = ['B-PERS', 'O', 'O', 'B-LOC', 'I-LOC']

Would “Johanson” be processed like this?

'johan',  '##son'  
'B-PERS'    'I-PERS'

or like this?

'johan',  '##son'  
'B-PERS'   'B-PERS'      

thank you!

from transformers.

hkmztrk avatar hkmztrk commented on May 4, 2024
labels = ['B-PERS', 'I-PERS', 'O', 'B-LOC', 'I-LOC']
labels2id = {'B-PERS': 0, 'I-PERS': 1, 'O': 2, 'B-LOC': 3, 'I-LOC': 4}
sent = ['[CLS]', 'john', 'johan',  '##son', 'lives',  'in', 'ramat', 'gan', '.', '[SEP]']
labels = [2, 0, 1, 1, 2, 2, 3, 4, 2, 2]
attention_mask = [0, 1, 1, 1, 1, 1, 1, 1, 1, 0]
sentence_id = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Hello, I'm confused about the labels for [CLS] and [PAD] tokens. Assume that I have originally have 4 labels for each word [0, 1, 2, 3, 4] should I add [CLS] and [PAD] as another label? I see that in the example here [CLS] and [SEP] takes labels '2'. Does making the attention 0 for those positions solve this?

from transformers.

shushanxingzhe avatar shushanxingzhe commented on May 4, 2024

This repository have showed how to add a CRF layer on transformers to get a better performance on token classification task.
https://github.com/shushanxingzhe/transformers_ner

from transformers.

linhlt-it-ee avatar linhlt-it-ee commented on May 4, 2024

tks alot @shushanxingzhe

from transformers.

linhlt-it-ee avatar linhlt-it-ee commented on May 4, 2024

Could someone please tell me how to use CRF with decode padding. When i code as below, i always get err: expected seq=18 but got 13 for next line "tags = torch.Tensor(tags)"
if labels is not None:
log_likelihood, tags = self.crf(logits, labels,attn_mask), self.crf.decode(logits,attn_mask)
loss = 0 - log_likelihood
else:
tags = self.crf.decode(logits,attn_mask)

from transformers.

zycalice avatar zycalice commented on May 4, 2024

Can we just remove the non-first subtokens during feature processing if we are treating NER problem as a classification problem?

Example:
labels = ['B-PERS', 'I-PERS', 'O', 'B-LOC', 'I-LOC']
labels2id = {'B-PERS': 0, 'I-PERS': 1, 'O': 2, 'B-LOC': 3, 'I-LOC': 4}
sent = ['[CLS]', 'john', 'johan', '##son', 'lives', 'in', 'ramat', 'gan', '.', '[SEP]']

cleaned_sent = ['[CLS]', 'john', 'johan', 'lives', 'in', 'ramat', 'gan', '.', '[SEP]']

from transformers.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.