Any way to combine the BIO tokens into compound words, i.e.: Detokenizing to form the

I figured out a solution to this problem. <div class="snippet-clipboard-content no

Detokenizing the words about bert-ner HOT 1 CLOSED

Rajmehta123 commented on May 28, 2024

Detokenizing the words

from bert-ner.

Comments (1)

Rajmehta123 commented on May 28, 2024

I figured out a solution to this problem.

    def join_tokens(tokens):
        res = ''
        if tokens:
            res = tokens[0]
            for token in tokens[1:]:
                if not (token.isalpha() and res[-1].isalpha()):
                    res += token  # punctuation
                else:
                    res += ' ' + token  # regular word
        return res
    
    def collapse(ner_result):
        # List with the result
        collapsed_result = []
    
    
        current_entity_tokens = []
        current_entity = None
    
        # Iterate over the tagged tokens
        for token, tag in ner_result:
    
            if tag.startswith("B-"):
                # ... if we have a previous entity in the buffer, store it in the result list
                if current_entity is not None:
                    collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
    
                current_entity = tag[2:]
                # The new entity has so far only one token
                current_entity_tokens = [token]
    
            # If the entity continues ...
            elif current_entity_tokens!= None and tag == "I-" + str(current_entity):
                # Just add the token buffer
                current_entity_tokens.append(token)
            else:
                collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
                collapsed_result.append([token,tag[2:]])
    
                current_entity_tokens = []
                current_entity = None
    
                pass
    
        # The last entity is still in the buffer, so add it to the result
        # ... but only if there were some entity at all
        if current_entity is not None:
            collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
            collapsed_result = sorted(collapsed_result)
            collapsed_result = list(k for k, _ in itertools.groupby(collapsed_result))
    
        return collapsed_result

Update
This will solve most of the cases, but there always be outliers.
For eg: The tags for the sentence "U.S. Securities and Exchange Commission" are ['U.S.', 'B-ORG'] ['Securities', 'I-ORG'] ['and', 'I-ORG'] ['Exchange', 'I-ORG'] ['Commission', 'I-ORG'] And when run the collapse command changed the sentence into: "U.S.Securities and Exchange Commission"

So the complete solution is to track the identity of the word that created a certain token. Creating LUT for the original sentence. Thus

text="U.S. Securities and Exchange Commission"
lut = [(token, ix) for ix, word in enumerate(text.split()) for token in tokenize(w)]

# lut = [("U",0), (".",0), ("S",0), (".",0), ("Securities",1), ("and",2), ("Exchange",3), ("Commision",4)]
Now, given token index you can know exact word it came from, and simply concatenate tokens that belong to the same word, while adding space when a token belongs to a different word. So the NER result would be something like:

[["U","B-ORG", 0], [".","I-ORG", 0], ["S", "I-ORG", 0], [".","I-ORG", 0], ['Securities', 'I-ORG', 1], ['and', 'I-ORG', 2], ['Exchange', 'I-ORG',3], ['Commission', 'I-ORG', 4]]

from bert-ner.

Detokenizing the words about bert-ner HOT 1 CLOSED

Comments (1)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent