Giter Club home page Giter Club logo

Comments (1)

Rajmehta123 avatar Rajmehta123 commented on May 28, 2024

I figured out a solution to this problem.

    def join_tokens(tokens):
        res = ''
        if tokens:
            res = tokens[0]
            for token in tokens[1:]:
                if not (token.isalpha() and res[-1].isalpha()):
                    res += token  # punctuation
                else:
                    res += ' ' + token  # regular word
        return res
    
    def collapse(ner_result):
        # List with the result
        collapsed_result = []
    
    
        current_entity_tokens = []
        current_entity = None
    
        # Iterate over the tagged tokens
        for token, tag in ner_result:
    
            if tag.startswith("B-"):
                # ... if we have a previous entity in the buffer, store it in the result list
                if current_entity is not None:
                    collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
    
                current_entity = tag[2:]
                # The new entity has so far only one token
                current_entity_tokens = [token]
    
            # If the entity continues ...
            elif current_entity_tokens!= None and tag == "I-" + str(current_entity):
                # Just add the token buffer
                current_entity_tokens.append(token)
            else:
                collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
                collapsed_result.append([token,tag[2:]])
    
                current_entity_tokens = []
                current_entity = None
    
                pass
    
        # The last entity is still in the buffer, so add it to the result
        # ... but only if there were some entity at all
        if current_entity is not None:
            collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
            collapsed_result = sorted(collapsed_result)
            collapsed_result = list(k for k, _ in itertools.groupby(collapsed_result))
    
        return collapsed_result

Update
This will solve most of the cases, but there always be outliers.
For eg: The tags for the sentence "U.S. Securities and Exchange Commission" are ['U.S.', 'B-ORG'] ['Securities', 'I-ORG'] ['and', 'I-ORG'] ['Exchange', 'I-ORG'] ['Commission', 'I-ORG'] And when run the collapse command changed the sentence into: "U.S.Securities and Exchange Commission"

So the complete solution is to track the identity of the word that created a certain token. Creating LUT for the original sentence. Thus

text="U.S. Securities and Exchange Commission"
lut = [(token, ix) for ix, word in enumerate(text.split()) for token in tokenize(w)]  

# lut = [("U",0), (".",0), ("S",0), (".",0), ("Securities",1), ("and",2), ("Exchange",3), ("Commision",4)]
Now, given token index you can know exact word it came from, and simply concatenate tokens that belong to the same word, while adding space when a token belongs to a different word. So the NER result would be something like:

[["U","B-ORG", 0], [".","I-ORG", 0], ["S", "I-ORG", 0], [".","I-ORG", 0], ['Securities', 'I-ORG', 1], ['and', 'I-ORG', 2], ['Exchange', 'I-ORG',3], ['Commission', 'I-ORG', 4]]

from bert-ner.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.