Comments (1)
I figured out a solution to this problem.
def join_tokens(tokens):
res = ''
if tokens:
res = tokens[0]
for token in tokens[1:]:
if not (token.isalpha() and res[-1].isalpha()):
res += token # punctuation
else:
res += ' ' + token # regular word
return res
def collapse(ner_result):
# List with the result
collapsed_result = []
current_entity_tokens = []
current_entity = None
# Iterate over the tagged tokens
for token, tag in ner_result:
if tag.startswith("B-"):
# ... if we have a previous entity in the buffer, store it in the result list
if current_entity is not None:
collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
current_entity = tag[2:]
# The new entity has so far only one token
current_entity_tokens = [token]
# If the entity continues ...
elif current_entity_tokens!= None and tag == "I-" + str(current_entity):
# Just add the token buffer
current_entity_tokens.append(token)
else:
collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
collapsed_result.append([token,tag[2:]])
current_entity_tokens = []
current_entity = None
pass
# The last entity is still in the buffer, so add it to the result
# ... but only if there were some entity at all
if current_entity is not None:
collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
collapsed_result = sorted(collapsed_result)
collapsed_result = list(k for k, _ in itertools.groupby(collapsed_result))
return collapsed_result
Update
This will solve most of the cases, but there always be outliers.
For eg: The tags for the sentence "U.S. Securities and Exchange Commission" are ['U.S.', 'B-ORG'] ['Securities', 'I-ORG'] ['and', 'I-ORG'] ['Exchange', 'I-ORG'] ['Commission', 'I-ORG']
And when run the collapse command changed the sentence into: "U.S.Securities and Exchange Commission"
So the complete solution is to track the identity of the word that created a certain token. Creating LUT for the original sentence. Thus
text="U.S. Securities and Exchange Commission"
lut = [(token, ix) for ix, word in enumerate(text.split()) for token in tokenize(w)]
# lut = [("U",0), (".",0), ("S",0), (".",0), ("Securities",1), ("and",2), ("Exchange",3), ("Commision",4)]
Now, given token index you can know exact word it came from, and simply concatenate tokens that belong to the same word, while adding space when a token belongs to a different word. So the NER result would be something like:
[["U","B-ORG", 0], [".","I-ORG", 0], ["S", "I-ORG", 0], [".","I-ORG", 0], ['Securities', 'I-ORG', 1], ['and', 'I-ORG', 2], ['Exchange', 'I-ORG',3], ['Commission', 'I-ORG', 4]]
from bert-ner.
Related Issues (20)
- How to convert all cpp,header files to DLL file? HOT 1
- Error index out of range in self when trying to predict for text of close to 5000 characters HOT 1
- how convert bin to two part model? HOT 1
- How to predict on test dataset after training? HOT 1
- What is the data your model trained on? HOT 2
- small error in the source code HOT 1
- KeyError: '' HOT 1
- How can i make fine tuning with new entities/labels? HOT 1
- CUDA Runtime Error: Which Cuda version is compatible to run NER task using BERT-NER
- Key error 0 on evaluation set HOT 1
- Reproduce CoNLL results HOT 2
- RuntimeError : during model.predict() HOT 1
- Train your own model (colab)
- Pre-processing steps
- After training and saved the models, I got a valid accuracy, while got an error(bad) result based on loading the saved model. HOT 1
- How to show only the keywords in inference?
- How can i use this project in Chinese NER? HOT 1
- Understanding the Evaluation Code HOT 1
- Model training does not work on CPU HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bert-ner.