Comments (17)
Thanks, looks great. I have merged your PR and will create some tests for this later today (and I'll close this issue after). I can quickly walk you through it maybe next week; the tests are all in the tests
folder and run using pytest
through Github workflows (see .github
folder).
from rel.
Okay, so I have an idea to "fix" this fairly easily but I ran into another problem while exploring my solution.. More on that later.
I thought of using the field sent_idx
(https://github.com/informagi/REL/blob/master/REL/mention_detection.py#L174). Since every mention has that field, we could calculate the offset per mention based on total sentence length up that point. My solution looks something like this:
input = "Paris, Paris. Paris."
# then mentions look something like:
[{"mention": "Paris", ... "sent_idx": 0, "pos": 0}, {"mention": "Paris", ... "sent_idx": 0, "pos": 7}, {"mention": "Paris", ... "sent_idx": 1, "pos": 0}]
# so we can calculate an offset based on sentence lengths, something like:
sentence_lengths = {m["sent_idx"] + 1: len(m["sentence"]) for m in mentions}
# the "+ 1" is necessary because we do not want any offset with sent_idx = 0, so everything needs to shift over by 1 index
# e.g. for sent_idx = 1, we want offset = len(sent_idx 0)
cumulative_offsets = dict(zip(sentence_lengths.keys(), numpy.cumsum(list(sentence_lengths.values())).tolist()))
# after adding {0: 0}, this results in something like:
{0: 0, 1: 13, 2: 19}
So once you have these offsets, you can write something for the for-loop from utils.py#L91:
if include_offset:
len_until_now = cumulative_offsets[ment["sent_idx"]]
offset = len_until_now + text[len_until_now:].find(sent)
start_pos = offset + ment["pos"]
...
This also works properly for both examples. Effectively the same as your proposition @hideaki-j .
Now for the new problem: because the text is split up and parsed into Flair Sentence
objects, the extra whitespace between sentences is lost. This is partly resolved by the line offset = len_until_now + text[len_until_now:].find(sent)
, where the cumulative length is summed with an additional index increase from .find
. However, if the amount of whitespace between sentences is significant, this doesn't work. For example, the input:
example = "Paris. Paris. Paris."
#> outputs:
[[0, 5, "Paris"...], [20, 5, "Paris"...], [20, 5, "Paris"...]]
Which of course is not correct. If we want to solve this, we also need to keep track of whitespace between sentences, which is actually not trivial I think as sentence segmentation is handled using a different library (segtok
). It does not seem to have a built-in way of keeping track of the whitespace.
Of course this new problem is a crafted example that you probably won't find in the wild, so we can choose to live with it. What do you think @hideaki-j ?
from rel.
Hi guys,
As I can't push my changes to Hideaki's repo, here's my proposed change. The root cause is actually in find_mention
, so let's correct that with the following code for find_mentions
and process_results
respectively. Note that this way we don't have to create an additional dictionary, thus saving us an additional loop.
Let me know if I missed anything!
Detailed explanation
In find_mentions
we now add an offset if we are using Flair. This consists of a three-step process:
- Compute the cumulative sentence length.
- Compute offset based on the raw text and cumulative sentence length. This is required as during splitting, some whitespaces may be removed between sentences.
- Add offset to the start-position in the respective sentence.
def find_mentions(self, dataset, tagger=None):
"""
Responsible for finding mentions given a set of documents in a batch-wise manner. More specifically,
it returns the mention, its left/right context and a set of candidates.
:return: Dictionary with mentions per document.
"""
if tagger is None:
raise Exception(
"No NER tagger is set, but you are attempting to perform Mention Detection.."
)
# Verify if Flair, else ngram or custom.
is_flair = isinstance(tagger, SequenceTagger)
dataset_sentences_raw, processed_sentences, splits = self.split_text(dataset, is_flair)
results = {}
total_ment = 0
if is_flair:
tagger.predict(processed_sentences)
for i, doc in enumerate(dataset_sentences_raw):
contents = dataset_sentences_raw[doc]
raw_text = dataset[doc][0]
sentences_doc = [v[0] for v in contents.values()]
sentences = processed_sentences[splits[i] : splits[i + 1]]
result_doc = []
cum_sent_length = 0
offset = 0
for (idx_sent, (sentence, ground_truth_sentence)), snt in zip(
contents.items(), sentences
):
for entity in (
snt.get_spans("ner")
if is_flair
else tagger.predict(snt, processed_sentences)
):
text, start_pos, end_pos, conf, tag = (
entity.text,
entity.start_pos,
entity.end_pos,
entity.score,
entity.tag,
)
total_ment += 1
m = self.preprocess_mention(text)
cands = self.get_candidates(m)
if len(cands) == 0:
continue
# Re-create ngram as 'text' is at times changed by Flair (e.g. double spaces are removed).
ngram = sentence[start_pos:end_pos]
left_ctxt, right_ctxt = self.get_ctxt(
start_pos, end_pos, idx_sent, sentence, sentences_doc
)
# Only include offset if using Flair.
if is_flair:
offset = raw_text.find(sentence, cum_sent_length)
res = {
"mention": m,
"context": (left_ctxt, right_ctxt),
"candidates": cands,
"gold": ["NONE"],
"pos": start_pos+offset,
"sent_idx": idx_sent,
"ngram": ngram,
"end_pos": end_pos+offset,
"sentence": sentence,
"conf_md": conf,
"tag": tag,
}
result_doc.append(res)
cum_sent_length += len(sentence) + (offset - cum_sent_length)
results[doc] = result_doc
return results, total_ment
def process_results(
mentions_dataset,
predictions,
processed,
include_offset=False,
):
"""
Function that can be used to process the End-to-End results.
:return: dictionary with results and document as key.
"""
res = {}
for doc in mentions_dataset:
if doc not in predictions:
# No mentions found, we return empty list.
continue
pred_doc = predictions[doc]
ment_doc = mentions_dataset[doc]
text = processed[doc][0]
res_doc = []
for pred, ment in zip(pred_doc, ment_doc):
sent = ment["sentence"]
idx = ment['sent_idx']
start_pos = ment["pos"]
mention_length = int(ment["end_pos"] - ment["pos"])
if pred["prediction"] != "NIL":
temp = (
start_pos,
mention_length,
ment["ngram"],
pred["prediction"],
pred["conf_ed"],
ment["conf_md"] if "conf_md" in ment else 0.0,
ment["tag"] if "tag" in ment else "NULL",
)
res_doc.append(temp)
res[doc] = res_doc
return res
from rel.
You're right @hideaki-j , the version on Gem accumulates the offsets in a different way and then it doesn't happen. I played around a bit more this morning, I think I have it fixed completely now with some slight tweaks:
# in utils.py - process_results
sents = {m["sent_idx"]: m["sentence"] for m in ment_doc}
cum_offsets = {0: 0}
# e.g. with 4 sentences, loop through 1, 2, 3 and skip 0
for idx in range(1, max(sents.keys())+1):
# current sentence text
cur_sent = sents[idx]
# length of previous sentence
prev_sent_len = len(sents[idx-1])
# number of chars up until now: previous sentence offset + length of previous sentence
until_now = cum_offsets[idx-1] + prev_sent_len
# total offset is until_now + whitespace compensation through .find(cur_sent)
cum_offsets[idx] = until_now + text[until_now:].find(cur_sent)
...
if include_offset:
offset = cum_offsets[ment["sent_idx"]]
start_pos = offset + ment["pos"]
...
In our test examples, this works as intended. Could you verify @hideaki-j ? If it works for you as well I will open a separate PR for this.
Ideally I'd like to solve this a bit neater, feels a bit hacky like this... But it will do for now, I'll put a note in the project board :-)
from rel.
Not all sentences have a mention.
You're right, I missed that... In that case I agree with your idea of creating the offset dictionary earlier in the pipeline (for example in find_mentions
). Let me know how you get on! And thanks for the feedback
from rel.
Nice Mick, thanks :) One thing that needs to change is the line cum_sent_length += len(sentence)
, since it doesn't deal properly with e.g. the "much whitespace" example ("Paris. Paris. Paris.").
Easily fixed however by changing it to read cum_sent_length += len(sentence) + (offset - cum_sent_length)
.
Also, the offset calculation should be outside of the for entity in ...
loop. In its current place, you will get an UnboundLocalError
by trying to reference the offset
variable for a sentence that didn't have any mentions (e.g. "Hello. Hello.").
from rel.
Updated code! Better safe than sorry, although the 'many whitespaces' example is hopefully something that'll never occur xD.
from rel.
Thank you all!
One minor change. In the current code, (offset - cum_sent_length)
could become minus when a sentence does not contain any entity. Easily fixed by changing the place of
if is_flair:
offset = raw_text.find(sentence, cum_sent_length)
form inside the for loop:
for entity in (
...
to inside the for loop:
for (idx_sent, (sentence, ground_truth_sentence)), snt in zip(
...
Fixed code is like:
def find_mentions(self, dataset, tagger=None):
"""
Responsible for finding mentions given a set of documents in a batch-wise manner. More specifically,
it returns the mention, its left/right context and a set of candidates.
:return: Dictionary with mentions per document.
"""
if tagger is None:
raise Exception(
"No NER tagger is set, but you are attempting to perform Mention Detection.."
)
# Verify if Flair, else ngram or custom.
is_flair = isinstance(tagger, SequenceTagger)
dataset_sentences_raw, processed_sentences, splits = self.split_text(dataset, is_flair)
results = {}
total_ment = 0
if is_flair:
tagger.predict(processed_sentences)
for i, doc in enumerate(dataset_sentences_raw):
contents = dataset_sentences_raw[doc]
raw_text = dataset[doc][0]
print('contents', contents)
print('raw_text', raw_text)
sentences_doc = [v[0] for v in contents.values()]
sentences = processed_sentences[splits[i] : splits[i + 1]]
result_doc = []
cum_sent_length = 0
offset = 0
for (idx_sent, (sentence, ground_truth_sentence)), snt in zip(
contents.items(), sentences
):
# Only include offset if using Flair.
if is_flair:
offset = raw_text.find(sentence, cum_sent_length)
for entity in (
snt.get_spans("ner")
if is_flair
else tagger.predict(snt, processed_sentences)
):
text, start_pos, end_pos, conf, tag = (
entity.text,
entity.start_pos,
entity.end_pos,
entity.score,
entity.tag,
)
total_ment += 1
m = self.preprocess_mention(text)
cands = self.get_candidates(m)
if len(cands) == 0:
continue
# Re-create ngram as 'text' is at times changed by Flair (e.g. double spaces are removed).
ngram = sentence[start_pos:end_pos]
left_ctxt, right_ctxt = self.get_ctxt(
start_pos, end_pos, idx_sent, sentence, sentences_doc
)
res = {
"mention": m,
"context": (left_ctxt, right_ctxt),
"candidates": cands,
"gold": ["NONE"],
"pos": start_pos+offset,
"sent_idx": idx_sent,
"ngram": ngram,
"end_pos": end_pos+offset,
"sentence": sentence,
"conf_md": conf,
"tag": tag,
}
result_doc.append(res)
cum_sent_length += len(sentence) + (offset - cum_sent_length)
results[doc] = result_doc
return results, total_ment
I tested for (maybe) all of the examples mentioned in our discussion and confirmed that it works.
All test cases are shown below.
# Case 1
text = "Hi. I need to book a vacation to Long Beach between August 25 and September 3. Departure is from Paris\tWould you like to depart Paris on September 6th?\tPreferably by the 3rd. Is there anything available?\tSorry, looks like there is nothing available from Paris to Long Beach on September 3rd.\tI'm not quite sure I understand, is there anything available leaving Long Beach to go to Paris between August 25 and September 3rd?\tWould you like to depart Paris on September 6th?\tNo. I would like to leave Long Beach around the 25th of August to go to Paris for some reason. What is so confusing about that!?\tYou can leave Long Beach, USA and go to Paris, France on Tuesday, August 30th. Will I book this?\tFinally! No, don't book yet, I would like to know more about the hotel. Is there free breakfast?\tThere is free wifi.\tLook buddy, is there free breakfast or not? Tell me, am I gonna get eggs, toast, cereal, etc? You know? The good stuff?\tThere is free wifi at the hotel. \tWhat is the price of this toastless package?\tThis package costs 2191.55USD.\tIs this the cheapest option?\tYes, this is the cheapest option from Long Beach to Paris.\tAnd the hotel has how many stars?\tMuse Hotel has 2.0 stars.\tOk I will book this one.\tGreat. You will leave Paris on September 7"
# Case 2
text = "Paris, Paris. Paris."
# Case 3
text = "Paris. Paris. Paris."
# Case 4
text = 'Pars. Paris. Paris. Paris.'
# Case 5
text = 'Paris. Paris. JParis. Paris. ' # JParis is the entity but this is not in Wikipedia, so REL cannot detect it.
If you do not have any more ideas for this correction, I will create a pull request.
from rel.
text = "Hello. Hello. Hello"
This case works in my environment.
Regarding the unit test, I agree with you.
(I am still figuring out how can I create a unit test in GitHub tho)
from rel.
I can add some tests later, good idea.
from rel.
Closed through #50
from rel.
Issue
Currently, the REL works for example 2 but not in example 1 above.
Examples
- The explanation below is based on example 1.
- Currently, second "Paris" comes the position 199 (see first code block), the right answer is 128 though (see second code block).
# Input text
dialogue = "Hi. I need to book a vacation to Long Beach between August 25 and September 3. Departure is from Paris\tWould you like to depart Paris on September 6th?\tPreferably by the 3rd. Is there anything available?\tSorry, looks like there is nothing available from Paris to Long Beach on September 3rd.\tI'm not quite sure I understand, is there anything available leaving Long Beach to go to Paris between August 25 and September 3rd?\tWould you like to depart Paris on September 6th?\tNo. I would like to leave Long Beach around the 25th of August to go to Paris for some reason. What is so confusing about that!?\tYou can leave Long Beach, USA and go to Paris, France on Tuesday, August 30th. Will I book this?\tFinally! No, don't book yet, I would like to know more about the hotel. Is there free breakfast?\tThere is free wifi.\tLook buddy, is there free breakfast or not? Tell me, am I gonna get eggs, toast, cereal, etc? You know? The good stuff?\tThere is free wifi at the hotel. \tWhat is the price of this toastless package?\tThis package costs 2191.55USD.\tIs this the cheapest option?\tYes, this is the cheapest option from Long Beach to Paris.\tAnd the hotel has how many stars?\tMuse Hotel has 2.0 stars.\tOk I will book this one.\tGreat. You will leave Paris on September 7"
# Entity Linking
el(dialogue)
######## Output ########
# [[33, 10, 'Long Beach', 'Long_Beach,_California', ... ],
# [97, 5, 'Paris', 'Paris', 0.8168372931596612, 0.9900407195091248, 'LOC'],
# ---> [199, 5, 'Paris', 'Paris', 0.8369923447610185, 0.9948850274085999, 'LOC'],
# [271, 5, 'Paris', 'Paris', 0.7787875371965208, 0.9929811954498291, 'LOC'],
# ...
# Find all positions of "Paris"
print([m.span() for m in re.finditer('Paris', dialogue)])
######## Output ########
# [(97, 102),
# ---> (128, 133),
# (254, 259), ...
Cause
Summary
- See code block [4]. In some cases including example 1, mentions appear several times in one sentence. (e.g., "Paris, Paris."). In that case, we want
text[offset:].find(sent)
to be 0 from the second mention in the same sentence. But that does not work as intended currently, because ofoffset += len(sent)
. (For more details please see Details section below.) - However, if we delete
offset += len(sent)
, the code does not work for example 2 (i.e., "Paris. Paris. Paris. Paris.").
Details
Let's think about the below example.
dialogue = "Paris, Paris. Paris."
There are three mentions "Paris" and two sentences "Paris, Paris." and "Pars." here.
The current situation and the expected answer are presented below. The positions of the second and third mentions are wrong.
Current situation
dialogue = 'Paris, Paris. Paris.'
el(dialogue) # Entity Linking
######## Output ########
# [[0, 5, 'Paris', 'Paris', 0.9057319011744249, 0.9962735176086426, 'LOC'],
# [19, 5, 'Paris', 'Paris', 0.9029048880752896, 0.9977388381958008, 'LOC'],
# [24, 5, 'Paris', 'Paris', 0.9029048880752896, 0.9808972477912903, 'LOC']]
Expected answer
mention = 'Paris'
print([m.span() for m in re.finditer(mention, dialogue)])
######## Output ########
# [(0, 5), (7, 12), (14, 19)]
Explanation
Let's think about the for loop in [4].
First Loop
In the first loop,
ment
: "Paris"sent
: "Paris, Paris."
offset
become 0 after processed L97.
Then, offset
become len(sent)
, here 12.
Second Loop
ment
: "Paris"sent
: "Paris, Paris."
Note that ment
"Paris" indicates the second "Paris" in sent
, not the first one.
However, offset
is already 12, that means find in text[offset:].find(sent)
search in the second sentence.
Of course you cannot find "Paris, Paris." in the second sentence, that mean text[offset:].find(sent)
become -1, and offset
become 11 (12 + (-1). please see L97).
Start position is culculated in L100:
start_pos = offset + ment["pos"]
where the offset
is 11 and ment["pos"]
is 7(*1), then sart_pos become 19. This value is the same as shown in expected answer section above.
(*1) ment["pos"] indicates mention's start position in the sentence. Here, "Paris" starts in position 7 of the sentence
"Paris, Paris.".
How to fix
I cannot come up with a good idea.
One "easy" way to solve this situation is to use another function to calculate the start position like:
def start_pos_itr(ment_doc, text):
"""
Responsible for obtaining the start position for each mention.
:return:
start_pos: start position
:args:
ment_doc: start position, end position, sentence for each mention.
Obtained from "mentions_dataset[doc]".
E.g., [{'pos': 0, 'end_pos': 5, 'sentence': 'Paris xxx yyy zzz...'},...
text: Input text for REL (text_doc).
Obtained from "processed[doc][0]".
Note that text consists of one or several sentence(s).
"""
offset, prev_pos = 0, -1
for ment in ment_doc: # For each mention
sent, pos = ment["sentence"], ment["pos"]
if pos <= prev_pos: # "pos <= prev_pos" means sentence has just been moved to next.
offset += len(prev_sent)
offset = text[offset:].find(sent) + offset
start_pos = offset + pos
prev_pos, prev_sent = pos, sent # Save the previous sentence and position
yield start_pos
The key part is:
...
if pos <= prev_pos: # "pos <= prev_pos" means sentence has just been moved to next.
offset += len(prev_sent)
...
The basic idea is the same as using offset += len(sent)
. The new part is adding the above if statement.
The idea is as follows:
- If there are several mentions in the same sentence, the current position should become bigger than the previous position for each loop.
- If the current position is smaller than the previous position, that means the new sentence starts. So you can
offset += len(prev_sent)
.
In my own environment, this function work for both example 1 and 2.
However, this fix makes code more complicated so I am wondering if there is any easy way to solve the current issue.
References
[4] https://github.com/informagi/REL/blob/master/REL/utils.py#L91
from rel.
Thank you for your detailed explanation of your idea! It is a very clear and clever solution! I agree with your correction @KDercksen .
Of course this new problem is a crafted example that you probably won't find in the wild,
I ran into this problem in the local environment. But when I used REL API, I could not find the same error.
I tried your example but I could not find any problem here either.
It might be other function(s) in REL "correct" the problem that you mentioned, but I didn't identify it yet.
from rel.
Thank you for your proposal! I tested your idea. Your idea worked for example 2 (ie, 'Paris. Paris. Paris. Paris.') and example 3 (white space example). But did not work for example 1 in my environment.
Details
The error message is as follows.
KeyError Traceback (most recent call last)
<ipython-input-67-a3f30ce27db0> in <module>
----> 1 process_results(mentions_dataset, processed)
<ipython-input-65-1faef6f7f268> in process_results(mentions_dataset, processed, include_offset)
18 cur_sent = sents[idx]
19 # length of previous sentence
---> 20 prev_sent_len = len(sents[idx-1])
21 # number of chars up until now: previous sentence offset + length of previous sentence
22 until_now = cum_offsets[idx-1] + prev_sent_len
KeyError: 0
where idx-1
is 0.
Cause
Not all sentences have a mention.
In example 1, only sent_idx
1, 2, 5, ... have a mention (that means sent_idx
0 does not contain any mention).
# Input text is the same as example 1.
for i in range(len(mentions_dataset['API_DOC'])):
print(mentions_dataset['API_DOC'][i]['sent_idx'], end=', ')
######## Output ########
# 1, 2, 2, 5, 5, 6, 6, 7, 8, 8, 10, 10, 10, 10, 24, 24, 29,
How to fix...?
- I like your idea because it is very simple and intuitive, so I would like to use
sent_idx
field, if possible. - We also have
processed
data as an argument forprocess_results
, and this contains the whole input text.
If we usefind()
, we could identify "offset". ...However, we do not want to usefind()
anymore here, right? - My idea is creating
sents
in a previous process (eg, in the for loop atfind_mentions
function), and get it asprocess_results
's argument. If you agree with my idea, I will fix the program based on this idea and test it.
What do you think @KDercksen ?
from rel.
Yes, let's finalize this! Looks good to me
from rel.
Looks good, one final test case would be without any entities:
text = "Hello. Hello. Hello"
Perhaps we should create a unit test based on this? Seems like something that could easily be broken given some incorrect MD module.
from rel.
Regarding creating the GitHub unit test (?), it seems that I need more time to learn how this works.
(I am still figuring out where can I find the current test implementation in REL. Is this implemented in test_ed_pipeline.py?)
I have written a draft of the pull request for this issue. Should I submit right now? or after implementing the (GitHub?) unit test?
Thank you @KDercksen . I will submit my pull request. If you think this is not a good decision (if you think it is better to submit after your implementation of the unit test) please reject my pull request.
from rel.
Related Issues (20)
- API Status HOT 1
- REL Integration into multi-linker framework
- Error while running example in README HOT 8
- Tweaking ED pipeline HOT 8
- Wrong ED result despite wildly different context and wiki embeddings
- Partial mention matches get higher scores than full ones
- Community documentation
- Memory Leak in REL API HOT 3
- Workflows are not running
- Publishing to pypi HOT 6
- Error when querying API with specific sentence HOT 5
- Error when using prebuilt docker image HOT 6
- Updating the entity DB periodically HOT 2
- AttributeError when using ngram NER HOT 2
- Celebrate first release on pypi HOT 2
- Server architecture HOT 5
- Add 'hello world' example to documentation
- Hard coded paths HOT 2
- 'bert_conv-td' model HOT 4
- Sklearn and Numpy Dependencies when installing REL from source (Option 3) HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rel.