Issue Returning the same start positions for different mentions</p

You're right <a class="user-mention notranslate" data-hovercard-type="user" data-hover

Nice Mick, thanks :) One thing that needs to change is the line <code class="notransla

Thank you all! One minor change. In the current code, <code class="n

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

Comments (17)

KDercksen commented on August 14, 2024 3

Thanks, looks great. I have merged your PR and will create some tests for this later today (and I'll close this issue after). I can quickly walk you through it maybe next week; the tests are all in the tests folder and run using pytest through Github workflows (see .github folder).

from rel.

KDercksen commented on August 14, 2024 2

Okay, so I have an idea to "fix" this fairly easily but I ran into another problem while exploring my solution.. More on that later.

I thought of using the field sent_idx (https://github.com/informagi/REL/blob/master/REL/mention_detection.py#L174). Since every mention has that field, we could calculate the offset per mention based on total sentence length up that point. My solution looks something like this:

input = "Paris, Paris. Paris."

# then mentions look something like:
[{"mention": "Paris", ... "sent_idx": 0, "pos": 0}, {"mention": "Paris", ... "sent_idx": 0, "pos": 7}, {"mention": "Paris", ... "sent_idx": 1, "pos": 0}]

# so we can calculate an offset based on sentence lengths, something like:
sentence_lengths = {m["sent_idx"] + 1: len(m["sentence"]) for m in mentions}
# the "+ 1" is necessary because we do not want any offset with sent_idx = 0, so everything needs to shift over by 1 index
# e.g. for sent_idx = 1, we want offset = len(sent_idx 0)
cumulative_offsets = dict(zip(sentence_lengths.keys(), numpy.cumsum(list(sentence_lengths.values())).tolist()))
# after adding {0: 0}, this results in something like:
{0: 0, 1: 13, 2: 19}

So once you have these offsets, you can write something for the for-loop from utils.py#L91:

if include_offset:
    len_until_now = cumulative_offsets[ment["sent_idx"]]
    offset = len_until_now + text[len_until_now:].find(sent)
start_pos = offset + ment["pos"]
...

This also works properly for both examples. Effectively the same as your proposition @hideaki-j .

Now for the new problem: because the text is split up and parsed into Flair Sentence objects, the extra whitespace between sentences is lost. This is partly resolved by the line offset = len_until_now + text[len_until_now:].find(sent), where the cumulative length is summed with an additional index increase from .find. However, if the amount of whitespace between sentences is significant, this doesn't work. For example, the input:

example = "Paris.              Paris.               Paris."

#> outputs:
[[0, 5, "Paris"...], [20, 5, "Paris"...], [20, 5, "Paris"...]]

Which of course is not correct. If we want to solve this, we also need to keep track of whitespace between sentences, which is actually not trivial I think as sentence segmentation is handled using a different library (segtok). It does not seem to have a built-in way of keeping track of the whitespace.

Of course this new problem is a crafted example that you probably won't find in the wild, so we can choose to live with it. What do you think @hideaki-j ?

from rel.

mickvanhulst-TomTom commented on August 14, 2024 2

Hi guys,

As I can't push my changes to Hideaki's repo, here's my proposed change. The root cause is actually in find_mention, so let's correct that with the following code for find_mentions and process_results respectively. Note that this way we don't have to create an additional dictionary, thus saving us an additional loop.

Let me know if I missed anything!

Detailed explanation
In find_mentions we now add an offset if we are using Flair. This consists of a three-step process:

Compute the cumulative sentence length.
Compute offset based on the raw text and cumulative sentence length. This is required as during splitting, some whitespaces may be removed between sentences.
Add offset to the start-position in the respective sentence.

def find_mentions(self, dataset, tagger=None):
        """
        Responsible for finding mentions given a set of documents in a batch-wise manner. More specifically,
        it returns the mention, its left/right context and a set of candidates.
        :return: Dictionary with mentions per document.
        """
        if tagger is None:
            raise Exception(
                "No NER tagger is set, but you are attempting to perform Mention Detection.."
            )
        # Verify if Flair, else ngram or custom.
        is_flair = isinstance(tagger, SequenceTagger)
        dataset_sentences_raw, processed_sentences, splits = self.split_text(dataset, is_flair)
        results = {}
        total_ment = 0
        if is_flair:
            tagger.predict(processed_sentences)
        for i, doc in enumerate(dataset_sentences_raw):
            contents = dataset_sentences_raw[doc]
            raw_text = dataset[doc][0]
            sentences_doc = [v[0] for v in contents.values()]
            sentences = processed_sentences[splits[i] : splits[i + 1]]
            result_doc = []
            cum_sent_length = 0
            offset = 0
            for (idx_sent, (sentence, ground_truth_sentence)), snt in zip(
                contents.items(), sentences
            ):
                for entity in (
                    snt.get_spans("ner")
                    if is_flair
                    else tagger.predict(snt, processed_sentences)
                ):
                    text, start_pos, end_pos, conf, tag = (
                        entity.text,
                        entity.start_pos,
                        entity.end_pos,
                        entity.score,
                        entity.tag,
                    )
                    total_ment += 1
                    m = self.preprocess_mention(text)
                    cands = self.get_candidates(m)
                    if len(cands) == 0:
                        continue
                    # Re-create ngram as 'text' is at times changed by Flair (e.g. double spaces are removed).
                    ngram = sentence[start_pos:end_pos]
                    left_ctxt, right_ctxt = self.get_ctxt(
                        start_pos, end_pos, idx_sent, sentence, sentences_doc
                    )
                    # Only include offset if using Flair.
                    if is_flair:
                        offset = raw_text.find(sentence, cum_sent_length)
                    res = {
                        "mention": m,
                        "context": (left_ctxt, right_ctxt),
                        "candidates": cands,
                        "gold": ["NONE"],
                        "pos": start_pos+offset,
                        "sent_idx": idx_sent,
                        "ngram": ngram,
                        "end_pos": end_pos+offset,
                        "sentence": sentence,
                        "conf_md": conf,
                        "tag": tag,
                    }
                    result_doc.append(res)
                cum_sent_length += len(sentence) + (offset - cum_sent_length)
            results[doc] = result_doc
        return results, total_ment

def process_results(
    mentions_dataset,
    predictions,
    processed,
    include_offset=False,
):
    """
    Function that can be used to process the End-to-End results.
    :return: dictionary with results and document as key.
    """
    res = {}
    for doc in mentions_dataset:
        if doc not in predictions:
            # No mentions found, we return empty list.
            continue
        pred_doc = predictions[doc]
        ment_doc = mentions_dataset[doc]
        text = processed[doc][0]
        res_doc = []

        for pred, ment in zip(pred_doc, ment_doc):
            sent = ment["sentence"]
            idx = ment['sent_idx']
            start_pos = ment["pos"]
            mention_length = int(ment["end_pos"] - ment["pos"])

            if pred["prediction"] != "NIL":
                temp = (
                    start_pos,
                    mention_length,
                    ment["ngram"],
                    pred["prediction"],
                    pred["conf_ed"],
                    ment["conf_md"] if "conf_md" in ment else 0.0,
                    ment["tag"] if "tag" in ment else "NULL",
                )
                res_doc.append(temp)
        res[doc] = res_doc
    return res

from rel.

KDercksen commented on August 14, 2024 1

You're right @hideaki-j , the version on Gem accumulates the offsets in a different way and then it doesn't happen. I played around a bit more this morning, I think I have it fixed completely now with some slight tweaks:

# in utils.py - process_results

sents = {m["sent_idx"]: m["sentence"] for m in ment_doc}
cum_offsets = {0: 0}
# e.g. with 4 sentences, loop through 1, 2, 3 and skip 0
for idx in range(1, max(sents.keys())+1):
    # current sentence text
    cur_sent = sents[idx]
    # length of previous sentence
    prev_sent_len = len(sents[idx-1])
    # number of chars up until now: previous sentence offset + length of previous sentence
    until_now = cum_offsets[idx-1] + prev_sent_len
    # total offset is until_now + whitespace compensation through .find(cur_sent)
    cum_offsets[idx] = until_now + text[until_now:].find(cur_sent)

...

if include_offset:
    offset = cum_offsets[ment["sent_idx"]]
start_pos = offset + ment["pos"]
...

In our test examples, this works as intended. Could you verify @hideaki-j ? If it works for you as well I will open a separate PR for this.

Ideally I'd like to solve this a bit neater, feels a bit hacky like this... But it will do for now, I'll put a note in the project board :-)

from rel.

KDercksen commented on August 14, 2024 1

Not all sentences have a mention.

You're right, I missed that... In that case I agree with your idea of creating the offset dictionary earlier in the pipeline (for example in find_mentions). Let me know how you get on! And thanks for the feedback 👍

from rel.

KDercksen commented on August 14, 2024 1

Nice Mick, thanks :) One thing that needs to change is the line cum_sent_length += len(sentence), since it doesn't deal properly with e.g. the "much whitespace" example ("Paris. Paris. Paris.").

Easily fixed however by changing it to read cum_sent_length += len(sentence) + (offset - cum_sent_length).

Also, the offset calculation should be outside of the for entity in ... loop. In its current place, you will get an UnboundLocalError by trying to reference the offset variable for a sentence that didn't have any mentions (e.g. "Hello. Hello.").

from rel.

mickvanhulst-TomTom commented on August 14, 2024 1

Updated code! Better safe than sorry, although the 'many whitespaces' example is hopefully something that'll never occur xD.

from rel.

hideaki-j commented on August 14, 2024 1

Thank you all!

One minor change. In the current code, (offset - cum_sent_length) could become minus when a sentence does not contain any entity. Easily fixed by changing the place of

if is_flair:
    offset = raw_text.find(sentence, cum_sent_length)

form inside the for loop:

for entity in (
...

to inside the for loop:

for (idx_sent, (sentence, ground_truth_sentence)), snt in zip(
...

Fixed code is like:

def find_mentions(self, dataset, tagger=None):
        """
        Responsible for finding mentions given a set of documents in a batch-wise manner. More specifically,
        it returns the mention, its left/right context and a set of candidates.
        :return: Dictionary with mentions per document.
        """
        if tagger is None:
            raise Exception(
                "No NER tagger is set, but you are attempting to perform Mention Detection.."
            )
        # Verify if Flair, else ngram or custom.
        is_flair = isinstance(tagger, SequenceTagger)
        dataset_sentences_raw, processed_sentences, splits = self.split_text(dataset, is_flair)
        results = {}
        total_ment = 0
        if is_flair:
            tagger.predict(processed_sentences)
        for i, doc in enumerate(dataset_sentences_raw):
            contents = dataset_sentences_raw[doc]
            raw_text = dataset[doc][0]
            print('contents', contents)
            print('raw_text', raw_text)
            sentences_doc = [v[0] for v in contents.values()]
            sentences = processed_sentences[splits[i] : splits[i + 1]]
            result_doc = []
            cum_sent_length = 0
            offset = 0
            for (idx_sent, (sentence, ground_truth_sentence)), snt in zip(
                contents.items(), sentences
            ):
                
                # Only include offset if using Flair.
                if is_flair:
                    offset = raw_text.find(sentence, cum_sent_length)
                
                for entity in (
                    snt.get_spans("ner")
                    if is_flair
                    else tagger.predict(snt, processed_sentences)
                ):
                    text, start_pos, end_pos, conf, tag = (
                        entity.text,
                        entity.start_pos,
                        entity.end_pos,
                        entity.score,
                        entity.tag,
                    )
                    total_ment += 1
                    m = self.preprocess_mention(text)
                    cands = self.get_candidates(m)
                    if len(cands) == 0:
                        continue
                    # Re-create ngram as 'text' is at times changed by Flair (e.g. double spaces are removed).
                    ngram = sentence[start_pos:end_pos]
                    left_ctxt, right_ctxt = self.get_ctxt(
                        start_pos, end_pos, idx_sent, sentence, sentences_doc
                    )
                    res = {
                        "mention": m,
                        "context": (left_ctxt, right_ctxt),
                        "candidates": cands,
                        "gold": ["NONE"],
                        "pos": start_pos+offset,
                        "sent_idx": idx_sent,
                        "ngram": ngram,
                        "end_pos": end_pos+offset,
                        "sentence": sentence,
                        "conf_md": conf,
                        "tag": tag,
                    }
                    result_doc.append(res)
                cum_sent_length += len(sentence) + (offset - cum_sent_length)
            results[doc] = result_doc
        return results, total_ment

I tested for (maybe) all of the examples mentioned in our discussion and confirmed that it works.
All test cases are shown below.

# Case 1
text =  "Hi. I need to book a vacation to Long Beach between August 25 and September 3. Departure is from Paris\tWould you like to depart Paris on September 6th?\tPreferably by the 3rd. Is there anything available?\tSorry, looks like there is nothing available from Paris to Long Beach on September 3rd.\tI'm not quite sure I understand, is there anything available leaving Long Beach to go to Paris between August 25 and September 3rd?\tWould you like to depart Paris on September 6th?\tNo. I would like to leave Long Beach around the 25th of August to go to Paris for some reason. What is so confusing about that!?\tYou can leave Long Beach, USA and go to Paris, France on Tuesday, August 30th. Will I book this?\tFinally! No, don't book yet, I would like to know more about the hotel. Is there free breakfast?\tThere is free wifi.\tLook buddy, is there free breakfast or not? Tell me, am I gonna get eggs, toast, cereal, etc? You know? The good stuff?\tThere is free wifi at the hotel. \tWhat is the price of this toastless package?\tThis package costs 2191.55USD.\tIs this the cheapest option?\tYes, this is the cheapest option from Long Beach to Paris.\tAnd the hotel has how many stars?\tMuse Hotel has 2.0 stars.\tOk I will book this one.\tGreat. You will leave Paris on September 7"

# Case 2
text = "Paris, Paris. Paris."

# Case 3
text = "Paris.              Paris.               Paris."

# Case 4
text = 'Pars. Paris. Paris. Paris.'

# Case 5
text = 'Paris. Paris. JParis. Paris. ' # JParis is the entity but this is not in Wikipedia, so REL cannot detect it.

If you do not have any more ideas for this correction, I will create a pull request.

from rel.

hideaki-j commented on August 14, 2024 1

text = "Hello. Hello. Hello"

This case works in my environment.

Regarding the unit test, I agree with you.
(I am still figuring out how can I create a unit test in GitHub tho)

from rel.

KDercksen commented on August 14, 2024 1

I can add some tests later, good idea.

from rel.

arjenpdevries commented on August 14, 2024

Closed through #50

from rel.

hideaki-j commented on August 14, 2024

Issue

Currently, the REL works for example 2 but not in example 1 above.

Examples

The explanation below is based on example 1.
Currently, second "Paris" comes the position 199 (see first code block), the right answer is 128 though (see second code block).

# Input text
dialogue =  "Hi. I need to book a vacation to Long Beach between August 25 and September 3. Departure is from Paris\tWould you like to depart Paris on September 6th?\tPreferably by the 3rd. Is there anything available?\tSorry, looks like there is nothing available from Paris to Long Beach on September 3rd.\tI'm not quite sure I understand, is there anything available leaving Long Beach to go to Paris between August 25 and September 3rd?\tWould you like to depart Paris on September 6th?\tNo. I would like to leave Long Beach around the 25th of August to go to Paris for some reason. What is so confusing about that!?\tYou can leave Long Beach, USA and go to Paris, France on Tuesday, August 30th. Will I book this?\tFinally! No, don't book yet, I would like to know more about the hotel. Is there free breakfast?\tThere is free wifi.\tLook buddy, is there free breakfast or not? Tell me, am I gonna get eggs, toast, cereal, etc? You know? The good stuff?\tThere is free wifi at the hotel. \tWhat is the price of this toastless package?\tThis package costs 2191.55USD.\tIs this the cheapest option?\tYes, this is the cheapest option from Long Beach to Paris.\tAnd the hotel has how many stars?\tMuse Hotel has 2.0 stars.\tOk I will book this one.\tGreat. You will leave Paris on September 7"

# Entity Linking
el(dialogue)

######## Output ########
# [[33, 10, 'Long Beach', 'Long_Beach,_California', ... ],
#  [97, 5, 'Paris', 'Paris', 0.8168372931596612, 0.9900407195091248, 'LOC'],
#  ---> [199, 5, 'Paris', 'Paris', 0.8369923447610185, 0.9948850274085999, 'LOC'],
#  [271, 5, 'Paris', 'Paris', 0.7787875371965208, 0.9929811954498291, 'LOC'],
#  ...

# Find all positions of "Paris"
print([m.span() for m in re.finditer('Paris', dialogue)])

######## Output ########
# [(97, 102), 
# ---> (128, 133), 
# (254, 259), ...

Cause

Summary

See code block [4]. In some cases including example 1, mentions appear several times in one sentence. (e.g., "Paris, Paris."). In that case, we want text[offset:].find(sent) to be 0 from the second mention in the same sentence. But that does not work as intended currently, because of offset += len(sent). (For more details please see Details section below.)
However, if we delete offset += len(sent), the code does not work for example 2 (i.e., "Paris. Paris. Paris. Paris.").

Details

Let's think about the below example.

dialogue = "Paris, Paris. Paris."

There are three mentions "Paris" and two sentences "Paris, Paris." and "Pars." here.

The current situation and the expected answer are presented below. The positions of the second and third mentions are wrong.

Current situation

dialogue = 'Paris, Paris. Paris.'
el(dialogue) # Entity Linking

######## Output ########
# [[0, 5, 'Paris', 'Paris', 0.9057319011744249, 0.9962735176086426, 'LOC'],
#  [19, 5, 'Paris', 'Paris', 0.9029048880752896, 0.9977388381958008, 'LOC'],
#  [24, 5, 'Paris', 'Paris', 0.9029048880752896, 0.9808972477912903, 'LOC']]

Expected answer

mention = 'Paris'
print([m.span() for m in re.finditer(mention, dialogue)])

######## Output ########
# [(0, 5), (7, 12), (14, 19)]

Explanation

Let's think about the for loop in [4].

First Loop

In the first loop,

ment: "Paris"
sent: "Paris, Paris."

offset become 0 after processed L97.
Then, offset become len(sent), here 12.

Second Loop

ment: "Paris"
sent: "Paris, Paris."

Note that ment "Paris" indicates the second "Paris" in sent, not the first one.

However, offset is already 12, that means find in text[offset:].find(sent) search in the second sentence.
Of course you cannot find "Paris, Paris." in the second sentence, that mean text[offset:].find(sent) become -1, and offset become 11 (12 + (-1). please see L97).

Start position is culculated in L100:

start_pos = offset + ment["pos"]

where the offset is 11 and ment["pos"] is 7(*1), then sart_pos become 19. This value is the same as shown in expected answer section above.

(*1) ment["pos"] indicates mention's start position in the sentence. Here, "Paris" starts in position 7 of the sentence
"Paris, Paris.".

How to fix

I cannot come up with a good idea.
One "easy" way to solve this situation is to use another function to calculate the start position like:

def start_pos_itr(ment_doc, text):
    """
    Responsible for obtaining the start position for each mention.
    
    :return:
      start_pos: start position
    
    :args: 
      ment_doc: start position, end position, sentence for each mention.
        Obtained from "mentions_dataset[doc]".
        E.g., [{'pos': 0, 'end_pos': 5, 'sentence': 'Paris xxx yyy zzz...'},...
      text: Input text for REL (text_doc).
        Obtained from "processed[doc][0]".
        Note that text consists of one or several sentence(s). 
    """
    offset, prev_pos  = 0, -1
    for ment in ment_doc: # For each mention
        sent, pos = ment["sentence"], ment["pos"]
        if pos <= prev_pos: # "pos <= prev_pos" means sentence has just been moved to next.
            offset += len(prev_sent)
        offset = text[offset:].find(sent) + offset
        start_pos = offset + pos
        prev_pos, prev_sent = pos, sent # Save the previous sentence and position
        yield start_pos

The key part is:

...
if pos <= prev_pos: # "pos <= prev_pos" means sentence has just been moved to next.
    offset += len(prev_sent)
...

The basic idea is the same as using offset += len(sent). The new part is adding the above if statement.
The idea is as follows:

If there are several mentions in the same sentence, the current position should become bigger than the previous position for each loop.
If the current position is smaller than the previous position, that means the new sentence starts. So you can offset += len(prev_sent).

In my own environment, this function work for both example 1 and 2.
However, this fix makes code more complicated so I am wondering if there is any easy way to solve the current issue.

References

[4] https://github.com/informagi/REL/blob/master/REL/utils.py#L91

from rel.

hideaki-j commented on August 14, 2024

Thank you for your detailed explanation of your idea! It is a very clear and clever solution! I agree with your correction @KDercksen .

Of course this new problem is a crafted example that you probably won't find in the wild,

I ran into this problem in the local environment. But when I used REL API, I could not find the same error.

I tried your example but I could not find any problem here either.

It might be other function(s) in REL "correct" the problem that you mentioned, but I didn't identify it yet.

from rel.

hideaki-j commented on August 14, 2024

Thank you for your proposal! I tested your idea. Your idea worked for example 2 (ie, 'Paris. Paris. Paris. Paris.') and example 3 (white space example). But did not work for example 1 in my environment.

Details

The error message is as follows.

KeyError                                  Traceback (most recent call last)
<ipython-input-67-a3f30ce27db0> in <module>
----> 1 process_results(mentions_dataset, processed)

<ipython-input-65-1faef6f7f268> in process_results(mentions_dataset, processed, include_offset)
     18             cur_sent = sents[idx]
     19             # length of previous sentence
---> 20             prev_sent_len = len(sents[idx-1])
     21             # number of chars up until now: previous sentence offset + length of previous sentence
     22             until_now = cum_offsets[idx-1] + prev_sent_len

KeyError: 0

where idx-1 is 0.

Cause

Not all sentences have a mention.
In example 1, only sent_idx 1, 2, 5, ... have a mention (that means sent_idx 0 does not contain any mention).

# Input text is the same as example 1.
for i in range(len(mentions_dataset['API_DOC'])):
    print(mentions_dataset['API_DOC'][i]['sent_idx'], end=', ')

######## Output ########
# 1, 2, 2, 5, 5, 6, 6, 7, 8, 8, 10, 10, 10, 10, 24, 24, 29,

How to fix...?

I like your idea because it is very simple and intuitive, so I would like to use sent_idx field, if possible.
We also have processed data as an argument for process_results, and this contains the whole input text.
If we use find(), we could identify "offset". ...However, we do not want to use find() anymore here, right?
My idea is creating sents in a previous process (eg, in the for loop at find_mentions function), and get it as process_results's argument. If you agree with my idea, I will fix the program based on this idea and test it.

What do you think @KDercksen ?

from rel.

KDercksen commented on August 14, 2024

Yes, let's finalize this! Looks good to me

from rel.

mickvanhulst-TomTom commented on August 14, 2024

Looks good, one final test case would be without any entities:

text = "Hello. Hello. Hello"

Perhaps we should create a unit test based on this? Seems like something that could easily be broken given some incorrect MD module.

from rel.

hideaki-j commented on August 14, 2024

Regarding creating the GitHub unit test (?), it seems that I need more time to learn how this works.
(I am still figuring out where can I find the current test implementation in REL. Is this implemented in test_ed_pipeline.py?)

I have written a draft of the pull request for this issue. ~~Should I submit right now? or after implementing the (GitHub?) unit test?~~

Thank you @KDercksen . I will submit my pull request. If you think this is not a good decision (if you think it is better to submit after your implementation of the unit test) please reject my pull request.

from rel.

Error for start position of mentions about rel HOT 17 CLOSED

Comments (17)

Issue

Examples

Cause

Summary

Details

Current situation

Expected answer

Explanation

First Loop

Second Loop

How to fix

References

Details

Cause

How to fix...?

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent