Giter Club home page Giter Club logo

Comments (10)

liuph0119 avatar liuph0119 commented on June 3, 2024 4

应该发现 "bug" 了。

_single_stage_predict 方法里,在 497 行 for 循环时 input_idsoffset_maps 的 size 不对,如果长文本被切成了多个文本(例如 1 个 batch 中原始有 128 条文本,切断后成了 200 条),则在代码 455 行会按 self._batch_size 分批处理。而 497 行的 input_idsoffset_maps 只取了最后一个 batch 的数据(即 200-128=72),而 start_ids_list end_ids_list 由于对每个 batch 结果进行了 concat,因此长度仍然为 200,进而导致 497 行时 4 个变量的 size 并不一样。

解法:494 行 input_ids = encoded_inputs["input_ids"],再加一行 offset_maps = encoded_inputs["offset_mapping"]。497-508 改为:

        for start_ids, end_ids, ids, offset_map in zip(start_ids_list, end_ids_list, input_ids.tolist(), offset_maps.tolist()):
            for i in reversed(range(len(ids))):
                if ids[i] != 0:
                    ids = ids[:i]
                    break
            span_list = get_span(start_ids, end_ids, with_prob=True)
            sentence_id, prob = get_id_and_prob(span_list, offset_map)
            sentence_ids.append(sentence_id)
            probs.append(prob)

from uie_pytorch.

jiaohuix avatar jiaohuix commented on June 3, 2024 1

@Gladiator566 我在_auto_joiner中加了判断跳过了空的情况,位置在558行else后、for前,现在不报错了,但不清楚有没有漏掉什么。可以试试:

                    else:
                        if len(short_results) <= v:  # 新增
                            continue                        # 新增
                        for i in range(len(short_results[v])):
                            if 'start' not in short_results[v][
                                i] or 'end' not in short_results[v][i]:
                                continue
                            short_results[v][i]['start'] += offset
                            short_results[v][i]['end'] += offset
                        offset += len(short_inputs[v])
                        single_results.extend(short_results[v])

from uie_pytorch.

zjms avatar zjms commented on June 3, 2024

我也遇到了。。。。
中文貌似是可以的,英文数据时,输入文本过长时,感觉是max_predict_len长度有问题吧。
uie_predictor.py", line 418
max_predict_len = self._max_seq_len - len(max(prompts)) - 3
关系抽取时prompts的长度应该是 subject of predicate的长度,debug时这块的长度好像只是subject的长度。

请问下,你修改好了吗?

from uie_pytorch.

zjms avatar zjms commented on June 3, 2024

我之前的理解貌似有些问题。。
长文本处理时,直接丢弃后面的,就正常。
应该就是batch长度的问题

from uie_pytorch.

liuph0119 avatar liuph0119 commented on June 3, 2024

应该是英文输入太长时的文本分段导致的。亲测以下方式可以解决我这边的英文长文本预测时下标索引越界问题。

对于英文长文本输入,可以考虑按 https://stackoverflow.com/questions/51952833/how-to-split-string-to-substrings-with-given-length-but-not-breaking-sentences 所示的方案将长文本切割,而后优化 _auto_splitter 的写法:

            if not split_sentence:
                sens = [text]
            else:
                sens = cut_english_sent(text)
            for sen in sens:
                # 以下这一句改动了
                temp_text_list = list(get_sentences(sen, max_text_len))
                short_input_texts.extend(temp_text_list)
                short_idx = cnt_short
                # 以下这一句改动了
                cnt_short += len(temp_text_list)
                temp_text_id = [short_idx + i for i in range(cnt_short - short_idx)]
                if cnt_org not in input_mapping.keys():
                    input_mapping[cnt_org] = temp_text_id
                else:
                    input_mapping[cnt_org].extend(temp_text_id)

当然,具体为什么之前的切割方式会报下标索引越界问题,这里没有去细究,感兴趣的可以去 debug 看看。

from uie_pytorch.

liuph0119 avatar liuph0119 commented on June 3, 2024

另外,在进行英文的关系推理时,get_id_and_probs 方法中也存在 "bug"。比如情感抽取时,结果可能为:

 'Sentiment classification [negative, positive]': [{'end': 45,
                                                                               'probability': 0.9998571872711182,
                                                                               'start': 37,
                                                                               'text': ''}]

原因为:offset_map 减去的是 prompt 的 token 数,实际上要减去 prompt 的字符长度。

def get_id_and_prob(spans, offset_map):
    prompt_length = 0
    prompt_char_length = 0
    for i in range(1, len(offset_map)):
        if offset_map[i] != [0, 0]:
            prompt_length += 1  # prompt 包含的 token 数
            prompt_char_length = offset_map[i][-1]
        else:
            break

    for i in range(1, prompt_length + 1):
        offset_map[i][0] -= (prompt_char_length + 1)
        offset_map[i][1] -= (prompt_char_length + 1)

    sentence_id = []
    prob = []
    for start, end in spans:
        prob.append(start[1] * end[1])
        sentence_id.append((offset_map[start[0]][0], offset_map[end[0]][1]))
    return sentence_id, prob

from uie_pytorch.

litterairplane avatar litterairplane commented on June 3, 2024

'text': ''这个问题我在抽取实体时也频繁出现求解

from uie_pytorch.

litterairplane avatar litterairplane commented on June 3, 2024

'text': ''这个问题我在抽取实体时也频繁出现求解
在uie_prodector这个代码中这个位置增加 start == end,终止循环的判断即可
for i in range(len(sentence_id)):
start, end = sentence_id[i]
if start < 0 and end >= 0:
continue
if end < 0:
start += (len(prompt) + 1)
end += (len(prompt) + 1)
result = {"text": prompt[start:end],
"probability": prob[i]}
result_list.append(result)
if start == end:
break
else:
result = {
"text": text[start:end],

from uie_pytorch.

jiaohuix avatar jiaohuix commented on June 3, 2024

所以大佬们,这个错误应该如何解决呀?

  File "D:\ZJH\Projects\deploy\src\uie\predictor.py", line 234, in __call__
    results = self._multi_stage_predict(texts)
  File "D:\ZJH\Projects\\deploy\src\uie\predictor.py", line 296, in _multi_stage_predict
    result_list = self._single_stage_predict(examples)
  File "D:\ZJH\Projects\\deploy\src\uie\predictor.py", line 548, in _single_stage_predict
    results = self._auto_joiner(results, short_input_texts,
  File "D:\ZJH\Projects\deploy\src\uie\predictor.py", line 598, in _auto_joiner
    for i in range(len(short_results[v])):
IndexError: list index out of range

from uie_pytorch.

Gladiator566 avatar Gladiator566 commented on June 3, 2024

我是中文长文本抽实体会报这个错,目前issue中的修改方式貌似都没有办法解决

from uie_pytorch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.