Comments (10)
应该发现 "bug" 了。
_single_stage_predict
方法里,在 497 行 for 循环时 input_ids
和 offset_maps
的 size 不对,如果长文本被切成了多个文本(例如 1 个 batch 中原始有 128 条文本,切断后成了 200 条),则在代码 455 行会按 self._batch_size
分批处理。而 497 行的 input_ids
和 offset_maps
只取了最后一个 batch 的数据(即 200-128=72),而 start_ids_list
和 end_ids_list
由于对每个 batch 结果进行了 concat,因此长度仍然为 200,进而导致 497 行时 4 个变量的 size 并不一样。
解法:494 行 input_ids = encoded_inputs["input_ids"]
,再加一行 offset_maps = encoded_inputs["offset_mapping"]
。497-508 改为:
for start_ids, end_ids, ids, offset_map in zip(start_ids_list, end_ids_list, input_ids.tolist(), offset_maps.tolist()):
for i in reversed(range(len(ids))):
if ids[i] != 0:
ids = ids[:i]
break
span_list = get_span(start_ids, end_ids, with_prob=True)
sentence_id, prob = get_id_and_prob(span_list, offset_map)
sentence_ids.append(sentence_id)
probs.append(prob)
from uie_pytorch.
@Gladiator566 我在_auto_joiner中加了判断跳过了空的情况,位置在558行else后、for前,现在不报错了,但不清楚有没有漏掉什么。可以试试:
else:
if len(short_results) <= v: # 新增
continue # 新增
for i in range(len(short_results[v])):
if 'start' not in short_results[v][
i] or 'end' not in short_results[v][i]:
continue
short_results[v][i]['start'] += offset
short_results[v][i]['end'] += offset
offset += len(short_inputs[v])
single_results.extend(short_results[v])
from uie_pytorch.
我也遇到了。。。。
中文貌似是可以的,英文数据时,输入文本过长时,感觉是max_predict_len长度有问题吧。
uie_predictor.py", line 418
max_predict_len = self._max_seq_len - len(max(prompts)) - 3
关系抽取时prompts的长度应该是 subject of predicate的长度,debug时这块的长度好像只是subject的长度。
请问下,你修改好了吗?
from uie_pytorch.
我之前的理解貌似有些问题。。
长文本处理时,直接丢弃后面的,就正常。
应该就是batch长度的问题
from uie_pytorch.
应该是英文输入太长时的文本分段导致的。亲测以下方式可以解决我这边的英文长文本预测时下标索引越界问题。
对于英文长文本输入,可以考虑按 https://stackoverflow.com/questions/51952833/how-to-split-string-to-substrings-with-given-length-but-not-breaking-sentences 所示的方案将长文本切割,而后优化 _auto_splitter
的写法:
if not split_sentence:
sens = [text]
else:
sens = cut_english_sent(text)
for sen in sens:
# 以下这一句改动了
temp_text_list = list(get_sentences(sen, max_text_len))
short_input_texts.extend(temp_text_list)
short_idx = cnt_short
# 以下这一句改动了
cnt_short += len(temp_text_list)
temp_text_id = [short_idx + i for i in range(cnt_short - short_idx)]
if cnt_org not in input_mapping.keys():
input_mapping[cnt_org] = temp_text_id
else:
input_mapping[cnt_org].extend(temp_text_id)
当然,具体为什么之前的切割方式会报下标索引越界问题,这里没有去细究,感兴趣的可以去 debug 看看。
from uie_pytorch.
另外,在进行英文的关系推理时,get_id_and_probs
方法中也存在 "bug"。比如情感抽取时,结果可能为:
'Sentiment classification [negative, positive]': [{'end': 45,
'probability': 0.9998571872711182,
'start': 37,
'text': ''}]
原因为:offset_map 减去的是 prompt 的 token 数,实际上要减去 prompt 的字符长度。
def get_id_and_prob(spans, offset_map):
prompt_length = 0
prompt_char_length = 0
for i in range(1, len(offset_map)):
if offset_map[i] != [0, 0]:
prompt_length += 1 # prompt 包含的 token 数
prompt_char_length = offset_map[i][-1]
else:
break
for i in range(1, prompt_length + 1):
offset_map[i][0] -= (prompt_char_length + 1)
offset_map[i][1] -= (prompt_char_length + 1)
sentence_id = []
prob = []
for start, end in spans:
prob.append(start[1] * end[1])
sentence_id.append((offset_map[start[0]][0], offset_map[end[0]][1]))
return sentence_id, prob
from uie_pytorch.
'text': ''这个问题我在抽取实体时也频繁出现求解
from uie_pytorch.
'text': ''这个问题我在抽取实体时也频繁出现求解
在uie_prodector这个代码中这个位置增加 start == end,终止循环的判断即可
for i in range(len(sentence_id)):
start, end = sentence_id[i]
if start < 0 and end >= 0:
continue
if end < 0:
start += (len(prompt) + 1)
end += (len(prompt) + 1)
result = {"text": prompt[start:end],
"probability": prob[i]}
result_list.append(result)
if start == end:
break
else:
result = {
"text": text[start:end],
from uie_pytorch.
所以大佬们,这个错误应该如何解决呀?
File "D:\ZJH\Projects\deploy\src\uie\predictor.py", line 234, in __call__
results = self._multi_stage_predict(texts)
File "D:\ZJH\Projects\\deploy\src\uie\predictor.py", line 296, in _multi_stage_predict
result_list = self._single_stage_predict(examples)
File "D:\ZJH\Projects\\deploy\src\uie\predictor.py", line 548, in _single_stage_predict
results = self._auto_joiner(results, short_input_texts,
File "D:\ZJH\Projects\deploy\src\uie\predictor.py", line 598, in _auto_joiner
for i in range(len(short_results[v])):
IndexError: list index out of range
from uie_pytorch.
我是中文长文本抽实体会报这个错,目前issue中的修改方式貌似都没有办法解决
from uie_pytorch.
Related Issues (20)
- uie_m_large_pytorch 问题
- 情感分类支持微调吗? HOT 2
- Parameter error
- UIEPredictor(model='uie-base', schema=schema)默认模型存在哪
- 微调模型时疑似报错:he OrderedVocab you are attempting to save contains a hole for index 12084, your vocabulary could be corrupted ! HOT 2
- 使用gpu,中文文本长度过长时会出现报错。长度较短则不会。报错信息如下 HOT 2
- schema里面添加的属性是不是不能太多? HOT 2
- prompt(in-context learning)实现信息抽取
- docker部署出错 无法推理结果 HOT 2
- 好像不支持普通分类的模型微调是吗?
- 数据预处理格式 - 关系抽取和事件抽取
- convert uie-m-base报错AttributeError: 'ErnieMTokenizer' object has no attribute 'vocab' HOT 2
- 为啥论文中用的是Transformer架构而实际实现却用bert?
- evaluate.py执行时报错
- 数据格式中的prompt含义
- 报错module 'paddle.fluid.dygraph' has no attribute 'load_dygraph',请问怎么解决 HOT 1
- 转onnx模型的时候报错
- Bug in ErnieMConverter Class
- 多label进行训练之后的测试集的F1值针对的是所有标签的嘛,如何看针对一个标签的F1值?
- uie-base 转torch,验证时报错
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from uie_pytorch.