Giter Club home page Giter Club logo

2020ccf-ner's People

Contributors

babermuyu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

2020ccf-ner's Issues

out of memory

小白提问:
在本地执行时运行一会儿电脑崩溃直接关机,想着是配置太低了跑不了
然后我把项目放到了kaggle上执行,
为什么还是会出现out of memory的问题?

缺少train_data

大佬,打扰一下。我跑代码发现:No such file or directory: '/Users//2020CCF-NER-main/data/ccf2020/train_data/'
是需要将另外将数据放在ccf2020文件夹下面吗?

lattice的start end 和 text不对应

请教一个问题:
通过代码生成的一个样本:
"text": "《别告诉我你懂PPT》《不懂项目管理还敢拼职场》《让营销更性感》的作者李治(Liz),《不懂项目管理,还敢拼职场》及《别告诉我你懂PPT》的作者"", "entities": [], "lattice": [["告诉", 2, 3], ["项目", 14, 15], ["管理", 16, 17], ["职场", 21, 22], ["营销", 26, 27], ["性感", 29, 30], ["作者", 33, 34], ["项目", 46, 47], ["管理", 48, 49], ["职场", 54, 55], ["告诉", 60, 61], ["作者", 70, 71]]}

text经过bert_tokenizer后的结果是:
[101, 517, 1166, 1440, 6401, 2769, 872, 2743, 8842, 518, 517, 679, 2743, 7555, 4680, 5052, 4415, 6820, 3140, 2894, 5466, 1767, 518, 517, 6375, 5852, 7218, 3291, 2595, 2697, 518, 4638, 868, 5442, 3330, 3780, 8020, 9341, 8253, 8021, 8024, 517, 679, 2743, 7555, 4680, 5052, 4415, 8024, 6820, 3140, 2894, 5466, 1767, 518, 1350, 517, 1166, 1440, 6401, 2769, 872, 2743, 8842, 518, 4638, 868, 5442, 107, 102]

发现一个现象是 lattice的start 和end和text_ids 不对应,比如 项目 14 15 text_ids的14、15对应的文本并不是项目,这样处理会有影响吗?

(出现这个现象的原因的ppt这个词都tokenzie成了1个id)

pu learning

大佬问下pu learning大概是哪一部分代码?

AttributeError: 'str' object has no attribute 'detach'

encoder.py这个文件中的vec是str类型,vec.detach()这样写会报错,请问作者这里是不是去掉detach()

def get_bert_vec(self, text, text_mask, text_pos=None):
if text_pos is None:
_, _, text_vecs = self.bert(text, text_mask)
else:
_, _, text_vecs = self.bert(text, text_mask, position_ids=text_pos)
text_vecs = list(text_vecs)
if self.detach_ptm_flag:
for i, vec in enumerate(text_vecs):
text_vecs[i] = vec.detach()
return text_vecs

TextEncoder处理FLAT input时,char_word_mask及part_size计算方式的问题

请问,代码中,在计算 word_mask,part_size等 (NERModelFitting.py 中 collate_fn_test 方法),这种有点奇怪的方式,有什么依据吗?
如果只是处理为正常的 FLAT 输入,在我看来结果是错的。还有这个奇怪的 mask,在 Model 的 Tranformer 计算时,明显已经把正常的 text token 都 mask 掉了。
所以,冒昧想问一下,这么处理有没有什么理由,还是说,恰巧得到了比较好的分数,或者上传的不是最终的正确代码。
还请不吝赐教,谢谢

lattice 的 start、end 与 bert ids不对应

请教一个问题:
通过代码生成的一个样本:
"text": "《别告诉我你懂PPT》《不懂项目管理还敢拼职场》《让营销更性感》的作者李治(Liz),《不懂项目管理,还敢拼职场》及《别告诉我你懂PPT》的作者"", "entities": [], "lattice": [["告诉", 2, 3], ["项目", 14, 15], ["管理", 16, 17], ["职场", 21, 22], ["营销", 26, 27], ["性感", 29, 30], ["作者", 33, 34], ["项目", 46, 47], ["管理", 48, 49], ["职场", 54, 55], ["告诉", 60, 61], ["作者", 70, 71]]}

text经过bert_tokenizer后的结果是:
[101, 517, 1166, 1440, 6401, 2769, 872, 2743, 8842, 518, 517, 679, 2743, 7555, 4680, 5052, 4415, 6820, 3140, 2894, 5466, 1767, 518, 517, 6375, 5852, 7218, 3291, 2595, 2697, 518, 4638, 868, 5442, 3330, 3780, 8020, 9341, 8253, 8021, 8024, 517, 679, 2743, 7555, 4680, 5052, 4415, 8024, 6820, 3140, 2894, 5466, 1767, 518, 1350, 517, 1166, 1440, 6401, 2769, 872, 2743, 8842, 518, 4638, 868, 5442, 107, 102]

发现一个现象是 lattice的start 和end和text_ids 不对应,比如 项目 14 15 text_ids的14、15对应的文本并不是项目,这样处理会有影响吗?

(出现这个现象的原因的ppt这个词都tokenize成了1个id)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.