Comments (3)
- 语言模型是根据中文维基百科及人民日报数据训练得到的,对于该模型来说,人名、公司名、地名信息是未登录词,因此会有很大概率把这部分词认为是疑似错词。
- 解决思路:几点建议:
- a)使用对人名、公司名、地名识别效果较好的CRF模型,把这类专名识别出来,并过滤掉。
- b) 对于领域内的纠错场景,如商标侵权保护、语音识别纠错,添加原体词表,只识别原体词表的错误情况,其他不考虑,可以对准确率及效率有比较大的提升。
- c) 使用带序列特征的rnn类深度模型处理,模型本身会学到词语的语义表示,对于监督模型来讲,人名类的专名本身错别字较少。这类模型效果应该不错,但我这类数据量有限,未做实验进行效果比对。
from pycorrector.
谢谢!
关于第二个建议,原体词表具体指的是什么?能举个例子么?
from pycorrector.
比如保护的商标品牌词库。
from pycorrector.
Related Issues (20)
- ppl_score 返回的结果似乎不太合理 HOT 1
- gpt纠错模型的评估结果和展示不一致? HOT 1
- 中文引号的句子,不能正常纠错 HOT 2
- on_validation_epoch_end() missing 1 required positional argument: 'outputs' HOT 1
- 关于文本中存在空格问题 HOT 2
- 加入标点符号的纠错, 根据《标点符号用法》GB/T15834-2011规范 HOT 1
- Macbert怎么导入本地模型? HOT 3
- 使用macbert时输入的参数中有空格,会导致无法成功纠错。使用T5时就可以成功纠错 HOT 1
- 请问“的地得”纠错有什么好的方法 HOT 3
- 模型文件路径问题 HOT 7
- does not appear to have a file named config.json HOT 2
- T5Corrector' object has no attribute 'batch_t5_correct' HOT 2
- 最新版本pycorrector导入不了config HOT 1
- 现在使用给定的数据基进行训练,后期我想自己寻找其他的数据集训练 HOT 1
- 使用pip install -U pycorrector安装pycorrector最新版本后,测试运行报错 HOT 1
- Tokenizer类实现问题 HOT 2
- T5模型预测的时候一直为0 HOT 1
- 请问针对非对称长度的文本纠错(冗余、缺失等),评估指标应该选哪些呢? HOT 1
- 能否更新下docker HOT 1
- 新版本macbert4csc中ConfusionCorrector实现逻辑问题 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pycorrector.