seanlee97 / xmnlp Goto Github PK
View Code? Open in Web Editor NEWxmnlp:提供中文分词, 词性标注, 命名体识别,情感分析,文本纠错,文本转拼音,文本摘要,偏旁部首,句子表征及文本相似度计算等功能
License: Apache License 2.0
xmnlp:提供中文分词, 词性标注, 命名体识别,情感分析,文本纠错,文本转拼音,文本摘要,偏旁部首,句子表征及文本相似度计算等功能
License: Apache License 2.0
您好,这个工具对我很有帮助,非常感谢您的分享。我对纠错功能很感兴趣,请问纠错可以自己设置一些非登陆词的字典表吗?我在使用过程中发现某些专有的特殊词汇会被错误地改正。
This line https://github.com/SeanLee97/xmnlp/blob/master/xmnlp/checker/__init__.py#L44 causes NameError in python3.7
RT,当前使用已经训练好的模型,存在一定偏差,无法微调。如果有开源模型训练的代码,就好办了
已修复,python2.7使用了更高效的cPickle来完成模型持久话。不过相比之下还是推荐使用python3,python3有更好的性能。
请问作者有没有类似开发文档这样的东西,比较细致一点的,如介绍模型的整体架构、函数组成等,还有召回率、准确率等数据的话简直就拜谢您了。如果有的话,是否方便分享一下?
0.1.8版本纠错效果感觉不如0.1.7的,兄弟有0.1.7版本的吗,可以发一份吗,[email protected],谢谢。。
请问一下这是什么问题?放在win7就只能UserWarning: Unsupported Windows version (7). ONNX Runtime supports Windows 10 and above, only.
warnings.warn('Unsupported Windows version (%s). ONNX Runtime supports Windows 10 and above, only.' %
Lazy load checker...
但放到win10就变成以下这样了。
init.py 22
from . import multiarray
multiarray.py 12
from . import overrides
overrides.py 7
from numpy.core._multiarray_umath import (
ImportError:
PyCapsule_Import could not import module "datetime"
xmnlp测试.py 1
import xmnlp
init.py 15
from xmnlp import config
init.py 6
from xmnlp.utils import load_stopword
init.py 12
import numpy as np
init.py 140
from . import core
init.py 48
raise ImportError(msg)
ImportError:
IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!
Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.
We have compiled some common reasons and troubleshooting tips at:
https://numpy.org/devdocs/user/troubleshooting-importerror.html
Please note and check the following:
and make sure that they are the versions you expect.
Please carefully study the documentation linked above for further help.
Original error was: PyCapsule_Import could not import module "datetime"
报错如下
ERROR: Could not find a version that satisfies the requirement scikit-learn (from xmnlp) (from versions: none)
ERROR: No matching distribution found for scikit-learn
你好,
我想问一下,这个文本纠错的功能是只能将一个字换成另一个字么?对于少字,多字的错误可以解决么?
非常感谢您的分享,正在学习这个项目,对情感分析很感兴趣。有点困惑想请教一下,您在做情感分析时是采用什么方法做的特征选择呢?最终计算得到的情感数值是根据什么计算得到的呢?再次感谢~还有您这个项目有没有学习交流的QQ群之类的呢?
我的环境:win10中文版,python3.6
examples的错误信息
UnicodeDecodeError: 'gbk' codec can't decode byte 0xaa in position 35: illegal multibyte sequence
我的修复:
dag.py line 159: with open(fname, 'r',encoding='utf-8') as f:
还请作者有空修复一下。
另外文本纠错方面,除了汉字的编辑距离,是否加上拼音方面的编辑距离,再去评估bi-gram会更加合理?
Traceback (most recent call last):
File "normal_keywords.py", line 49, in
keywords_list += xmnlp.seg(text)
File "/home/ubuser/anaconda3/envs/gpu/lib/python3.8/site-packages/xmnlp/lexical/init.py", line 53, in seg
load_lexical()
File "/home/ubuser/anaconda3/envs/gpu/lib/python3.8/site-packages/xmnlp/lexical/init.py", line 46, in load_lexical
lexical = LexicalDecoder(
File "/home/ubuser/anaconda3/envs/gpu/lib/python3.8/site-packages/xmnlp/lexical/lexical_model.py", line 45, in init
self.lexical_model = LexicalModel(os.path.join(model_dir, 'lexical.onnx'))
File "/home/ubuser/anaconda3/envs/gpu/lib/python3.8/site-packages/xmnlp/base_model.py", line 11, in init
self.sess = ort.InferenceSession(model_path, providers=['CPUExecutionProvider'])
File "/home/ubuser/anaconda3/envs/gpu/lib/python3.8/site-packages/onnxruntime/capi/session.py", line 158, in init
self._load_model(providers or [])
File "/home/ubuser/anaconda3/envs/gpu/lib/python3.8/site-packages/onnxruntime/capi/session.py", line 177, in _load_model
self._sess.load_model(providers)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidGraph: [ONNXRuntimeError] : 10 : INVALID_GRAPH : This is an invalid model. Error in Node:Embedding-Token/NotEqual : No Op registered for Equal with domain_version of 13
def calc_proba(self, gram):
x = self.bi[tuple(gram)]
y = self.uni[gram[0]]
return float((x + 1)) / (y + len(self.uni.keys())**2)
这段代码的作用是smoothing吧?为什么是用y + len(self.uni.keys())**2而不是y + len(self.uni.keys())呢?
text='7月1日,世预赛亚洲区12强赛抽签举行,**队分在B组。同组对手是日本、澳大利亚、沙特、阿曼、越南。体育博主潘伟力在个人微博上表示,国足应把目标定在小组第二,第三意义不大。'
xmnlp.seg(text)
['7月1日', ',', '世', '预赛', '亚洲区', '12', '强赛', '抽签', '举行', ',', '**队', '分', '在', 'B', '组', '。', '同', '组', '对手', '是', '日本', '、', '澳大利亚', '、', '沙特', '、', '阿曼', '、', '越南', '。', '体育博主', '潘伟力', '在', '个人', '微博', '上', '表示', ',', '国', '足', '应', '把', '目标', '定', '在', '小组', '第二', ',', '第三意义', '不大', '。']
我重新训练了下examples/corpus/checker.txt文件,生成的models/checker.pickle.3替换了xmnlp/checker/下的checker.pickle.3,但是运行examples/checker.py纠错不生效
error: """这理风景绣丽,而且天汽不错,我的心情各外舒畅!"""
correct:"""这理风景绣丽,而且天汽不错,我的心情各外舒畅!"""
FileNotFoundError: [Errno 2] No such file or directory: 'D:/桌面/cnn+bilstm/xmnlp-onnx-models-v5/xmnlp-models\lexical\trans.npy'
您好,您的代码对我最近做的拼音纠错有很大帮助,拼音纠错的效果非常好,非常感谢你的分享,但是,我想把时间缩短到200ms,您有什么建议吗?
训练拼音文件的格式能给个样例吗
要是能够训练自己场景的模型就好了,尤其是纠错和实体识别这块。
您好,通过方式二配置模型:xmnlp.set_model('/path/to/xmnlp-models')
报错:AttributeError: module 'xmnlp' has no attribute 'set_model'
如何解决?
错词纠错的模型从何而来 怎样训练出自己的模型 要不纠错涵盖的太少了
我看到java项目文件中有人用到1.4版本的xmnlp包,可是maven公仓以及百度谷歌都搜不到这个包,请问哪里可以找到呢。
现版本分词中有粗分和细分的功能吗
base_model.py内对onnxruntime调用代码有误
`# -- coding: utf-8 --
from abc import ABCMeta, abstractmethod
import onnxruntime as ort
class BaseModel(metaclass=ABCMeta):
def __init__(self, model_path: str):
self.sess = ort.InferenceSession(model_path, providors=['CPUExecutionProvider'])
@abstractmethod
def predict(self):
raise NotImplementedError`
是providers不是providors
userdict.txt --》如果只有詞,不加入詞頻及詞性,然後去做seg及textrank,會否出現跟jieba相同的問題?
而詞性pos的標註,跟jieba是否相同 e.g. v = verb., ns= place?
请问拼音和汉字,是以什么形式引入神经网络模型的呢?
例如已经得到了,“国”:guo2,以及 囗,如何将其转化为 向量表示呢?
类似债券简称,比如“02进出04”,特殊名词比如“5G”,我发现在分词的时候会打散
科研想要用这个资源,请问如何引用呢?
能简要介绍一下检测和纠错的模型思路吗
请问是根据新华字典的标注对字进行部首提取的吗?
如题,
非常感谢
分词j结果:
记住钥匙放在厨房餐桌上 ->记住 / 钥匙 / 放在 / 厨房 / 餐桌上
"记住钥匙放在厨房桌子上" ->记住 / 钥匙 / 放在 / 厨房 / 桌子 / 上
应该是餐桌没有在字典中。我在examples中的userdict增加了餐桌也没有用,如何增加字典值,词后面的5 nw是什么意思,都有哪些选择
人工智能 5 nw
机器学习 5
checker.py文件中50行assert len(mask_id) == 1出现AssertionError请问如何解决
module 'xmnlp' has no attribute 'tag_parallel'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.