seanlee97 / xmnlp Goto Github PK

View Code? Open in Web Editor NEW

1.2K 29.0 188.0 117.19 MB

xmnlp：提供中文分词, 词性标注, 命名体识别，情感分析，文本纠错，文本转拼音，文本摘要，偏旁部首，句子表征及文本相似度计算等功能

License: Apache License 2.0

Python 98.80% Dockerfile 1.20%

pinyin nlp spell-checker radical sentiment-analysis ner lexical-analysis segmentation postagging sentence-embeddings

xmnlp's People

Contributors

Stargazers

Watchers

Forkers

tuian joker8023 he-zhang carrotshub fendaq binnong qianqq qiuxiaoxue baifengbai mrb957600057 chrinide grimoireks zhouzhonghua yuanjie-ai ruo2012 wyatt88 olivia-meng tutty427 shellleyma kafka-learn tb21434718 amoliu yushu-liu hehuanshu96 little1tow mengjiaozhang cukuangjiangjun mqrshiyan zouchl colinsongf guanlongtianzi googlepeng sololex michaelyryi buptlida yxxiwang liudefu axu4github jiayong aiedward jvwke asdlei99 b2220333 xuelinchao tactictnlp holahack xfzhu2003 xingbaji monkeyfx englishvillage lidutech hanyinong zhangpengshan bobqiu jianjunwu bg2wlj machine4life chenny0808 huguanglong dst1213 wheniseeyou xingchengxu yishuihanhan danielzhang111cn del18687058912 microw lingxc sylnkk jiniaoxu gccrpm moonlione alexxrliu wengfna xizil hhy5277 ugi929 cgedjnu shangcaiwangtao azuredsky sayiho lorenhsu1128 wsnuser jnxiongjun mrcare liuyuzhangolvz gz1243463939 annabelle115 zyc14588 4ai anigi98932 yaoyaozhi lambdalpha curiszhou sadxiaohu fox315 lovesnacks newverygood gm19900510 hpgogo jonxia

xmnlp's Issues

纠错字典表

您好，这个工具对我很有帮助，非常感谢您的分享。我对纠错功能很感兴趣，请问纠错可以自己设置一些非登陆词的字典表吗？我在使用过程中发现某些专有的特殊词汇会被错误地改正。

python 3.7 NameError: name 'unicode' is not defined

This line https://github.com/SeanLee97/xmnlp/blob/master/xmnlp/checker/__init__.py#L44 causes NameError in python3.7

有没有开源模型训练代码的打算？

RT，当前使用已经训练好的模型，存在一定偏差，无法微调。如果有开源模型训练的代码，就好办了

python2 模型加载慢问题

已修复，python2.7使用了更高效的cPickle来完成模型持久话。不过相比之下还是推荐使用python3，python3有更好的性能。

开发文档

请问作者有没有类似开发文档这样的东西，比较细致一点的，如介绍模型的整体架构、函数组成等，还有召回率、准确率等数据的话简直就拜谢您了。如果有的话，是否方便分享一下？

0.1.8版本纠错效果

0.1.8版本纠错效果感觉不如0.1.7的，兄弟有0.1.7版本的吗，可以发一份吗，[email protected],谢谢。。

Original error was: PyCapsule_Import could not import module "datetime"

请问一下这是什么问题？放在win7就只能UserWarning: Unsupported Windows version (7). ONNX Runtime supports Windows 10 and above, only.
warnings.warn('Unsupported Windows version (%s). ONNX Runtime supports Windows 10 and above, only.' %
Lazy load checker...

但放到win10就变成以下这样了。

init.py 22
from . import multiarray

multiarray.py 12
from . import overrides

overrides.py 7
from numpy.core._multiarray_umath import (

ImportError:
PyCapsule_Import could not import module "datetime"

xmnlp测试.py 1
import xmnlp

init.py 15
from xmnlp import config

init.py 6
from xmnlp.utils import load_stopword

init.py 12
import numpy as np

init.py 140
from . import core

init.py 48
raise ImportError(msg)

ImportError:

IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!

Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.

We have compiled some common reasons and troubleshooting tips at:

https://numpy.org/devdocs/user/troubleshooting-importerror.html

Please note and check the following:

The Python version is: Python3.7 from "D:\LLD\python3\python.exe"
The NumPy version is: "1.19.5"

and make sure that they are the versions you expect.
Please carefully study the documentation linked above for further help.

Original error was: PyCapsule_Import could not import module "datetime"

安装不成功报错是为啥啊

报错如下
ERROR: Could not find a version that satisfies the requirement scikit-learn (from xmnlp) (from versions: none)
ERROR: No matching distribution found for scikit-learn

文本纠错

你好，

我想问一下，这个文本纠错的功能是只能将一个字换成另一个字么？对于少字，多字的错误可以解决么？

情感分析

非常感谢您的分享，正在学习这个项目，对情感分析很感兴趣。有点困惑想请教一下，您在做情感分析时是采用什么方法做的特征选择呢？最终计算得到的情感数值是根据什么计算得到的呢？再次感谢~还有您这个项目有没有学习交流的QQ群之类的呢？

田哥到此一游

python3 userdict.txt 加载错误

我的环境：win10中文版，python3.6

examples的错误信息

UnicodeDecodeError: 'gbk' codec can't decode byte 0xaa in position 35: illegal multibyte sequence

我的修复：
dag.py line 159: with open(fname, 'r',encoding='utf-8') as f:

网盘分享的训练语料已失效

还请作者有空修复一下。
另外文本纠错方面，除了汉字的编辑距离，是否加上拼音方面的编辑距离，再去评估bi-gram会更加合理？

新版本在linux上使用报错

Traceback (most recent call last):
File "normal_keywords.py", line 49, in
keywords_list += xmnlp.seg(text)
File "/home/ubuser/anaconda3/envs/gpu/lib/python3.8/site-packages/xmnlp/lexical/init.py", line 53, in seg
load_lexical()
File "/home/ubuser/anaconda3/envs/gpu/lib/python3.8/site-packages/xmnlp/lexical/init.py", line 46, in load_lexical
lexical = LexicalDecoder(
File "/home/ubuser/anaconda3/envs/gpu/lib/python3.8/site-packages/xmnlp/lexical/lexical_model.py", line 45, in init
self.lexical_model = LexicalModel(os.path.join(model_dir, 'lexical.onnx'))
File "/home/ubuser/anaconda3/envs/gpu/lib/python3.8/site-packages/xmnlp/base_model.py", line 11, in init
self.sess = ort.InferenceSession(model_path, providers=['CPUExecutionProvider'])
File "/home/ubuser/anaconda3/envs/gpu/lib/python3.8/site-packages/onnxruntime/capi/session.py", line 158, in init
self._load_model(providers or [])
File "/home/ubuser/anaconda3/envs/gpu/lib/python3.8/site-packages/onnxruntime/capi/session.py", line 177, in _load_model
self._sess.load_model(providers)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidGraph: [ONNXRuntimeError] : 10 : INVALID_GRAPH : This is an invalid model. Error in Node:Embedding-Token/NotEqual : No Op registered for Equal with domain_version of 13

关于checker.py的一个疑问

    def calc_proba(self, gram):
        x = self.bi[tuple(gram)]
        y = self.uni[gram[0]]
        return float((x + 1)) / (y + len(self.uni.keys())**2)

这段代码的作用是smoothing吧?为什么是用y + len(self.uni.keys())**2而不是y + len(self.uni.keys())呢?

xmnlp.seg(text)效果不是很好

text='7月1日，世预赛亚洲区12强赛抽签举行，**队分在B组。同组对手是日本、澳大利亚、沙特、阿曼、越南。体育博主潘伟力在个人微博上表示，国足应把目标定在小组第二，第三意义不大。'
xmnlp.seg(text)
['7月1日', '，', '世', '预赛', '亚洲区', '12', '强赛', '抽签', '举行', '，', '**队', '分', '在', 'B', '组', '。', '同', '组', '对手', '是', '日本', '、', '澳大利亚', '、', '沙特', '、', '阿曼', '、', '越南', '。', '体育博主', '潘伟力', '在', '个人', '微博', '上', '表示', '，', '国', '足', '应', '把', '目标', '定', '在', '小组', '第二', '，', '第三意义', '不大', '。']

训练的纠错模型不生效

我重新训练了下examples/corpus/checker.txt文件，生成的models/checker.pickle.3替换了xmnlp/checker/下的checker.pickle.3，但是运行examples/checker.py纠错不生效
error: """这理风景绣丽，而且天汽不错，我的心情各外舒畅!"""
correct:"""这理风景绣丽，而且天汽不错，我的心情各外舒畅!"""

运行出错【FileNotFoundError】

FileNotFoundError: [Errno 2] No such file or directory: 'D:/桌面/cnn+bilstm/xmnlp-onnx-models-v5/xmnlp-models\lexical\trans.npy'

缩短纠错时间

您好，您的代码对我最近做的拼音纠错有很大帮助，拼音纠错的效果非常好，非常感谢你的分享，但是，我想把时间缩短到200ms，您有什么建议吗？

请问xmnlp-onnx-models-v5.zip v0.5.1 百度网盘提取码有误吗，不能提取

训练拼音应该文件应该是什么格式的

训练拼音文件的格式能给个样例吗

大佬，模型训练代码会有可能开源吗？

要是能够训练自己场景的模型就好了，尤其是纠错和实体识别这块。

AttributeError: module 'xmnlp' has no attribute 'set_model'

您好，通过方式二配置模型：xmnlp.set_model('/path/to/xmnlp-models')
报错：AttributeError: module 'xmnlp' has no attribute 'set_model'
如何解决？

错词纠错模型怎样训练

错词纠错的模型从何而来怎样训练出自己的模型要不纠错涵盖的太少了

java版的包哪里可以下载到？

我看到java项目文件中有人用到1.4版本的xmnlp包，可是maven公仓以及百度谷歌都搜不到这个包，请问哪里可以找到呢。

中文分词支持粗分和细分吗

现版本分词中有粗分和细分的功能吗

代码勘误

base_model.py内对onnxruntime调用代码有误

`# -- coding: utf-8 --

from abc import ABCMeta, abstractmethod

import onnxruntime as ort

class BaseModel(metaclass=ABCMeta):

def __init__(self, model_path: str):
    self.sess = ort.InferenceSession(model_path, providors=['CPUExecutionProvider'])

@abstractmethod
def predict(self):
    raise NotImplementedError`

是providers不是providors

人工智能 5 nw
机器学习 5

文本纠错AssertionError

checker.py文件中50行assert len(mask_id) == 1出现AssertionError请问如何解决

0.5.1版本使用keyword报错。

module 'xmnlp' has no attribute 'tag_parallel'