yongzhuo / macropodus Goto Github PK

自然语言处理工具Macropodus，基于Albert+BiLSTM+CRF深度学习网络架构，中文分词，词性标注，命名实体识别，新词发现，关键词，文本摘要，文本相似度，科学计算器，中文数字阿拉伯数字(罗马数字)转换，中文繁简转换，拼音转换。tookit(tool) of NLP，CWS(chinese word segnment)，POS(Part-Of-Speech Tagging)，NER(name entity recognition)，Find(new words discovery)，Keyword(keyword extraction)，Summarize(text summarization)，Sim(text similarity)，Calculate(scientific calculator)，Chi2num(chinese number to arabic number)

Home Page: https://blog.csdn.net/rensihui

License: MIT License

Python 100.00%

nlp macropodus albert segnment cws ner newword keyword text-summarization calulator

macropodus's People

Contributors

Stargazers

Watchers

Forkers

seeker1943 dr-data napoler studentmicky pull-qutter xiaming9880 halicia zhiyuanding barryzm yespon 90217 xqd915 xrosliang gdh756462786 qq345736500 liutong-cnu cjhaitman tqcai allensmile javalearning-gss xinhen awesome-archive jingmouren eedanny pphanwang aiedward masonyyp matrixgame2018 suyujun91 liyinchao ztfsmart seanko leileixiao adamchau monkeyfx lastoautumn yukuotc fighting41love sunnyhuma171 xkyoung dtmndas atticusjohnson baitianyu-cyber richardomu sunny121li yehuangcn prettyxqy poeticcharm ancue cry2133 wangderfulth xbad whmzsu fjteam a627414850 zhangtaokd wurentidai tangpeng19 77216384 markkun sysujayce nusselttech littlerookie jeffreylau521 dystudio org-mars lzh867750684 marscube aliweiya xiankaigit harry8207 kkwss techthiyanes linhong00316 lomessa muxichu jwang1993 anttutu luis-wang kingking888 sunpu1995 ldkwebsite misstingting allen0125 angel-yi 10700 aspnetcs darkdepth mppsk0 iq-scm chaozheng ningshiqi

macropodus's Issues

安装 Macropodus 库的时候，提示需要 tqdm == 4.31.1，但是其他库需要 tqdm 库版本更高，应该怎么解决？

安装的时候有警告

Installing collected packages: tqdm
  Attempting uninstall: tqdm
    Found existing installation: tqdm 4.65.0
    Uninstalling tqdm-4.65.0:
      Successfully uninstalled tqdm-4.65.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 2.12.0 requires tqdm>=4.62.1, but you have tqdm 4.31.1 which is incompatible.
huggingface-hub 0.15.1 requires tqdm>=4.42.1, but you have tqdm 4.31.1 which is incompatible.
papermill 2.4.0 requires tqdm>=4.32.2, but you have tqdm 4.31.1 which is incompatible.
ydata-profiling 4.1.2 requires tqdm<4.65,>=4.48.2, but you have tqdm 4.31.1 which is incompatible.

虽然安装好了，执行代码的时候，会提示 tqdm 被其他库依赖的时候，需要更高版本，导致代码无法正常编译执行。

安装scikit-learn==0.19.1错误

error: Command "g++ -pthread -B /home/xiazhichao/.conda/envs/text/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/home/xiazhichao/.conda/envs/text/lib/python3.7/site-packages/numpy/core/include -I/home/xiazhichao/.conda/envs/text/lib/python3.7/site-packages/numpy/core/include -I/home/xiazhichao/.conda/envs/text/include/python3.7m -c sklearn/cluster/_dbscan_inner.cpp -o build/temp.linux-x86_64-3.7/sklearn/cluster/_dbscan_inner.o -MMD -MF build/temp.linux-x86_64-3.7/sklearn/cluster/_dbscan_inner.o.d" failed with exit status 1
----------------------------------------
Rolling back uninstall of scikit-learn
Moving to /home/xiazhichao/.conda/envs/text/lib/python3.7/site-packages/scikit_learn-0.23.2.dist-info/
from /home/xiazhichao/.conda/envs/text/lib/python3.7/site-packages/~cikit_learn-0.23.2.dist-info
Moving to /home/xiazhichao/.conda/envs/text/lib/python3.7/site-packages/scikit_learn.libs/
from /home/xiazhichao/.conda/envs/text/lib/python3.7/site-packages/~cikit_learn.libs
Moving to /home/xiazhichao/.conda/envs/text/lib/python3.7/site-packages/sklearn/
from /home/xiazhichao/.conda/envs/text/lib/python3.7/site-packages/~klearn
ERROR: Command errored out with exit status 1: /home/xiazhichao/.conda/envs/text/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/var/tmp/pip-install-o1v2c3xe/scikit-learn/setup.py'"'"'; file='"'"'/var/tmp/pip-install-o1v2c3xe/scikit-learn/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /var/tmp/pip-record-xocia0vr/install-record.txt --single-version-externally-managed --compile --install-headers /home/xiazhichao/.conda/envs/text/include/python3.7m/scikit-learn Check the logs for full command output.

版本为0.0.7的macropodus依赖于版本为2.4的networkx，与numpy依赖冲突

版本为2.4的networkx中networkx/readwrite/graphml.py引用了np.int，而在版本1.20之后的numpy废弃了该接口而需要指定为np.int64或np.int32。报错为：
AttributeError: module 'numpy' has no attribute 'int'.
np.int was a deprecated alias for the builtin int. To avoid this error in existing code, use int by itself. Doing this will not modify any behavior and is safe. When replacing np.int, you may wish to use e.g. np.int64 or np.int32 to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
而在版本为3.1的networkx中解决了这个依赖问题，应该修改版本为2.4的networkx的源码还是升级为版本3.1的networkx？（其他模块有对numpy的依赖，不能降级numpy版本）

安装问题

看起来这个库功能很不错，但我的oycharm已装了numpy ,pandas, 在pip install macropodus之后它会重新下载numpy,pandas安装包，接着报错failed building wheel XXXXXX. 能提供一下解决的思路吗？

AttributeError: module 'macropodus' has no attribute 'postag'

我下载并把tag_albert_people_1998覆盖到安装目录macropodus/data/model了。
但还是无法使用macropodus.postag，是什么原因呢？
AttributeError: module 'macropodus' has no attribute 'postag'

词性标注显示没有那个方法

res_postag = macropodus.postag(summary)

AttributeError: module 'macropodus' has no attribute 'postag'‘

模型已经覆盖

Keyword 无结果

你好，我有一段文字

content = '每日商报讯 还有不到一个月的时间，就是春节了，每年这段时间都是年宵花销售的旺季，不少市民会前往市场选购花卉和绿植，准备将家里好好捯饬一番，今年年宵花市场表现如何？有没有新的品种供大家选择？昨天下午，记者也替大家提前探了探路。\n今年蝴蝶兰价格有所下降\n越临近春节 价格越高\n下午2:30左右，记者来到了吴山花鸟城，一进入市场，姹紫嫣红的花草让人眼前一亮，每家商铺都被塞得满满当当，基本只留下了一人通行的空位，其中，以蝴蝶兰为代表的年宵花更是稳稳占据了中心位置。\n和往年人头攒动的景象相比，今年前来市场选购的市民并不多，基本上每家店铺门口都站着两三位询价的客户。\n虽然人流量少了，但下单率并不低，在一家名为“花为媒”的商铺里，老板赵先生正在和他的妻子核对订单，今年是他从事花卉经营的第15个年头，赵老板说，以往看的人多，买的人少，现在这种情况正好相反，来到店里的人虽然不多，但基本都会买一些花卉回家。\n至于大家最关心的价格，赵老板表示，今年蝴蝶兰的价格便宜了不少，“比如，现在12株的‘紫气东来’售价为450元，但是在去年，同样的产品却要卖到600多元一盆。”\n在“名人花苑”，蝴蝶兰摆满了整整三排货架，品种也和往年的差不多，工作人员祝大姐正在整理新到的货，她告诉记者，蝴蝶兰、大花蕙兰都是较为传统的年宵花，因为价格实惠，好养活所以客户的接受度一直也比较高。\n记者提出想要一些特别一些的年宵花时，祝大姐特别推荐了今年的新品“喜炮”，和它的名字相同，“喜炮”的外观和小炮仗类似，只不过是一半红色一半黄色，看着就十分养眼。\n“这个品种去年市场上都没有，是我们今年新进的，寓意很好。”不过祝大姐坦言，价格也是有些“小贵”，一盆半人高的“喜炮”开价在1200元。\n除了大型的年宵花，傲梅、水仙等日常花卉也成为了很多市民的选购对象，正在看花的孙女士就挑走了一盆长寿花，“快要过年了，买一盆鲜花带回办公室，可以提前营造一下新年氛围。”\n不少商家还表示，现在正是选购年宵花的最好时机，越临近春节价格越高，一盆的差价能够达到五六十元，想要入手的市民要趁早。\n年宵花购买主力逐渐年轻化\n花色不再局限于传统\n在采访中，记者发现今年市场蝴蝶兰价格普遍偏低，至于其中的原因，记者也咨询了相关行业内的人士。\n浙江传化生物技术有限公司总经理倪惠珠就表示，“上半年由于受到疫情的影响，很多经销商不敢贸然大批量进货，对于花卉基地来说，出货速度就变低了，年宵花以应季花品为主，过了春节时间段，就卖不动了，所以就会出现抛售的现象，导致价格降低。”\n她所在的传化农业园区一直以过硬的品质和优美的造型所著称，并受到了众多消费者的认可。对于近年来的年宵花市场的变化，倪惠珠有着深刻的认识。\n“这几年，随着市场对于年宵花认可度的提高，消费者对品质和品种的需求也相应提高了，疫情更是驱使着我们研发出一批又一批的优质花卉。”\n传化的研发中心主任荣松告诉记者，随着购买主力逐渐年轻化，花色花型配比也有了很大不同，“早些年出货的花品颜色以深、重为主，越喜庆越受欢迎，但现在随着生活品质和审美观念的提升，花卉颜色不再局限于传统的大红大紫，而是朝着个性化和多元化延伸。”\n今年公司更是一口气上线了包括蝴蝶兰、大花蕙兰在内的40多个新品种。\n比如，“西宾王子”就是他们推出的具有代表性的新品，“这款花卉的特点就是色彩对比强烈，橘黄色的花上点缀着红色线条，一下就能抓住大家的眼球。”\n除了年宵花自身品种和品质提升，近些年，传化也在慢慢调整自己的产品结构，这一举措也使得他们在疫情中很快“转危为安”，今年他们推出了“送货上门”的新举措，同时通过线上、线下多元并举的销售模式来增强优质客户黏性和提升复购率。'

macropodus.keyword(content)
# []  返回空集

我试过去除其中的换行符，结果还是一样。

Python 3.8.5 Macropodus '0.0.7' Ubuntu 20.04.1 LTS

请问下

想请问一下，对于命名实体识别，您是用albert训练的模型吗，用的什么数据集。如果我想用自己的数据训练数据可行吗，刚入门nlp，请教下

模型下载链接失效

很棒的工作，请问ner的模型有下载的链接吗？

安装时报错，无法找到nlg-yongzhuo正确版本

你好，在终端输入pip install -i https://pypi.tuna.tsinghua.edu.cn/simple macropodus

安装时报以下错误：

Collecting nlg-yongzhuo==0.0.4 (from macropodus)
ERROR: Could not find a version that satisfies the requirement nlg-yongzhuo==0.0.4 (from macropodus) (from versions: 0.0.2)
ERROR: No matching distribution found for nlg-yongzhuo==0.0.4 (from macropodus)

关于新词发现中计算凝固值的方程compute_aggregation中, 代码实现逻辑的疑问

作者你好, 在compute_aggregation方程中

            len_word = len(word)
            twl_n = self.total_words_len[len_word] # ngram=n 的所有词频
            words_freq = [self.words_count.get(wd, 1) for wd in word]
            probability_word = value / twl_n
            probability_chars = reduce(mul,([wf for wf in words_freq])) / (twl_1**(len(word)))
            pmi = math.log(probability_word / probability_chars, 2)
            # AMI=PMI/length_word. 惩罚虚词(避免"的", "得", "了"开头结尾的情况)
            word_aggregation = pmi/(len_word**len_word) if (word[0] in self.empty_words or word[-1] in self.empty_words) \
                                                        else pmi/len_word # pmi / len_word / len_word
            self.aggregation[word] = round(word_aggregation, self.round)

请问下,为什么你的代码, 要单独计算一个新词 word 在对应的 ngrams 中的词频分布, 和新词的字符 (1 gram) 的词频分布? 并且为什么得到pmi后,还需要除以len_word?

我想知道上面的处理是出于什么考虑?

在这里提前感谢你的回答及解惑~

关于数据集

请问训练自己的数据集，是按照data/train/corpus/下面的train.json格式来制作自己的数据集来训练吗？
想问一下，您当时只使用了train.json中的数据进行训练，还是使用了整个ChineseNER外加自己的数据？

网盘的链接挂掉了

您好，百度网盘的链接挂掉了，请问实体识别和词性标注的模型可以再发一次吗？