Giter Club home page Giter Club logo

jionlp's Issues

jio.summary.extract_summary报错

Traceback (most recent call last):
File "split_test.py", line 59, in
File "D:\software\anaconda\envs\torch13\lib\site-packages\jionlp\algorithm\summary\extract_summary.py", line 140, in call
sen_segs[3] = len([w for w in sen_segs_weights if w != 0]) / len(sen_segs_weights)
ZeroDivisionError: division by zero

时间语义解析无法处理这个情况

提 issue 请务必将以下信息写清楚,否则无法解答!!!
描述(Description)

描述你遇到了什么问题(Please description your issue here)

  1. jionlp版本(Version): v1.3.27
  2. 调用报错日志如下(Log):
    Snipaste_2021-09-25_12-52-55
  3. jionlp的调用代码与输入文本(Code & Text):
09-01 20:01 至 12-01 18:07

期望行为(Expect)

起始时间: 2021-09-01 20:01:00
终止时间: 2021-12-01 18:07:00

“中午”和“中午的”解析结果不一

描述(Description)

描述你遇到了什么问题(Please describe your issue here)

  1. 版本(Version):
  • python 版本: 3.7.4
  • jionlp 版本: 1.3.53
  1. jionlp的调用代码与输入文本(Code & Text):
import jionlp as jio
from datetime import datetime


res = jio.parse_time('中午的两点一刻定一个闹铃', time_base=datetime.now())
print(res)
res = jio.parse_time('中午两点一刻定一个闹铃', time_base=datetime.now())
print(res)

  1. 调用报错日志如下(Log):
> {'type': 'time_point', 'definition': 'accurate', 'time': ['2022-05-13 02:15:00', '2022-05-13 02:15:59']}
> {'type': 'time_point', 'definition': 'accurate', 'time': ['2022-05-13 14:15:00', '2022-05-13 14:15:59']}

期望行为(Expectation)

期盼返回结果是一致的, 但是两个返回结果不一

请顺手 star 一下右上角的⭐小星星

已⭐,大赞

兼容控制台所默认输出的帮助信息

描述该功能的用处,可以提供相关资料描述该功能
禁用默认的帮助信息输出功能, 如 jio.disable_help()

该功能是否用于改进项目缺陷,如果是,请描述现有缺陷
只要import 了, 控制台必然会输出

`jio.help()` is provided to search how to use jio functions.
Or browse `https://github.com/dongrixinyu/JioNLP` to get help.

每次都输出这一行呢, 感觉没法让我通过命令行调用的方式和其他非python的程序拼接起来

ValueError: the given string `早` is illegal

请输入您的问题描述,或您预期的功能 please describe the bug or the function you expect

  • 函数名 function name:
    parse_time

请输入报错的文本,以及代码 please input the text and code

(*** 一定要写清楚是具体哪一条文本数据造成了报错!!! ***)


# 复制粘贴此处 copy and paste here

```市政协十三届四次会议举行第二次全体会议|漳州新闻网讯(记者 张志鹏)1月3日下午,市政协十三届四次会议举行第二次全体会议,听取委员大会发言。10位政协委员分别代表有关专委会、**党派、人民团体作大会发言。市政协十三届四次会议执行主席张祯锦、柳建聪、黄井南、杨胜华、吴芳华、陈跃鸿、何伟燕、卢力、李扬真及值日常委在主席台就座。市领导邵玉龙、刘远、阮开森、张慧德、李珊珊、谢毅泰、张翼腾、兰万安、陈水树、吴卫红应邀出席会议并在主席台就座,在漳省政协委员、市直及驻漳有关单位领导、漳州异地商会和异地漳州商会会长等应邀列席会议,听取委员发言。10位委员就全市经济社会发展及民生改善等领域提出意见和建议。刘志明代表漳州市政协经济委员会发言,提出关于加快漳州智能制造发展的建议;王金泉代表九三学社漳州市委员会发言,提出关于进一步激活我市“夜间经济”的建议;蔡晓洁代表民盟漳州市委员会发言,提出打造龙头深化融合、推动职教高质量发展的建议;张建国代表农工党漳州市委员会发言,提出关于进一步推进我市县域紧密型医共体建设的建议;颜小燕代表民进漳州市委员会发言,提出持续优化营商环境、加快漳州台资工业发展的建议;苏美华代表漳州市政协教科卫体委员会发言,提出关于把**女排漳州体训基地建设成城市新形象标志性区域的建议;刘丽贞代表漳州市政协提案委员会发言,提出关于试行“时间银行”互助养老模式的建议;杨栋代表漳州市政协特邀(二)界发言,提出关于加强漳州古城运营与管理的几点建议;陆銮眉代表漳州市政协农业和农村委员会发言,提出关于扶持我市休闲食品制造产业的建议;陈婉儿代表共青团漳州市委员会发言,提出关于进一步推进漳台青年交流融合的建议。会上还有15个单位分别围绕优化钢铁产业布局推动高质量发展;发起“太平洋海上丝绸之路”联合申遗倡议;破解医养结合瓶颈;把握新机遇、提振民营企业发展信心;加快推进漳州儒学遗址保护与合理利用;推进0-3岁儿童早期发展项目工作;加强食品“三小”监管工作;推动我市小流域治理;推进厦门湾南岸交通环境建设;推动漳州工业高质量发展的路径与对策;落实好支持漳州民营企业发展政策增强实体经济竞争力;加强漳州市青少年科技教育;落实“两岸一家亲”理念、提升两岸婚姻家庭的服务水平;加强涉侨文物的保护与活化利用助力“一带一路”建设;推进文明城市“妈妈小屋”建设、保障妇女合法权益等方面作书面发言。 http://www.zznews.cn/news/system/2020/01/04/030187660.shtml|新闻|||20200104

## 请输入报错信息与日志追踪 please input the bug info and traceback


Traceback (most recent call last):
  File "D:/论文代码/yidong/test.py", line 32, in <module>
    time = jio.parse_time(line)
  File "D:\compiler\anaconda\lib\site-packages\jionlp\gadget\time_parser.py", line 605, in __call__
    _, second_full_time_handler, _, blur_time = self.parse_time_point(
  File "D:\compiler\anaconda\lib\site-packages\jionlp\gadget\time_parser.py", line 985, in parse_time_point
    cur_hms_func(cur_hms_string)
  File "D:\compiler\anaconda\lib\site-packages\jionlp\gadget\time_parser.py", line 3146, in normalize_blur_hour
    raise ValueError('the given string `{}` is illegal'.format(time_string))
ValueError: the given string `早` is illegal

[BUG]对于“前两个月”的语境分析

描述(Description)

描述你遇到了什么问题(Please describe your issue here)

  1. 版本(Version):
  • python 版本: 3.8.12
  • jionlp 版本: 1.3.34
  1. jionlp的调用代码与输入文本(Code & Text):
import jionlp as jio
import time

print(f't = {time.time()}')
res = jio.parse_time('查询销售部门前两个月的业绩', strict=False,)
print(res)

>>> t = 1639467272.000813
>>> {'type': 'time_span', 'definition': 'accurate', 'time': ['2021-01-01 00:00:00', '2021-02-28 23:59:59']}

期望行为(Expectation)

这句话的查询效果应该等同于“查询销售部门过去两个月的业绩”, 目前似乎是按照”当年的前两个月“分析的。这里应该加入语境分析,判断前面有没有指定年份?

对于每个工作日和每个周末的判断

描述(Description)

描述你遇到了什么问题(Please describe your issue here)

  1. 版本(Version):
  • python 版本:3.8
  • jionlp 版本: 1.3.47
  1. jionlp的调用代码与输入文本(Code & Text):
每个周末九点
{
	"definition": "accurate",
	"time": [
		"2022-03-06 09:00:00",
		"2022-03-06 09:59:59"
	],
	"type": "time_point"
}

每个工作日九点
{
	"definition": "accurate",
	"time": [
		"2022-03-03 09:00:00",
		"2022-03-03 09:59:59"
	],
	"type": "time_point"
}
  1. 调用报错日志如下(Log):

期望行为(Expectation)

对于每个工作日和每个周末的判断,期望返回是一个时间周期,而不是一个精确时间点

又想了一下,这个问题可能不太好解决,不知道有没有什么更好的方案~

“4月23”和“4月23之后”都无法正确解析

描述(Description)

描述你遇到了什么问题(Please describe your issue here)

  1. 版本(Version):
    pyhton: 3.9.12
    jionlp: 1.3.53

  2. jionlp的调用代码与输入文本(Code & Text):
    jio.ner.extract_time("4月23之后", time_base=time.time(), with_parsing=True)
    jio.parse_time("4月23之后", ret_future=True, time_base=time.time())

  3. 调用报错日志如下(Log):
    只能解析成4月份,无法解析到23号,5点之后也是,只能解析成5点,“之后”这个关键词丢了,“之前”反而没有这个问题,但时间点包含了5:00:00~5:59:59,正常应该不包含这个小时点的

Exception: Http请求失败,状态码:403,错误信息: {"message":"HMAC signature cannot be verified, a valid date or x-date header is required for HMAC Authentication"}

使用讯飞api时处理多条数据生成时出现:
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/lmf/anaconda3/envs/tf/lib/python3.7/site-packages/jionlp/gadget/back_translation/back_translation.py", line 164, in iter_api_by_language
tmp = mt_api(text, from_lang=chinese_lang, to_lang=foreign_lang)
File "/home/lmf/anaconda3/envs/tf/lib/python3.7/site-packages/jionlp/gadget/back_translation/translation_api.py", line 43, in wrapper
f = func(self, *args, **kargs)
File "/home/lmf/anaconda3/envs/tf/lib/python3.7/site-packages/jionlp/gadget/back_translation/translation_api.py", line 53, in wrapper
f = func(self, *args, **kargs)
File "/home/lmf/anaconda3/envs/tf/lib/python3.7/site-packages/jionlp/gadget/back_translation/translation_api.py", line 89, in wrapper
raise Exception(err)
Exception: Http请求失败,状态码:403,错误信息:
{"message":"HMAC signature cannot be verified, a valid date or x-date header is required for HMAC Authentication"}
2020-10-09 13:48:27 ERROR wrapper: Http请求失败,状态码:403,错误信息:
{"message":"HMAC signature cannot be verified, a valid date or x-date header is required for HMAC Authentication"}
Traceback (most recent call last):
File "/home/lmf/anaconda3/envs/tf/lib/python3.7/site-packages/jionlp/gadget/back_translation/translation_api.py", line 70, in wrapper
f = func(self, *args, **kargs)
File "/home/lmf/anaconda3/envs/tf/lib/python3.7/site-packages/jionlp/gadget/back_translation/translation_api.py", line 670, in call
raise Exception(exception_string)
Exception: Http请求失败,状态码:403,错误信息:
{"message":"HMAC signature cannot be verified, a valid date or x-date header is required for HMAC Authentication"}

时间周期性问题

描述(Description)

描述你遇到了什么问题(Please description your issue here)

  1. 版本(Version):
  • python 版本: 3.7
  • jionlp 版本: 1.3.41
  1. jionlp的调用代码与输入文本(Code & Text):
每周四三点和张三在徐家汇开会

3. 调用报错日志如下(Log):

无法解析,返回'time': [None, None]


**期望行为(Expectation)**

> 期望:返回准确的 'delta': {'day': 7}, 'point': {'time': [时间点]}


是否有可能手动选择不进行农历的转换?

描述该功能的用处,可以提供相关资料描述该功能
在使用parse_time()的时候,提供一个参数用来设定不进行农历的转换

该功能是否用于改进项目缺陷,如果是,请描述现有缺陷
#43 #48 提到的问题,因为原本农历和阳历会容易混淆,但 "X月X" 这种用法其实算是蛮常用到的
因此希望若使用者确定其文字不会有农历日期需要转换,在呼叫parse_time()的时候另外提供的参数用来将 X月X 视为阳历

# 现有情况
jio.parse_time("四月十三")
{'type': 'time_point',
 'definition': 'accurate',
 'time': ['2022-05-13 00:00:00', '2022-05-13 23:59:59']}

描述你期望实现该功能的方式和最终效果

# 期望效果
jio.parse_time("四月十三",  lunar_date=False) # lunar_date预设为True,不影响原本的执行结果
{'type': 'time_point',
 'definition': 'accurate',
 'time': ['2022-04-13 00:00:00', '2022-04-13 23:59:59']}

请顺手 star 一下右上角的⭐小星星
Star了,真的很棒的套件!

回译的API没用了

import jionlp as jio
google_api = jio.GoogleApi()
baidu_api = jio.BaiduApi(
[{'appid': '',
'secretKey': '
'}], gap_time=0.5
)
apis = [baidu_api,google_api]
back_trans = jio.BackTranslation(mt_apis=apis)
text = '饿了么凌晨发文将推出新功能,用户可选择是否愿意多等外卖员 5 分钟,你愿意多等这 5 分钟吗?'
print(baidu_api(text)) # 使用接口做单次调用
result = back_trans(text)
print(result)
报一下的错,这里我隐藏了appid和secret
``jio.help() is provided to search how to use jio functions. Traceback (most recent call last): File "E:/python/code/nlp/学术论文分类/LGB推特情感分析.py", line 123, in <module> print(baidu_api(text)) # 使用接口做单次调用 File "E:\install\pythonEnv\lib\site-packages\jionlp\textaug\back_translation\translation_api.py", line 43, in wrapper from_lang = kargs['from_lang'] KeyError: 'from_lang'

时间解析问题

提 issue 请务必将以下信息写清楚,否则无法解答!!!
描述(Description)

描述你遇到了什么问题(Please description your issue here)

  1. jionlp版本(Version): xxxxxx 通过 jionlp.__version__ 可查
  2. 调用报错日志如下(Log):
无法识别时间当中的刻,如:三点一刻,三点三刻
  1. jionlp的调用代码与输入文本(Code & Text):
今天下午三点一刻过来写作业:输出是”今天下午三点“

期望行为(Expect)

今天下午三点一刻

多时间范围提取问题

text = '周一到周三早上九点到晚上十点的日程'
res1 = jio.ner.extract_time(text, time_base=time.time())
res1:[{'text': '周三早上九点到晚上十点', 'offset': [3, 14], 'type': 'time_span', 'detail': {'type': 'time_span', 'definition': 'accurate', 'time': ['2021-10-13 09:00:00', '2021-10-13 22:00:00']}}]

描述:日期范围只能提取出一个日期
期望:期望能提取出多个日期

时间的正则支持x月x

描述该功能的用处,可以提供相关资料描述该功能
text2 = "1月1至2月10的天气真好"
res = extract_time(text2, with_parsing=True)
print(res)

得到的结果如下:
[{'text': '1至2月', 'offset': [2, 6], 'type': 'time_span', 'detail': {'type': 'time_span', 'definition': 'accurate', 'time': ['2022-01-01 00:00:00', '2022-02-28 23:59:59']}}, {'text': '10的天', 'offset': [6, 10], 'type': 'time_delta', 'detail': {'type': 'time_delta', 'definition': 'accurate', 'time': {'day': 10.0}}}]

描述你期望实现该功能的方式和最终效果
时间正则增加模式:x月x。不强制带日/号

[BUG]URL正则匹配错误

描述(Description)
URL正则匹配错误

描述你遇到了什么问题(Please description your issue here)
调用remove_url函数时出现不能匹配的情况

  1. 版本(Version):

  2. jionlp的调用代码与输入文本(Code & Text):

sent1 = "抖音知识分享 https://v.douyin.com/RtKFFah/ 复制Ci鏈接,打开Dou音搜索,直接观看視頻"
sent2 = "抖音知识分享https://v.douyin.com/RtKFFah/复制Ci鏈接,打开Dou音搜索,直接观看視頻"
print("1", jionlp.remove_url(sent1))
print("2", jionlp.remove_url(sent2))
  1. 调用报错日志如下(Log):
1.抖音知识分享 https://v.douyin.com/RtKFFah/ 复制Ci鏈接,打开Dou音搜索,直接观看視頻
2.抖音知识分享复制Ci鏈接,打开Dou音搜索,直接观看視頻

期望行为(Expectation)

☝️输出【抖音知识分享复制Ci鏈接,打开Dou音搜索,直接观看視頻】才是正确的

请顺手 star 一下右上角的⭐小星星

时间语义分析对于“和”的判断[BUG]

描述(Description)

描述你遇到了什么问题(Please describe your issue here)

  1. 版本(Version):
  • python 版本:3.8
  • jionlp 版本: 1.3.47
  1. jionlp的调用代码与输入文本(Code & Text):
对于“每周一9点和14点”的,返回如下:
[
	{
		"detail": {
			"definition": "accurate",
			"time": {
				"delta": {
					"day": 7
				},
				"point": {
					"string": "周一9点",
					"time": [
						"2022-02-28 09:00:00",
						"2022-02-28 09:59:59"
					]
				}
			},
			"type": "time_period"
		},
		"offset": [
			0,
			5
		],
		"text": "每周一9点",
		"type": "time_period"
	},
	{
		"detail": {
			"definition": "accurate",
			"time": [
				"2022-03-05 14:00:00",
				"2022-03-05 14:59:59"
			],
			"type": "time_point"
		},
		"offset": [
			6,
			9
		],
		"text": "14点",
		"type": "time_point"
	}
]
  1. 调用报错日志如下(Log):
对于这个“和14点”的解析貌似不正确,求修复或者有什么方案能解决?感谢!

期望行为(Expectation)

返回正确的解析结果
参考 时间语义解析-关于 和 字的解析

entity2tag()有些局限

entity2tag() 函数还可以提升一下,看了一下源码标注是按照offset顺序来的,如果ner_entities里面多个entities的offset是乱序的

比如:
before:
[{'text': '胡静静', 'offset': [0, 3], 'type': 'Person'},{'text': '水利局', 'offset': [4, 7], 'type': 'Orgnization'}]]
after:
ner_entities =[{'text': '水利局', 'offset': [4, 7], 'type': 'Orgnization'},{'text': '胡静静', 'offset': [0, 3], 'type': 'Person'}]
最后的结果将会变成:
['O', 'O', 'O', 'O', 'B-Orgnization', 'I-Orgnization', 'E-Orgnization', 'O', 'O', 'O']
非常感谢您的工具,受益很多

  • 函数名 function name:
    entity2tag()

请输入报错的文本,以及代码 please input the text and code

# 复制粘贴此处 copy and paste here

请输入报错信息与日志追踪 please input the bug info and traceback

python>3.9有没有办法安装呢

安装时还是会报pkuseg错误,有没有办法绕过嘞,其实自己使用的功能也没用到分词相关的,而且环境已经装了很多包,不方便再重新搭了T T

解析时间parse_time指定了时间类型不管用

描述(Description)

描述你遇到了什么问题(Please describe your issue here)

  1. 版本(Version):
  • python 版本: 3.7.4
  • jionlp 版本:1.3.53
  1. jionlp的调用代码与输入文本(Code & Text):
res = jio.parse_time('请修改每天18点的提醒', time_base=datetime.now(), time_type='time_point')
print(res)
  1. 调用报错日志如下(Log):

期望行为(Expectation)
期盼返回:{'type': 'time_point', 'definition': 'accurate', 'time': ['2022-05-11 18:00:00', '2022-05-11 18:59:59'], 'string': '请修改18点的提醒'}
实际返回:{'type': 'time_period', 'definition': 'accurate', 'time': {'delta': {'day': 1}, 'point': {'time': ['2022-05-11 18:00:00', '2022-05-11 18:59:59'], 'string': '请修改18点的提醒'}}}

若返回结果不理想,描述你期望发生的事情(Please describe your expectation)

指定了按照时间类型time_point解析,但是结果并没有按照设定执行

请顺手 star 一下右上角的⭐小星星

地址解析不准确

{
"province":"黑龙江省",
"city":"齐齐哈尔市",
"county":"龙江县",
"detail":"省富裕县富裕镇三社区五委2组",
"full_location":"黑龙江省齐齐哈尔市龙江县省富裕县富裕镇三社区五委2组",
"orig_location":"黑龙江省富裕县富裕镇三社区五委2组"
}

抽取金额字符串功能中对于口语化似乎不太支持

请输入您的问题描述,或您预期的功能 please describe the bug or the function you expect

例如:十块五 八块五毛钱 这类口语化的金额表示似乎无法支持
不过想了一下这些说法要想识别确实容易和其他量词产生冲突 不知道有没有合适的解决方法

时间解析问题

提 issue 请务必将以下信息写清楚,否则无法解答!!!
描述(Description)

描述你遇到了什么问题(Please description your issue here)
如果遇到 1. 明天上午8点到9点开会,这种9点会解析成今天的9点
2. 下午3点开会,提前20分钟提醒我,之后的20分钟也会解析成3点

python 3.9.10安装失败

python版本:3.9.10
pip版本:22.0.4
操作系统:windows10
下载源:清华(https://pypi.tuna.tsinghua.edu.cn/simple)

错误日志:
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting jionlp
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/54/81/72112e67f4de08db3b701e36f69318c79540f67916fc6ab26c91995725fd/jionlp-1.3.47-py2.py3-none-any.whl (19.0 MB)
Collecting pkuseg
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/64/3a/090a533c7f0682d653633cfd2d33e9aab3e671379fb199aeb7fa9bd3c34a/pkuseg-0.0.25.tar.gz (48.8 MB)
Preparing metadata (setup.py): started
Preparing metadata (setup.py): finished with status 'done'
Requirement already satisfied: jieba in c:\users\jiebrother\appdata\local\programs\python\python39\lib\site-packages (from jionlp) (0.42.1)
Requirement already satisfied: numpy in c:\users\jiebrother\appdata\local\programs\python\python39\lib\site-packages (from jionlp) (1.22.2)
Requirement already satisfied: requests in c:\users\jiebrother\appdata\local\programs\python\python39\lib\site-packages (from jionlp) (2.27.1)
Requirement already satisfied: zipfile36 in c:\users\jiebrother\appdata\local\programs\python\python39\lib\site-packages (from jionlp) (0.1.3)
Requirement already satisfied: cython in c:\users\jiebrother\appdata\local\programs\python\python39\lib\site-packages (from pkuseg->jionlp) (0.29.28)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\jiebrother\appdata\local\programs\python\python39\lib\site-packages (from requests->jionlp) (1.26.8)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\jiebrother\appdata\local\programs\python\python39\lib\site-packages (from requests->jionlp) (2021.10.8)
Requirement already satisfied: idna<4,>=2.5 in c:\users\jiebrother\appdata\local\programs\python\python39\lib\site-packages (from requests->jionlp) (3.3)
Requirement already satisfied: charset-normalizer~=2.0.0 in c:\users\jiebrother\appdata\local\programs\python\python39\lib\site-packages (from requests->jionlp) (2.0.12)
Using legacy 'setup.py install' for pkuseg, since package 'wheel' is not installed.
Installing collected packages: pkuseg, jionlp
Running setup.py install for pkuseg: started
Running setup.py install for pkuseg: finished with status 'error'
error: subprocess-exited-with-error

Running setup.py install for pkuseg did not run successfully.
exit code: 1

[63 lines of output]
running install
running build
running build_py
creating build
creating build\lib.win-amd64-3.9
creating build\lib.win-amd64-3.9\pkuseg
copying pkuseg\config.py -> build\lib.win-amd64-3.9\pkuseg
copying pkuseg\data.py -> build\lib.win-amd64-3.9\pkuseg
copying pkuseg\download.py -> build\lib.win-amd64-3.9\pkuseg
copying pkuseg\gradient.py -> build\lib.win-amd64-3.9\pkuseg
copying pkuseg\model.py -> build\lib.win-amd64-3.9\pkuseg
copying pkuseg\optimizer.py -> build\lib.win-amd64-3.9\pkuseg
copying pkuseg\res_summarize.py -> build\lib.win-amd64-3.9\pkuseg
copying pkuseg\scorer.py -> build\lib.win-amd64-3.9\pkuseg
copying pkuseg\trainer.py -> build\lib.win-amd64-3.9\pkuseg
copying pkuseg_init_.py -> build\lib.win-amd64-3.9\pkuseg
creating build\lib.win-amd64-3.9\pkuseg\dicts
copying pkuseg\dicts_init_.py -> build\lib.win-amd64-3.9\pkuseg\dicts
creating build\lib.win-amd64-3.9\pkuseg\models
copying pkuseg\models_init_.py -> build\lib.win-amd64-3.9\pkuseg\models
creating build\lib.win-amd64-3.9\pkuseg\postag
copying pkuseg\postag\model.py -> build\lib.win-amd64-3.9\pkuseg\postag
copying pkuseg\postag_init_.py -> build\lib.win-amd64-3.9\pkuseg\postag
creating build\lib.win-amd64-3.9\pkuseg\models\default
copying pkuseg\models\default_init_.py -> build\lib.win-amd64-3.9\pkuseg\models\default
copying pkuseg\feature_extractor.pyx -> build\lib.win-amd64-3.9\pkuseg
copying pkuseg\inference.pyx -> build\lib.win-amd64-3.9\pkuseg
copying pkuseg\dicts\default.pkl -> build\lib.win-amd64-3.9\pkuseg\dicts
copying pkuseg\postag\feature_extractor.pyx -> build\lib.win-amd64-3.9\pkuseg\postag
copying pkuseg\models\default\features.pkl -> build\lib.win-amd64-3.9\pkuseg\models\default
copying pkuseg\models\default\weights.npz -> build\lib.win-amd64-3.9\pkuseg\models\default
running build_ext
skipping 'pkuseg\inference.cpp' Cython extension (up-to-date)
cythoning pkuseg/feature_extractor.pyx to pkuseg\feature_extractor.c
cythoning pkuseg/postag/feature_extractor.pyx to pkuseg/postag\feature_extractor.c
building 'pkuseg.inference' extension
creating build\temp.win-amd64-3.9
creating build\temp.win-amd64-3.9\Release
creating build\temp.win-amd64-3.9\Release\pkuseg
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Users\JieBrother\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\core\include -IC:\Users\JieBrother\AppData\Local\Programs\Python\Python39\include -IC:\Users\JieBrother\AppData\Local\Programs\Python\Python39\include -IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt /EHsc /Tppkuseg\inference.cpp /Fobuild\temp.win-amd64-3.9\Release\pkuseg\inference.obj
inference.cpp
c:\users\jiebrother\appdata\local\programs\python\python39\lib\site-packages\numpy\core\include\numpy\npy_1_7_deprecated_api.h(14) : Warning Msg: Using deprecated NumPy API, disable it with #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
pkuseg\inference.cpp(3118): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
pkuseg\inference.cpp(4284): warning C4244: '=': conversion from 'npy_intp' to 'int', possible loss of data
pkuseg\inference.cpp(4285): warning C4244: '=': conversion from 'npy_intp' to 'int', possible loss of data
pkuseg\inference.cpp(5108): warning C4267: '=': conversion from 'size_t' to 'int', possible loss of data
pkuseg\inference.cpp(6219): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
pkuseg\inference.cpp(6807): warning C4244: 'argument': conversion from 'Py_ssize_t' to 'int', possible loss of data
pkuseg\inference.cpp(23619): error C2039: 'tp_print': is not a member of '_typeobject'
c:\users\jiebrother\appdata\local\programs\python\python39\include\cpython/object.h(193): note: see declaration of '_typeobject'
pkuseg\inference.cpp(23624): error C2039: 'tp_print': is not a member of '_typeobject'
c:\users\jiebrother\appdata\local\programs\python\python39\include\cpython/object.h(193): note: see declaration of '_typeobject'
pkuseg\inference.cpp(23639): error C2039: 'tp_print': is not a member of '_typeobject'
c:\users\jiebrother\appdata\local\programs\python\python39\include\cpython/object.h(193): note: see declaration of '_typeobject'
pkuseg\inference.cpp(23652): error C2039: 'tp_print': is not a member of '_typeobject'
c:\users\jiebrother\appdata\local\programs\python\python39\include\cpython/object.h(193): note: see declaration of '_typeobject'
pkuseg\inference.cpp(24323): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
c:\users\jiebrother\appdata\local\programs\python\python39\include\cpython/unicodeobject.h(446): note: see declaration of '_PyUnicode_get_wstr_length'
pkuseg\inference.cpp(24339): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
c:\users\jiebrother\appdata\local\programs\python\python39\include\cpython/unicodeobject.h(446): note: see declaration of '_PyUnicode_get_wstr_length'
pkuseg\inference.cpp(26222): warning C4996: 'PyUnicode_FromUnicode': deprecated in 3.3
c:\users\jiebrother\appdata\local\programs\python\python39\include\cpython/unicodeobject.h(551): note: see declaration of 'PyUnicode_FromUnicode'
error: command 'C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe' failed with exit code 2
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

Encountered error while trying to install package.

pkuseg

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

关键短语抽取例子疑问

image
如图,使用demo提取关键短语的时候,为什么不是这样的输出:['俄罗斯克里姆林宫', '邀请金正恩访俄', '举行会谈', '朝方转交普京', '最高司令官金正恩']
还是说需要调整什么参数才能得到上面的输出结果?

地址识别 模糊匹配

感谢作者开发这么棒的工具,我现在遇到个问题,在地址识别的过程中,请问,现在的工具支持模糊匹配吗,比如北京***,当前的工具是无法识别的,只有精准到北京市才行,请问这种应该怎么解决?

这个Issue只为赞美

请输入您的问题描述,或您预期的功能 please describe the bug or the function you expect

  • 函数名 function name:
    Great Job!

请输入报错的文本,以及代码 please input the text and code

# 复制粘贴此处 copy and paste here

请输入报错信息与日志追踪 please input the bug info and traceback

感谢你的付出,我正在使用中,希望未来可以贡献自己的力量!

remove_exception_char 中的正则不起作用

ASCII_EXCEPTION_PATTERN = '[^\x09-\x0d\x20-\x7e\xa0£¥©®°±×÷]'
UNICODE_EXCEPTION_PATTERN = '[^‐-”•…‰※℃℉Ⅰ-ⅹ①-⒛\u3000-】〔-〞㈠-㈩一-龥﹐-﹫!-~¢£¥]'
EXCEPTION_PATTERN = ASCII_EXCEPTION_PATTERN[:-1] + UNICODE_EXCEPTION_PATTERN[2:]

print(EXCEPTION_PATTERN)
 -~ £¥©®°±×÷‐-”•…‰※℃℉Ⅰ-ⅹ①-⒛ -】〔-〞㈠-㈩一-龥﹐-﹫!-~¢£¥]

调用方法时,无法清除文本中的异常字符

parse_time 默认参数 time.time() 的实时调整

描述(Description)

描述你遇到了什么问题(Please describe your issue here)
萌新写了提醒自己的机器人, 但是发现使用明天, 后天, 几秒后, 几分钟后等相对时间时, 基准时间都是程序运行时那一刻的时间.
而不是调用时的时间. (抽象成样例如下) (不知道是不是bug, 还是使用的方法不当)

  1. 版本(Version):
  • python 版本: 3.9.12
  • jionlp 版本: 1.3.53
  1. jionlp的调用代码与输入文本(Code & Text):
e.g.
import time
import jionlp
import re


def analyse(text: str):
    match_rule = r"(?P<time>(.*)?)(提醒我|[和对跟]我说)(?P<something>(.*))"
    result = re.match(pattern=match_rule, string=text)
    print(text)
    if result is not None:
        print(jionlp.parse_time(result.groupdict()['time']))
    else:
        print("匹配失败")
    print('*' * 50)
    time.sleep(2)


text_list = [
    "1秒后提醒我做吃饭",
    "1秒后提醒我做吃饭",
    "1秒后提醒我做吃饭",
    "1秒后提醒我做吃饭",
]

for text in text_list:
    analyse(text=text)

print(jionlp.__version__)
  1. 调用报错日志如下(Log):
1秒后提醒我做吃饭
{'type': 'time_point', 'definition': 'accurate', 'time': ['2022-04-22 15:12:12', '2022-04-22 15:12:13']}
**************************************************
1秒后提醒我做吃饭
{'type': 'time_point', 'definition': 'accurate', 'time': ['2022-04-22 15:12:12', '2022-04-22 15:12:13']}
**************************************************
1秒后提醒我做吃饭
{'type': 'time_point', 'definition': 'accurate', 'time': ['2022-04-22 15:12:12', '2022-04-22 15:12:13']}
**************************************************
1秒后提醒我做吃饭
{'type': 'time_point', 'definition': 'accurate', 'time': ['2022-04-22 15:12:12', '2022-04-22 15:12:13']}
**************************************************
1.3.53

进程已结束,退出代码0

期望行为(Expectation)

若返回结果不理想,描述你期望发生的事情(Please describe your expectation)
期望每次调用都以调用时的时间为基准时间, 如上例中, 秒数分别期望是12,14,16,18

请顺手 star 一下右上角的⭐小星星 (已点, 膜拜大佬~)

时间语义解析

提 issue 请务必将以下信息写清楚,否则无法解答!!!
描述(Description)

描述你遇到了什么问题(Please description your issue here)
大年初一解析有问题

image

货币金额抽取解析反馈

版本:
Python 3.9
jionlp-py39 1.3.45

问题描述:
使用货币金额抽取,如:”2.2本计划投资3541.07万元2.3本项目……“会抽取到”3541.07万元2“,解析出的结果是:
{'num': '200000.20', 'case': '元', 'definition': 'accurate'}
也就是万元前面是数字,后面也带数字的话,解析出的结果就不对,好像是前面的数字相加:3+5+4+1+0+7=20, 结果就是200000.2
image

在在线版的测试结果:
image

主要是反馈一下自己在这种情况下遇到的解析结果,看对迭代有无帮助,感谢作者大大:)

新闻地名识别本地无法正常运行

  1. jionlp版本(Version): 1.3.39
  2. 调用报错日志如下(Log):
➜  JioNLP git:(master) ✗ python3.9 index.py
`jio.help()` is provided to search how to use jio functions.
Traceback (most recent call last):
  File "python/JioNLP/index.py", line 6, in <module>
    print(jio.recognize_location(text))
  File "python/JioNLP/jionlp/gadget/location_recognizer.py", line 381, in __call__
    self._prepare()
  File "python/JioNLP/jionlp/gadget/location_recognizer.py", line 111, in _prepare
    self.pkuseg = pkuseg.pkuseg(postag=True)
TypeError: __init__() got an unexpected keyword argument 'postag'
  1. jionlp的调用代码与输入文本(Code & Text):
import jionlp as jio
text = '海洋一号D星。中新网北京6月11日电(郭超凯)记者从**国家航天局获悉,6月11日2时31分,在牛家村,**在太原卫星发射中心用长征二号丙运载火箭成功发射海洋一号D星。该星将与海洋一号C星组成**首个海洋民用业务卫星星座。相比于美国,海洋一号D星是**第四颗海洋水色系列卫星,是国家民用空间基础设施规划的首批海洋业务卫星之一。'
res = jio.recognize_location(text)
print(res)

期望行为(Expect)

可以和样例运行出一样的结果

回译时报错

请输入您的问题描述,或您预期的功能 please describe the bug or the function you expect

  • 函数名 function name:
    jio.BackTranslation

请输入报错的文本,以及代码 please input the text and code

# 复制粘贴此处 copy and paste here
国家卫生健康委今天5月25日通报5月24日024时31个省自治区直辖市和**生产建设兵团报告新增新冠肺炎确诊病例15例其中境外输入病例13例本土病例2例新增无症状感染者18例其中境外输入16例本土2例截至5月24日24时现有确诊病例319例截至5月24日各地累计报告接种新冠病毒疫苗527253万剂次

请输入报错信息与日志追踪 please input the bug info and traceback

Traceback (most recent call last):
File "back_transformation_xwlb_data.py", line 41, in
back_transformation_xwlb_data(bt, data_path, save_path, gap)
File "back_transformation_xwlb_data.py", line 22, in back_transformation_xwlb_data
res = back_translation.back_translation(line)
File "/source/code/zhaoyhy/AI/src/DataAugment/back_translation.py", line 57, in back_translation
return self.back_translation_api(text)
File "/usr/python3.8/lib/python3.8/site-packages/jionlp/textaug/back_translation/back_translation.py", line 115, in call
back_tran_result = self.filter_results(text, back_tran_result)
File "/usr/python3.8/lib/python3.8/site-packages/jionlp/textaug/back_translation/back_translation.py", line 197, in filter_results
back_tran_results = [line for line in back_tran_results
File "/usr/python3.8/lib/python3.8/site-packages/jionlp/textaug/back_translation/back_translation.py", line 198, in
if _length_filter(text, line)]
File "/usr/python3.8/lib/python3.8/site-packages/jionlp/textaug/back_translation/back_translation.py", line 192, in _length_filter
if (orig_len / tran_len) < 1 / 3 or (orig_len / tran_len) > 3:
ZeroDivisionError: division by zero

location_parser中的报错问题

提 issue 请务必将以下信息写清楚,否则无法解答!!!
描述(Description)
使用地址解析功能,如果text为“上门西湖区蒋村花园小区管局农贸市场高高兴兴”,会报错TypeError: sequence item 0: expected str instance, NoneType found

需要把location_parser.py中的第279行修改为 key_name = ''.join( [str(prov), str(city), str(county)])
已解决

数据增强:同音字替换bug

请输入您的问题描述,或您预期的功能 please describe the bug or the function you expect

  • 函数名 function name:
    dictionary_loader.py -》 chinese_char_dictionary_loader

请输入报错的文本,以及代码 please input the text and code

jio.homophone_substitution("北京市")

请输入报错信息与日志追踪 please input the bug info and traceback

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/ks/vz0z2zk13hx0t6h_pgy1bpfh0000gn/T/jieba.cache
Loading model cost 0.602 seconds.
Prefix dict has been built succesfully.
Traceback (most recent call last):
  File "/Users/mrx/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3326, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-703a6a7940fb>", line 1, in <module>
    runfile('/Users/mrx/Documents/work/lance/gov_nlp/repo/legal_instrument/corpus/augement.py', wdir='/Users/mrx/Documents/work/lance/gov_nlp/repo/legal_instrument/corpus')
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_umd.py", line 197, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/Users/mrx/Documents/work/lance/gov_nlp/repo/legal_instrument/corpus/augement.py", line 217, in <module>
    jio.homophone_substitution('北京市')
  File "/Users/mrx/anaconda3/lib/python3.7/site-packages/jionlp/textaug/homophone_substitution.py", line 108, in __call__
    self._prepare(homo_ratio=homo_ratio, seed=seed)
  File "/Users/mrx/anaconda3/lib/python3.7/site-packages/jionlp/textaug/homophone_substitution.py", line 68, in _prepare
    self._construct_word_pinyin_dict()
  File "/Users/mrx/anaconda3/lib/python3.7/site-packages/jionlp/textaug/homophone_substitution.py", line 80, in _construct_word_pinyin_dict
    word_pinyin = self.pinyin(word, formater='detail')
  File "/Users/mrx/anaconda3/lib/python3.7/site-packages/jionlp/gadget/pinyin.py", line 164, in __call__
    self._prepare()
  File "/Users/mrx/anaconda3/lib/python3.7/site-packages/jionlp/gadget/pinyin.py", line 79, in _prepare
    self.pinyin_char = pinyin_char_loader()
  File "/Users/mrx/anaconda3/lib/python3.7/site-packages/jionlp/dictionary/dictionary_loader.py", line 424, in pinyin_char_loader
    char_dict = chinese_char_dictionary_loader()
  File "/Users/mrx/anaconda3/lib/python3.7/site-packages/jionlp/dictionary/dictionary_loader.py", line 245, in chinese_char_dictionary_loader
    assert len(segs) == 8
AssertionError

版本信息:
jionlp==1.3.15

其他问题:
word_distribution.zip 这个文件没有包含解压后的文本, 需要手动解压才可以

初始化时的预加载

如果把该系统集成到线上系统,为了确保模型的响应速度,最好能显式在整个后台跑起来时就能预加载会用到的模块,如何实现?
类似于jieba提供的:jieba.initialize(),会直接加载模型,确保后续调用时不需要重新加载,更快响应。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.