deepcs233 / jieba_fast Goto Github PK
View Code? Open in Web Editor NEWUse C Api and Swig to Speed up jieba 高效的中文分词库
License: MIT License
Use C Api and Swig to Speed up jieba 高效的中文分词库
License: MIT License
能否将 jieba_fast/jieba_fast/init.py
中的日志等级调制warning?(原先的debug很烦人)
default_logger.setLevel(logging.WARNING )
測試代碼如下:
text_proc = text.replace('\n', ' ')
import time
# running jieba.cut()
jieba_words = [word for word in jieba.cut(text_proc, HMM=True)]
# running jieba_fast.cut()
jb_fast_words = [word for word in jieba_fast.cut(text_proc, HMM=True)]
print("Test set: 西遊記")
print("Number of unique jieba words: {}".format(len(set(jieba_words))))
print("Number of unique jieba_fast words: {}".format(len(set(jb_fast_words))))
print("Number of unique words in intersection: {}".format(len(set(jieba_words) & set(jb_fast_words))))
print("Number of unique words in union: {}".format(len(set(jieba_words) | set(jb_fast_words))))
print("IOU (intersection over union): {}".format(len(set(jieba_words) & set(jb_fast_words)) / len(set(jieba_words) | set(jb_fast_words))))
測試結果:
Test set: 西遊記
Number of unique jieba words: 31095
Number of unique jieba_fast words: 41684
Number of unique words in intersection: 22078
Number of unique words in union: 50701
IOU (intersection over union): 0.43545492199364905
測試數據:https://github.com/deepcs233/jieba_fast/files/1795904/jttw_1-50.txt
我训练Word2Vec,代码大概是
import pymongo
db = pymongo.MongoClient().weike.content_lectures
import jieba
class Document:
def __iter__(self):
for t in db.find(no_cursor_timeout=True):
yield jieba.lcut(t['content'])
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
from gensim.models import Word2Vec
word2vec = Word2Vec(Document(), size=128, window=10, min_count=5, sg=1, negative=10, workers=4, iter=10)
word2vec.save('weike.word2vec')
这个代码一直保持比较低的内存消耗。
然而,如果将jieba换成你写的jieba_fast,内存就一直飙升,直到最后进程被杀。分别在ubuntu和centos都测试过。
import jieba_fast as jieba
jieba.add_word('希尔顿')
print(list(jieba.cut('希尔顿酒店健身中心')))
输出的结果依然是希尔顿酒店
是否没有实现?
Current last upload date Dec 21, 2018
提示错误,系统找不到指定的路径。: 'C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\PlatformSDK\lib
这个怎么解决啊?
Python是3.6版本
在程序中根据不同的条件动态的加载不同的配置文件
也可能是配置文件某些关键词被删除,添加了另外一些
我对python、 jieba都不太懂
麻烦指点一下
第一次调用 jieba_fast.lcut
会首先构建trie树写入 /tmp/jieba.cache
想知道如果两个线程同时调用 jieba_fast.lcut
会如何?两个进程同时调又会如何?
jieba.set_dictionary和jieba.load_userdict有何区别
今天想下载一下,突然不行了,麻烦帮帮忙
Windows 10 64 Bit
Python 3.8.6 (tags/v3.8.6:db45529, Sep 23 2020, 15:52:53) [MSC v.1927 64 bit (AMD64)] on win32
Pip源码
Collecting jieba-fast
Using cached jieba_fast-0.53.tar.gz (7.5 MB)
Using legacy 'setup.py install' for jieba-fast, since package 'wheel' is not installed.
Installing collected packages: jieba-fast
Running setup.py install for jieba-fast ... error
ERROR: Command errored out with exit status 1:
command: 'c:\users\kiwirafe\appdata\local\programs\python\python38\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\kiwirafe\\AppData\\Local\\Temp\\pip-install-rj2rjw1v\\jieba-fast\\setup.py'"'"'; __file__='"'"'C:\\Users\\kiwirafe\\AppData\\Local\\Temp\\pip-install-rj2rjw1v\\jieba-fast\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\kiwirafe\AppData\Local\Temp\pip-record-wu54fq9o\install-record.txt' --single-version-externally-managed --compile --install-headers 'c:\users\kiwirafe\appdata\local\programs\python\python38\Include\jieba-fast'
cwd: C:\Users\kiwirafe\AppData\Local\Temp\pip-install-rj2rjw1v\jieba-fast\
Complete output (55 lines):
running install
running build
running build_py
creating build
creating build\lib.win-amd64-3.8
creating build\lib.win-amd64-3.8\jieba_fast
copying jieba_fast\jieba_fast_functions_py2.py -> build\lib.win-amd64-3.8\jieba_fast
copying jieba_fast\jieba_fast_functions_py3.py -> build\lib.win-amd64-3.8\jieba_fast
copying jieba_fast\_compat.py -> build\lib.win-amd64-3.8\jieba_fast
copying jieba_fast\__init__.py -> build\lib.win-amd64-3.8\jieba_fast
copying jieba_fast\__main__.py -> build\lib.win-amd64-3.8\jieba_fast
copying jieba_fast\dict.txt -> build\lib.win-amd64-3.8\jieba_fast
copying jieba_fast\_compat.pyc -> build\lib.win-amd64-3.8\jieba_fast
copying jieba_fast\__init__.pyc -> build\lib.win-amd64-3.8\jieba_fast
creating build\lib.win-amd64-3.8\jieba_fast\finalseg
copying jieba_fast\finalseg\jieba_fast_functions_py2.py -> build\lib.win-amd64-3.8\jieba_fast\finalseg
copying jieba_fast\finalseg\jieba_fast_functions_py3.py -> build\lib.win-amd64-3.8\jieba_fast\finalseg
copying jieba_fast\finalseg\prob_emit.p -> build\lib.win-amd64-3.8\jieba_fast\finalseg
copying jieba_fast\finalseg\prob_emit.py -> build\lib.win-amd64-3.8\jieba_fast\finalseg
copying jieba_fast\finalseg\prob_emit.pyc -> build\lib.win-amd64-3.8\jieba_fast\finalseg
copying jieba_fast\finalseg\prob_start.p -> build\lib.win-amd64-3.8\jieba_fast\finalseg
copying jieba_fast\finalseg\prob_start.py -> build\lib.win-amd64-3.8\jieba_fast\finalseg
切分的句子过长时,会出现Segmentation fault的问题。
作者你好。我在使用jieba_fast的时候发现一个问题,就是在使用自定义词典,jieba_fast的分词结果会和jieba的分词结果有所不同。系统版本为ubuntu18.04,库的版本信息如下:
jieba 0.39
jieba-fast 0.53
复现代码如下:
import jieba
import jieba_fast
sentence = '从一开始的不被看好到逆袭,拼多多只用了3年的时间。虽然上周的优惠券漏洞让拼多多被薅了千万羊毛,但却并未对其股价造成不利影响。25日拼多多市值报318.38亿美元,超过京东313.51亿的市值。拼多多股价周四大涨。截至收盘时,拼多多股价收于28.74美元,上涨2.07美元,涨幅达7.76%。而京东股价收于22.1美元,上涨0.13美元,涨幅为0.59%,成为**第二大电商平台。拼多多股价在过去的半年里波动强烈:IPO定价于19美元,首日收盘价就达到26.70美元,涨幅达40.5%;之后不断下跌至17.22美元,然后就一路上涨,创造目前历史最高价30.48美元,但随后再次下跌,达到历史最低价16.53美元;此后公司股票在22美元附近震荡。拼多多股价波动强烈拼多多近日遭遇两大利空,一方面,拼多多被曝出现重大漏洞,引来大批用户“薅羊毛”,导致公司声誉和资金遭到重大损失。另一方面,拼多多股票的禁售期将于1月22日结束。届时,将有大量拼多多股东二级市场出售股票进行套现。虽然有很多人不喜欢拼多多,但是不得不说拼多多近来发展确实不错,而京东今年则比较坎坷,但在物流和服务方面还是口碑不错的。此前刘强东曾称京东和拼多多的商业模式不同。黄峥则回应称要多向电商前辈学习。'
a1 = jieba.lcut(sentence, HMM=False)
b1 = jieba_fast.lcut(sentence, HMM=False)
print(a1==b1)
jieba.set_dictionary('./word_list_nnlm_128.txt')
a2 = jieba.lcut(sentence, HMM=False)
jieba_fast.set_dictionary('./word_list_nnlm_128.txt')
b2 = jieba_fast.lcut(sentence, HMM=False)
print(a2==b2)
print(a2)
print(b2)
输出如下:
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 3.847 seconds.
Prefix dict has been built succesfully.
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 2.129 seconds.
Prefix dict has been built succesfully.
Building prefix dict from /home/zeng/Code/work/jixin/PublicMonitoring/NoteBook/word_list_nnlm_128.txt ...
Loading model from cache /tmp/jieba.u2fcc826b59ac43bb4127b88239445c58.cache
True
Loading model cost 8.175 seconds.
Prefix dict has been built succesfully.
Building prefix dict from /home/zeng/Code/work/jixin/PublicMonitoring/NoteBook/word_list_nnlm_128.txt ...
Loading model from cache /tmp/jieba.u2fcc826b59ac43bb4127b88239445c58.cache
Loading model cost 6.771 seconds.
Prefix dict has been built succesfully.
False
['从', '一', '开始', '的', '不', '被', '看好', '到', '逆袭', ',', '拼', '多多', '只', '用', '了', '3', '年', '的', '时间', '。', '虽然', '上周', '的', '优惠', '券', '漏洞', '让', '拼', '多多', '被', '薅', '了', '千万', '羊毛', ',', '但却', '并未', '对其', '股价', '造成', '不利', '影响', '。', '25', '日', '拼', '多多', '市值', '报', '318', '.', '38', '亿', '美元', ',', '超过', '京东', '313', '.', '51', '亿', '的', '市值', '。', '拼', '多多', '股价', '周四', '大涨', '。', '截至', '收盘', '时', ',', '拼', '多多', '股价', '收', '于', '28', '.', '74', '美元', ',', '上涨', '2', '.', '07', '美元', ',', '涨幅', '达', '7', '.', '76', '%', '。', '而', '京东', '股价', '收', '于', '22', '.', '1', '美元', ',', '上涨', '0', '.', '13', '美元', ',', '涨幅', '为', '0', '.', '59', '%', ',', '成为', '**', '第二', '大', '电', '商', '平台', '。', '拼', '多多', '股价', '在', '过去', '的', '半年', '里', '波动', '强烈', ':', 'IPO', '定价', '于', '19', '美元', ',', '首', '日', '收盘', '价', '就', '达到', '26', '.', '70', '美元', ',', '涨幅', '达', '40', '.', '5', '%', ';', '之后', '不断', '下跌', '至', '17', '.', '22', '美元', ',', '然后', '就', '一路上', '涨', ',', '创造', '目前', '历史', '最高', '价', '30', '.', '48', '美元', ',', '但', '随后', '再次', '下跌', ',', '达到', '历史', '最低价', '16', '.', '53', '美元', ';', '此后', '公司', '股票', '在', '22', '美元', '附近', '震荡', '。', '拼', '多多', '股价', '波动', '强烈', '拼', '多多', '近日', '遭遇', '两大', '利空', ',', '一方面', ',', '拼', '多多', '被曝', '出现', '重大', '漏洞', ',', '引来', '大批', '用户', '“', '薅', '羊毛', '”', ',', '导致', '公司', '声誉', '和', '资金', '遭到', '重大', '损失', '。', '另一方面', ',', '拼', '多多', '股票', '的', '禁售', '期', '将于', '1', '月', '22', '日', '结束', '。', '届时', ',', '将', '有', '大量', '拼', '多多', '股东', '二级', '市场', '出售', '股票', '进行', '套现', '。', '虽然', '有', '很多', '人', '不', '喜欢', '拼', '多多', ',', '但是', '不得不', '说', '拼', '多多', '近来', '发展', '确实', '不错', ',', '而', '京东', '今年', '则', '比较', '坎坷', ',', '但', '在', '物流', '和服', '务', '方面', '还是', '口碑', '不错', '的', '。', '此前', '刘强', '东', '曾', '称', '京东', '和', '拼', '多多', '的', '商业', '模式', '不同', '。', '黄', '峥', '则', '回应', '称', '要', '多', '向', '电', '商', '前辈', '学习', '。']
['从', '一', '开始', '的', '不', '被', '看好', '到', '逆袭', ',', '拼', '多多', '只', '用', '了', '3', '年', '的', '时间', '。', '虽然', '上周', '的', '优惠', '券', '漏洞', '让', '拼', '多多', '被', '薅', '了', '千万', '羊毛', ',', '但却', '并未', '对其', '股价', '造成', '不利', '影响', '。', '25', '日', '拼', '多多', '市值', '报', '318', '.', '38', '亿', '美元', ',', '超过', '京东', '313', '.', '51', '亿', '的', '市值', '。', '拼', '多多', '股价', '周四', '大涨', '。', '截至', '收盘', '时', ',', '拼', '多多', '股价', '收', '于', '28', '.', '74', '美元', ',', '上涨', '2', '.', '07', '美元', ',', '涨幅', '达', '7', '.', '76', '%', '。', '而', '京东', '股价', '收', '于', '22', '.', '1', '美元', ',', '上涨', '0', '.', '13', '美元', ',', '涨幅', '为', '0', '.', '59', '%', ',', '成为', '**', '第二', '大', '电', '商', '平台', '。', '拼', '多多', '股价', '在', '过去', '的', '半年', '里', '波动', '强烈', ':', 'IPO', '定价', '于', '19', '美元', ',', '首', '日', '收盘', '价', '就', '达到', '26', '.', '70', '美元', ',', '涨幅', '达', '40', '.', '5', '%', ';', '之后', '不断', '下', '跌至', '17', '.', '22', '美元', ',', '然', '后就', '一路', '上涨', ',', '创造', '目前', '历史', '最', '高价', '30', '.', '48', '美元', ',', '但', '随后', '再次', '下跌', ',', '达到', '历史', '最低价', '16', '.', '53', '美元', ';', '此后', '公司', '股票', '在', '22', '美元', '附近', '震荡', '。', '拼', '多多', '股价', '波动', '强烈', '拼', '多多', '近日', '遭遇', '两大', '利空', ',', '一方面', ',', '拼', '多多', '被曝', '出现', '重大', '漏洞', ',', '引来', '大批', '用户', '“', '薅', '羊毛', '”', ',', '导致', '公司', '声誉', '和', '资金', '遭到', '重大', '损失', '。', '另一方面', ',', '拼', '多多', '股票', '的', '禁售', '期', '将于', '1', '月', '22', '日', '结束', '。', '届时', ',', '将', '有', '大量', '拼', '多多', '股东', '二级', '市场', '出售', '股票', '进行', '套现', '。', '虽然', '有', '很', '多人', '不', '喜欢', '拼', '多多', ',', '但是', '不得不', '说', '拼', '多多', '近来', '发展', '确实', '不错', ',', '而', '京东', '今年', '则', '比较', '坎坷', ',', '但', '在', '物流', '和', '服务', '方面', '还是', '口碑', '不错', '的', '。', '此前', '刘强', '东', '曾', '称', '京东', '和', '拼', '多多', '的', '商业', '模式', '不同', '。', '黄', '峥', '则', '回应', '称', '要', '多', '向', '电', '商', '前辈', '学习', '。']
其中词典文件来源于tensorhub上的一个nnlm中文预训练词向量模型,我把所有词的词频设为1并放入word_list_nnlm_128.txt文件中。我把它上传到百度网盘了。
链接:https://pan.baidu.com/s/1C-z2mJl6y8qRZEFO1_cqeA
提取码:s9ia
jieba-fast的分词结果和jieba的确差异不大。然而我通过原版jieba分词库在其分词结果的基础上训练了一个情感倾向判断模型,可能是我的模型不够稳健的原因,两个库对这句话的分词结果在模型中预测得到的情感倾向还是有明显差异的,jieba为正面倾向0.9,jieba-fast为正面倾向0.6。
当前你编译后的版本里,有一个.pyd的文件,jieba_fast_functions_py3.cp35-win_amd64.pyd
如果这样原样放到site-package下面的话,import时候会报错:
“”“
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\7q\Anaconda3\lib\site-packages\jieba_fast_init.py", line 16, in
from . import finalseg
File "C:\Users\7q\Anaconda3\lib\site-packages\jieba_fast\finalseg_init_.py", line 12, in
import _jieba_fast_functions_py3 as _jieba_fast_functions
ModuleNotFoundError: No module named 'jieba_fast_functions_py3'
”“”
将文件名改为: jieba_fast_functions_py3.pyd之后再次尝试import
此时报错:
“”“
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\7q\Anaconda3\lib\site-packages\jieba_fast_init.py", line 16, in
from . import finalseg
File "C:\Users\7q\Anaconda3\lib\site-packages\jieba_fast\finalseg_init.py", line 12, in
import _jieba_fast_functions_py3 as _jieba_fast_functions
ImportError: DLL load failed: 找不到指定的模块。
”“”
我不太熟悉这种安装方式,以前都是偷懒anaconda的方法,是否是我操作出了问题?还是windows版本的问题
谢谢!
好像是只做了cut功能的重写,我测试textrank功能并没有速度上的提升。
原版的python子package已经完全包含进来了,按理说可以安全的替换掉import jieba,
这带来的另外一个副作用是如果使用子package就会import jieba 使得jieba_fast和jieba的default_logger都起作用,console里会显示两次load信息。
環境是 macOS 10.13.3, Python 3.6.1
# running jieba.cut()
jieba_words = [word for word in jieba.cut(text, HMM=True) if len(word) > 1 and len(word) <= 4]
# running jieba_fast.cut()
jb_fast_words = []
for word in jieba_fast.cut(text, HMM=True):
if len(word) > 1 and len(word) <= 4:
jb_fast_words.append(word)
其中 jieba.cut() 運行正常
jieba_fast.cut() 會報以下錯:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x88 in position 0: invalid start byte
The above exception was the direct cause of the following exception:
SystemError Traceback (most recent call last)
<ipython-input-37-ab60a3468933> in <module>()
9 jb_fast_words = []
10
---> 11 for word in jieba_fast.cut(text, HMM=True):
12 if len(word) > 1 and len(word) <= 4:
13 jb_fast_words.append(word)
~/anaconda/envs/python3/lib/python3.6/site-packages/jieba_fast/__init__.py in cut(self, sentence, cut_all, HMM)
306 continue
307 if re_han.match(blk):
--> 308 for word in cut_block(blk):
309 yield word
310 else:
~/anaconda/envs/python3/lib/python3.6/site-packages/jieba_fast/__init__.py in __cut_DAG(self, sentence)
271 elif not self.FREQ.get(buf):
272 recognized = finalseg.cut(buf)
--> 273 for t in recognized:
274 yield t
275 else:
~/anaconda/envs/python3/lib/python3.6/site-packages/jieba_fast/finalseg/__init__.py in cut(sentence)
95 for blk in blocks:
96 if re_han.match(blk):
---> 97 for word in __cut(blk):
98 if word not in Force_Split_Words:
99 yield word
~/anaconda/envs/python3/lib/python3.6/site-packages/jieba_fast/finalseg/__init__.py in __cut(sentence)
67 def __cut(sentence):
68 global emit_P
---> 69 prob, pos_list = _jieba_fast_functions._viterbi(sentence, 'BMES', start_P, trans_P, emit_P)
70 begin, nexti = 0, 0
71 for i, char in enumerate(sentence):
SystemError: <built-in function _viterbi> returned a result with an error set
附上文本(UTF-8 的西遊記 1-50 回),感謝!
jttw_1-50.txt
pip3安装
主要依赖posseg
命令行time的结果
处理了4691份短文本
并使我的程序下降了约0.01的准确度。
也就是说,当使用posseg.cut进行分词和词性标注的时候,自定义词典根本就不起作用。而原python版的结巴,不存在此问题。
原版jieba中,calc()函数的动态规划实现的最后一步,使用max(score,end_index)。这样可以实现,如果存在相同分数的分词路径时,结尾索引大的词被分出来。
这里的实现略有不同。
不知您还是否维护这个项目了,如果您看到了这个情况,请协助,谢谢!
当对小区名“和家欣苑”分词时,jieba 的分词结果为:
['和', '家', '欣苑']
而 jieba_fast 的分词结果为:
['和家欣苑']
>>> import jieba
>>> jieba.lcut('和家欣苑')
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\asus\AppData\Local\Temp\jieba.cache
Loading model cost 0.999 seconds.
Prefix dict has been built succesfully.
['和', '家', '欣苑']
>>> import jieba_fast
>>> jieba_fast.lcut('和家欣苑')
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\asus\AppData\Local\Temp\jieba.cache
Loading model cost 1.000 seconds.
Prefix dict has been built succesfully.
['和家欣苑']
Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/8t/z__z7fgj5rnfxbvmysdv7_rw0000gn/T/jieba.cache
Loading model cost 0.919 seconds.
Prefix dict has been built succesfully.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 0: invalid start byte
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/macos/PycharmProjects/tensorflow/NLPTools/jiebaParallel.py", line 18, in
words = "/ ".join(jieba.lcut(content))#默认精确模式
File "/Users/macos/anaconda3/envs/tensorflow/lib/python3.6/site-packages/jieba_fast/init.py", line 340, in lcut
return list(self.cut(*args, **kwargs))
File "/Users/macos/anaconda3/envs/tensorflow/lib/python3.6/site-packages/jieba_fast/init.py", line 308, in cut
for word in cut_block(blk):
File "/Users/macos/anaconda3/envs/tensorflow/lib/python3.6/site-packages/jieba_fast/init.py", line 273, in __cut_DAG
for t in recognized:
File "/Users/macos/anaconda3/envs/tensorflow/lib/python3.6/site-packages/jieba_fast/finalseg/init.py", line 97, in cut
for word in __cut(blk):
File "/Users/macos/anaconda3/envs/tensorflow/lib/python3.6/site-packages/jieba_fast/finalseg/init.py", line 69, in __cut
prob, pos_list = _jieba_fast_functions._viterbi(sentence, 'BMES', start_P, trans_P, emit_P)
SystemError: returned a result with an error set
分词的文件2500多万词 22万多行 是因为太大了吗?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.