#注意

目前Jseg 在 python2上有許多問題，暫時停止開發

若有需要，請參考Python3的版本：Jseg3

Jseg

A modified version of Jieba segmentator

Synopsis

Equipped with Emoticon detection

Emoticons will not be segmented as sequences of meaningless punctuations.

Data are trained with Sinica Corpus

Results are more accurate when dealing with Traditional Chinese (F1-score = 0.91).

Using Brill Tagger

Training data are trained with Sinica Treebank, which raises the accuracy of POS tagging.

Installation

``` (sudo) pip install git+https://github.com/amigcamel/Jseg.git (sudo) pip install setuptools==9.1 (sudo) pip install -I nltk==2.0.4 (sudo) pip install --upgrade setuptools ```

Usage

``` from jseg.jieba import Jieba jieba = Jieba() ``` Here's a sample text: ``` sample = '''台灣大學語言學研究所LOPE實驗室超強 Taco門神超罩 Amber 和 Emily 是雙胞胎 Yvonne 不是小老鼠期末要爆炸啦！ ◢▆▅▄▃崩╰(〒皿〒)╯潰▃▄▅▇◣ ''' ``` Segmentation ``` result = jieba.seg(sample) ``` Print out: ``` print result.text ``` And the result: ``` 台灣/Nca 大學/Ncb 語言學/Nad 研究所/Ncb LOPE/FW 實驗室/Ncb 超強/VH11 Taco/FW 門神/Nad 超罩/VH14 Amber/FW 和/Caa Emily/FW 是/V_11 雙胞胎/DM Yvonne/FW 不/Dc 是/V_11 小老鼠/Nab 期末/Ng 要/Dbab 爆炸/VH11 啦/Tc ！/PUNCTUATION ◢▆▅▄▃崩╰(〒皿〒)╯潰▃▄▅▇◣/EMOTICON ``` ~~You can print out the result with colored POS tagging:~~

Print out without POS tagging:

print result.nopos

Result:

台灣 大學 語言學 研究所 LOPE 實驗室 超強
Taco 門神 超罩
Amber 和 Emily 是 雙胞胎
Yvonne 不 是 小老鼠
期末 要 爆炸 啦 ！ ◢▆▅▄▃崩╰(〒皿〒)╯潰▃▄▅▇◣

~~If you want the result to be in a list, set mode to list:~~

###Add user defined dictionary

jieba.add_guaranteed_wordlist(lst)

lst should be a list of unicodes, e.g., [u'蟹老闆', u'張他口', u'愛米粒', u'劉阿吉']

jianxing0310 / jseg Goto Github PK

jseg's Introduction

目前Jseg 在 python2上有許多問題，暫時停止開發

若有需要，請參考Python3的版本：Jseg3

Jseg

A modified version of Jieba segmentator

Synopsis

Installation

Usage

jseg's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent