#注意
若有需要,請參考Python3的版本:Jseg3
A modified version of Jieba segmentator
- Equipped with Emoticon detection
- Data are trained with Sinica Corpus
- Using Brill Tagger
Emoticons will not be segmented as sequences of meaningless punctuations.
Results are more accurate when dealing with Traditional Chinese (F1-score = 0.91).
Training data are trained with Sinica Treebank, which raises the accuracy of POS tagging.
Print out without POS tagging:
print result.nopos
Result:
台灣 大學 語言學 研究所 LOPE 實驗室 超強
Taco 門神 超罩
Amber 和 Emily 是 雙胞胎
Yvonne 不 是 小老鼠
期末 要 爆炸 啦 ! ◢▆▅▄▃崩╰(〒皿〒)╯潰▃▄▅▇◣
If you want the result to be in a list, set mode
to list
:
###Add user defined dictionary
jieba.add_guaranteed_wordlist(lst)
lst
should be a list of unicodes, e.g., [u'蟹老闆', u'張他口', u'愛米粒', u'劉阿吉']