Giter Club home page Giter Club logo

jseg's Introduction

#注意

目前Jseg 在 python2上有許多問題,暫時停止開發

若有需要,請參考Python3的版本:Jseg3


Jseg

A modified version of Jieba segmentator

Synopsis

  • Equipped with Emoticon detection
  • Emoticons will not be segmented as sequences of meaningless punctuations.

  • Data are trained with Sinica Corpus
  • Results are more accurate when dealing with Traditional Chinese (F1-score = 0.91).

  • Using Brill Tagger
  • Training data are trained with Sinica Treebank, which raises the accuracy of POS tagging.

Installation

``` (sudo) pip install git+https://github.com/amigcamel/Jseg.git (sudo) pip install setuptools==9.1 (sudo) pip install -I nltk==2.0.4 (sudo) pip install --upgrade setuptools ```

Usage

``` from jseg.jieba import Jieba jieba = Jieba() ``` Here's a sample text: ``` sample = '''台灣大學語言學研究所LOPE實驗室超強 Taco門神超罩 Amber 和 Emily 是雙胞胎 Yvonne 不是小老鼠 期末要爆炸啦! ◢▆▅▄▃崩╰(〒皿〒)╯潰▃▄▅▇◣ ''' ``` Segmentation ``` result = jieba.seg(sample) ``` Print out: ``` print result.text ``` And the result: ``` 台灣/Nca 大學/Ncb 語言學/Nad 研究所/Ncb LOPE/FW 實驗室/Ncb 超強/VH11 Taco/FW 門神/Nad 超罩/VH14 Amber/FW 和/Caa Emily/FW 是/V_11 雙胞胎/DM Yvonne/FW 不/Dc 是/V_11 小老鼠/Nab 期末/Ng 要/Dbab 爆炸/VH11 啦/Tc !/PUNCTUATION ◢▆▅▄▃崩╰(〒皿〒)╯潰▃▄▅▇◣/EMOTICON ``` ~~You can print out the result with colored POS tagging:~~

Print out without POS tagging:

print result.nopos

Result:

台灣 大學 語言學 研究所 LOPE 實驗室 超強
Taco 門神 超罩
Amber 和 Emily 是 雙胞胎
Yvonne 不 是 小老鼠
期末 要 爆炸 啦 ! ◢▆▅▄▃崩╰(〒皿〒)╯潰▃▄▅▇◣ 

If you want the result to be in a list, set mode to list:

###Add user defined dictionary

jieba.add_guaranteed_wordlist(lst)

lst should be a list of unicodes, e.g., [u'蟹老闆', u'張他口', u'愛米粒', u'劉阿吉']

jseg's People

Contributors

amigcamel avatar

Watchers

James Cloos avatar JianXing avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.