Giter Club home page Giter Club logo

jseg's Introduction

Jseg

A modified version of Jieba

All credit goes to fxsjy/jieba.
Find more on: https://github.com/fxsjy/jieba

Synopsis

  1. Equipped with Emoticon detection
    Emoticons will not be segmented as sequences of meaningless punctuations.

  2. Data are trained with Sinica Corpus
    Results are more accurate when dealing with Traditional Chinese (F1-score = 0.91).

  3. Using Brill Tagger
    Training data are trained with Sinica Treebank, which raises the accuracy of POS tagging.

Environment

  • Python2.7+
  • Python3.3+

Installation

pip install -U jseg

Usage

from jseg import Jieba
j = Jieba()

Add user defined dictionary

j.add_guaranteed_wordlist(lst)

Here's a sample text:

sample = '期末要爆炸啦! ◢▆▅▄▃崩╰(〒皿〒)╯潰▃▄▅▇◣'

Segmentation with POS (part-of-speech)

j.seg(sample, pos=True)

jseg's People

Contributors

amigcamel avatar henryyang42 avatar

Stargazers

Matzsche avatar  avatar Richard avatar  avatar Merik C. avatar Ryan Zhang avatar Chris_Tsai avatar Ryan  avatar FuTe Wong avatar Vicky Lin avatar Kai avatar Zhong-Yi Li avatar Chums avatar Chia-Chi Chang avatar Summit Suen avatar Chuehnone avatar jsleetw avatar Poren Chiang avatar Samuel Sung avatar iakuhs avatar

Watchers

James Cloos avatar  avatar

jseg's Issues

Error of seg after add_guaranteed_list

UnboundLocalError                         Traceback (most recent call last)
<ipython-input-27-3b13136d5e16> in <module>()
----> 1 j.seg(sample, pos=True)

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/jseg/jieba.py in seg(self, text, pos)
    277         gws = sorted(gws, key=lambda x: len(x), reverse=True)  # 長詞優先
    278         for gw in gws:
--> 279             if gw in sentence:
    280                 text = sentence.replace(gw, self._gw[gw])
    281

UnboundLocalError: local variable 'sentence' referenced before assignment

This error happens after adding user define dictionary and then do the seg.

POS label "NN"

I was wondering how I should interpret the pos lable "NN", such as in the following:

宜蘭縣(Nca) 中道(NN) 小學(Ncb) 四年級(NN) 學生(Nab) 陳卉(NN) 溱(NN) 則(Dbb) 挑戰(VC2) 水墨畫(NN) ,(NN) 以(P11) 作品(Nac) 〈(NN) 猴子(Nab) 〉(NN) 獲得(VJ3) 特優(NN) 。(NN) 她(Nhaa) 表示(VE2) ,(NN) 最(Dfa) 難畫(NN) 的(DE) 是(V_11) 猴子(Nab) 的(DE) 毛(Nab) ,(NN) 最好(Dbb) 畫(VC31) 的(DE) 是(V_11) 果實(Nab) 。(NN) 她(Nhaa) 畫(VC31) 了(Di) 一(Neu) 隻(DM) 母猴(NN) 帶(VC32) 著(Di) 兩(Neu) 隻(DM) 小(VH13) 猴子(Nab) ,(NN) 第一(Neu) 次(Nfa) 參賽(VA4) 就(Dd) 獲得(VJ3) 特優(NN) 大獎(NN) ,(NN) 很(Dfa) 開心(VH21) 。(NN)

"NN" seems to label punctuation marks and some nouns, but it doesn't seem to be in the official CKIP tagset? I'm using version 0.0.4. Please advise.

add lazy load mechanism

When initialize Jieba, dictionary will load first.
j = Jieba()

Should add a lazy load mechanism.

unicodeDecodeError

I want to use jieba Segmentator for chinese, but it is giving me this DecodeError :

from jseg.jieba import Jieba
j = Jieba()

DEBUG:jseg.jieba:loading default dictionary
Traceback (most recent call last):
File "<pyshell#58>", line 1, in
j = Jieba()
File "C:\Python34\jseg\jieba.py", line 89, in init
self._gen_trie()
File "C:\Python34\jseg\jieba.py", line 122, in _gen_trie
dic = self._load_dic()
File "C:\Python34\jseg\jieba.py", line 101, in _load_dic
raw = tf.read()
File "C:\Python34\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 44: character maps to

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.