Giter Club home page Giter Club logo

tatomecab's People

Contributors

jiru avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tatomecab's Issues

Generate better readings for numbers

Currently mecab readings for numbers are almost completely wrong so we just remove them. We should instead try to generate correct readings.

An python implementation that sounds good.

Also stick the number with its eventual number particule so that it looks better, i.e. 100回{ひゃっ|||かい} instead of 100{ひゃっ}回{かい}.

Support for Python 3

We need webserver.py and tatomecab.py to support Python 3 so that they can integrate better with warifuri.

Support 々

Details about 々 by tommy_san:

々が1つだったら直前の漢字に置き換えてKANJIDICで探せばいいはずです。「人々」(ひとびと)なら「人」「人」というふうに。
まれに「一歩々々」(いっぽいっぽ)「絶え々々」(たえだえ)のように「々々」が直前の2字を繰り返す場合もあります。
このほかに、「前々々回」(ぜんぜんぜんかい)のように「々」が重なってその前の1字を何度も繰り返す場合もありますし、「南無阿弥陀仏々々々々々々」(なむあみだぶつなむあみだぶつ)のようなのもありますが、NAISTの辞書にはこれらの例はないのでとりあえず考えなくていいでしょう。

Random result when several way of splitting furigana

At least these are split randomly when relying only on kanjidic:

大君 おおき|み おお|きみ
川原 かわ|ら か|わら
河合 か|わい かわ|い
平気 へ|いき へい|き
南波平 みなみ|なみ|ひら みな|みなみ|ひら
入江 い|りえ いり|え
好き嫌い すき|き|ら|い す|き|きら|い

Support readings with katakanas

Currently no difference is made between hiragana and katakana in readings, while some readings require katakana:

  • 王蟲 オー|ム
  • 飲茶 ヤム|チャ
  • Letters:
    A エイ
    B ビー
    C シー
    D ディー
    E イー
    F エフ
    G ジー
    H エイチ
    H エッチ
    I アイ
    J ジェイ
    K ケイ
    L エル
    M エム
    N エヌ
    O オー
    P ピー
    Q キュー
    R アール
    S エス
    T ティー
    U ユー
    V ブイ
    W ダブリュー
    X エックス
    Y ワイ
    Z ゼット
  • Digits as English:
    0 ゼロ
    1 ワン
    2 ツー
    3 スリー
    4 フォー
    5 ファイブ
    6 シックス
    7 セブン
    8 エイト
    9 ナイン

Regex module from Python 3.2 fails

Python’s regex module produces unexpected results which makes warifuri fail. Python 3.5 works fine. The fundamental problem can be showed by this:

$ python
Python 3.5.0 (default, Sep 20 2015, 11:56:03) 
[GCC 5.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> print("%4X" % ord('人'))
4EBA
>>> re.findall(r'[^\u4E00-\u9FD5]', '人')
[]
>>> re.findall(r'[\u4E00-\u9FD5]', '人')
['人']

$ python3
Python 3.2.3 (default, Feb 20 2013, 14:44:27) 
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re.findall(r'[^\u4E00-\u9FD5]', '人') # doesn’t work as expected
['人']
>>> re.findall(r'[\u4E00-\u9FD5]', '人') # doesn’t work as expected
[]

Add readings for various length of okurigana

Build a list of special okuriganas like 落(おと)す for 落(お)とす from Naist and add them as jukujikuns, so that 見落す gets split correctly.

Verbs are not the only type of words to consider:

  • 晴(はれ)やか
  • 代(かわ)り

ビタミンE not split

This one should be split using optimistic path.

$ echo ビタミンE | mecab -
ビタミンE 名詞,一般,*,*,*,*,ビタミンE,ビタミンイー,ビタミンイー,,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.