Giter Club home page Giter Club logo

tatomecab's Introduction

Tatomecab

A wrapper around mecab for the Tatoeba project.

Tatomecab is of a set of tools to provide Japanese sentences with furiganas.

tatomecab.py

A library that wraps Mecab and add some more features (like parsing markers set by warifuri). It can also be used as a command line to do quick testing like mecab:

$ echo 振り仮名をつけろう | ./tatomecab.py
振	ふ
り	None
仮	が
名	な
を	None
つけろ	None
う	None

webserver.py

Exposes the tatomecab library as a webservice.

$ curl http://127.0.0.1:8842/furigana -G --data-urlencode str=振り仮名をつけろう
# Actual URL is http://127.0.0.1:8842/furigana?str=%E6%8C%AF%E3%82%8A%E4%BB%AE%E5%90%8D%E3%82%92%E3%81%A4%E3%81%91%E3%82%8D%E3%81%86
<?xml version="1.0" encoding="UTF-8"?>
<root>
<parse>
<token>
  <reading furigana=""><![CDATA[]]></reading>
  <![CDATA[]]>
  <reading furigana=""><![CDATA[]]></reading>
  <reading furigana=""><![CDATA[]]></reading>
</token>
<token><![CDATA[]]></token>
<token><![CDATA[つけろ]]></token>
<token><![CDATA[]]></token>
</parse>
</root>

Warifuri

Warifuri is a script that edits mecab dictionary to insert markers in the reading field so that furigana(s) are mapped to the character(s) they belong to, enabling proper mono ruby and group ruby.

tatomecab's People

Contributors

jiru avatar

Stargazers

Paul O'Leary McCann avatar Anatoly Chernov avatar

Watchers

pep avatar Allan Simon avatar Trang avatar  avatar Tomasz Melcer avatar James Cloos avatar  avatar  avatar Ricardo14 avatar  avatar

tatomecab's Issues

Support readings with katakanas

Currently no difference is made between hiragana and katakana in readings, while some readings require katakana:

  • 王蟲 オー|ム
  • 飲茶 ヤム|チャ
  • Letters:
    A エイ
    B ビー
    C シー
    D ディー
    E イー
    F エフ
    G ジー
    H エイチ
    H エッチ
    I アイ
    J ジェイ
    K ケイ
    L エル
    M エム
    N エヌ
    O オー
    P ピー
    Q キュー
    R アール
    S エス
    T ティー
    U ユー
    V ブイ
    W ダブリュー
    X エックス
    Y ワイ
    Z ゼット
  • Digits as English:
    0 ゼロ
    1 ワン
    2 ツー
    3 スリー
    4 フォー
    5 ファイブ
    6 シックス
    7 セブン
    8 エイト
    9 ナイン

Random result when several way of splitting furigana

At least these are split randomly when relying only on kanjidic:

大君 おおき|み おお|きみ
川原 かわ|ら か|わら
河合 か|わい かわ|い
平気 へ|いき へい|き
南波平 みなみ|なみ|ひら みな|みなみ|ひら
入江 い|りえ いり|え
好き嫌い すき|き|ら|い す|き|きら|い

Generate better readings for numbers

Currently mecab readings for numbers are almost completely wrong so we just remove them. We should instead try to generate correct readings.

An python implementation that sounds good.

Also stick the number with its eventual number particule so that it looks better, i.e. 100回{ひゃっ|||かい} instead of 100{ひゃっ}回{かい}.

Add readings for various length of okurigana

Build a list of special okuriganas like 落(おと)す for 落(お)とす from Naist and add them as jukujikuns, so that 見落す gets split correctly.

Verbs are not the only type of words to consider:

  • 晴(はれ)やか
  • 代(かわ)り

Support for Python 3

We need webserver.py and tatomecab.py to support Python 3 so that they can integrate better with warifuri.

Regex module from Python 3.2 fails

Python’s regex module produces unexpected results which makes warifuri fail. Python 3.5 works fine. The fundamental problem can be showed by this:

$ python
Python 3.5.0 (default, Sep 20 2015, 11:56:03) 
[GCC 5.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> print("%4X" % ord('人'))
4EBA
>>> re.findall(r'[^\u4E00-\u9FD5]', '人')
[]
>>> re.findall(r'[\u4E00-\u9FD5]', '人')
['人']

$ python3
Python 3.2.3 (default, Feb 20 2013, 14:44:27) 
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re.findall(r'[^\u4E00-\u9FD5]', '人') # doesn’t work as expected
['人']
>>> re.findall(r'[\u4E00-\u9FD5]', '人') # doesn’t work as expected
[]

ビタミンE not split

This one should be split using optimistic path.

$ echo ビタミンE | mecab -
ビタミンE 名詞,一般,*,*,*,*,ビタミンE,ビタミンイー,ビタミンイー,,

Support 々

Details about 々 by tommy_san:

々が1つだったら直前の漢字に置き換えてKANJIDICで探せばいいはずです。「人々」(ひとびと)なら「人」「人」というふうに。
まれに「一歩々々」(いっぽいっぽ)「絶え々々」(たえだえ)のように「々々」が直前の2字を繰り返す場合もあります。
このほかに、「前々々回」(ぜんぜんぜんかい)のように「々」が重なってその前の1字を何度も繰り返す場合もありますし、「南無阿弥陀仏々々々々々々」(なむあみだぶつなむあみだぶつ)のようなのもありますが、NAISTの辞書にはこれらの例はないのでとりあえず考えなくていいでしょう。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.