tatoeba / tatomecab Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 2.0 71 KB

A wrapper around mecab for the Tatoeba project (https://tatoeba.org/)

License: GNU Affero General Public License v3.0

Shell 2.55% Python 97.45%

tatomecab's People

Contributors

Stargazers

Watchers

Forkers

kakugirai sanyaade-artificial-intelligence

tatomecab's Issues

Generate better readings for numbers

Currently mecab readings for numbers are almost completely wrong so we just remove them. We should instead try to generate correct readings.

An python implementation that sounds good.

Also stick the number with its eventual number particule so that it looks better, i.e. 100回｛ひゃっ｜｜｜かい｝ instead of 100｛ひゃっ｝回｛かい｝.

Support for Python 3

We need webserver.py and tatomecab.py to support Python 3 so that they can integrate better with warifuri.

Spaces removed

Spaces are removed by mecab. As a result, autogenerated transcriptions are invalid.

$ echo "空白！ 空白！" | ./tatomecab.py 
空白  くうはく
！ None
空白  くうはく
！ None

Warifuri removes reading of the dot token

.,8,8,9,記号,句点,*,*,*,*,.,.,.,,

becomes

.,8,8,9,記号,句点,*,*,*,*,.,,.,,

after warifuri parsed it.

Support 々

Details about 々 by tommy_san:

々が1つだったら直前の漢字に置き換えてKANJIDICで探せばいいはずです。「人々」（ひとびと）なら「人」「人」というふうに。
まれに「一歩々々」（いっぽいっぽ）「絶え々々」（たえだえ）のように「々々」が直前の2字を繰り返す場合もあります。
このほかに、「前々々回」（ぜんぜんぜんかい）のように「々」が重なってその前の1字を何度も繰り返す場合もありますし、「南無阿弥陀仏々々々々々々」（なむあみだぶつなむあみだぶつ）のようなのもありますが、NAISTの辞書にはこれらの例はないのでとりあえず考えなくていいでしょう。

Random result when several way of splitting furigana

At least these are split randomly when relying only on kanjidic:

大君おおき｜みおお｜きみ
川原かわ|らか|わら
河合か|わいかわ|い
平気へ｜いきへい｜き
南波平みなみ|なみ|ひらみな|みなみ|ひら
入江い|りえいり|え
好き嫌いすき|き|ら|いす|き|きら|い

Support readings with katakanas

Currently no difference is made between hiragana and katakana in readings, while some readings require katakana:

王蟲オー|ム
飲茶ヤム|チャ
Letters:
Ａエイ
Ｂビー
Ｃシー
Ｄディー
Ｅイー
Ｆエフ
Ｇジー
Ｈエイチ
Ｈエッチ
Ｉアイ
Ｊジェイ
Ｋケイ
Ｌエル
Ｍエム
Ｎエヌ
Ｏオー
Ｐピー
Ｑキュー
Ｒアール
Ｓエス
Ｔティー
Ｕユー
Ｖブイ
Ｗダブリュー
Ｘエックス
Ｙワイ
Ｚゼット
Digits as English:
０ゼロ
１ワン
２ツー
３スリー
４フォー
５ファイブ
６シックス
７セブン
８エイト
９ナイン

Regex module from Python 3.2 fails

Python’s regex module produces unexpected results which makes warifuri fail. Python 3.5 works fine. The fundamental problem can be showed by this:

$ python
Python 3.5.0 (default, Sep 20 2015, 11:56:03) 
[GCC 5.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> print("%4X" % ord('人'))
4EBA
>>> re.findall(r'[^\u4E00-\u9FD5]', '人')
[]
>>> re.findall(r'[\u4E00-\u9FD5]', '人')
['人']

$ python3
Python 3.2.3 (default, Feb 20 2013, 14:44:27) 
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re.findall(r'[^\u4E00-\u9FD5]', '人') # doesn’t work as expected
['人']
>>> re.findall(r'[\u4E00-\u9FD5]', '人') # doesn’t work as expected
[]

Add readings for various length of okurigana

Build a list of special okuriganas like 落（おと）す for 落（お）とす from Naist and add them as jukujikuns, so that 見落す gets split correctly.

Verbs are not the only type of words to consider:

晴（はれ）やか
代（かわ）り

ビタミンＥ not split

This one should be split using optimistic path.

$ echo ビタミンＥ | mecab -
ビタミンＥ 名詞,一般,*,*,*,*,ビタミンＥ,ビタミンイー,ビタミンイー,,

tatoeba / tatomecab Goto Github PK

tatomecab's People

Contributors

Stargazers

Watchers

Forkers

tatomecab's Issues

Generate better readings for numbers

Support for Python 3

Spaces removed

Warifuri removes reading of the dot token

Support 々

Random result when several way of splitting furigana

Support readings with katakanas

Regex module from Python 3.2 fails

Add readings for various length of okurigana

ビタミンＥ not split

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent