Giter Club home page Giter Club logo

mecab_controller's Introduction

Mecab controller

Mecab controller is a simple wrapper around mecab (AUR). It was created primarily to be used in AJT Japanese, an Anki add-on that generates furigana for Japanese text. Originally based on code from Japanese support.

Usage with AJT Japanese

This repository is already included with AJT Japanese. You don't need to do anything extra.

Standalone usage

>>> import mecab_controller
>>> mecab = mecab_controller.MecabController()
>>> print(mecab.reading('昨日すき焼きを食べました'))
昨日[きのう]すき 焼[や]きを 食[た]べました
python -m mecab_controller 昨日すき焼きを食べました
昨日[きのう]すき 焼[や]きを 食[た]べました

mecab_controller's People

Contributors

handlerug avatar homocomputeris avatar tatsumoto-ren avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

mecab_controller's Issues

garbled characters for some kanji

mecab_controller outputs garbled characters for some kanji.
For the expression "粗末な家に住んでいる" in the examples, I got the following, debugger at basic_mecab_controller.py:110 :

(Pdb) outs
b'\xe7\xb2<ajt__component_separator>\xe7\xb2<ajt__component_separator>\xa5\xd2\xa5\xab\xa5\xac\xa5\xdf<ajt__component_separator>\xcc\xbe\xbb\xec<ajt__component_separator><ajt__node_separator>\x97<ajt__node_separator>\xe6\x9c\xab<ajt__node_separator>\xe3\x81\xaa<ajt__node_separator>\xe5\xae\xb6<ajt__node_separator>\xe3\x81\xab<ajt__node_separator>\xe4\xbd\x8f<ajt__node_separator>\xe3\x82\x93\xe3\x81\xa7\xe3\x81\x84\xe3\x82\x8b<ajt__node_separator><ajt__footer>'
(Pdb) outs.rstrip(b'\r\n').decode('utf-8', 'replace')
'�<ajt__component_separator>�<ajt__component_separator>�ҥ�����<ajt__component_separator>̾��<ajt__component_separator><ajt__node_separator>�<ajt__node_separator>末<ajt__node_separator>な<ajt__node_separator>家<ajt__node_separator>に<ajt__node_separator>住<ajt__node_separator>んでいる<ajt__node_separator><ajt__footer>'

I grabbed the command that would be run from self._mecab_cmd and ran in my terminal:

echo '粗末な家に住んでいる' | /opt/homebrew/bin/mecab --dicdir=/opt/homebrew/lib/mecab/dic/ipadic --rcfile=/Users/MY_PATH_TO_MECAB_CONTROLLER/mecab_controller/support/mecabrc --userdic=/Users/MY_PATH_TO_MECAB_CONTROLLER/mecab_controller/support/user_dic.dic --input-buffer-size=819200
        ̾,,*,*,*,*,,ҥ,ҥ
        ,,*,*,*,*,*
末      ̾,̾,ȿ,*,*,*,*
な      ̾,,*,*,*,*,*
家      ̾,,*,*,*,*,*
に      ̾,,*,*,*,*,*
住      ̾,,*,*,*,*,*
んでいる        ̾,,*,*,*,*,*
EOS

Platform: Apple Silicon Mac
OS: macos 14.1
Installed mecab and mecab-ipadic with Homebrew.

I'm not too familiar with mecab, is this how it is supposed to output if it can't parse a token?
If so, the section gets split into components at mecab_controller.py:57, the value ends up ['�', '�', '�ҥ�����', '̾��', '']. Should this be handled in this error block?

except ValueError:
    # unknown to mecab, gave the same word back
    word, headword, katakana_reading = components * 3
    part_of_speech, inflection = None, None

Hiragana conversion issue

Hello, I came here after using a great Anki plugin of yours called PitchAccent.
I've noticed the issue when trying to convert pitch pattern to hiragana that it doesn't handle long vowel mark ー properly.
Turns out that it isn't that easy to convert katakana to hiragana because of the fact that there are two ways to make vowel longer. If we would simply try to reverse "ー" character based on the preceding vowel it would make words like せんせえ (if the original data is written as センセー).

It would be the best to reverse the conversion workflow, make accents originally in hiragana and then it would be possible to convert to katakana deterministically, right?
For that you need to have the original data in hiragana but from what I've seen the accent_dict data contains fields only in katakana, perhaps you cut out hiragana fields?

I prefer to use hiragana in pitch pattern so I can simply use that instead of vocab kana field in Anki.
If it's too hard - don't mind it.
Thanks for your hard work. よろしくお願いいたします。

Add ipadic paths for macOS if installed via Homebrew on Apple silicon

The path for ipadic if installed via brew install mecab-ipadic in the code is correct only for Intel-based Macs. On Apple silicon, that would default to /opt/homebrew/lib/mecab/dic/ipadic/.

Likewise, if mecab itself had been installed via brew install mecab, subsequently installing mecab-ipadic-neologd by following its install instructions (since as of today there is no Homebrew formula to install it with brew) will leave the dictionary installed at:

  • /usr/local/lib/mecab/dic/mecab-ipadic-neologd/ if on an Intel-based Mac
  • /opt/homebrew/lib/mecab/dic/mecab-ipadic-neologd/ if on an Apple silicon based Mac.

neither of which are contemplated in the code.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.