miurahr / pykakasi Goto Github PK

View Code? Open in Web Editor NEW

403.0 5.0 53.0 27.61 MB

Lightweight converter from Japanese Kana-kanji sentences into Kana-Roman.

Home Page: https://codeberg.org/miurahr/pykakasi

License: GNU General Public License v3.0

Python 100.00%

natural-language-processing japanese python transliterator transliterate-japanese

pykakasi's Introduction

Pykakasi

Overview

pykakasi is a Python Natural Language Processing (NLP) library to transliterate hiragana, katakana and kanji (Japanese text) into rōmaji (Latin/Roman alphabet). It can handle characters in NFC form.

Its algorithms are based on the kakasi library, which is written in C.

Install (from PyPI): pip install pykakasi
Install (from conda-forge): conda install -c conda-forge pykakasi
Documentation available on readthedocs

Give Up GitHub

This project has given up GitHub. (See Software Freedom Conservancy's Give Up GitHub site for details)

You can now find this project at https://codeberg.org/miurahr/pykakasi instead.

Any use of this project's code by GitHub Copilot, past or present, is done without our permission. We do not consent to GitHub's use of this project's code in Copilot.

Join us; you can Give Up GitHub too!

pykakasi's People

Stargazers

Watchers

pykakasi's Issues

Document is old, need update

API description is still v1.2 base, no v2.0 interface.

Conversion of Kanji lead to no characters on Python 2

Executing the sample code on the front page with Python 3 leads to the converted text:
kana Kanji Majiri Bun

Executing the sample code on the front page with Python 2 leads to the converted text:
kana jiri

No exception or warning is displayed. Is this intended behavior?

Need documentations

kakasi cli does not show help

Describe the bug

I tried to use kakasi cli and show the help but an error occurred.

$ kakasi --help
pykakasi: version 2.0.4 on Python 3.7.5 [CPython Clang 10.0.1 (clang-1001.0.46.4)]
Python implementation of kakasi

Traceback (most recent call last):
  File "/Users/hnishi/.pyenv/versions/3.7.5/bin/kakasi", line 108, in <module>
    sys.exit(main())
  File "/Users/hnishi/.pyenv/versions/3.7.5/bin/kakasi", line 79, in main
    usage()
  File "/Users/hnishi/.pyenv/versions/3.7.5/bin/kakasi", line 25, in usage
    print("{}".format(pykakasi.__copyright__))
AttributeError: module 'pykakasi' has no attribute '__copyright__'

Related issue
(if exist)
none

To Reproduce

run the following commands in a shell.

pip install pykakasi
kakasi --help

Expected behavior

show help of command line options and usage without errors

Environment (please complete the following information):

OS: macOS Catalina version 10.15.6
Python 3.7.5
pykakasi version: 2.0.4

Test data(please attach in the report):
none

Additional context
none

276 "kanji" are not converted if the input text has same/similar looking hanzi mixed in, also the converter does not complain.

Describe the bug
First of all, thanks for having this project! Without Your work I could not do my project probably at all.

The issue:
Some Japanese probably accidentally typed the same looking Chinese variant of kanji, or use simplified Chinese charcters mixed in, or some CJK unification conversion happened.
The issue is: 見 and 見 are not the same in unicode. In the full list I've sent, some characters are clearly the simplified chinese versions of the kanji characters, however, these characters should be converted just the same I think.

Related issue
(if exist)

To Reproduce
Steps to reproduce the behavior:
(example)
I use the following code to convert Japanese text to romaji:

kakasi = pykakasi.kakasi()
kakasi.setMode("H", "a")  # Hiragana to ascii, default: no conversion
kakasi.setMode("K", "a")  # Katakana to ascii, default: no conversion
kakasi.setMode("J", "a")  # Japanese to ascii, default: no conversion
kakasi.setMode("r", "Hepburn")  # default: use Hepburn Roman table
kakasi.setMode("s", True)  # add space, default: no separator
kakasi.setMode("C", False)  # no capitalization
kakasi.getConverter().do(text)
text = "⽢⾃々〻ゞ业东丝丢两丨为丽么乐习书产亿们众优伙伟传伤你侧俱值內兰关兴兹军冻击别剧办动劳卖卡卢厉厌发变吗吧启呃员呜呢响哎哟唸啊啦喂喔嗎嗫嗯团场增处备头夺奶她妆妈妳姬实对寻尔带应废开张强怀态总恶战戾护报拋拥择损捥搔搞敌斩时晚暧极查标样欢步歲每污沟淚渴溫满灵热爱爸爹狱玛环现盘矿确离种竞笔类紧緖红约级纯纸线细终绊经结给绝续绮缔缚缠缲罗职胜脋脫舰艳蓝蔷薰虽蟬补见觉說计认让议记许讹诂诉词诛话该详语说诵请诺谁谈谍谛谢负贯贵贷费贽赶跃踠踩踬轨轮轻辉边达过运还这进远连选遗銮錬针钱铛银错镜镮长门闭问间闻阁队阳险隐难预领颗颜风飞饭马驶驾验骗骷髅髙鲜鸟鸠黑金北葉立切行見"
assert(kakasi.getConverter().do(text).replace(' ', '') == text) # it is true

Expected behavior
The characters converted to latin letters.

Environment (please complete the following information):

OS: [e.g. Windows 10, Ubuntu Linux 18.04.01] macOS 10.13.6 (17G65)
Python [e.g. 3.6, pypy3.6.9-7.3.0] Python 3.7.4
pykakasi version: [e.g. v0.5b1, commit #123456 on master]

Test data(please attach in the report):
A minimum test data to reproduce your problem.

Additional context
Add any other context about the problem here.

pip install fails

(pykakasi)miurahr@:~/projects/pykakasi-test$ pip install pykakasi
Collecting pykakasi
/home/miurahr/.pyenv/versions/pykakasi/lib/python2.6/site-packages/pip/_vendor/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning
  Downloading pykakasi-0.23.tar.gz (1.0MB)
    100% |████████████████████████████████| 1.0MB 178kB/s 
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 20, in <module>
      File "/tmp/pip-build-0fGo1u/pykakasi/setup.py", line 8, in <module>
        import nose
    ImportError: No module named nose

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-0fGo1u/pykakasi

Syllabic n (ん) is written as n' before vowels and y in kunrei-siki

Describe the bug
Kunrei-shiki romanization requested to be *Syllabic n (ん) is written as n' before vowels and y *

Related issue
#107

Expected behavior
Add ' after n when Syllabic n (ん) is written as n' before vowels and y

Problem while converting Hiragana to Katakana

In the process of converting from Hiragana to Katakana, not Chinese characters, a bug occurs and asks questions.

original : 私がこの子を助けなきゃいけないってことだよね
pykakasi : ワタシガコノコヲタスケナキゃいけないってことだよね
katakana : ワタシガコノコヲタスケナキャイケナイッテコトダヨネ

code :

import pykakasi
import codecs
import MeCab

kakasi = pykakasi.kakasi()
kakasi.setMode("J","K") 
conv = kakasi.getConverter()


_symbols = []
_tagger = MeCab.Tagger()
print(_tagger)

def _yomi(mecab_result):
    tokens = []
    yomis = []
    for line in mecab_result.split("\n")[:-1]:
        s = line.split("\t")
        if len(s) == 1:
            break
        token, rest = s
        rest = rest.split(",")
        tokens.append(token)
        yomi = rest[7] if len(rest) > 7 else None
        yomi = None if yomi == "*" else yomi
        yomis.append(yomi)
        
    return tokens, yomis
  
def add_symbols(text):
    for c in text:
        if not c in _symbols:
            _symbols.append(c)

def j2k(text):
    tokens, yomis = _yomi(_tagger.parse(text))
    return "".join(
        yomis[idx] if yomis[idx] is not None else tokens[idx]
        for idx in range(len(tokens)))

conv = kakasi.getConverter()
text2 = conv.do(j2k('私がこの子を助けなきゃいけないってことだよね	'))
print(text2)```

Is there anything wrong with the code or text?

Fullwitdh colon \u11fa causes Exception on conv.do()

Similar behaviour to issue #46 but for the character Fullwidth colon \uff1a.
Possibly more characters affected.

KeyError: 'pykakasi/hepburnhira2.pickle'

I get this error when doing a trivial test like what's listed here:
https://pypi.python.org/pypi/pykakasi

This is on a --user install by the way.

install from git repository make error

(venv) :~/projects/test-pykakasi$ pip install git+https://github.com/miurahr/pykakasi
Collecting git+https://github.com/miurahr/pykakasi
  Cloning https://github.com/miurahr/pykakasi to /tmp/pip-ow0_xkl1-build
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-ow0_xkl1-build/setup.py", line 9, in <module>
        import pykakasi.genkanwadict as genkanwadict
      File "/tmp/pip-ow0_xkl1-build/pykakasi/__init__.py", line 1, in <module>
        from .a2 import a2
      File "/tmp/pip-ow0_xkl1-build/pykakasi/a2.py", line 7, in <module>
        from six import unichr
    ModuleNotFoundError: No module named 'six'
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-ow0_xkl1-build/

python3 support

Issue with ー

I've had a few issues where words using the ー character have it converted into a hyphen rather than extending the previous character's sound.

For example:

デッデー is becoming dedde- instead of deddee.

Handle Unicode IVS

Describe the bug
It cannot handle Unicode IVS properly. Issue recorded at unihandecode project.

Related issue
miurahr/unihandecode#35

Issues on Katakana half-width

It's useful module to convert Japanese to Romaji.
But, I tested with Half-width Kana, and same like it could not convert to Romaji. Please tell me if it's not a issue.
Thank you so much.

Need flag to open new db

Traceback (most recent call last):
  File "/root/kakasi-test.py", line 14, in <module>
    conv = kakasi.getConverter()
  File "/usr/local/lib/python2.7/dist-packages/pykakasi/kakasi.py", line 119, in getConverter
    self._conv["J"] = J2a(method = self._option["r"])
  File "/usr/local/lib/python2.7/dist-packages/pykakasi/j2a.py", line 38, in __init__
    self._jconv = J2H()
  File "/usr/local/lib/python2.7/dist-packages/pykakasi/j2h.py", line 46, in __init__
    self._kanwa = kanwa()
  File "/usr/local/lib/python2.7/dist-packages/pykakasi/kanwa.py", line 36, in __init__
    self._kanwadict = dbm.open(dictpath,'r')
  File "/usr/lib/python2.7/anydbm.py", line 79, in open
    raise error, "need 'c' or 'n' flag to open new db"
anydbm.error: need 'c' or 'n' flag to open new db

Use six for python 2 compatibility

hepburnhira2.pickle is not found?

Hi,

I am trying to convert hiragana or kanji into romaji... was trying this:

import pykasi
j = pykasi.H2a()
Then I got this error:
IOError: [Errno 2] No such file or directory: '/Users/ronalds/.virtualenvs/testFlask/lib/python2.7/site-packages/pykakasi/hepburnhira2.pickle'

Do you know why I am missing those pickle file?

Thanks for your help!

Problem with kanji + っ ?

Describe the bug
I am both a beginner to japanese and python, so it is likely that I am making a mistake, but I may have found a bug

Related issue
None.

To Reproduce

import pykakasi
text= '思った 言ったら 行って'
kakasi = pykakasi.kakasi()
kakasi.setMode("H","a") # Hiragana to ascii, default: no conversion
kakasi.setMode("K","a") # Katakana to ascii, default: no conversion
kakasi.setMode("J","a") # Japanese to ascii, default: no conversion
kakasi.setMode("r","Hepburn") # default: use Hepburn Roman table
kakasi.setMode("s", True) # add space, default: no separator
kakasi.setMode("C", False) # capitalize, default: no capitalize
conv = kakasi.getConverter()
print(conv.do(text))
omotsu ta  itsutsu tara  itsu te

Expected behavior
omotta ittara itte

Environment (please complete the following information):

OS: Windows 10
Python 3.8
pykakasi version: 2.0.4

Test data(please attach in the report):
None

Additional context
None

cannot convert character "々" with specific words.

I found bugs with specific words when converting to Hiragana.

from pykakasi import kakasi

K = kakasi()
K.setMode("J","H")
conv = K.getConverter()
print(conv.do("月々"))
print(conv.do("毎月々"))
print(conv.do("佐々木"))
print(conv.do("中佐々木"))
print(conv.do("代々木"))
print(conv.do("次代々木"))

つきづき
まいつき々
ささき
ちゅうさ々き
よよぎ
じだい々き

v1.1にアップデート後、「゛ー」が異常終了する

$ python
Python 3.7.3 (default, Apr  8 2019, 13:30:19) 
[Clang 10.0.1 (clang-1001.0.46.3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pykakasi
>>> pykakasi.__version__
'1.1'
>>> kakasi = pykakasi.kakasi()
>>> 
>>> kakasi.setMode("H", "a")
>>> kakasi.setMode("K", "a")
>>> kakasi.setMode("J", "a")
>>> 
>>> kakasi.setMode("r", "Hepburn")
>>> 
>>> kakasi.setMode("s", True)
>>> kakasi.setMode("C", True)
>>> 
>>> conv = kakasi.getConverter()
>>> conv.do("バー")
'baa'
>>> conv.do("ハ゛ー")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "******************/.pyenv/versions/3.7.3/lib/python3.7/site-packages/pykakasi/kakasi.py", line 147, in do
    chunk = chunk + chunk[-1]
IndexError: string index out of range
>>>

Hiragana for Number Counters

Describe the bug

My use case: I am using pykakasi to help generate Furigana from a given text or paragraph by using the Hiragana output.

Some counters, such as "十歳" are converted as {"orig": "十", "hira": "じゅう"}, {"orig": "歳", "hira": "とし"} instead of さい.

I am not sure if this is a bug with kakasi or whether there is a way for me to customise pykakasi to always convert 歳 -> さい if there are numbers before it.

To Reproduce
Steps to reproduce the behavior:

Use the input text: "ほかの十人は同じ施設の関係者で、六人は十歳未満の子どもでした。"

歳 is mapped as とし instead of さい。

Environment (please complete the following information):

OS: MacOS
Python 3.9.1
pykakasi version: 2.0.4

Implement wakati mode

Propose to have a structured return form from converter instead string with delimiters

Is it possible to have a possibility to make a structured form of conversion to romaji instead of a string result with delimiters? It is important because somebody (like me) can need to have a back hash of romaji and source (hiragana, kanji, etc sequences) but now after conversion it is impossible. It is will be very usable converter method or optional argument that lead to dict() will be returned instead of a string. Propose a key - source pattern and value - a converted.
Thanks!

Greek characters support

Russian charactors conversions

"ー" becomes "???" when ("K", "H" ) mode

when I set kakasi.setMode("K","H")
and input "じゃーん" it returns "じゃ???ん"

use semidbm rather than anydbm

semidbm is a pure python, cross platform dbm engine.

https://github.com/jamesls/semidbm

Characters 々〇 cause Exception on conv.do()

Traceback (most recent call last):

  File "<ipython-input-1-4ab2ca517509>", line 1, in <module>
    runfile('D:/syosetsu-dl/kigou_conversion_issue.py', wdir='D:/syosetsu-dl')

  File "C:\ProgramData\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
    execfile(filename, namespace)

  File "C:\ProgramData\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "D:/syosetsu-dl/kigou_conversion_issue.py", line 118, in <module>
    test_faulty_characters(conv)

  File "D:/syosetsu-dl/kigou_conversion_issue.py", line 18, in test_faulty_characters
    transscripted_string = conv.do(string)

  File "C:\ProgramData\Anaconda3\lib\site-packages\pykakasi\kakasi.py", line 235, in do
    otext = otext + self._conv["E"].convert(text[i])

TypeError: must be str, not NoneType

This error happens when mode "E" is set to "a" and one attempts to convert a string containing "々" or "〇".

Attached you can find a small script demonstrating the issue.
kigou_conversion_issue.txt

Symbols conversion

Japanese names in half-width kana to romaji.

Describe the bug
Hi,

I am trying to translate Japanese names to Romaji. I used the code written under. As you can see I get the same output for hira, kana, hepburn, kunrei and passport.
Is there anything wrong with my text or code?

To Reproduce

import pykakasi
kks = pykakasi.kakasi()
text='ｿｳｿﾞｸﾆﾝ'
result = kks.convert(text)
print(result)


[{'orig': 'ｿｳｿﾞｸﾆﾝ', 'hira': 'ｿｳｿﾞｸﾆﾝ', 'kana': 'ｿｳｿﾞｸﾆﾝ', 'hepburn': 'ｿｳｿﾞｸﾆﾝ', 'kunrei': 'ｿｳｿﾞｸﾆﾝ', 'passport': 'ｿｳｿﾞｸﾆﾝ'}]

Release script for v0.80 broken

installing from pykakasi-0.80-py2.py3-none-any.whl is broken because of a dictionary file is empty.

(venv):~/projects/test-pykakasi$ ls -l venv/lib/python3.6/site-packages/pykakasi/kanwadict3.db/
-rw-r--r-- 1 miurahr miurahr 8  Mar 29 18:36 data

JIS X0213 characters (old IBM/NEC extensions) generate key error

report from #68

Unicode definition is

U+4F60	kCantonese	nei5
U+4F60	kDefinition	you, second person pronoun
U+4F60	kHanyuPinlu	ni3(10944)
U+4F60	kHanyuPinyin	10137.050:nǐ
U+4F60	kJapaneseKun	NANJI
U+4F60	kJapaneseOn	JI NI
U+4F60	kKorean	NI
U+4F60	kMandarin	NI3
U+4F60	kVietnamese	nể
U+4F60	kXHC1983	0828.020:nǐ

and it is mapped to IBM kanji FA61

Infinite loop after running for a while

Describe the bug
If I convert a lot of texts using pykakasi, the program freeze after a while.
I have found that the problem is caused by

pykakasi/src/pykakasi/scripts.py

Line 152 in 7d2179e

if length > 0:

Here if the if statement is not satisficed x will not increase and the while loop will not stop.
To reproduce

kks = pykakasi.kakasi()
kks.convert('ﾞっ、')

Inconsistent behavior between chinese kanji and extended kana

An expected failure of test test_kakasi_extended_kana() illustrate an inconsistent behavior.
There are issues on code that is out of supported characters except for CJK Unified Ideographs (Han).

kakasi command line -v option no longer works.

Describe the bug
Command line kakasi replacement doesn't support -v command line option.
This same issue breaks the -h option as well.

To Reproduce
Steps to reproduce the behavior:
(example)

Prepare test data attached as 'file' in current directory.
Run following code with python3.

$ pip3 install pykakasi
Collecting pykakasi
  Using cached https://files.pythonhosted.org/packages/bb/59/e09e7b0e0b5aaaefa8ea6dced1fc1e60987e4663d9c4aca0a9a95a9e0ecd/pykakasi-2.0.1-py3-none-any.whl
Requirement already satisfied: klepto in ./.local/lib/python3.7/site-packages (from pykakasi) (0.2.0)
Requirement already satisfied: dill>=0.3.3 in ./.local/lib/python3.7/site-packages (from klepto->pykakasi) (0.3.3)
Requirement already satisfied: pox>=0.2.9 in ./.local/lib/python3.7/site-packages (from klepto->pykakasi) (0.2.9)
Installing collected packages: pykakasi
Successfully installed pykakasi-2.0.1
$ kakasi -v
Traceback (most recent call last):
  File "/home/dennyvandenberg/.local/bin/kakasi", line 99, in <module>
    sys.exit(main())
  File "/home/dennyvandenberg/.local/bin/kakasi", line 66, in main
    show_version()
  File "/home/dennyvandenberg/.local/bin/kakasi", line 11, in show_version
    print("{}: version {}".format(os.path.basename(sys.argv[0]),  pykakasi.__version__))
AttributeError: module 'pykakasi' has no attribute '__version__'

Expected behavior
For the command to print out the version number.

Environment (please complete the following information):

OS: Chromebook running virtual linux
Python: 3.7.3
pykakasi version: 2.0.1

Issues on Unicode normalize forms

A library should consider normalize form difference.
jaconv.normalize may help.

Raise Key error with several non-standard form of Japanese character

report from #68, U+3402 become key error.

From Unicode standard, it is;

U+3402	kDefinition	(J) non-standard form of U+559C 喜, to like, love, enjoy; a joyful thing

FileNotFoundError: [Errno 2] No such file or directory: '/usr/lib/python3.6/site-packages/pykakasi/hepburnhira2.pickle'

Hello,

When I try the demo in the readme.md, it give me such an error:

In [9]: conv = kakasi.getConverter()
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-9-a04f71e77951> in <module>()
----> 1 conv = kakasi.getConverter()

/usr/lib/python3.6/site-packages/pykakasi/kakasi.py in getConverter(self)
     99         if self._mode["H"] == "a":
    100             from .h2a import H2a
--> 101             self._conv["H"] = H2a(method = self._option["r"])
    102         elif self._mode["H"] == "K":
    103             from .h2k import H2K

/usr/lib/python3.6/site-packages/pykakasi/h2a.py in __init__(self, method)
     39     def __init__(self, method="Hepburn"):
     40         if method == "Hepburn":
---> 41             self._kanadict = jisyo('hepburnhira2.pickle')
     42         elif method == "Passport":
     43             self._kanadict = jisyo('passporthira2.pickle')

/usr/lib/python3.6/site-packages/pykakasi/jisyo.py in __init__(self, pklname)
     13 
     14     def __init__(self, pklname):
---> 15         dict_pkl = open(resource_filename(__name__, pklname), 'rb')
     16         self._dict = load(dict_pkl)
     17 

FileNotFoundError: [Errno 2] No such file or directory: '/usr/lib/python3.6/site-packages/pykakasi/hepburnhira2.pickle'

Is there anything that I've missed?

Thanks!

Improve test coverage

Now becoming test coverage is down because test cases for kakasi.py are not enough. As in Jan, 2019, 84% of coverage in there.
https://coveralls.io/builds/20923762/source?filename=pykakasi/kakasi.py

Add New API documentation

Is your feature request related to a problem? Please describe.

There is no description about new api on README.

Describe the solution you'd like
It is better to add usage on README.

Describe alternatives you've considered

It is also better to write manual document.

Sometimes it does not recognize punctuation as word separation

u"由来し、この" should be recognized as two words; u"ゆらいし、" and "この"
but current master output continuous u"ゆらいし、この".

Test failed on Windows

from https://ci.appveyor.com/project/miurahr/pykakasi/build/1.0.27/job/kb83k9v66xjqwq0r

183 ======================================================================
184 ERROR: test_mkdict (tests.test_genkanwadict.TestGenkanwadict)
185 ----------------------------------------------------------------------
186 Traceback (most recent call last):
187  File "C:\projects\pykakasi\tests\test_genkanwadict.py", line 28, in test_mkdict
188    self.kanwa.mkdict(src, dst)
189  File "C:\projects\pykakasi\pykakasi\genkanwadict\mkkanwa.py", line 41, in mkdict
190    dump((dic, max_key_len), open(dst, 'wb'), protocol=2)
191IOError: [Errno 2] No such file or directory: '/tmp\\test_kanadict.pickle'
192
193 ======================================================================
194 ERROR: test_mkkanwa (tests.test_genkanwadict.TestGenkanwadict)
195 ----------------------------------------------------------------------
196 Traceback (most recent call last):
197  File "C:\projects\pykakasi\tests\test_genkanwadict.py", line 42, in test_mkkanwa
198    self.kanwa.run(src, dst)
199  File "C:\projects\pykakasi\pykakasi\genkanwadict\mkkanwa.py", line 21, in run
200    self.kanwaout(dst)
201  File "C:\projects\pykakasi\pykakasi\genkanwadict\mkkanwa.py", line 71, in kanwaout
202    dic = dbm.open(out, 'c')
203  File "c:\python27\Lib\anydbm.py", line 85, in open
204    return mod.open(file, flag, mode)
205  File "c:\python27\Lib\dbhash.py", line 18, in open
206    return bsddb.hashopen(file, flag, mode)
207  File "c:\python27\Lib\bsddb\__init__.py", line 364, in hashopen
208    d.open(file, db.DB_HASH, flags, mode)
209 DBNoSuchFileError: (2, 'No such file or directory')
210
211----------------------------------------------------------------------
212Ran 29 tests in 0.140s

を become "wo" in romanization but hepburn-ski requested to be "o"

Describe the bug
を become "wo" in romanization but hepburn-ski requested to be "o".

Related issue
#107

Expected behavior

を become 'o' in kunrei-shiki romanization.

The convert function of the new api misses last part of a text

Using the new api in version 2.0, the convert function doesn't take into account the last part of a text if no final punctuation is present and the text doesn't end with a kanji.

To Reproduce
Code to reproduce the behavior:

import pykakasi

text = 'お茶にお煎餅、よく合いますね'

kakasi = pykakasi.kakasi()
result = kakasi.convert(text)

print(text)
print("".join([item['orig'] for item in result]))

Expected behavior

Both outputs should read
"お茶にお煎餅、よく合いますね"

Both "こんやく" and "こにゃく" is translated into "konyaku"

Describe the bug
＊これはバグと呼んで良いのか迷いますが、気になったのでissueを起票させていただきます。（本issue内ではバグと呼ばさせていただきます）

違う日本語の単語が同一のローマ字に変換されてしまうバグです。
"こんやく", "こにゃく"がどちらも"konyaku"というローマ字に変換されてしまいます。

また、"こんにゃく"は"konnyaku"というローマ字に変換されますが、これを逆に(人間が)ひらがなに変換するとしたら"こんにゃく", "こんやく"という２つの選択肢があると思います。

pykakasiが目指すところとして、
「ローマ字への変換」 = 「読みようによっては、元々のひらがなが推定出来る」というものであれば現在の仕様でも良いと思いましたが、
「ローマ字への変換」 = 「元々のひらがなが一意に決定出来る」というものであれば現在の仕様はバグと呼べるのかなと思いました。

I'm not sure I should call this situation as a bug, but I created this issue to discuss.

The bug is, different two Japanese is translated into same rōmaji.
For example, "こんやく" and "こにゃく" is both translated into "konyaku".
And also, "こんにゃく" is translated as "konnyaku", then this rōmaji can be translated not only "こんにゃく" but also "こんやく" by human.

If pykakasi concept is 'Translating into rōmaji' = 'In some way, human can infer original Japanese', this behavior should not be called as bug, but specification.
But if pykakasi concept is 'Translating into rōmaji' = 'Human always can get original Japanese', this behavior should be called as bug, I think.

To Reproduce
Run following code with python3 in interactive mode.
(Of corse this also happens with script mode.)

>>> import pykakasi
>>> kks = pykakasi.kakasi()
>>>
>>> kks.convert("こんやく")
[{'orig': 'こんやく', 'hira': 'こんやく', 'kana': 'コンヤク', 'hepburn': 'konyaku', 'kunrei': 'konyaku', 'passport': 'konyaku'}]
>>>
>>> kks.convert("こにゃく")
[{'orig': 'こにゃく', 'hira': 'こにゃく', 'kana': 'コニャク', 'hepburn': 'konyaku', 'kunrei': 'konyaku', 'passport': 'konyaku'}]
>>>
>>> kks.convert("こんにゃく")
[{'orig': 'こんにゃく', 'hira': 'こんにゃく', 'kana': 'コンニャク', 'hepburn': 'konnyaku', 'kunrei': 'konnyaku', 'passport': 'konnyaku'}]

Expected behavior
こにゃく should be "konyaku"
こんやく should be "konnyaku"
こんにゃく should be "konnnyaku"

Environment (please complete the following information):

OS: [ docker alpine3.11 OS on 'macOS Big Sur' host ]
Python(CPython) [3.8.3]
pykakasi version: pykakasi==2.0.1 via pip install

Comment for fixing
"こんぶ" can be translated as "konbu" or "konnbu" (Both can be correctly reverse translated into "こんぶ").
Both has pros I think.
"konbu": It's shorter.
"konnbu": Translating rule is clear. Just "ん" is always translated into "nn"

Instances of a kakasi() class are overlapped with _conv and _mode fields

Hi, i'm very happy to use a kakasi module in my current project HCE and applied Japanese projects, but have some notes. Most important is:

Because a _conv and _mode fields are defined outside of init - they are overlapped in two or more instances of a classes. This gives some additional difficulties for case of usage of two or more instances of a converter for different purposes, for example for a split on tokens and convert to romaji...
If no any serious arguments, please move a definitions of all internal fields in to the init to get classical instantiation.
Thanks.

市立 becomes イチリツ

text = "市立"
kakasi = kakasi()
kakasi.setMode('J', 'K')
kakasi.setMode('H', 'K')
kakasi.setMode('K', 'K')
conv = kakasi.getConverter()
result = conv.do(text)
print(result)

above code's output is
イチリツ

I am expecting
シリツ

Consideration for missing definitions in dictionary

KeyError with specific character

A bad character causes KeyError in pykakasi.

from pykakasi import kakasi


def text_convert():
    bad_char = ""
    print(ord(bad_char))
    kks = kakasi()
    kks.setMode("J", "H")
    convert = kks.getConverter()

    text = convert.do(bad_char)
    print(text)


if __name__ == "__main__":
    text_convert()

results

57496
Traceback (most recent call last):
  File "/tmp/pyk.py", line 16, in <module>
    text_convert()
  File "/tmp/pyk.py", line 11, in text_convert
    text = convert.do(bad_char)
  File "/home/picosushi/pyenv/py3/lib/python3.7/site-packages/pykakasi/kakasi.py", line 146, in do                                                                     
    (t, l1) = self._conv[mode].convert(text[i:w])
  File "/home/picosushi/pyenv/py3/lib/python3.7/site-packages/pykakasi/j2.py", line 80, in convert_H                                                                   
    table = self._kanwa.load(text[0])
  File "/home/picosushi/pyenv/py3/lib/python3.7/site-packages/pykakasi/kanwa.py", line 40, in load                                                                     
    self._jisyo_table[key] = loads(decompress(self._kanwadict[key]))
  File "/home/picosushi/pyenv/py3/lib/python3.7/site-packages/semidbm/db.py", line 93, in __getitem__                                                                  
    offset, size = self._index[key]
KeyError: b'e098'

NOTE: This issue is from python - pykakasiで文字列置き換えの際にKeyErrorが発生する - スタック・オーバーフロー .

miurahr / pykakasi Goto Github PK

pykakasi's Introduction

Pykakasi

Overview

Give Up GitHub

pykakasi's People

Stargazers

Watchers

Forkers

pykakasi's Issues

Recommend Projects

Recommend Topics

Recommend Org