suminb / hanja Goto Github PK

View Code? Open in Web Editor NEW

130.0 6.0 16.0 719 KB

한글, 한자 라이브러리

Python 100.00%

python hanja hangul nlp

hanja's Issues

미변환 한자 존재

안녕하세요

hanja 라이브러리를 사용 중에 변환되지 않는 한자를 발견했습니다.

input_text ='女fjdks南減朴a로롤로롤로 '
hanja.translate(input_text, 'substitution')  # 한자 -> 한글 치환
>>> '女fjdks남감박a로롤로롤로 '

위와 같이 계집녀 자가 변환이 안됩니다.

사용 데이터는 모두의말뭉치 뉴스데이터이고,
개발 환경은 ubuntu 18.03, python 3.8.3 입니다

pip install로 설치가 안되는 문제

안녕하세요! hanja 라이브러리 설치와 관련해서 issue를 하나 남깁니다.

pip install hanja로 설치시 다음과 같은 오류가 발생합니다.

Collecting hanja
  Using cached hanja-0.14.1.tar.gz (121 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error       

  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [21 lines of output]
      <string>:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
      Traceback (most recent call last):    
        File "C:\Users\mclub4\AppData\Local\Programs\Python\Python311\Lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 353, in <module>
          main()
        File "C:\Users\mclub4\AppData\Local\Programs\Python\Python311\Lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "C:\Users\mclub4\AppData\Local\Programs\Python\Python311\Lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 118, in get_requires_for_build_wheel
          return hook(config_settings)      
                 ^^^^^^^^^^^^^^^^^^^^^      
        File "C:\Users\mclub4\AppData\Local\Temp\pip-build-env-wgg3us39\overlay\Lib\site-packages\setuptools\build_meta.py", line 327, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=[])
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "C:\Users\mclub4\AppData\Local\Temp\pip-build-env-wgg3us39\overlay\Lib\site-packages\setuptools\build_meta.py", line 297, in _get_build_requires
          self.run_setup()
        File "C:\Users\mclub4\AppData\Local\Temp\pip-build-env-wgg3us39\overlay\Lib\site-packages\setuptools\build_meta.py", line 497, in run_setup
          super().run_setup(setup_script=setup_script)
        File "C:\Users\mclub4\AppData\Local\Temp\pip-build-env-wgg3us39\overlay\Lib\site-packages\setuptools\build_meta.py", line 313, in run_setup
          exec(code, locals())
        File "<string>", line 17, in <module>
      FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt'        
      [end of output]

저 뿐만 아니라 동료들의 다른 환경에서도 같은 오류가 발생하고 있습니다. 확인 부탁드립니다!

pypi에 새 버전 릴리즈 해주세요

현재 pypi에 올라온 버전에서는 다음과 같은 오류가 발생합니다.

  File "translit.py", line 36, in tra
    input = hanja.translate(input, 'substitution')
  File "/private/tmp/.env/lib/python2.7/site-packages/hanja/hanja.py", line 46, in translate
    split_hanja(text)))
  File "/private/tmp/.env/lib/python2.7/site-packages/hanja/hanja.py", line 45, in <lambda>
    return ''.join(map(lambda w: translate_word(w, mode),
  File "/private/tmp/.env/lib/python2.7/site-packages/hanja/hanja.py", line 54, in translate_word
    tw = ''.join(map(translate_syllable, u' '+word[:-1], word))
  File "/private/tmp/.env/lib/python2.7/site-packages/hanja/hanja.py", line 14, in translate_syllable
    return dooeum(previous, hanja_table[current])
  File "/private/tmp/.env/lib/python2.7/site-packages/hanja/hangul.py", line 30, in dooeum
    p, c = Hangul.separate(previous), Hangul.separate(current)
NameError: global name 'Hangul' is not defined

Consolidate mode, string_format parameters

It is possible to specify a translation mode when calling translate().

hanja.translate('大韓民國은 **共和國이다.', 'substitution')
hanja.translate('大韓民國은 **共和國이다.', 'combination-text')

It is also possible to provide a custom translation mode by supplying format_string parameter.

hanja.translate('大韓民國은 **共和國이다.', 'combination-text', format_string='{hanja} {hangul}')

In such cases, the mode parameter does not serve any purposes.

I would like to revise translate() so it only takes format_string parameter, and we provide pre-defined format strings for existing translation modes (substitution, combination-text, combination-html). It will look like this:

hanja.translate('大韓民國은 **共和國이다.', Mode.substitution)
hanja.translate('大韓民國은 **共和國이다.', Mode.combination_text)
hanja.translate('大韓民國은 **共和國이다.', '{hanja} <{hangul}>')

한자 모듈이 0.14.0로 업데이트 되면서 정상적으로 작동하지 않습니다.

PyPI로 올라가면서 impl.py 등의 파일이 누락된 것으로 보입니다.

>>> hanja.split_hanja("대한민국은 한자로 표현하면 大韓民國이다.")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/homebrew/lib/python3.10/site-packages/hanja/__init__.py", line 38, in load_and_call
    mod = __import__(import_path)
ModuleNotFoundError: No module named 'hanja.impl'

Compiling with pyinstaller does not work if "--collect-all hanja" is not used

Hello! Unfortunately, I am not an advanced python coder so I'm not sure if I can provide the best feedback but here is what I encountered.

Compile my app using pyinstaller
pyinstaller -F -w main.py
Run the application using the main.exe
The 'hanja' corresponding function is not working. The app does not crash, the app does not throw any error at all. It just does nothing.
I go back to my IDE and test it. Everything is working perfectly fine without any errors.
I start to use additional commands with pyinstaller. I try --hidden-import, it does not work. Then, I try --collect-all and it works.
I get confused. I have no idea why it didn't work previously but it worked after.

This is the only bit of hanja I use in my code:
converted_source = hanja.translate(source_value, 'substitution')

I do not know what info/files/logs to share with you to provide more information, please let me know if you want more specific details. It might be a rare issue, I might be the only person encountering this type of problem.

Edit: I'm using python 3.11 and a venv. Also, I am able to reproduce this problem 100% of the time.

인식하지 못하는 한자

𤍠(\u24360) 𨽾(\u28F7E) 이런 한자들을 처리하게 하려면 어떻게 해야 하나요?

Being aware of some hanjas' phonetic changes

Some hanjas like 金/讀/畵 can be pronounced in different ways. The current behavior can produce incorrect results in some cases e.g.:

Input: 金日成綜合大學은 平壤에 있는 朝鮮**主義人民共和國의 國立大學이다.
Expected output: 김일성종합대학은 평양에 있는 조선민주주의인민공화국의 국립대학이다.
Actual output: 금일성종합대학은 평양에 있는 조선민주주의인민공화국의 국립대학이다.

Hanja	Word 1	Word 2
金	金剛經 (금강경)	金浦國際空港 (김포국제공항)
讀	讀書 (독서)	句讀點 (구두점)
畵	畵龍點睛 (화룡점정)	企畵 (기획)

두음법칙 오류

안녕하세요

龍潭의 한글변환값은 용담이 아닌 龍담으로 출력됨
麗川의 한글변환값은 여천이 아닌 麗천으로 출력됨

아마도 두음법칙의 문제가 있어 보입니다.
확인해주시면 감사하겠습니다.

Documentation

Do some documentation!!

Two versions of the same Chinese character

Hi. It seems that the same Chinese character can have two versions, which look slightly different and also have different unicode values. And only one version is recognized as hanja.

For example, 李 has two versions, unicode 674e and unicode f9e1. Only the first version passes as hanja:

My guess, from looking at 李, 金, 宅, is that all unicode values f900-fa60 in the unicode tables (http://www.tamasoft.co.jp/en/general-info/unicode.html) suffer the same problem.

Would it be possible to include unicode values f900-fa60 to be recognized by hanja?

Thank you!

you need PyYAML in install_requires in setup.py for pypi package

pip install 에러

안녕하세요, 오늘 설치를 시도했으나 에러가 나는데 패키지에 문제가 있는걸까요?

Collecting hanja
  Using cached hanja-0.14.1.tar.gz (121 kB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [6 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/private/var/folders/d5/3zwq438d2mn5d_92hlxqzcg00000gn/T/pip-install-5mq29t6d/hanja_536dcc2e774245a1998596ebf5b3c5d5/setup.py", line 17, in <module>
          with open("requirements.txt") as f:
      FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt'
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Combination Modes

We currently have combination mode where Hanja characters are converted into Hangul while preserving the original text in parentheses. Each class of characters are contained different <span> tags to differentiate semantics.

>>> hanja.translate(u'大韓民國은 **共和國이다.', 'combination')
<span class="hanja">大韓民國</span><span class="hangul">(대한민국)</span>은 <span class="hanja">**共和國</span><span class="hangul">(민주공화국)</span>이다.

However, I thought it would be useful to provide a text-only combination mode, assuming not everyone uses this library to produce HTML.

>>> hanja.translate(u'大韓民國은 **共和國이다.', 'combination-html')
<span class="hanja">大韓民國</span><span class="hangul">(대한민국)</span>은 <span class="hanja">**共和國</span><span class="hangul">(민주공화국)</span>이다.

>>> hanja.translate(u'大韓民國은 **共和國이다.', 'combination-text')
大韓民國(대한민국)은 **共和國(민주공화국)이다.

Backward compatibility may be preserved by making the legacy combination mode fall back to combination-html.

Please, support python 3.x version.

It does not work properly in Python 3.5.
Is it only work on Python 2.x version?
Please, support python 3.x version.

라이선스

이 라이브러리가 하는 일과 비슷한 기능을 하는 확장앱/확장 프로그램을 만들고 있는 한 사람입니다. 한자의 음을 굉장히 실하게 정리해놓은 table.yml을 활용하는(가져다 쓰는) 것에 관심이 있는데, 저를 포함한 다른 많은 사람들이 정당하게 그렇게 할 수 있는 것인지를 명시해줄 라이선스를 만들어놓는 것은 어떨까 제안합니다.
But, of course, "[y]ou're under no obligation to choose a license."

hanaj-0.13.1, No such file or directory: 'requirements.txt'

C:\temp>py -m pip install hanja
Collecting hanja
  Using cached hanja-0.13.1.tar.gz (119 kB)
    ERROR: Command errored out with exit status 1:
     command: 'C:\Programs\Python3864\python.exe' -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\usrme\\AppData\\Local\\Temp\\pip-install-2zu2r50b\\hanja\\setup.py'"'"'; __file__='"'"'C:\\Users\\usrme\\AppData\\Local\\Temp\\pip-install-2zu2r50b\\hanja\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base 'C:\Users\usrme\AppData\Local\Temp\pip-install-2zu2r50b\hanja\pip-egg-info'
         cwd: C:\Users\usrme\AppData\Local\Temp\pip-install-2zu2r50b\hanja\
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "C:\Users\usrme\AppData\Local\Temp\pip-install-2zu2r50b\hanja\setup.py", line 17, in <module>
        with open("requirements.txt") as f:
    FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

0.13.1 을 pip 로 설치할 때, requirements.txt 가 없어서 위와 같이 에러가 발생합니다.

hanja-0.13.1.tar.gz 안에 해당 파일이 없습니다.

발음이 2종류 이상인 한자의 독음 문제

안녕하세요, 멋진 패키지 만들어 주셔서 감사합니다. 너무 잘 쓰고 있습니다.

저는 주로 중국원서를 읽을 때 가독성 향상 및 독서속도 증진을 위해서
원문 아래에 독음 가이드라인을 붙이는 용도로 사용중인데요.
(이렇게 하고 나서 독해속도가 2배 더 빨라졌습니다... 감사해요!)

한자가 발음이 2가지 이상인 경우 아무래도 잘못 읽는 경우가 많습니다.
패키지의 문제는 아니고 한자라는 체계의 근본적 한계라고 생각하지만요.
예를 들어서 '适合'를 Hanja로 읽으면 '괄합'이라고 나옵니다. '적합'이 맞는 발음인데요.

네이버 한자 사전을 보면 빠를 괄, 적합할 적 2가지 발음이 있는데
대부분의 경우 첫 번째 발음으로 출력되더라고요.
이런 경우가 적지 않다보니 조금 아쉬움이 있습니다.

개인적으로는 시작한지 2달 된 파이썬 실력을 어떻게든 쥐어짜서
사용자 사전을 만들어 사용하고 있습니다.
사전에 适:적 이렇게 넣으면 패키지 내부 사전을 덮어쓰면서 우선적용되게 했어요.
근데 제가 워낙 실력이 미천하다보니 코드도 그지같고 효율적이지도 않아서...
패키지 자체적으로 사용자사전 기능을 제공한다면 참 좋을 것 같습니다.

또 하나는 두음법칙 문제입니다.
중국어는 띄어쓰기가 없다보니 중국문서에 Hanja를 적용하면
문장 제일 처음에 올 때 외에는 전부 두음법칙을 적용받지 못합니다.

예를 들어서 이면세계 할 때 이면(里面)은 전부 다 리면으로 나오네요.
이런 것도 따로 수정할 방법이 있다면
(예를 들어 두음법칙 함수보다 우선순위가 높은 발음사전을 지원한다던가)
훨씬 더 유용성이 높아지지 않을까 싶습니다.

다시 한 번 좋은 모듈 공유해 주셔서 감사하다는 말씀 드리면서 이만 글 줄입니다.

간체 한자를 한글로 변환할때 틀리게 변환하는 글자가 좀 있습니다

다운로드받아서 테스트를 해보고 있는데
간체 한자를 한글로 변환할때 틀리게 변환하는 글자가 좀 있어서 이슈로 남겨봅니다.

중국의 요녕성 (辽宁省)을 "요저성"으로 변환합니다.
중국의 광동성 (广东省)을 "엄동성"으로 변환합니다.

번체의 쌓을저 (宁)와 간체의 宁가 같고,
번체의 집엄 (广)과 간체의 广이 같아서 그런것 같은데,
문자열을 전체적으로 스캐닝해서 번체인지 간체인지 판단 후 변환은 불가능할까요?
아니면, 번체 또는 간체 여부를 파라미터로 넘겨서 변환하는 방식은 불가능할까요?

혼용 모드 변환

The current combination translation mode (혼용 모드 변환) is designed for a particular web application. Consider employing a different interface to take a function to generate custom output.

테스트 해보려 하는데 에러를 뿜네요.

ImportError: cannot import name 'hangul' from partially initialized module 'hanja' (most likely due to a circular import

suminb / hanja Goto Github PK

hanja's Issues

Recommend Projects

Recommend Topics

Recommend Org