Giter Club home page Giter Club logo

jamotools's Introduction

Jamotools

Build Status GitHub Tag PyPI version Python version License

A library for Korean Jamo split and vectorize.

Install

pip install jamotools

Unicode of Korean

According to the Version 9.0.0 database of the Unicode Consortium, the blocks specified in Hangul (Korean) in Unicode are as follows.

  • Hangul Jamo: 1100 ~ 11FF
  • WON SIGN in Currency Symbols: 20A9
  • HANGUL DOT TONE MARK in CJK Symbols and Punctuation: 302E ~ 302F
  • Hangul Compatibility Jamo : 3130 ~ 318F
  • Hangul in Enclosed CJK Letters and Months: 3200 ~ 321E, 3260 ~ 327F
  • Hangul Jamo Extended-A : A960 ~ A97F
  • Hangul Syllables : AC00 ~ D7AF
  • Hangul Jamo Extended-B : D7B0 ~ D7FF
  • Halfwidth Hangul variants in Halfwidth and Fullwidth Forms: FFA0 ~ FFDC
  • FULLWIDTH WON SIGN in Halfwidth and Fullwidth Forms: FFE6

Jamo

Hangul is made of basic letters called Jamo. In unicode, Jamo is defined by several kinds which contain old Hangul that does not use in nowadays. Jamotools only supports modern Hangul Jamo area as follows.

  • Hangul Jamo: Consist of Choseong, Jungseong, Jongseong. It is divided mordern Hangul and old Hangul that does not use in nowadays. Jamotools supports modern Hangul Jamo area.
    • 1100 ~ 1112 (Choseong)
    • 1161 ~ 1175 (Jungseong)
    • 11A8 ~ 11C2 (Jongseong)
  • Hangul Compatibility Jamo: It is a Korean Hangul language area that is compatible with the Hangul character standard (KS X 1001). It is not divided Choseong, Jungseong, Jongseong.
    • 3131 ~ 3163 (modern Hangul Jamo area)
  • Halfwidth Hangul variants: This is the Korean half-width symbol area. Only modern Korean Jamo exists. The general Korean Hangul characterization method is the full-width.
    • FFA1 ~ FFDC

Manipulating Korean Jamo

API for split syllables and join jamos to syllable is based on hangul-utils.

  • split_syllables: Converts a string of syllables to a string of jamos, can be select which convert unicode type.
  • join_jamos: Converts a string of jamos to a string of syllables.
  • normalize_to_compat_jamo: Normalize a string of jamos to a string of Hangul Compatibility Jamo.
>>> import jamotools
>>> print(jamotools.split_syllable_char(u"안"))
('ㅇ', 'ㅏ', 'ㄴ')

>>> print(jamotools.split_syllables(u"안녕하세요"))
ㅇㅏㄴㄴㅕㅇㅎㅏㅅㅔㅇㅛ

>>> sentence = u"  팥죽은 붉은  풋팥죽이고, 뒷집 콩죽은 햇콩 단콩 콩죽.우리 
    깨죽은 검은  깨죽인데 사람들은 햇콩 단콩 콩죽 깨죽 죽먹기를 싫어하더라."
>>> s = jamotools.split_syllables(sentence)
>>> print(s)
ㅇㅏㅍ ㅈㅣㅂ ㅍㅏㅌㅈㅜㄱㅇㅡㄴ ㅂㅜㄺㅇㅡㄴ ㅍㅏㅌ ㅍㅜㅅㅍㅏㅌㅈㅜㄱㅇㅣㄱㅗ,
ㄷㅟㅅㅈㅣㅂ ㅋㅗㅇㅈㅜㄱㅇㅡㄴ ㅎㅐㅅㅋㅗㅇ ㄷㅏㄴㅋㅗㅇ ㅋㅗㅇㅈㅜㄱ.ㅇㅜㄹㅣ
ㅈㅣㅂ ㄲㅐㅈㅜㄱㅇㅡㄴ ㄱㅓㅁㅇㅡㄴ ㄲㅐ ㄲㅐㅈㅜㄱㅇㅣㄴㄷㅔ ㅅㅏㄹㅏㅁㄷㅡㄹㅇㅡㄴ
ㅎㅐㅅㅋㅗㅇ ㄷㅏㄴㅋㅗㅇ ㅋㅗㅇㅈㅜㄱ ㄲㅐㅈㅜㄱ ㅈㅜㄱㅁㅓㄱㄱㅣㄹㅡㄹ
ㅅㅣㅀㅇㅓㅎㅏㄷㅓㄹㅏ.

>>> sentence2 = jamotools.join_jamos(s)
>>> print(sentence2)
  팥죽은 붉은  풋팥죽이고, 뒷집 콩죽은 햇콩 단콩 콩죽.우리  깨죽은 검은 
깨죽인데 사람들은 햇콩 단콩 콩죽 깨죽 죽먹기를 싫어하더라.

>>> print(sentence == sentence2)
True

Jamotools' API supports multiple unicode area of Hangul Jamo for manipulating. Also consists of additional API for manipulating Korean jamo.

>>> sentence = u"자모"

>>> jamos1 = jamotools.split_syllables(sentence, jamo_type="JAMO")
>>> print([hex(ord(c)) for c in jamos1])
['0x110C', '0x1161', '0x1106', '0x1169']
>>> sentence1 = jamotools.join_jamos(jamos1)
>>> print(sentence1)
안녕하세요. hello 1

>>> jamos2 = jamotools.split_syllables(sentence, jamo_type="COMPAT")
>>> print([hex(ord(c)) for c in jamos2])
['0x3148', '0x314F', '0x3141', '0x3157']
>>> sentence2 = jamotools.join_jamos(jamos2)
>>> print(sentence2)
안녕하세요. hello 1

>>> jamos3 = jamotools.split_syllables(sentence, jamo_type="HALFWIDTH")
>>> print([hex(ord(c)) for c in jamos3])
['0xFFB8', '0xFFC2', '0xFFB1', '0xFFCC']
>>> sentence3 = jamotools.join_jamos(jamos3)
>>> print(sentence3)
안녕하세요. hello 1

>>> print(sentence == sentence1 == sentence2 == sentence3)
True

>>> normalize1 = jamotools.normalize_to_compat_jamo(jamos1)
>>> normalize2 = jamotools.normalize_to_compat_jamo(jamos2)
>>> normalize3 = jamotools.normalize_to_compat_jamo(jamos3)
>>> print(jamos1 == jamos2 == jamos3)
False
>>> print(normalize1 == normalize2 == normalize3)
True

Vectorize Korean Jamo

Jamotools support vectorize function following RULE. Each RULE is defined how split sentence to Jamo and convert which type of symbols. It can be used character-level Korean text processing.

  • Vectorizationer: Class for vectorize text by Rule and pad.
>>> v = jamotools.Vectorizationer(rule=jamotools.rules.RULE_1, \
                                  max_length=None, \
                                  prefix_padding_size=0)
>>> print(v.vectorize(u"안녕"))
[13, 21, 45,  4, 27, 62]

Custom RULE

Jamotools can add user's custom RULE class as following steps.

  1. Make custom RULE class which inherit RuleBase (e.g. Rule2) in rules.py like Rule1.
  2. Add constant for custom RULE like RULE_1.
  3. Modify get_rule function to return custom RULE class.

Then it can be use as same as RULE_1 usage.

>>> v = jamotools.Vectorizationer(rule=jamotools.rules.RULE_2, \
                                  max_length=None, \
                                  prefix_padding_size=0)

jamotools's People

Contributors

haebinshin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

jamotools's Issues

Maximum evaluation

i'am trying to use this library to create a data set for ML . My issue is : what's the biggest number in the built in vectorization function ? maybe 62 but i am not sure.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.