open-korean-text / open-korean-text Goto Github PK

View Code? Open in Web Editor NEW

598.0 53.0 94.0 33.49 MB

Open Korean Text Processor - An Open-source Korean Text Processor

License: Apache License 2.0

Scala 75.01% Java 24.99%

korean korean-text-processing natural-language-processing text-processing tokenizer korean-tokenizer

open-korean-text's Introduction

open-korean-text

Open-source Korean Text Processor / 오픈소스 한국어 처리기 (Official Fork of twitter-korean-text)

Scala/Java library to process Korean text with a Java wrapper. open-korean-text currently provides Korean normalization and tokenization. Please join our community at Google Forum. The intent of this text processor is not limited to short tweet texts.

스칼라로 쓰여진 한국어 처리기입니다. 현재 텍스트 정규화와 형태소 분석, 스테밍을 지원하고 있습니다. 짧은 트윗은 물론이고 긴 글도 처리할 수 있습니다. 개발에 참여하시고 싶은 분은 Google Forum에 가입해 주세요. 사용법을 알고자 하시는 초보부터 코드에 참여하고 싶으신 분들까지 모두 환영합니다.

설치 및 수정하는 방법 상세 안내

open-korean-text의 목표는 빅데이터 등에서 간단한 한국어 처리를 통해 색인어를 추출하는 데에 있습니다. 완전한 수준의 형태소 분석을 지향하지는 않습니다.

open-korean-text는 normalization, tokenization, stemming, phrase extraction 이렇게 네가지 기능을 지원합니다.

정규화 normalization (입니닼ㅋㅋ -> 입니다 ㅋㅋ, 샤릉해 -> 사랑해)

한국어를 처리하는 예시입니닼ㅋㅋㅋㅋㅋ -> 한국어를 처리하는 예시입니다 ㅋㅋ

토큰화 tokenization

한국어를 처리하는 예시입니다 ㅋㅋ -> 한국어Noun, 를Josa, 처리Noun, 하는Verb, 예시Noun, 입니다Adjective(이다), ㅋㅋKoreanParticle

어근화 stemming (입니다 -> 이다)

한국어를 처리하는 예시입니다 ㅋㅋ -> 한국어Noun, 를Josa, 처리Noun, 하다Verb, 예시Noun, 이다Adjective, ㅋㅋKoreanParticle

어구 추출 phrase extraction

한국어를 처리하는 예시입니다 ㅋㅋ -> 한국어, 처리, 예시, 처리하는 예시

Introductory Presentation: Google Slides

Web API Service

open-korean-text-api
이 API 서비스는 Heroku 서버에서 제공되며(Domain: https://open-korean-text.herokuapp.com/) 현재 정규화(normalization), 토큰화(tokenization), 어근화(stemmin), 어구 추출(phrase extract) 서비스를 제공합니다.

각 서비스와 사용법은 다음과 같습니다.
normalize, tokenize, stem, extractPhrases 가 각 서비스의 Action 이 되며 Query parameter 는 text 입니다.

서비스	사용법
정규화	https://open-korean-text-api.herokuapp.com/normalize?text=오픈코리안텍스트
토큰화	https://open-korean-text-api.herokuapp.com/tokenize?text=오픈코리안텍스트
어근화	https://open-korean-text-api.herokuapp.com/stem?text=오픈코리안텍스트
어구 추출	https://open-korean-text-api.herokuapp.com/extractPhrases?text=오픈코리안텍스트

Semantic Versioning

1.0.2 (Major.Minor.Patch)

Major: API change Minor: Processor behavior change Patch: Bug fixes without a behavior change

API

Maven

To include this in your Maven-based JVM project, add the following lines to your pom.xml: / Maven을 이용할 경우 pom.xml에 다음의 내용을 추가하시면 됩니다:

  <dependency>
    <groupId>org.openkoreantext</groupId>
    <artifactId>open-korean-text</artifactId>
    <version>2.1.0</version>
  </dependency>

Maven Repository: http://mvnrepository.com/artifact/org.openkoreantext/open-korean-text

Support for other languages.

Type	Language	Contributor
Wrapper	.net/C#	modamoda
Wrapper	Node JS	Ch0p
Wrapper	Node JS	Youngrok Kim
Wrapper	Python	Jaepil Jeong
Wrapper	Clojure	Seonho Kim
Wrapper	Ruby for Java Version	jun85664396
Wrapper	Ruby for Scala Version	Jaehyun Shin
Porting	Python	Baeg-il Kim
Package	Python Korean NLP	KoNLPy
Package	Elastic Search	socurites
Package	Elastic Search	Jaehyun Shin
Package	JavaScript (browser-compatible)	Grégoire Geis

Get the source / 소스를 원하시는 경우

Clone the git repo and build using maven. / Git 전체를 클론하고 Maven을 이용하여 빌드합니다.

git clone https://github.com/open-korean-text/open-korean-text.git
cd open-korean-text
mvn compile

Open 'pom.xml' from your favorite IDE.

Basic Usage / 사용 방법

You can find these examples in examples folder. / examples 폴더에 사용 방법 예제 파일이 있습니다.

Running Tests

mvn test will run our unit tests / 모든 유닛 테스트를 실행하려면 mvn test를 이용해 주세요.

Contribution

Refer to the general contribution guide. We will add this project-specific contribution guide later.

설치 및 수정하는 방법 상세 안내

Performance / 처리 속도

Tested on Intel i7 2.3 Ghz

Initial loading time (초기 로딩 시간): 2~4 sec

Average time per parsing a chunk (평균 어절 처리 시간): 0.12 ms

Tweets (Avg length ~50 chars)

Tweets	100K	200K	300K	400K	500K	600K	700K	800K	900K	1M
Time in Seconds	57.59	112.09	165.05	218.11	270.54	328.52	381.09	439.71	492.94	542.12

Average per tweet: 0.54212 ms

Benchmark test by KoNLPy

From http://konlpy.org/ko/v0.4.3/morph/#pos-tagging-with-konlpy

Author

Will Hohyon Ryu (유호현): https://github.com/nlpenguin | https://twitter.com/NLPenguin

Admin Staff

Mingyu Kim (김민규): https://github.com/MechanicKim

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

open-korean-text's People

Contributors

Stargazers

Watchers

Forkers

ykkwon songjihwan lmh30003 hk2faith mapplus gjsee jihuichoi shuuki4 keepcosmos cookieshake ksseono dodokim2278 skystar-p ssehuun timefree 0r0i-00 redbyzan hyunwoo-e choi-jinil hyunii12 emptydrawing angrydata ho9science lepetit80 ewencluley dolja315 okwon78 oyk4017 sonsunghwan jujojujoju karroo imnotbeen usdong98 keisoftdev sungbin damiox fwbrasil co5808 kiravspace koosangwon yjo12 seriatin sskknnyy kottakji almazkun geoclick lucky7323 hojunpark sjlee84 studioego jeongwookie kyuhwas head1ton guriguri noncetaxa renuevo yjo5252 jayhawk ojh4105 wesias7 keithkim cepiloth el-ground jjang16 coalee mm2001 yurangja99 widian sbseong fish895623 sunhee0211 penguin418 rosssong almondh jaytsol hyejeong443 inhee0815 belhyun keyog0 hnddle voiij shipjobs xoogx rabierre koreanteacherjw 71 hwijune egnartsz lab-yue zelosorg jawn-blue it9good vitaly-z ybhwang

open-korean-text's Issues

Support top n parsings.

Add ability to prioritize/constrain `tokenize` results

Our application uses OpenKoreanTextProcessorJava.tokenize to segment Korean text for word-by-word translation into another language.

At the moment, OKT will return the tokenization that best matches words in its dictionary, but sometimes these words aren't available to us for translation, but more complex tokenizations have translations available.

As an example, tokenize returns a single token for 평창올림픽, but (평창, 올림픽) is also a valid tokenization. We don't have a translation available for 평창올림픽.

In this case, We'd like to be able to rule out 평창올림픽 as a valid token. This could be done by:

Adding the ability to give a set of words a penality in TokenizerProfile
Adding the ability to remove words from the dictionary

형태소를 다시 문장으로 복구하는 알고리즘은 없나요?

형태소를 다시 문장으로 복구하는 알고리즘은 없나요? 현재 TTS에 사용하고싶은데 문제가 TTS에서 나온 형태소들을 다시 문장으로 조합하는 것 입니다.

감사합니다

Fix incorrect stems

붇다 -> 부다
졸르 -> 조르
일르 -> 이르
골르 -> 고르
...

말뭉치 라이센스에 관한 문의

Elastic 그룹에서 공식적으로 elasticsearch-analysis-openkoreantext를 홍보해 줄 수 있다고합니다.

Elastic 홈페이지에서 official 하게 홍보를 해 드릴 수 있는데 그러려면 라이센스가 Apache 2 라이센스여야지 가능합니다. 보통 엔진은 그런 경우가 많은데 말뭉치 사전이 라이센스가 달라서 못 하는 경우가 많아서요. 감사합니다.

라고 문의해왔는데, 말뭉치 라이센스는 따로 있는건가요? 아니면 소스코드와 같은 라이센스를 따르나요?

사전 경로 미수정

안녕하세요, OpenKoreanText를 사용하다가 미수정된 것 같은 코드를 발견하여 글 올립니다.(github을 잘 안써봐서 여기 쓰는게 맞나 잘 모르겠네요)
scala/org/openkoreantext/processor/tools/DeduplicateAndSortDictionaries.scala 파일의 61번째 줄에
val outputFolder = "src/main/resources/com/twitter/penguin/korean/util/"
로 되어 있습니다. 옮겨오면서 누락된 것 같은데, 수정하는 것이 맞는 것 같습니다.

감사합니다

elasticsearch plugin 관련 프로젝트 생성 가능할까요?

엘라스틱서치 분석 플러그인을 개발하려고 합니다.

open-korean-text에서 fork된 관련 레파지토리가 있지만, 새로운 버젼에 맞게 관리되거나 PR를 받거나 하지는 않아 보입니다. 검색을 좀 해보니 같은 레파지토리를 fork해서 조금씩 수정해서 사용하는 repository들이 있긴 하더라구요.
거기에 또 제 필요에 따라 레파지토리를 하나 만들거나 혹은 또 fork해서 쓰는 것도 좀 별로인거 같아요.

그래서 open-korean-text에 프로젝트를 하나 만들어서 관리하면 어떨까해요.

사실 AWS elasticsearch, compose.com, elastic.co 등 엘라스틱서치를 클라우드 서비스로 제공하는 곳에서 언어 분석 플러그인을 추가할 수 있는 옵션을 주는데, 중국어 일본어등 다양한 언어 분석 플러그인들은 기본적으로 제공되는 반면 한국어에 관련된 플러그인 지원은 전무한 상태입니다.

Organization을 통해서 프로젝트를 진행하면 지속적으로 관리하기도 용이할거같고, 엘라스틱서치를 서비스로 제공하는 클라우드업체에게도 해당 플러그인을 옵션으로 제공해달라고 요청하기도 좋을거 같아요.

Get a list of stems instead of one?

Currently, stemming returns one stem, but some conjugations can have multiple conjugations. For example 갈 거예요 could be either 가다 or 갈다 depending on the context. Another example is #66.

Is there a way to return a list of possible stems instead of just one? If not, could you point me in the right direction for implementing it my self, I can put in a PR once I'm done.

Incorrect stem for 외로워

I noticed another stem issue with "외로워", it doesn't return "외롭다"as expected.

Start an API Java Service

Probably using Heroku

maven cli exec error on running examples

I got an error when I run examples (examples/JavaOpenKoreanTextProcessorExample), the following error has occurred. The error dump is here.

(I know that running this code in IntelliJ is reccomended, and I successed to exec the example code in IntelliJ. BUT I want to use this code in CLI environment.)

$open-korean-text / examples > mvn exec:java -DexecmainClass="JavaOpenKoreanTextProcessorExample"
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Korean Text Examples 0.0.1
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ example ---
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 0.335 s
[INFO] Finished at: 2018-07-17T15:42:14+09:00
[INFO] Final Memory: 22M/964M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.6.0:java (default-cli) on project example: The parameters 'mainClass' for goal org.codehaus.mojo:exec-maven-plugin:1.6.0:java are missing or invalid -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginParameterException

Here is the environment of my machine.

│ $uname -a
│Linux red 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

│ $mvn --version
│Apache Maven 3.3.9
│Maven home: /usr/share/maven
│Java version: 1.8.0_171, vendor: Oracle Corporation
│Java home: /usr/lib/jvm/java-8-openjdk-amd64/jre
│Default locale: en_US, platform encoding: UTF-8
│OS name: "linux", version: "4.4.0-116-generic", arch: "amd64", family: "unix"

heroku application error

fix it, please.

토크나이징 결과 미스매치

java분석예시 [한국어(Noun: 0, 3), 를(Josa: 3, 1), 처리(Noun: 5, 2), 하는(Verb(하다): 7, 2), 예시(Noun: 10, 2),
// 입니다(Adjective(이다): 12, 3), ㅋㅋㅋ(KoreanParticle: 15, 3), #한국어(Hashtag: 19, 4)]

예시를 보아도 그렇고 실제로 tokenize 결과를 보아도 그렇고
"입니다"는 "입니다(Adjective(이다): 12, 3)"로 분석되고 있습니다.

메인에 있는 예시를 보면
"입Adjective, 니다Eomi" 이렇게 분석되어야 맞는 것 같습니다.

KoreanStemmer 를 사용할 때
nosuchelementexception : key not found 오류가 발생합니다.

위 토크나이즈된 결과(입니다(Adjective(이다):12,3)를 인자로 사용할 때 발생됩니다.
Seq<KoreanToken> result = stem(tokens)

Normalizing Issue

안녕하세요.
Normalization 기능을 사용하다 의도치 않은 텍스트 변경을 발견하여 남깁니다.
예를 들어, "자한당" 과 같은 단어를 "자한다" 로 변경시키네요.
아마 "한당" => "한다" 로 바꾸는 규칙 때문에 의도치 않게 명사인 "자한당" 도 바뀌는 것 같습니다.

Update examples folder

examples/pom.xml

dependency and plugin version

examples/src/main/java/JavaOpenKoreanTextProcessExample.java

remove 'Stemming'

examples/src/main/scala/ScalaOpenKoreanTextExample.java

remove 'Stemming'

Fix 'Twitter*' -> 'Open*'

#extractPhrases 동작 오류

val tokens: Seq[KoreanToken] = OpenKoreanTextProcessor.tokenize(normalized)
println(tokens)
// List(한국어(Noun: 0, 3), 를(Josa: 3, 1),  (Space: 4, 1), 처리(Noun: 5, 2), 하는(Verb: 7, 2),  (Space: 9, 1), 예시(Noun: 10, 2), 입니(Adjective: 12, 2), 다(Eomi: 14, 1), ㅋㅋ(KoreanParticle: 15, 2),  (Space: 17, 1), #한국어(Hashtag: 18, 4))

// Phrase extraction
val phrases: Seq[KoreanPhrase] = OpenKoreanTextProcessor.extractPhrases(tokens, filterSpam = true, enableHashtags = true)
println(phrases)

위의 구문을 수행할 경우,

예상

// List(한국어(Noun: 0, 3), 처리(Noun: 5, 2), 처리하는 예시(Noun: 5, 7), 예시(Noun: 10, 2), #한국어(Hashtag: 18, 4))

실제

/// List(한국어(Noun: 0, 3), 예시(Noun: 10, 2), #한국어(Hashtag: 18, 4))

v1.1를 사용해서 #extractPhrases 를 수행하면 메뉴얼에 작성된 예시와 다른 결과가 나옵니다. 무엇이 제대로 된 결과인지 알 수 있을까요?

'~라면서' 처리 오류

OpenKoreanTokenizer 업데이트에 맞춰서 다른 라이브러리를 업데이트하고 테스트하던 중에 다음과 같은 오류를 발견했습니다.

테스트 문장은 다음과 같습니다.

이들은 "현역 국회의원인 시점에 자서전을 내면서 부끄러운 범죄 사실을 버젓이 써 놓고 사과 한마디 없다는 것은 더 기막히다"라면서 "대선 후보가 아닌 정상적인 사고를 가진 한 인간으로서도 자질 부족인 홍 후보의 사퇴를 촉구한다"고 밝혔다.

여기서 문제가 되는 구조는 "~"라면서 부분입니다.
openkoreantext.org에서도 테스트했을 때 다음과 같이 통과합니다.

성공

그는 "ABCD"라고 말을 외쳤다
그는 "ABCD"라는 말을 외쳤다

실패

그는 "ABCD"라면서 말을 외쳤다.
그는 "ABCD"라면서도 말을 외쳤다.

1.3 버전까지는 테스트에서 이상이 없었는데, 2.0.3에서 이렇게 되었습니다.
사전에는 '어미' 항목으로 있는데 왜 이럴까요?

Improve code coverall coverage.

Uncommon stem for 주세요

Stems 주세요 as 줄다 instead of 주다.

Memory Leak

OpenKoreanTextProcessorJava.normalize and OpenKoreanTextProcessorJava.tokenize. Leak arround 35 to 40 MB of memory when called from Java. Could you suggest how to release memory after each call

Stem of "몰라요"

I noticed that the stem we obtained for the word "몰라요" is "몰르다".
Shouldn't it be "모르다" instead?

Improve coverage on OpenKoreanTextProcessor.scala

Improve coverall coverage

https://coveralls.io/github/open-korean-text/open-korean-text?branch=master

We are at 98% in test coverage. Let's make it 100%.

...koreantext/processor/tokenizer/KoreanChunker.scala - @MechanicKim
...reantext/processor/tokenizer/KoreanTokenizer.scala - @hewonjeong
...enkoreantext/processor/stemmer/KoreanStemmer.scala - @ovekyc
.../org/openkoreantext/processor/util/KoreanPos.scala - @KyuaKwon
...text/processor/util/KoreanDictionaryProvider.scala - @hewonjeong
...antext/processor/normalizer/KoreanNormalizer.scala - @ovekyc
...essor/phrase_extractor/KoreanPhraseExtractor.scala - @KyuaKwon

Sample PR: #28

질문 있으면 댓글로 달아주세요.

Error in loading class on ElasticSearch Plugin

Could not initialize class org.openkoreantext.processor.tokenizer.KoreanChunker$

ES Version 2.3.4

분석기 사용방법 문의

@nlpenguin

안녕하세요. ES에서 open-korean-text를 도입하여 사용중에 있습니다.
(ES : 2.4.x, open-korean-text: 1.1)

사용도중에 질문이 있어 게시글 남겨봅니다.
(google group에서는 글남기기 권한이 필요한지.. 글을 남길 수가 없어 여기에 먼저 남깁니다. 추후에 옮길 수 있을때 옮기도록 하겠습니다.)

질의 "닭도리탕" 이고
분석결과는

{
   "tokens": [
      {
         "token": "닭도리탕",
         "start_offset": 0,
         "end_offset": 4,
         "type": "Noun",
         "position": 0
      }
   ]
}

로 분석이 됩니다.

닭도리탕은 합성어로 "닭"과 "도리탕"을 합쳐놓은 것입니다.
분석결과에서 "닭/도리/탕/도리탕/닭도리탕"을 함께 보고 싶은데요. (비록 닭볶음탕으로 이야기하는것이 옳을지라도...)
물론 두 단어 모두 사전에는 포함되어있습니다.

방법이 있을까요?

tokenizing issue with "여행가고싶어"

tokenizer를 사용하는 중에 다음과 같은 문제가 발견되어 이슈 생성합니다.

https://open-korean-text-api.herokuapp.com/tokenize?text=%EC%97%AC%ED%96%89%EA%B0%80%EA%B3%A0%EC%8B%B6%EC%96%B4

보니까 wikipedia_title_nouns.txt 에는 <여행>이 없고, nouns.txt에는 <여행>이 있는데요.
사전 파일의 우선순위로 인한 문제인지 잘 모르겠네요.
이런 경우, 여행/가고/싶어 이런식의 tokenizing이 되려면 어떻게 해야하는지 조언 부탁드립니다.

@nlpenguin 미리 감사드립니다. ^^;;

Setup Maven Central publish pipeline

java.lang.VerifyError: Verifier rejected class org.openkoreantext.processor.OpenKoreanTextProcessor

Getting the above error on running the code:

CharSequence inputData = OpenKoreanTextProcessorJava.normalize(text);

Complete Stacktrace:
`
Caused by: java.lang.VerifyError: Verifier rejected class org.openkoreantext.processor.OpenKoreanTextProcessor$: java.lang.Object

org.openkoreantext.processor.OpenKoreanTextProcessor$.$deserializeLambda$(java.lang.invoke.SerializedLambda) failed to verify: java.lang.Object

org.openkoreantext.processor.OpenKoreanTextProcessor$.$deserializeLambda$(java.lang.invoke.SerializedLambda): [0x0] Call site #49 bootstrap method argument 2 is not a reference (declaration

of 'org.openkoreantext.processor.OpenKoreanTextProcessor$' appears in prtlw==/split_lib_dependencies_apk.apk)
at org.openkoreantext.processor.OpenKoreanTextProcessor.normalize(Unknown Source:0)
at org.openkoreantext.processor.OpenKoreanTextProcessorJava.normalize(OpenKoreanTextProcessorJava.java:45)
`

PS: Running the code on Android, after building a JAR from java library using open-korean-text package.

성명 고유명사 처리 관련 오류

이슈 내용
given_names.txt에 등록된 이름 중 "종은", "영은" 처럼 명사 + 조사의 형태로 읽을 수 있는 이름이 포함된 문장을 토크나이징할때 명사+조사 형태로 결과가 나오는 것이 아니라 3글자 명사로 결과가 나옴.

반영은 뭐죠 => {"tokens":["반영은(Noun: 0, 3)","뭐(Noun: 4, 1)","죠(Josa: 5, 1)"],"token_strings":["반영은","뭐","죠"]}

디버깅해보니 KoreanTokenizer.scala 파일의 findTopCandidates 함수에서 호출하는 isName, isKoreanNameVariation 관련하여 이슈가 생기는 것임을 확인함.

의견
한국 이름을 식별하여 처리하는 부분은 정확성을 증명하기가 어려우므로 아예 고려 대상에서 제외하거나 해당 부분을 옵션화하는 것이 안전할 것 같다고 생각됨.

"Failed to fetch" error on website

This pertains to the website you guys host (with the code).
Whenever I input text, or use the default text provided ("한국어를 처리하는 예시입니닼ㅋㅋㅋ"), an error message pops up that says, "Fetch failed! TypeError: Failed to fetch".

240 I want to add user-dictionary. is it possible??

hello.
I user twitter analyzer with python.

And I want to make user-dictionary.
It means,
when I tokenize "섬유탈취제"
now, it makes tokens like this.
"tokens": [
"섬유(Noun: 0, 2)",
"탈취(Noun: 2, 2)",
"제(Noun: 4, 1)"
],
but, I want to make tokens like this.
"tokens": [
"섬유탈취제(Noun or Compound)"
],

is it possible?
if possible, how could I make this?

Consider a pure Java 8 version

Enable Web service leveraging the existing web service.

https://github.com/openkoreantext/open-korean-text-web

Question related to nouns dictionary size

Hello! First of all big thanks for continuing the twitter-korean-text project. I'm excited to see what's next :)

I am currently comparing two Korean text analyzers, which are open-korean-text and mecab-ko. One of the biggest advantages of open-korean-text are the offset and length variables, especially in combination with stemming and the offset/length staying identical to the original string. My use case requires a reconstruction of the text at a later stage, which these values make possible.

My questions is however related to the dictionary size of the nouns. The mecab-ko dictionary currently has over 200.000 nouns while open-korean-text is floating at 30.000. Is there something keeping me from simply importing all the nouns from mecab-ko into open-korean-text aside from a potential performance hit?

I'd like to contribute to this project in the future, but as a Korean learner I'm afraid I can't really contribute much in the actual analyzing of the language, the dictionary however I can do. I'd like to use the project for a Korean learning hobby project which would require up-to-date celebrity names, game character names etc.

Thanks again and I hope to hear from you!

normalizer CODA_N_EXCPETION issue

private[this] val CODA_N_EXCPETION = "은는운인텐근른픈닌든던".toSet
This exceptions cause some problems.
case ( -으, -느, -우, -이, -테, -그, -르, -프, -니, -드, -더 ) + ㄴ + 인(데,지,가)

example
먹인가
expected 먹이인가, but 먹인가.
버근가
expected 버그인가, but 버근가.

About wrappers

안녕하세요, 저는 node-twitter-korean-text 를 maintain하고 있었습니다만, 이번 open-korean-text으로의 변경에 맞추어 어떤 방향으로 업데이트를 진행하여야 좋을지 문의드립니다.

기존 repository node-twitter-korean-text 이용 (node-open-korean-text로 이름 변경)
fork된 repository open-korean-text-wrapper-node-2 이용

2의 경우는 권한이 부여되지 않아 작업이 불가능합니다만,
커뮤니티에 도움이 되는 방향으로 결정해 주시면 그에 맞춰 따라가도록 하겠습니다. :)

EDIT: 2의 방향이라면, openkoreantext라는 일반 계정보다는 Github Organizaiton을 생성하는 방향은 어떨까요?

Update READ.ME

openkoreantext

Enable Travis

unexpected tokenization for "8시즌"

Hello, I just figured out that the unexpected result of tokenizatin

"8시즌" => [('8시', 'Number'), ('즌', 'Foreign')]

Even I added "시즌" using OpenKoreanTextProcessorJava.addNounsToDictionary(), the result doesn't change.

OpenKoreanTextProcessor에서 KoreanDictionaryProvider.addWordsToDictionary를 직접 노출하지 않는 이유?

OpenKoreanTextProcessor에 addNounsToDictionary는 노출되어 있는데, addWordsToDictionary를 직접 노출하지 않는 이유가 있으신가요? 예를 들면 명사 외에 다른 pos를 노출하면 알고리즘에 혼선을 줄 가능성이 있다던가..

현재 프로젝트는 resource내에 포함되어 있어서 사용자 사전을 정의하여 추가하고자 한다면 다시 빌드를 해야할 것 같은데, open-korean-text-4clj에서는 사용자 사전 경로를 지정해서 다른 pos들도 추가할 수 있도록 해볼까 싶은데요, 혹시 고려해야 할 부분이 있을지 질문드려 봅니다.

Feature Request: Ability to trigger load of resources without calling the tokenize function

Open Korean Text currently loads resources lazily on first call, this is fine for a lot of applications, however for web applications can cause issues with the first request being slow (can take ~4 seconds to load the resources). A work around is to call the tokenizer during web app start up but it would be nicer if there were a method to trigger this loading of the resources in the OpenKoreanText API.

NoSuchElementException thrown in KoreanStemmer.scala

While using the stemmer with real-life data, we get the following exception thrown:

Caused by: java.util.NoSuchElementException: key not found: ì‹¶ì�€ë�°
	at scala.collection.MapLike.default(MapLike.scala:235)
	at scala.collection.MapLike.default$(MapLike.scala:234)
	at scala.collection.AbstractMap.default(Map.scala:63)
	at scala.collection.MapLike.apply(MapLike.scala:144)
	at scala.collection.MapLike.apply$(MapLike.scala:143)
	at scala.collection.AbstractMap.apply(Map.scala:63)
	at org.openkoreantext.processor.stemmer.KoreanStemmer$.$anonfun$stem$2(KoreanStemmer.scala:44)

I am not sure what causes this issue, can you help investigating further?