atilika / kuromoji Goto Github PK

View Code? Open in Web Editor NEW

952.0 952.0 131.0 5.64 MB

Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search

License: Apache License 2.0

Java 100.00%

japanese morphological-analyser nlp-library part-of-speech-tagger

kuromoji's People

Contributors

Stargazers

Watchers

Forkers

haruyama arton komiya-atsushi dweiss makigumo wanasit masaruh takuyaa fielddb naoki-iwami mimakita amitabul 3w66uy hadinhviet yo0824 teux91 taniokah jacobiz yazavva tajiri3films stone54321277 dmcyh5 guitarmind uikit0 sidharth-k dedyk roikkuf yichuancao zhaoadou yinso richardwxn justinlmeyer yrbahn jamiemoon liu-lei k4200 yaozhang p101drs gautela trihm3011 yodasantu akkikiki ikawaha gerryhocks cmoen futuretechlabs solertis sdcr james-worsnop fanweihua daengky sgn-andot emmanuellegedin luke31 panyang davidtranno1 tunv11 mannongaidien82 wareya kwstewar deciphyre kitter mudsu pisush yasu0129 xiongdi9456 pombredanne hkazuakey logogin dacer letoemist weimingtom cho-hiroshi mrikitoku bugtrap sybelblue metastable chakrabandla thevinhluong phucdh toshiouchiyama huguanglong bugcheck anhtuan23 huache hrime diffblue-benchmarks rikima endlesscheng akiraworld nilportugues hhh-0 qianchen2018 rycaon questions-and-answers mikaerusan sunken-k wtbacon yukihane amirstudy

kuromoji's Issues

日本人 is not divided into two sections even in extended mode

Is this expected result? I personally think this needs to be divided so that query term '日本' should hit this target.

Kanji penalty and other penalty

Hi, what does kanji penalty and other penalty do? when would i need to use this feature?. thanks in advance, if there is any explanation or doc about that I haven't found, can someone link me to it.

Obtain furigana?

Hi, the documentation says kuromoji can extract the readings for kanji and shows an example in which the reading for each token is extracted. However, is it possible to extract what part of the reading corresponds to each kanji?

For example, given a token with the contents:

"寿司" -> 寿 = ス, 司 = シ

How to enable discardPunctuation in Kuromoji Java

Hi,

I can remove the punctuations in Kuromoji-ES plugin by setting "discard_punctuation": "true"

I'm wondering how can I get the same result with Kuromoji-Java?

For example, in Kuromoji-ES, 「浅草」駅 will be tokenized as

{
  "tokens" : [
    {
      "token" : "浅草",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "駅",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    }
  ]
}

Is there a same function with Kuromoji-Java to do so?

Internals documentation and academic papers?

Is there any description of how kuromoji works? E.g. an overview of what each class does, how they work together. And/or academic papers on what it is doing? (E.g. Is it behaving identically to MeCab, ChaSen or Juman, and if not, what innovations is it using and why? What design trade-offs are there?)

(If neither is available, this issue is a request for that kind of documentation; if they are then it is a request for them to be linked to from the README.md file. Thanks!)

The tokenizing performance of mixed language

The performance of lib for pure Japanese language in text is fine. But when I am trying to tokenize some texts with mixed languages (English and Japanese) when the language recognizer detects it as Japanese, the lib will filter some English words which could be key information of the texts. How can I deal with the cases by tuning or changing some parameters when I use it? You can try the following plain text first. The English word "[ 1 ] SINGAPORE 2" will be filtered, for example.

"ELECTRONIC e チケットお客様控 TICKET ITINERARY/RECEIPT 国際線自動ﾁｪｯｸｲﾝ機用２次元ﾊﾞｰｺｰﾄﾞ For International Self Service Unit ・搭乗手続き時、又は、出入国審査時に提示を求められた場合には、出入国に必要な全ての書類又は滞在先住所等の情報、 e チケットお客様控 ( Itine- rary/Receipt ) 、及びパスポート等の公的書類をご提示ください。コードシェア便の搭乗手続きは、運航会社で承ります。・ e チケットお客様控えは、旅程変更又は払い戻しの際に必要となる場合がありますので、ご旅行終了までお待ちください。・ Please present all necessary country specific travel documentation or data such as staying address , Itinerary/Receipt , and positive identification such as passport , when you are requested to do so at check-in , or at Immigration/Customs . ・ Please retain Itinerary/Receipt throughout your journey . Itinerary/Receipt may be required in case of itinerary change or refund . 搭乗者名： CHINOMI/KENJI MR PASSENGER NAME 航空券番号： 2052402171368 予約番号： YIFRBL 発行日： 24JAN16 TICKET NUMBER RESERVATION CODE DATE OF ISSUE OF ISSUE ISS . OFFICE CODE PLACE 発行所： SINGAPORE - NH SKY WEB SG R 発行店舗コード： 32393852 旅程表 ITINERARY 都市 /空港ﾀｰﾐﾅﾙ便名日付曜日時間ｸﾗｽ運賃種別予約状況手荷物有効期限 CITY/AIRPORT TERMINAL FLIGHT NO . DATE DAY TIME CLASS FARE BASIS STATUS BAGGAGE INVALID BEFORE/AFTER 出発 DEPARTURE 出発 DEPARTURE [ 1 ] SINGAPORE 2 NH844 27JAN16 WED 2220 W ( Y ) WRCS0 OK 2PC 27JAN/27JAN 到着 ARRIVAL 座席 SEAT 到着 ARRIVAL 運航航空会社 OPERATING CARRIER 備考 REMARKS TOKYO ( HANEDA ) INT 28JAN16 THU 0600 ALL NIPPON AIRWAYS 出発 DEPARTURE 出発 DEPARTURE [ 2 ] TOKYO ( HANEDA ) SURFACE 到着 ARRIVAL 座席 SEAT 到着 ARRIVAL 運航航空会社 OPERATING CARRIER 備考 REMARKS TOKYO ( NARITA ) 出発 DEPARTURE 出発 DEPARTURE [ 3 ] TOKYO ( NARITA ) 1 NH801 30JAN16 SAT 1805 V ( Y ) VRCS0 OK 2PC 30JAN/30JAN 到着 ARRIVAL 座席 SEAT 到着 ARRIVAL 運航航空会社 OPERATING CARRIER 備考 REMARKS SINGAPORE 2 31JAN16 SUN 0040 ALL NIPPON AIRWAYS 全日本空輸株式会社 ALL NIPPON AIRWAYS CO . , LTD . PAGE 1 / 2 PRINTED 24JAN16 ELECTRONIC e チケットお客様控 TICKET ITINERARY/RECEIPT 国際線自動ﾁｪｯｸｲﾝ機用２次元ﾊﾞｰｺｰﾄﾞ For International Self Service Unit 搭乗者名： CHINOMI/KENJI MR PASSENGER NAME 航空券番号： 2052402171368 予約番号： YIFRBL 発行日： 24JAN16 TICKET NUMBER RESERVATION CODE DATE OF ISSUE OF ISSUE ISS . OFFICE CODE PLACE 発行所： SINGAPORE - NH SKY WEB SG R 発行店舗コード： 32393852 運賃／航空券情報 FARE/TICKET INFORMATION 運賃額：支払運賃額： FARE SGD830.00 EQUIV . FARE PAID 税金・料金等合計：航空会社手数料： TAXES/FEES/CHARGES/AIRLINE CHARGES TOTAL SGD74.90 AIRLINE SERVICE CHARGE SGD0.00 ツアーコード： TOTAL ( AIRLINE SERVICE CHARGE is not included . ) SGD904.90 TOUR CODE 支払手段： FORM OF PAYMENT CCCAXXXXXXXXXXXX3426**/XX-XX S 811289 制限事項： FLT/CNX/CHG RESTRICTED CHECK FARE RULE ENDORSEMENTS/RESTRICTIONS 運賃詳細： SIN NH TYO Q14.23 259.84NH SIN316.79NUC590.86END ROE1.404690 FARE CALCULATION 税金・料金等詳細： SGD8.80YQ/ SGD19.90SG/ SGD6.10OP/ SGD8.00OO/ SGD25.70SW/ SGD6.40OI/ TAXES/FEES/CHARGES/ AIRLINE CHARGES DETAILS シンガポールの空港から出発する場合、上記金額には OP TAX ( Aviation Levy ) が含まれています。 OP tax ( Aviation Levy ) is included on a ticket when departing from the airport in Singapore . 原券：交換券： ORIGINAL ISSUE ISSUED IN EXCHANGE FOR ご注意及び契約条件 /TICKET NOTICE ・運送やその他のサービスは、各運送人の運送約款に従います。運送約款については発行運送人にご確認ください。なお、 ANA の運送による日本国内区間のみの旅行であって国際運送の一環ではない場合、 ANAの国内旅客運送約款が適用となります。・旅客が出発国以外の国に最終到達地又は寄港地を有する旅行を行なう場合は、その旅客の旅程全体 ( 同一国内の区間を含む ) についてモントリオール条約又はその前身のワルソー条約 ( その改正を含む ) の適用を受けることがあります。その旅客に対し適用となる条約 ( 適用タリフに含まれる特別運送契約を含む ) が、運送人の責任を制限することがあります。詳細は、各運送人へお問い合わせください。・エアゾール、花火、可燃性液体などの危険物は航空機へ持込はできません。これら制限の詳細は航空会社へお問い合わせください。・このお客様控とともに、航空券の一部を成し、かつ「契約条件及びその他重要事項」を含む、一連のご案内書をお受け取りになります。これらのご案内書をお受け取りになられたことを必ずご確認いただき、もしお受け取りになられていない場合には、旅行開始前に次の URL : https : //www . ana . co . jp/other/int/meta/0192.html ? CONNECTION_KIND\u003djp\u0026LANG\u003dj で入手いただくか、又は、発行運送人若しくは旅行会社へ連絡ください。・このお客様控は、モントリオール条約及びワルソー条約第３条でいう「航空券」の一部をなします。ただし、航空会社が第３条の要件を満たす別の書類を旅客へ渡す場合を除きます。・ ANA のコンピュータシステムに保管されている eチケット情報と e チケットお客様控の情報に相違がある場合、コンピュータシステム上の e チケット情報を有効と致します。・ Carriage and other services provided by the carrier are subject to conditions of carriage , which are hereby incorporated by reference . These conditions may be obtained from the issuing carrier . Please note that if you travel on ANA \u0027s domestic sector flights within Japan only , without any international connecting flights , ANA \u0027s Conditions of Carriage for Passengers and Baggage for domestic flights will apply . ・ Passengers on a journey involving an ultimate destination or a stop in a country other than the country of departure are advised that international treaties known as the Montreal Convention , or its predecessor , the Warsaw Convention , including its amendments ( the Warsaw Convention System ) , may apply to the entire journey , including any portion thereof within a country . For such passengers , the applicable treaty , including special contracts of carriage embodied in any applicable tariffs , governs and may limit the liability of the carrier . Check with your carrier for more information . ・ The carriage of certain hazardous materials , like aerosols , fireworks , and flammable liquids , aboard the aircraft is forbidden . If you do not understand these restrictions , further information may be obtained from your airline . ・ Further information may be obtained from the carrier . With this ticket you will receive a set of notices which forms part of the ticket and contains the " Conditions of Contract and Other Important Notices " . Please make sure that you have received these notices , and if not , obtain copies prior to the commencement of your journey at the following URL : https : //www . ana . co . jp/other/int/meta/0192.html ? CONNECTION_KIND\u003djp\u0026LANG\u003de , or contact the issuing airline or travel agent . ・ This Itinerary/Receipt constitutes the " passenger ticket " for the purposes of Article 3 of the Montreal Convention and the Warsaw Convention , except where the carrier delivers to the passenger another document complying with the requirements of Article 3. ・ Ticketing information contained in ANA \u0027s computer system shall prevail should any discrepancy occur between the Itinerary/Receipt held by the customer and the ticketing information in our computer system . 全日本空輸株式会社 ALL NIPPON AIRWAYS CO . , LTD . PAGE 2 / 2 PRINTED 24JAN16"

How to use Kuromoji in Gradle?

   // https://mvnrepository.com/artifact/org.atilika.kuromoji/kuromoji
    compile group: 'org.atilika.kuromoji', name: 'kuromoji', version: '0.9.0'

and it will cause an error!

Error:Failed to resolve: org.atilika.kuromoji:kuromoji:0.9.0

Possible Issue with tokenization when English+Japanese are adjacent in text

Text => Dior化粧品等の輸入総代理店で , which indexed with the default Kuromoji analyzer produces the following tokens:

dior
start: 0 end: 4 pos: 0
化粧
start: 4 end: 6 pos: 1
品等
start: 6 end: 8 pos: 2
輸入
start: 9 end: 11 pos: 4
総
start: 11 end: 12 pos: 5
代理
start: 12 end: 14 pos: 6
店
start: 14 end: 15 pos: 7

However, we noticed that when a user searched for the term Dior化粧品, it did not produce a match (using same analyzer settings). The reason is that the search term is tokenized as such:

dior
start: 0 end: 4 pos: 0
化粧
start: 4 end: 6 pos: 1
品  
start: 6 end: 7 pos: 2

Since the word cosmetics is the Japanese term 化粧品, it seems like the search term got analyzed correctly but the piece of text produced an unexpected bigram sequence of 化粧 and 品等

Not sure if this is a valid issue due to the mix of English/Japanese in the text or my Japanese fundamentals are off here

Handling of userDictionary comments

If a word you want to register in the dictionary contains #, it is treated as a comment and the following Exception occurs when loading the dictionary.

RuntimeException: Unmatched quote in entry:.

I would like to change the UserDictionary::read process so that it does not do this if the word is in a quote.

kuromoji/kuromoji-core/src/main/java/com/atilika/kuromoji/dict/UserDictionary.java

Line 174 in e18ff91

line = line.replaceAll("#.*$", "");

Segmentation wrong with token contains square brackets?

Looks like the segmenter does not work properly if there are square brackets, e.g.:

[   名詞,サ変接続,*,*,*,*,*,*,*
滧 名詞,一般,*,*,*,*,*,*,*
の 助詞,連体化,*,*,*,*,の,ノ,ノ
]。    名詞,サ変接続,*,*,*,*,*,*,*

「 記号,括弧開,*,*,*,*,「,「,「
国宝  名詞,一般,*,*,*,*,国宝,コクホウ,コクホー
五 名詞,数,*,*,*,*,五,ゴ,ゴ
城 名詞,一般,*,*,*,*,城,シロ,シロ
」[    名詞,サ変接続,*,*,*,*,*,*,*
``

Kuromoji_tokenizer: sort clause does not seem to work for some specific character combinations

Query:

{
  "query": {
    "bool": {
    }
  },
  "sort": [
    {
      "attribute.sortable": {
        "order": "asc"
      }
    }
  ]
}

Results:

"hits": [
  {
    "_index": "example_1",
    "_type": "example_1",
    "_id": "A2Ff26qFaV",
    "_score": null,
    "_source": {
      "attributes": {
        "attribute": "サヨ",
      }
    },
    "sort": [
      "サヨ"
    ]
  },
  {
    "_index": "example_2",
    "_type": "example_2",
    "_id": "A2Ff26qFaV",
    "_score": null,
    "_source": {
      "attributes": {
        "attribute": "シヨ",
      }
    },
    "sort": [
      "シ"
    ]
  }
]

The sort is working on the characters in attribute field for example_1 doc but not for example_2 doc.

Observed this in 3 instances in total for these strings:

シヨ
ヲシ
シヲ

Normalized surface in user dictionary.

in current implementations of the ipadic, it seems that there is no functionality to normalize surface in the user dic.
is this right?

i think that this functionality is very useful and required in common situations.

so, i have a plan to expand user dictionary function to handle normalize a word surface with keeping the current specification of the user dictionary resource format.

what do you think about this?

java.lang.RuntimeException: Could not load dictionaries. Caused by: java.io.IOException: Classpath resource not found: fst.bin

kuromoji-ipadic-1.0-SNAPSHOT.jar

09-27 11:22:12.361 E/AndroidRuntime( 1075): java.lang.RuntimeException: Could not load dictionaries.

09-27 11:22:12.361 E/AndroidRuntime( 1075): at ant.a(Unknown Source)

09-27 11:22:12.361 E/AndroidRuntime( 1075): at com.atilika.kuromoji.TokenizerBase.a(Unknown Source)

09-27 11:22:12.361 E/AndroidRuntime( 1075): at ans.(Unknown Source)

09-27 11:22:12.361 E/AndroidRuntime( 1075): at aoy.(Unknown Source)

09-27 11:22:12.361 E/AndroidRuntime( 1075): at aoy.a(Unknown Source)

09-27 11:22:12.361 E/AndroidRuntime( 1075): at aoz.(Unknown Source)

09-27 11:22:12.361 E/AndroidRuntime( 1075): at amw.a(Unknown Source)

09-27 11:22:12.361 E/AndroidRuntime( 1075): at amh.run(Unknown Source)

09-27 11:22:12.361 E/AndroidRuntime( 1075): at java.lang.Thread.run(Thread.java:818)

09-27 11:22:12.361 E/AndroidRuntime( 1075): Caused by: java.io.IOException: Classpath resource not found: fst.bin

09-27 11:22:12.361 E/AndroidRuntime( 1075): at anz.a(Unknown Source)

09-27 11:22:12.361 E/AndroidRuntime( 1075): at ann.a(Unknown Source)

09-27 11:22:12.361 E/AndroidRuntime( 1075): ... 10 more

Exception in
Tokenizer loadDictionaries() in this.fst = FST.newInstance(this.resolver);

fst.bin presents in wear-app.apk:/com/atilika/kuromoji/ipadic

apk structure
-resources.arsc
-classes.dex
-AndroidManifest.xml
----res/
----META-INF/
----assets/
----com/atilika/kuromoji/ipadic/*.bin

Longer string in Katakana has low priority

Tested with kuromoji-core-1.0-SNAPSHOT and kuromoji-ipadic-1.0-SNAPSHOT.
(build from master at 2017/3/8)

When the user dictionary is

くろも,くろも,くろも,カスタム名詞
ろ,ろ,ろ,カスタム名詞

, the string "くろもじ" is tokenized into

くろも カスタム名詞,*,*,*,*,*,*,くろも,*
じ 助動詞,*,*,*,不変化型,基本形,じ,ジ,ジ

which is fine.

When the user dictionary is

クロモ,クロモ,クロモ,カスタム名詞
ロ,ロ,ロ,カスタム名詞

, the string "クロモジ" is tokenized into

ク 名詞,一般,*,*,*,*,ク,ク,ク
ロ カスタム名詞,*,*,*,*,*,*,ロ,*
モ *,*,*,*,*,*,*,*,*
ジ *,*,*,*,*,*,*,*,*

which is not fine.

I expected below.

クロモ カスタム名詞,*,*,*,*,*,*,クロモ,*
ジ *,*,*,*,*,*,*,*,*

What should I do for the expectation?

sample code I used:

public static void main(String[] args) {
  String target = "くろもじ";
  List<String> dictionaryList = Arrays.asList("くろも,くろも,くろも,カスタム名詞", "ろ,ろ,ろ,カスタム名詞");
  String target = "クロモジ";
  List<String> dictionaryList = Arrays.asList("クロモ,クロモ,クロモ,カスタム名詞", "ロ,ロ,ロ,カスタム名詞");
  String dictionary = String.join(System.lineSeparator(), dictionaryList);
  Builder builder = new Tokenizer.Builder();
  try {
    InputStream inputStream = new ByteArrayInputStream(dictionary.getBytes("utf-8"));
    builder.userDictionary(inputStream);
  } catch (Exception e) {
  }
  Tokenizer tokenizer = builder.build();
  List<Token> tokens = tokenizer.tokenize(target);
  tokens.stream().forEach(token -> System.out.println(token.getSurface()+"\t"+token.getAllFeatures()));
}

Android runtime exception when creating new Tokenizer using kuromoji-ipadic

I implemented some test code on Android but I get a runtime exception when trying to create a new Tokenizer:

 java.lang.RuntimeException: Could not load dictionaries.

I'm using the kuromoji-ipadic package (com.atilika.kuromoji:kuromoji-ipadic:0.9.0)
I've traced through the stack a bit to find this:

Caused by: java.lang.IllegalArgumentException: capacity < 0: -4
     at java.nio.ByteBuffer.allocate(ByteBuffer.java:54)
     at com.atilika.kuromoji.io.IntegerArrayIO.readArray(IntegerArrayIO.java:38)
     at com.atilika.kuromoji.buffer.WordIdMap.<init>(WordIdMap.java:35)
     at com.atilika.kuromoji.dict.TokenInfoDictionary.setup(TokenInfoDictionary.java:168)
     at com.atilika.kuromoji.dict.TokenInfoDictionary.newInstance(TokenInfoDictionary.java:160)
     at com.atilika.kuromoji.ipadic.Tokenizer$Builder.loadDictionaries(Tokenizer.java:219)
     at com.atilika.kuromoji.TokenizerBase.configure(TokenizerBase.java:77) 
     at com.atilika.kuromoji.ipadic.Tokenizer.<init>(Tokenizer.java:74) 
     at com.atilika.kuromoji.ipadic.Tokenizer.<init>(Tokenizer.java:59) 
     at com.atilika.kuromoji.ipadic.Tokenizer$Builder.build(Tokenizer.java:203)

Which is crashing in IntegerArrayIO at:

public class IntegerArrayIO {

    private static final int INT_BYTES = Integer.SIZE / Byte.SIZE;

    public static int[] readArray(InputStream input) throws IOException {
        DataInputStream dataInput = new DataInputStream(input);
        int length = dataInput.readInt(); // length is returning -1

        ByteBuffer tmpBuffer = ByteBuffer.allocate(length * INT_BYTES); // -1 * 4 = -4, crashes with "capacity < 0:-4"
        ReadableByteChannel channel = Channels.newChannel(dataInput);
        channel.read(tmpBuffer);

I realize the library probably wasn't originally intended for Android, but I'd really like to try it and do some testing to see if it's viable for use on mobile as well.

The code that triggers the original exception is just the construction of the Tokenizer via the builder:

        Tokenizer tokenizer = new Tokenizer.Builder().build();

The IntegerArrayIO class's readArray() method seems to handle fine until trying to load the word ID map:

    public WordIdMap(InputStream input) throws IOException {
        indices = IntegerArrayIO.readArray(input);
        wordIds = IntegerArrayIO.readArray(input); // crashes here
    }

The above code is triggered by TokenInfoDictionary#setup():

        wordIdMap = new WordIdMap(resolver.resolve(TARGETMAP_FILENAME));

Here is the full stack trace (minus the parts that reference my code, which I think are irrelevant):

java.lang.RuntimeException: Could not load dictionaries.
   at com.atilika.kuromoji.ipadic.Tokenizer$Builder.loadDictionaries(Tokenizer.java:231)
   at com.atilika.kuromoji.TokenizerBase.configure(TokenizerBase.java:77)
   at com.atilika.kuromoji.ipadic.Tokenizer.<init>(Tokenizer.java:74)
   at com.atilika.kuromoji.ipadic.Tokenizer.<init>(Tokenizer.java:59)
   at com.atilika.kuromoji.ipadic.Tokenizer$Builder.build(Tokenizer.java:203)

(... my sources omitted for brevity ...)

 Caused by: java.lang.IllegalArgumentException: capacity < 0: -4
   at java.nio.ByteBuffer.allocate(ByteBuffer.java:54)
   at com.atilika.kuromoji.io.IntegerArrayIO.readArray(IntegerArrayIO.java:38)
   at com.atilika.kuromoji.buffer.WordIdMap.<init>(WordIdMap.java:35)
   at com.atilika.kuromoji.dict.TokenInfoDictionary.setup(TokenInfoDictionary.java:168)
   at com.atilika.kuromoji.dict.TokenInfoDictionary.newInstance(TokenInfoDictionary.java:160)
   at com.atilika.kuromoji.ipadic.Tokenizer$Builder.loadDictionaries(Tokenizer.java:219)
   at com.atilika.kuromoji.TokenizerBase.configure(TokenizerBase.java:77) 
   at com.atilika.kuromoji.ipadic.Tokenizer.<init>(Tokenizer.java:74) 
   at com.atilika.kuromoji.ipadic.Tokenizer.<init>(Tokenizer.java:59) 
   at com.atilika.kuromoji.ipadic.Tokenizer$Builder.build(Tokenizer.java:203)

and the sample code I was trying to test with:

        StringBuilder stringBuilder = new StringBuilder();
        Tokenizer tokenizer = new Tokenizer.Builder().build();
        for (Token token : tokenizer.tokenize(text)) {
            stringBuilder.append(token.getSurface());
            stringBuilder.append(" ");
        }

When I test kuromoji on my desktop it seems to work fine on the local JVM, but on Android it crashes.
I'll continue to dig deeper and see if I can find out what's going on. If there's anything else you need let me know.

あざーす！

how to increase heap size other than MAVEN_OPS

I hope to build kuromoji with neologd on CircleCI.
i try to increase heap size, so set env `MAVEN_OPTS="-Xmx4096m -XX:-UseGCOverheadLimit".
but OutOfMemoryError occurs:

...
[KUROMOJI] 02:40:21:     reading tokeninfo
[WARNING] 
java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:498)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:297)
    at java.lang.Thread.run (Thread.java:748)
Caused by: java.lang.OutOfMemoryError: Java heap space
...

debugged using -X option, then

env.MAVEN_OPTS= -Xmx4096m -XX:-UseGCOverheadLimit -Xmx512m

-Xmx512m found, so i think -Xmx4096m was overrode by this -Xmx512m.
i can't understand fundamental issue that why has set already -Xmx512m, but i should like to know if there is another way to increase heap size.

can you tell me any idea to avoid this issue? thanks.

Unidic design flaw

Unidic's lex data doesn't have enough information for the viterbi algorithm to distinguish words with the same readings and same word types in context. So お父さん is always interpreted as お・ちち・さん, instead of お・とう・さん like it should be.

父,5142,5142,3860,名詞,普通名詞,一般,*,*,*,チチ,父,父,チチ,父,チチ,和,*,*,*,*

父,5142,5142,4656,名詞,普通名詞,一般,*,*,*,トウ,父,父,トー,父,トー,和,*,*,*,*

They're otherwise identical, but the ちち reading has a lower cost, so it always wins when the word is in the kanji form. Basically, unidic's segment features don't have a way to distinguish these. It's easy to write a script that looks for segments that are identical in surface form and feature list and see what problematic matches there are.

This is basically impossible to fix on kuromoji's side without adding a list of segments that act differently than their features indicate, which would be ridiculous. On the other hand, one of kuromoji's implicit goals is to not be worse than other morphological analyzers, so this is a problem worth posting about.

I added a bunch of お父 etc. entries to my user dictionary to gloss over this problem by prepending the お・御. (for unidic-kanaaccent STAGING)

おとう,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オトウ,御父,おとう,オトー,おとう,オトー,和,*,*,*,*,オトウ,オトウ,オトウ,オトウ,*,*,2,*,*
お父,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オトウ,御父,お父,オトー,お父,オトー,和,*,*,*,*,オトウ,オトウ,オトウ,オトウ,*,*,2,*,*
御父,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オトウ,御父,御父,オトー,御父,オトー,和,*,*,*,*,オトウ,オトウ,オトウ,オトウ,*,*,2,*,*

おかあ,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オカア,御母,おかあ,オカー,おかあ,オカー,和,*,*,*,*,オカア,オカア,オカア,オカア,*,*,2,*,*
お母,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オカア,御母,お母,オカー,お母,オカー,和,*,*,*,*,オカア,オカア,オカア,オカア,*,*,2,*,*
御母,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オカア,御母,御母,オカー,御母,オカー,和,*,*,*,*,オカア,オカア,オカア,オカア,*,*,2,*,*

おにい,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オニイ,御兄,おにい,オニー,おにい,オニー,和,*,*,*,*,オニイ,オニイ,オニイ,オニイ,*,*,2,*,*
お兄,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オニイ,御兄,お兄,オニー,お兄,オニー,和,*,*,*,*,オニイ,オニイ,オニイ,オニイ,*,*,2,*,*
御兄,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オニイ,御兄,御兄,オニー,御兄,オニー,和,*,*,*,*,オニイ,オニイ,オニイ,オニイ,*,*,2,*,*

おねえ,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オネエ,御姉,おねえ,オネー,おねえ,オネー,和,*,*,*,*,オネエ,オネエ,オネエ,オネエ,*,*,2,*,*
お姉,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オネエ,御姉,お姉,オネー,お姉,オネー,和,*,*,*,*,オネエ,オネエ,オネエ,オネエ,*,*,2,*,*
御姉,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オネエ,御姉,御姉,オネー,御姉,オネー,和,*,*,*,*,オネエ,オネエ,オネエ,オネエ,*,*,2,*,*

お姐,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オネエ,御姐,お姐,オネー,お姐,オネー,和,*,*,*,*,オネエ,オネエ,オネエ,オネエ,*,*,2,*,*
御姐,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オネエ,御姐,御姐,オネー,御姐,オネー,和,*,*,*,*,オネエ,オネエ,オネエ,オネエ,*,*,2,*,*

おばあ,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オバア,御婆,おばあ,オバー,おばあ,オバー,和,*,*,*,*,オバア,オバア,オバア,オバア,*,*,2,*,*
お婆,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オバア,御婆,お婆,オバー,お婆,オバー,和,*,*,*,*,オバア,オバア,オバア,オバア,*,*,2,*,*
御婆,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オバア,御婆,御婆,オバー,御婆,オバー,和,*,*,*,*,オバア,オバア,オバア,オバア,*,*,2,*,*


おじい,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オジイ,御爺,おじい,オジー,おじい,オジー,和,*,*,*,*,オジイ,オジイ,オジイ,オジイ,*,*,2,*,*
お爺,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オジイ,御爺,お爺,オジー,お爺,オジー,和,*,*,*,*,オジイ,オジイ,オジイ,オジイ,*,*,2,*,*
御爺,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オジイ,御爺,御爺,オジー,御爺,オジー,和,*,*,*,*,オジイ,オジイ,オジイ,オジイ,*,*,2,*,*

(weights are for illustration, I think they're too high to catch in all intended cases)

Kuromoji POS Train

Hi,

Can we train Kuromoji POS ? If yes, Please tell me the format of data needed as input.

Thanks

Integration with solr

Hi,
I am new to solr. I have downloaded kuromoji
And placed it in solr-5.3.0\server\lib
And added

in solr-5.3.0\server\solr\configsets\basic_configs\conf
Now if i do search it should treat each of search term as japanese right?
Or do i need to specify which text should be treated as japanese

Kuromoji on Android

Hello and thank you for your library.

I tried to use Kuromoji on Android (actually it's a bit overkill for me, I only try to convert Japanese text to romaji for pronounciation).
I encountered this error:

java.lang.IllegalArgumentException: capacity < 0: -4
  at java.nio.ByteBuffer.allocate(ByteBuffer.java:54)
  at com.atilika.kuromoji.io.IntegerArrayIO.readArray(IntegerArrayIO.java:38)
  at com.atilika.kuromoji.buffer.WordIdMap.<init>(WordIdMap.java:35)
  at com.atilika.kuromoji.dict.TokenInfoDictionary.setup(TokenInfoDictionary.java:168)
  at com.atilika.kuromoji.dict.TokenInfoDictionary.newInstance(TokenInfoDictionary.java:160)
  at com.atilika.kuromoji.ipadic.Tokenizer$Builder.loadDictionaries(Tokenizer.java:219)

I guessed I hit a memory limit and this is not the library I'm looking for.

Can you confirm? And do you have a better idea to extract romaji from japanese text?

Thanks a lot.

ソーシャルメディア is not tokenized into two words

I hit an issue where this term is not tokenized 'social' and 'media'.
Is this because these two words are not in the corpus and will it be resolved in the future release when those words are added?

Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12.4:test (default-test) on project kuromoji-benchmark: There are test failures.

Another out of memory issue -- despite the fact I did

export MAVEN_OPTS=-Xmx3g

-------------------------------------------------------------------------------
Test set: com.atilika.kuromoji.benchmark.SimpleBenchmarkTest
-------------------------------------------------------------------------------
Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 67.952 sec <<< FAILURE!
testSimpleBenchmark(com.atilika.kuromoji.benchmark.SimpleBenchmarkTest)  Time elapsed: 5.8 sec  <<< ERROR!
java.lang.OutOfMemoryError: Java heap space
    at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
    at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
    at com.atilika.kuromoji.io.ByteBufferIO.read(ByteBufferIO.java:39)
    at com.atilika.kuromoji.buffer.TokenInfoBuffer.<init>(TokenInfoBuffer.java:39)
    at com.atilika.kuromoji.dict.TokenInfoDictionary.setup(TokenInfoDictionary.java:165)
    at com.atilika.kuromoji.dict.TokenInfoDictionary.newInstance(TokenInfoDictionary.java:160)
    at com.atilika.kuromoji.TokenizerBase$Builder.loadDictionaries(TokenizerBase.java:289)
    at com.atilika.kuromoji.TokenizerBase.configure(TokenizerBase.java:77)
    at com.atilika.kuromoji.unidic.neologd.Tokenizer.<init>(Tokenizer.java:67)
    at com.atilika.kuromoji.unidic.neologd.Tokenizer.<init>(Tokenizer.java:58)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at java.lang.Class.newInstance(Class.java:374)
    at com.atilika.kuromoji.benchmark.SimpleBenchmarkTest.tokenizeForName(SimpleBenchmarkTest.java:92)
    at com.atilika.kuromoji.benchmark.SimpleBenchmarkTest.testSimpleBenchmark(SimpleBenchmarkTest.java:48)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)

Nexus Repository is Offline?

All requests to http://www.atilika.org/nexus/content/repositories/atilika are 404-ing.

GC overhead limit exceeded when compiling IPADIC NEologd

I am getting a GC overhead limit exceeded message when trying to run

mvn clean package

on Mac OS X (MBP 2012 model).

The GC overhead limit exceeded message apparently means the system is spending 98% of the time doing GC and only 2% or less time doing any tasks which sounds like memory may not actually be the problem. However I am not sure how to test increasing memory, so I can't say for certain.

[INFO]
[INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ kuromoji-ipadic ---
[INFO] Building jar: /Users/zwoc/Workspace/deep-learning/narou/select/preproc/kuromoji/kuromoji-ipadic/target/kuromoji-ipadic-1.0-SNAPSHOT.jar
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Kuromoji IPADIC NEologd 1.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ kuromoji-ipadic-neologd ---
[INFO] Deleting /Users/zwoc/Workspace/deep-learning/narou/select/preproc/kuromoji/kuromoji-ipadic-neologd/target
[INFO]
[INFO] --- maven-resources-plugin:2.7:copy-resources (copy-license-resources) @ kuromoji-ipadic-neologd ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 2 resources
[INFO] Copying 1 resource
[INFO]
[INFO] --- maven-antrun-plugin:1.6:run (download-dictionary) @ kuromoji-ipadic-neologd ---
[INFO] Executing tasks

main:
     [echo] Downloading dictionary
   [delete] Deleting directory /Users/zwoc/Workspace/deep-learning/narou/select/preproc/kuromoji/kuromoji-ipadic-neologd/dictionary
    [mkdir] Created dir: /Users/zwoc/Workspace/deep-learning/narou/select/preproc/kuromoji/kuromoji-ipadic-neologd/dictionary
      [get] Getting: http://atilika.com/releases/mecab-ipadic-neologd/mecab-ipadic-2.7.0-20070801-neologd-20150925.tar.gz
      [get] To: /Users/zwoc/Workspace/deep-learning/narou/select/preproc/kuromoji/kuromoji-ipadic-neologd/dictionary/mecab-ipadic-2.7.0-20070801-neologd-20150925.tar.gz
    [untar] Expanding: /Users/zwoc/Workspace/deep-learning/narou/select/preproc/kuromoji/kuromoji-ipadic-neologd/dictionary/mecab-ipadic-2.7.0-20070801-neologd-20150925.tar.gz into /Users/zwoc/Workspace/deep-learning/narou/select/preproc/kuromoji/kuromoji-ipadic-neologd/dictionary
[INFO] Executed tasks
[INFO]
[INFO] --- maven-compiler-plugin:3.3:compile (compile-dictionary-compiler) @ kuromoji-ipadic-neologd ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 6 source files to /Users/zwoc/Workspace/deep-learning/narou/select/preproc/kuromoji/kuromoji-ipadic-neologd/target/classes
[INFO] /Users/zwoc/Workspace/deep-learning/narou/select/preproc/kuromoji/kuromoji-ipadic-neologd/src/main/java/com/atilika/kuromoji/ipadic/neologd/Tokenizer.java: /Users/zwoc/Workspace/deep-learning/narou/select/preproc/kuromoji/kuromoji-ipadic-neologd/src/main/java/com/atilika/kuromoji/ipadic/neologd/Tokenizer.java uses unchecked or unsafe operations.
[INFO] /Users/zwoc/Workspace/deep-learning/narou/select/preproc/kuromoji/kuromoji-ipadic-neologd/src/main/java/com/atilika/kuromoji/ipadic/neologd/Tokenizer.java: Recompile with -Xlint:unchecked for details.
[INFO]
[INFO] >>> exec-maven-plugin:1.2.1:java (run-dictionary-compiler) > validate @ kuromoji-ipadic-neologd >>>
[INFO]
[INFO] <<< exec-maven-plugin:1.2.1:java (run-dictionary-compiler) < validate @ kuromoji-ipadic-neologd <<<
[INFO]
[INFO] --- exec-maven-plugin:1.2.1:java (run-dictionary-compiler) @ kuromoji-ipadic-neologd ---
[KUROMOJI] 22:30:21: dictionary compiler
[KUROMOJI] 22:30:21:
[KUROMOJI] 22:30:21: input directory: /Users/zwoc/Workspace/deep-learning/narou/select/preproc/kuromoji/kuromoji-ipadic-neologd/dictionary/mecab-ipadic-2.7.0-20070801-neologd-20150925
[KUROMOJI] 22:30:21: output directory: /Users/zwoc/Workspace/deep-learning/narou/select/preproc/kuromoji/kuromoji-ipadic-neologd/src/main/resources/com/atilika/kuromoji/ipadic/neologd
[KUROMOJI] 22:30:21: input encoding: utf-8
[KUROMOJI] 22:30:21:
[KUROMOJI] 22:30:21: compiling tokeninfo dict...
[KUROMOJI] 22:30:21:     analyzing dictionary features
[KUROMOJI] 22:30:27:     reading tokeninfo
[KUROMOJI] 22:30:52:     compiling fst... [WARNING]
java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:297)
    at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.ArrayList.iterator(ArrayList.java:814)
    at java.util.AbstractList.hashCode(AbstractList.java:540)
    at com.atilika.kuromoji.fst.State.hashCode(State.java:127)
    at com.atilika.kuromoji.fst.Builder.findEquivalentState(Builder.java:243)
    at com.atilika.kuromoji.fst.Builder.freezeAndPointToNewState(Builder.java:179)
    at com.atilika.kuromoji.fst.Builder.createDictionaryCommon(Builder.java:143)
    at com.atilika.kuromoji.fst.Builder.build(Builder.java:119)
    at com.atilika.kuromoji.compile.FSTCompiler.compile(FSTCompiler.java:44)
    at com.atilika.kuromoji.compile.DictionaryCompilerBase.buildTokenInfoDictionary(DictionaryCompilerBase.java:70)
    at com.atilika.kuromoji.compile.DictionaryCompilerBase.build(DictionaryCompilerBase.java:37)
    at com.atilika.kuromoji.compile.DictionaryCompilerBase.build(DictionaryCompilerBase.java:172)
    at com.atilika.kuromoji.ipadic.neologd.compile.DictionaryCompiler.main(DictionaryCompiler.java:33)
    ... 6 more
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Kuromoji ........................................... SUCCESS [  0.186 s]
[INFO] Kuromoji Core ...................................... SUCCESS [  7.223 s]
[INFO] Kuromoji IPADIC .................................... SUCCESS [ 44.321 s]
[INFO] Kuromoji IPADIC NEologd ............................ FAILURE [03:53 min]
[INFO] Kuromoji JUMAN DIC ................................. SKIPPED
[INFO] Kuromoji NAIST-jdic ................................ SKIPPED
[INFO] Kuromoji UniDic .................................... SKIPPED
[INFO] Kuromoji UniDic Kana Accent ........................ SKIPPED
[INFO] Kuromoji UniDic NEologd ............................ SKIPPED
[INFO] Kuromoji Benchmark ................................. SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 04:45 min
[INFO] Finished at: 2015-10-07T22:33:55+09:00
[INFO] Final Memory: 15M/1834M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2.1:java (run-dictionary-compiler) on project kuromoji-ipadic-neologd: An exception occured while executing the Java class. null: InvocationTargetException: GC overhead limit exceeded -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :kuromoji-ipadic-neologd

Unidic Tokenization on Romaji Words

Tested with version 0.9.0.

I know this is for Japanese, but it would be nice if some romaji words were tokenized consistently.

The string "hello golf2" is tokenized into:

hello
golf
2

which is fine. But when I tokenize "golf2 hello", I get using com.atilika.kuromoji.unidic.Tokenizer (also unidic.kanaaccent, but not the other tokenizers):

g
o
l
f
2
hello

It would be nice, if the second case were like the first. In the meantime, I might handle this with a user-dictionary.

Tokenizer is not serializable for Apache Spark

On Apache Spark, instances must be serializable for parallel processing but kuromoji tokenizers are not and must be initialized for each time.
If tokenizers are serializable, we can decrease processing time.

Question: Is there any way to update neologd dictionary?

Hello.
The dictionary of ipadic-neologd in kuromoji has not updated since ver.20171113, but I want to use the newest dictionary.
Is there any way to include the newest dictionary and create jar file?
Thank you!

Very odd tokenization of a sentence

Tokenizing "色々やらなきゃならんことがたくさんあるんだ" in the command line version of Kuromoji (using ipadic) results in

色々
やら
なき
ゃならんことがたくさんあるんだ

The same output is observed in the online demo available at http://www.atilika.org/

Is there any example for Lemmatization?

Not familiar with Japanese, but could you help to share an example of how Lemmatization works?

Thanks a lot!

Optimization opportunity in the fst usage.

Please take this report with a big pinch of salt : I am not even a kuromoji user and I did not profile the code thoroughly.

In ViterbiBuilder, kuromoji uses an fst to search for all possible prefix of a given string that are within a dictionary (encoded as the fst).
The successive call to lookup however, restart from the rootnode of the fst. It would be advisable to get all of the prefix in a single browse of the fst.

The headroom is valuable, but not massive. Around 15% of the time is spent in Fst.lookup. One can hope to cut this bit in half.

Why does tokenized Kanji features never contains Hiragana ?

I was wondering how come that this 寿司 produces these 名詞,一般,*,*,*,*,寿司,スシ,スシ features and doesn't include すし

http://www.atilika.org showcases the outdated maven artifact repository information

I'm not sure where to report this issue and here is the proper place to report or not...

See the Maven artifact repository information section:
http://www.atilika.org

It seems that:

The groupid of org.atilika.kuromoji is outdated and the latest one is com.atilika.kuromoji.
kuromoji currently has 7 dictionaries, but the above website showcases the version which has only a dictionary.
The link for this package is broken :)

===== (日本の会社のようなので日本語でも書いておきます）
このISSUEを報告するべき場所がどこなのか、ここが適切なのかちょっとわかりません・・・

Maven artifact repository information を見てください:
http://www.atilika.org

下記のような状態になっているように見えます:

org.atilika.kuromoji というグループIDは古くなっていて、現行のIDは com.atilika.kuromoji である。
現行のkuromojiは7つの辞書をもっているが、ウェブサイトに掲載されているのは辞書を１つしか持っていないバージョンである。
このパッケージのリンクは使えなくなっています。 :)

Question: how to obtain multiple parsings?

MeCab has a -N flag with which a user can specify the top-N results to get back. On http://www.atilika.org/ the Viterbi algorithm's output graph shows all possible morphemes, along with the cost of each path, so I'm sure it's possible to get the top, say, five results, but is there a simpler way to get this, the equivalent of mecab -N 5? I'm using UniDic. Thank you 🙇!

Next release?

Hello.
I would like to know when will be the next release.
The last release is from 2015, and I have seen that you have developed recently.

Thank you

Best regards.
Marin.

tokenize 一人（ひとり,hitori）will be seperate as 一(いち,ichi) 人（ひと,hito）

This project is great and useful for me : )
but i have a little question.
.
I'am not sure if it should be seem as a bug or not.
but some words like 一人（ひとり,hitori）will be separate as two words 一(いち,ichi) 人（ひと,hito）

Compound word with nakaguro in it

Thanks for the library.

I was testing compound words with nakaguro character in them and noticed that a compound word 'コカ・コーラ' is tokenized to a single term <コカ・コーラ> in Search mode whereas another such word 'アイス・キューブ' tokenizes to its components <アイス>, <キューブ>. Is the former produces a single token because it's a trademark or could this be a bug? Ultimately, I'd like to find documents that contain <コカ・コーラ> using a search term <コーラ>.

Thanks in advance for your help!

Tokenizing text in Hiragana character set

Tokenizing a sentence "寿司が美味しい。" produces the following tokens:
<寿司>,<が>,<美味しい>,<。>

Tokenizing the same sentence written only in hiragana character exhibits identical behavior which is great.
<すし>,<が>,<おいしい>,<。>

However, for some other words, tokenization behavior depends on the input character set.

For example, for "大学生":

The word is correctly tokenized into <大学生> if the input character set was Kanji.

When the input character set was Hiragana, "だいがくせい", the same word produces the following tokens.
<だい>,<が>,<くせ>,<い>.

Is this a known issue? Is there any configuration I could tweak so that the two cases behaves the same way regardless of the input character set?

Thanks in advance for your help!

Configuring with Maven

Hi folks,

I'm aware this isn't quite specific to Kuromoji exactly, however, it is related to getting Kuromoji functioning so I hope this question is ok here.

The Setup

I'm not 100% familiar with Java and have followed some guides to spin up a quick Maven project.

I'm using simple CMD commands to get started as described on maven getting started.

I've included the dependencies into the pom.xml but noticed no Kuromoji related files were inside the packaged .jar. After some digging, I learnt about maven-assembly-plugin and seem to be collecting all the needed bits and pieces.

Issue

This is my output to CMD from the example code posted in the README.md

?       ???,????,*,*,*,*,?,?,?
??      ??,??,*,*,*,*,??,??,??
?       ??,???,??,*,*,*,?,?,?
??      ??,??,*,*,??,???,???,??,??
??      ???,*,*,*,?????,???,??,??,??
?       ??,??,*,*,*,*,?,?,?

Has anyone come across this kind of thing before?

I think this issue is down to me not being very familiar with Java.
My next step is to move away from using CMD and set up the project in Eclipse (although their website is currently down 😦 )