Giter Club home page Giter Club logo

kuromoji's Introduction

Kuromoji Build Status

Kuromoji is an easy to use and self-contained Japanese morphological analyzer that does

  • Word segmentation. Segmenting text into words (or morphemes)
  • Part-of-speech tagging. Assign word-categories (nouns, verbs, particles, adjectives, etc.)
  • Lemmatization. Get dictionary forms for inflected verbs and adjectives
  • Readings. Extract readings for kanji

Several other features are supported. Please consult each dictionaries' Token class for details.

Using Kuromoji

The example below shows how to use the Kuromoji morphological analyzer in its simlest form; to segment text into tokens and output features for each token.

package com.atilika.kuromoji.example;

import com.atilika.kuromoji.ipadic.Token;
import com.atilika.kuromoji.ipadic.Tokenizer;
import java.util.List;

public class KuromojiExample {
    public static void main(String[] args) {
        Tokenizer tokenizer = new Tokenizer() ;
        List<Token> tokens = tokenizer.tokenize("お寿司が食べたい。");
        for (Token token : tokens) {
            System.out.println(token.getSurface() + "\t" + token.getAllFeatures());
        }
    }
}

Make sure you add the dependency below to your pom.xml before building your project.

<dependency>
  <groupId>com.atilika.kuromoji</groupId>
  <artifactId>kuromoji-ipadic</artifactId>
  <version>0.9.0</version>
</dependency>

When running the above program, you will get the following output:

お   接頭詞,名詞接続,*,*,*,*,お,オ,オ
寿司  名詞,一般,*,*,*,*,寿司,スシ,スシ
が   助詞,格助詞,一般,*,*,*,が,ガ,ガ
食べ  動詞,自立,*,*,一段,連用形,食べる,タベ,タベ
たい  助動詞,*,*,*,特殊・タイ,基本形,たい,タイ,タイ
。   記号,句点,*,*,*,*,。,。,。

See the documentation for the com.atilika.kuromoji.ipadic.Token class for more information on the per-token features available.

Supported dictionaries

Kuromoji currently supports the following dictionaries:

Question: So which of these dictionaries should I use?

Answer: That depends on your application. Yes, we know - it's a boring answer... :)

If you are not sure about which dictionary you should use, kuromoji-ipadic is a good starting point for many applications.

See the getters in the per-dictionary Token classes for some more information on available token features - or consult the technical dictionary documentation elsewhere. (We plan on adding better guidance on choosing a dictionary.)

Maven coordinates and user classes

Each dictionary has its own Maven coordinates, and a Tokenizer and a Token class similar to that in the above example. These classes live in a designated packaged space indicated by the dictionary type.

The sections below list fully qualified class names and the Maven coordinates for each dictionary supported.

kuromoji-ipadic

  • com.atilika.kuromoji.ipadic.Tokenizer
  • com.atilika.kuromoji.ipadic.Token
<dependency>
  <groupId>com.atilika.kuromoji</groupId>
  <artifactId>kuromoji-ipadic</artifactId>
  <version>0.9.0</version>
</dependency>

kuromoji-ipadic-neologd

  • com.atilika.kuromoji.ipadic.neologd.Tokenizer
  • com.atilika.kuromoji.ipadic.neologd.Token

This dictionary will be available from Maven Central in a future version.

kuromoji-jumandic

  • com.atilika.kuromoji.jumandic.Tokenizer
  • com.atilika.kuromoji.jumandic.Token
<dependency>
  <groupId>com.atilika.kuromoji</groupId>
  <artifactId>kuromoji-jumandic</artifactId>
  <version>0.9.0</version>
</dependency>

kuromoji-naist-jdic

  • com.atilika.kuromoji.naist.jdic.Tokenizer
  • com.atilika.kuromoji.naist.jdic.Token
<dependency>
  <groupId>com.atilika.kuromoji</groupId>
  <artifactId>kuromoji-naist-jdic</artifactId>
  <version>0.9.0</version>
</dependency>

kuromoji-unidic

  • com.atilika.kuromoji.unidic.Tokenizer
  • com.atilika.kuromoji.unidic.Token
<dependency>
  <groupId>com.atilika.kuromoji</groupId>
  <artifactId>kuromoji-unidic</artifactId>
  <version>0.9.0</version>
</dependency>

kuromoji-unidic-kanaaccent

  • com.atilika.kuromoji.unidic.kanaaccent.Tokenizer
  • com.atilika.kuromoji.unidic.kanaaccent.Token
<dependency>
  <groupId>com.atilika.kuromoji</groupId>
  <artifactId>kuromoji-unidic-kanaaccent</artifactId>
  <version>0.9.0</version>
</dependency>

kuromoji-unidic-neologd

  • com.atilika.kuromoji.unidic.neologd.Tokenizer
  • com.atilika.kuromoji.unidic.kanaaneologdcent.Token

This dictionary will be available from Maven Central in a future version.

Building Kuromoji from source code

Released version of Kuromoji are available from Maven Central.

If you want to build Kuromoji from source code, run the following command:

$ mvn clean package

This will download all source dictionary data and build Kuromoji with all dictionaries. The following jars will then be available:

kuromoji-core/target/kuromoji-core-1.0-SNAPSHOT.jar
kuromoji-ipadic/target/kuromoji-ipadic-1.0-SNAPSHOT.jar
kuromoji-ipadic-neologd/target/kuromoji-ipadic-neologd-1.0-SNAPSHOT.jar
kuromoji-jumandic/target/kuromoji-jumandic-1.0-SNAPSHOT.jar
kuromoji-naist-jdic/target/kuromoji-naist-jdic-1.0-SNAPSHOT.jar
kuromoji-unidic/target/kuromoji-unidic-1.0-SNAPSHOT.jar
kuromoji-unidic-kanaaccent/target/kuromoji-unidic-kanaaccent-1.0-SNAPSHOT.jar
kuromoji-unidic-neologd/target/kuromoji-unidic-neologd-1.0-SNAPSHOT.jar

The following additional build options are available:

  • -DskipCompileDictionary Do not recompile the dictionaries
  • -DskipDownloadDictionary Do not download source dictionaries
  • -DbenchmarkTokenizers Profile each tokenizer during the package phase using content from Japanese Wikipedia
  • -DskipDownloadWikipedia Prevent the compressed version of the Japanese Wikipedia (~765 MB) from being downloaded during profiling, i.e. if it has already been downloaded.

License

Kuromoji is licensed under the Apache License, Version 2.0. See LICENSE.md for details.

This software also includes a binary and/or source version of data from various 3rd party dictionaries. See NOTICE.md for these details.

Contributing

Please open up issues if you have a feature request. We also welcome contributions through pull requests.

You will retain copyright to your own contributions, but you need to license them using the Apache License, Version 2.0. All contributors will be mentioned in the CONTRIBUTORS.md file.

About us

We are a small team of experienced software engineers based in Tokyo who offers technologies and good advice in the field of search, natural language processing and big data analytics.

Please feel free to contact us at [email protected] if you have any questions or need help.

kuromoji's People

Contributors

akkikiki avatar cmoen avatar emmanuellegedin avatar gautela avatar gerryhocks avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kuromoji's Issues

Optimization opportunity in the fst usage.

Please take this report with a big pinch of salt : I am not even a kuromoji user and I did not profile the code thoroughly.

In ViterbiBuilder, kuromoji uses an fst to search for all possible prefix of a given string that are within a dictionary (encoded as the fst).
The successive call to lookup however, restart from the rootnode of the fst. It would be advisable to get all of the prefix in a single browse of the fst.

The headroom is valuable, but not massive. Around 15% of the time is spent in Fst.lookup. One can hope to cut this bit in half.

Android runtime exception when creating new Tokenizer using kuromoji-ipadic

I implemented some test code on Android but I get a runtime exception when trying to create a new Tokenizer:

 java.lang.RuntimeException: Could not load dictionaries.

I'm using the kuromoji-ipadic package (com.atilika.kuromoji:kuromoji-ipadic:0.9.0)
I've traced through the stack a bit to find this:

Caused by: java.lang.IllegalArgumentException: capacity < 0: -4
     at java.nio.ByteBuffer.allocate(ByteBuffer.java:54)
     at com.atilika.kuromoji.io.IntegerArrayIO.readArray(IntegerArrayIO.java:38)
     at com.atilika.kuromoji.buffer.WordIdMap.<init>(WordIdMap.java:35)
     at com.atilika.kuromoji.dict.TokenInfoDictionary.setup(TokenInfoDictionary.java:168)
     at com.atilika.kuromoji.dict.TokenInfoDictionary.newInstance(TokenInfoDictionary.java:160)
     at com.atilika.kuromoji.ipadic.Tokenizer$Builder.loadDictionaries(Tokenizer.java:219)
     at com.atilika.kuromoji.TokenizerBase.configure(TokenizerBase.java:77) 
     at com.atilika.kuromoji.ipadic.Tokenizer.<init>(Tokenizer.java:74) 
     at com.atilika.kuromoji.ipadic.Tokenizer.<init>(Tokenizer.java:59) 
     at com.atilika.kuromoji.ipadic.Tokenizer$Builder.build(Tokenizer.java:203) 

Which is crashing in IntegerArrayIO at:

public class IntegerArrayIO {

    private static final int INT_BYTES = Integer.SIZE / Byte.SIZE;

    public static int[] readArray(InputStream input) throws IOException {
        DataInputStream dataInput = new DataInputStream(input);
        int length = dataInput.readInt(); // length is returning -1

        ByteBuffer tmpBuffer = ByteBuffer.allocate(length * INT_BYTES); // -1 * 4 = -4, crashes with "capacity < 0:-4"
        ReadableByteChannel channel = Channels.newChannel(dataInput);
        channel.read(tmpBuffer);

I realize the library probably wasn't originally intended for Android, but I'd really like to try it and do some testing to see if it's viable for use on mobile as well.

The code that triggers the original exception is just the construction of the Tokenizer via the builder:

        Tokenizer tokenizer = new Tokenizer.Builder().build();

The IntegerArrayIO class's readArray() method seems to handle fine until trying to load the word ID map:

    public WordIdMap(InputStream input) throws IOException {
        indices = IntegerArrayIO.readArray(input);
        wordIds = IntegerArrayIO.readArray(input); // crashes here
    }

The above code is triggered by TokenInfoDictionary#setup():

        wordIdMap = new WordIdMap(resolver.resolve(TARGETMAP_FILENAME));

Here is the full stack trace (minus the parts that reference my code, which I think are irrelevant):

java.lang.RuntimeException: Could not load dictionaries.
   at com.atilika.kuromoji.ipadic.Tokenizer$Builder.loadDictionaries(Tokenizer.java:231)
   at com.atilika.kuromoji.TokenizerBase.configure(TokenizerBase.java:77)
   at com.atilika.kuromoji.ipadic.Tokenizer.<init>(Tokenizer.java:74)
   at com.atilika.kuromoji.ipadic.Tokenizer.<init>(Tokenizer.java:59)
   at com.atilika.kuromoji.ipadic.Tokenizer$Builder.build(Tokenizer.java:203)

(... my sources omitted for brevity ...)

 Caused by: java.lang.IllegalArgumentException: capacity < 0: -4
   at java.nio.ByteBuffer.allocate(ByteBuffer.java:54)
   at com.atilika.kuromoji.io.IntegerArrayIO.readArray(IntegerArrayIO.java:38)
   at com.atilika.kuromoji.buffer.WordIdMap.<init>(WordIdMap.java:35)
   at com.atilika.kuromoji.dict.TokenInfoDictionary.setup(TokenInfoDictionary.java:168)
   at com.atilika.kuromoji.dict.TokenInfoDictionary.newInstance(TokenInfoDictionary.java:160)
   at com.atilika.kuromoji.ipadic.Tokenizer$Builder.loadDictionaries(Tokenizer.java:219)
   at com.atilika.kuromoji.TokenizerBase.configure(TokenizerBase.java:77) 
   at com.atilika.kuromoji.ipadic.Tokenizer.<init>(Tokenizer.java:74) 
   at com.atilika.kuromoji.ipadic.Tokenizer.<init>(Tokenizer.java:59) 
   at com.atilika.kuromoji.ipadic.Tokenizer$Builder.build(Tokenizer.java:203)

and the sample code I was trying to test with:

        StringBuilder stringBuilder = new StringBuilder();
        Tokenizer tokenizer = new Tokenizer.Builder().build();
        for (Token token : tokenizer.tokenize(text)) {
            stringBuilder.append(token.getSurface());
            stringBuilder.append(" ");
        }

When I test kuromoji on my desktop it seems to work fine on the local JVM, but on Android it crashes.
I'll continue to dig deeper and see if I can find out what's going on. If there's anything else you need let me know.

あざーす!

Next release?

Hello.
I would like to know when will be the next release.
The last release is from 2015, and I have seen that you have developed recently.

Thank you

Best regards.
Marin.

How to use Kuromoji in Gradle?

   // https://mvnrepository.com/artifact/org.atilika.kuromoji/kuromoji
    compile group: 'org.atilika.kuromoji', name: 'kuromoji', version: '0.9.0'

and it will cause an error!

Error:Failed to resolve: org.atilika.kuromoji:kuromoji:0.9.0

Integration with solr

Hi,
I am new to solr. I have downloaded kuromoji
And placed it in solr-5.3.0\server\lib
And added





in solr-5.3.0\server\solr\configsets\basic_configs\conf
Now if i do search it should treat each of search term as japanese right?
Or do i need to specify which text should be treated as japanese

GC overhead limit exceeded when compiling IPADIC NEologd

I am getting a GC overhead limit exceeded message when trying to run

mvn clean package

on Mac OS X (MBP 2012 model).

The GC overhead limit exceeded message apparently means the system is spending 98% of the time doing GC and only 2% or less time doing any tasks which sounds like memory may not actually be the problem. However I am not sure how to test increasing memory, so I can't say for certain.


[INFO]
[INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ kuromoji-ipadic ---
[INFO] Building jar: /Users/zwoc/Workspace/deep-learning/narou/select/preproc/kuromoji/kuromoji-ipadic/target/kuromoji-ipadic-1.0-SNAPSHOT.jar
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Kuromoji IPADIC NEologd 1.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ kuromoji-ipadic-neologd ---
[INFO] Deleting /Users/zwoc/Workspace/deep-learning/narou/select/preproc/kuromoji/kuromoji-ipadic-neologd/target
[INFO]
[INFO] --- maven-resources-plugin:2.7:copy-resources (copy-license-resources) @ kuromoji-ipadic-neologd ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 2 resources
[INFO] Copying 1 resource
[INFO]
[INFO] --- maven-antrun-plugin:1.6:run (download-dictionary) @ kuromoji-ipadic-neologd ---
[INFO] Executing tasks

main:
     [echo] Downloading dictionary
   [delete] Deleting directory /Users/zwoc/Workspace/deep-learning/narou/select/preproc/kuromoji/kuromoji-ipadic-neologd/dictionary
    [mkdir] Created dir: /Users/zwoc/Workspace/deep-learning/narou/select/preproc/kuromoji/kuromoji-ipadic-neologd/dictionary
      [get] Getting: http://atilika.com/releases/mecab-ipadic-neologd/mecab-ipadic-2.7.0-20070801-neologd-20150925.tar.gz
      [get] To: /Users/zwoc/Workspace/deep-learning/narou/select/preproc/kuromoji/kuromoji-ipadic-neologd/dictionary/mecab-ipadic-2.7.0-20070801-neologd-20150925.tar.gz
    [untar] Expanding: /Users/zwoc/Workspace/deep-learning/narou/select/preproc/kuromoji/kuromoji-ipadic-neologd/dictionary/mecab-ipadic-2.7.0-20070801-neologd-20150925.tar.gz into /Users/zwoc/Workspace/deep-learning/narou/select/preproc/kuromoji/kuromoji-ipadic-neologd/dictionary
[INFO] Executed tasks
[INFO]
[INFO] --- maven-compiler-plugin:3.3:compile (compile-dictionary-compiler) @ kuromoji-ipadic-neologd ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 6 source files to /Users/zwoc/Workspace/deep-learning/narou/select/preproc/kuromoji/kuromoji-ipadic-neologd/target/classes
[INFO] /Users/zwoc/Workspace/deep-learning/narou/select/preproc/kuromoji/kuromoji-ipadic-neologd/src/main/java/com/atilika/kuromoji/ipadic/neologd/Tokenizer.java: /Users/zwoc/Workspace/deep-learning/narou/select/preproc/kuromoji/kuromoji-ipadic-neologd/src/main/java/com/atilika/kuromoji/ipadic/neologd/Tokenizer.java uses unchecked or unsafe operations.
[INFO] /Users/zwoc/Workspace/deep-learning/narou/select/preproc/kuromoji/kuromoji-ipadic-neologd/src/main/java/com/atilika/kuromoji/ipadic/neologd/Tokenizer.java: Recompile with -Xlint:unchecked for details.
[INFO]
[INFO] >>> exec-maven-plugin:1.2.1:java (run-dictionary-compiler) > validate @ kuromoji-ipadic-neologd >>>
[INFO]
[INFO] <<< exec-maven-plugin:1.2.1:java (run-dictionary-compiler) < validate @ kuromoji-ipadic-neologd <<<
[INFO]
[INFO] --- exec-maven-plugin:1.2.1:java (run-dictionary-compiler) @ kuromoji-ipadic-neologd ---
[KUROMOJI] 22:30:21: dictionary compiler
[KUROMOJI] 22:30:21:
[KUROMOJI] 22:30:21: input directory: /Users/zwoc/Workspace/deep-learning/narou/select/preproc/kuromoji/kuromoji-ipadic-neologd/dictionary/mecab-ipadic-2.7.0-20070801-neologd-20150925
[KUROMOJI] 22:30:21: output directory: /Users/zwoc/Workspace/deep-learning/narou/select/preproc/kuromoji/kuromoji-ipadic-neologd/src/main/resources/com/atilika/kuromoji/ipadic/neologd
[KUROMOJI] 22:30:21: input encoding: utf-8
[KUROMOJI] 22:30:21:
[KUROMOJI] 22:30:21: compiling tokeninfo dict...
[KUROMOJI] 22:30:21:     analyzing dictionary features
[KUROMOJI] 22:30:27:     reading tokeninfo
[KUROMOJI] 22:30:52:     compiling fst... [WARNING]
java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:297)
    at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.ArrayList.iterator(ArrayList.java:814)
    at java.util.AbstractList.hashCode(AbstractList.java:540)
    at com.atilika.kuromoji.fst.State.hashCode(State.java:127)
    at com.atilika.kuromoji.fst.Builder.findEquivalentState(Builder.java:243)
    at com.atilika.kuromoji.fst.Builder.freezeAndPointToNewState(Builder.java:179)
    at com.atilika.kuromoji.fst.Builder.createDictionaryCommon(Builder.java:143)
    at com.atilika.kuromoji.fst.Builder.build(Builder.java:119)
    at com.atilika.kuromoji.compile.FSTCompiler.compile(FSTCompiler.java:44)
    at com.atilika.kuromoji.compile.DictionaryCompilerBase.buildTokenInfoDictionary(DictionaryCompilerBase.java:70)
    at com.atilika.kuromoji.compile.DictionaryCompilerBase.build(DictionaryCompilerBase.java:37)
    at com.atilika.kuromoji.compile.DictionaryCompilerBase.build(DictionaryCompilerBase.java:172)
    at com.atilika.kuromoji.ipadic.neologd.compile.DictionaryCompiler.main(DictionaryCompiler.java:33)
    ... 6 more
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Kuromoji ........................................... SUCCESS [  0.186 s]
[INFO] Kuromoji Core ...................................... SUCCESS [  7.223 s]
[INFO] Kuromoji IPADIC .................................... SUCCESS [ 44.321 s]
[INFO] Kuromoji IPADIC NEologd ............................ FAILURE [03:53 min]
[INFO] Kuromoji JUMAN DIC ................................. SKIPPED
[INFO] Kuromoji NAIST-jdic ................................ SKIPPED
[INFO] Kuromoji UniDic .................................... SKIPPED
[INFO] Kuromoji UniDic Kana Accent ........................ SKIPPED
[INFO] Kuromoji UniDic NEologd ............................ SKIPPED
[INFO] Kuromoji Benchmark ................................. SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 04:45 min
[INFO] Finished at: 2015-10-07T22:33:55+09:00
[INFO] Final Memory: 15M/1834M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2.1:java (run-dictionary-compiler) on project kuromoji-ipadic-neologd: An exception occured while executing the Java class. null: InvocationTargetException: GC overhead limit exceeded -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :kuromoji-ipadic-neologd

java.lang.RuntimeException: Could not load dictionaries. Caused by: java.io.IOException: Classpath resource not found: fst.bin

kuromoji-ipadic-1.0-SNAPSHOT.jar

09-27 11:22:12.361 E/AndroidRuntime( 1075): java.lang.RuntimeException: Could not load dictionaries.

09-27 11:22:12.361 E/AndroidRuntime( 1075): at ant.a(Unknown Source)

09-27 11:22:12.361 E/AndroidRuntime( 1075): at com.atilika.kuromoji.TokenizerBase.a(Unknown Source)

09-27 11:22:12.361 E/AndroidRuntime( 1075): at ans.(Unknown Source)

09-27 11:22:12.361 E/AndroidRuntime( 1075): at ans.(Unknown Source)

09-27 11:22:12.361 E/AndroidRuntime( 1075): at aoy.(Unknown Source)

09-27 11:22:12.361 E/AndroidRuntime( 1075): at aoy.a(Unknown Source)

09-27 11:22:12.361 E/AndroidRuntime( 1075): at aoz.(Unknown Source)

09-27 11:22:12.361 E/AndroidRuntime( 1075): at amw.a(Unknown Source)

09-27 11:22:12.361 E/AndroidRuntime( 1075): at amh.run(Unknown Source)

09-27 11:22:12.361 E/AndroidRuntime( 1075): at java.lang.Thread.run(Thread.java:818)

09-27 11:22:12.361 E/AndroidRuntime( 1075): Caused by: java.io.IOException: Classpath resource not found: fst.bin

09-27 11:22:12.361 E/AndroidRuntime( 1075): at anz.a(Unknown Source)

09-27 11:22:12.361 E/AndroidRuntime( 1075): at ann.a(Unknown Source)

09-27 11:22:12.361 E/AndroidRuntime( 1075): ... 10 more

Exception in
Tokenizer loadDictionaries() in this.fst = FST.newInstance(this.resolver);

fst.bin presents in wear-app.apk:/com/atilika/kuromoji/ipadic

apk structure
-resources.arsc
-classes.dex
-AndroidManifest.xml
----res/
----META-INF/
----assets/
----com/atilika/kuromoji/ipadic/*.bin

Question: how to obtain multiple parsings?

MeCab has a -N flag with which a user can specify the top-N results to get back. On http://www.atilika.org/ the Viterbi algorithm's output graph shows all possible morphemes, along with the cost of each path, so I'm sure it's possible to get the top, say, five results, but is there a simpler way to get this, the equivalent of mecab -N 5? I'm using UniDic. Thank you 🙇!

Tokenizing text in Hiragana character set

Tokenizing a sentence "寿司が美味しい。" produces the following tokens:
<寿司>,<が>,<美味しい>,<。>

Tokenizing the same sentence written only in hiragana character exhibits identical behavior which is great.
<すし>,<が>,<おいしい>,<。>

However, for some other words, tokenization behavior depends on the input character set.

For example, for "大学生":

The word is correctly tokenized into <大学生> if the input character set was Kanji.

When the input character set was Hiragana, "だいがくせい", the same word produces the following tokens.
<だい>,<が>,<くせ>,<い>.

Is this a known issue? Is there any configuration I could tweak so that the two cases behaves the same way regardless of the input character set?

Thanks in advance for your help!

Tokenizer is not serializable for Apache Spark

On Apache Spark, instances must be serializable for parallel processing but kuromoji tokenizers are not and must be initialized for each time.
If tokenizers are serializable, we can decrease processing time.

Normalized surface in user dictionary.

in current implementations of the ipadic, it seems that there is no functionality to normalize surface in the user dic.
is this right?

i think that this functionality is very useful and required in common situations.

so, i have a plan to expand user dictionary function to handle normalize a word surface with keeping the current specification of the user dictionary resource format.

what do you think about this?

Compound word with nakaguro in it

Thanks for the library.

I was testing compound words with nakaguro character in them and noticed that a compound word 'コカ・コーラ' is tokenized to a single term <コカ・コーラ> in Search mode whereas another such word 'アイス・キューブ' tokenizes to its components <アイス>, <キューブ>. Is the former produces a single token because it's a trademark or could this be a bug? Ultimately, I'd like to find documents that contain <コカ・コーラ> using a search term <コーラ>.

Thanks in advance for your help!

How to enable discardPunctuation in Kuromoji Java

Hi,

I can remove the punctuations in Kuromoji-ES plugin by setting "discard_punctuation": "true"

I'm wondering how can I get the same result with Kuromoji-Java?

For example, in Kuromoji-ES, 「浅草」駅 will be tokenized as

{
  "tokens" : [
    {
      "token" : "浅草",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "駅",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    }
  ]
}

Is there a same function with Kuromoji-Java to do so?

The tokenizing performance of mixed language

The performance of lib for pure Japanese language in text is fine. But when I am trying to tokenize some texts with mixed languages (English and Japanese) when the language recognizer detects it as Japanese, the lib will filter some English words which could be key information of the texts. How can I deal with the cases by tuning or changing some parameters when I use it? You can try the following plain text first. The English word "[ 1 ] SINGAPORE 2" will be filtered, for example.

"ELECTRONIC e チケットお客様控 TICKET ITINERARY/RECEIPT 国際線自動 チェックイン 機用2次元 バーコード For International Self Service Unit ・搭乗手続き時、又は、出入国審査時に提示を求められた場合には、出入国に必要な全ての書類又は滞在先住所等の情報、 e チケットお客様控 ( Itine- rary/Receipt ) 、及びパスポート等の公的書類をご提示ください。コードシェア便の搭乗手続きは、運航会社で承ります。 ・ e チケットお客様控えは、旅程変更又は払い戻しの際に必要となる場合がありますので、ご旅行終了までお待ちください。 ・ Please present all necessary country specific travel documentation or data such as staying address , Itinerary/Receipt , and positive identification such as passport , when you are requested to do so at check-in , or at Immigration/Customs . ・ Please retain Itinerary/Receipt throughout your journey . Itinerary/Receipt may be required in case of itinerary change or refund . 搭乗者名: CHINOMI/KENJI MR PASSENGER NAME 航空券番号: 2052402171368 予約番号: YIFRBL 発行日: 24JAN16 TICKET NUMBER RESERVATION CODE DATE OF ISSUE OF ISSUE ISS . OFFICE CODE PLACE 発行所: SINGAPORE - NH SKY WEB SG R 発行店舗コード: 32393852 旅程表 ITINERARY 都市 /空港 ターミナル 便名 日付 曜日 時間 クラス 運賃種別 予約状況 手荷物 有効期限 CITY/AIRPORT TERMINAL FLIGHT NO . DATE DAY TIME CLASS FARE BASIS STATUS BAGGAGE INVALID BEFORE/AFTER 出発 DEPARTURE 出発 DEPARTURE [ 1 ] SINGAPORE 2 NH844 27JAN16 WED 2220 W ( Y ) WRCS0 OK 2PC 27JAN/27JAN 到着 ARRIVAL 座席 SEAT 到着 ARRIVAL 運航航空会社 OPERATING CARRIER 備考 REMARKS TOKYO ( HANEDA ) INT 28JAN16 THU 0600 ALL NIPPON AIRWAYS 出発 DEPARTURE 出発 DEPARTURE [ 2 ] TOKYO ( HANEDA ) SURFACE 到着 ARRIVAL 座席 SEAT 到着 ARRIVAL 運航航空会社 OPERATING CARRIER 備考 REMARKS TOKYO ( NARITA ) 出発 DEPARTURE 出発 DEPARTURE [ 3 ] TOKYO ( NARITA ) 1 NH801 30JAN16 SAT 1805 V ( Y ) VRCS0 OK 2PC 30JAN/30JAN 到着 ARRIVAL 座席 SEAT 到着 ARRIVAL 運航航空会社 OPERATING CARRIER 備考 REMARKS SINGAPORE 2 31JAN16 SUN 0040 ALL NIPPON AIRWAYS 全日本空輸株式会社 ALL NIPPON AIRWAYS CO . , LTD . PAGE 1 / 2 PRINTED 24JAN16 ELECTRONIC e チケットお客様控 TICKET ITINERARY/RECEIPT 国際線自動 チェックイン 機用2次元 バーコード For International Self Service Unit 搭乗者名: CHINOMI/KENJI MR PASSENGER NAME 航空券番号: 2052402171368 予約番号: YIFRBL 発行日: 24JAN16 TICKET NUMBER RESERVATION CODE DATE OF ISSUE OF ISSUE ISS . OFFICE CODE PLACE 発行所: SINGAPORE - NH SKY WEB SG R 発行店舗コード: 32393852 運賃/航空券情報 FARE/TICKET INFORMATION 運賃額: 支払運賃額: FARE SGD830.00 EQUIV . FARE PAID 税金・料金等合計: 航空会社手数料: TAXES/FEES/CHARGES/AIRLINE CHARGES TOTAL SGD74.90 AIRLINE SERVICE CHARGE SGD0.00 ツアーコード: TOTAL ( AIRLINE SERVICE CHARGE is not included . ) SGD904.90 TOUR CODE 支払手段: FORM OF PAYMENT CCCAXXXXXXXXXXXX3426**/XX-XX S 811289 制限事項: FLT/CNX/CHG RESTRICTED CHECK FARE RULE ENDORSEMENTS/RESTRICTIONS 運賃詳細: SIN NH TYO Q14.23 259.84NH SIN316.79NUC590.86END ROE1.404690 FARE CALCULATION 税金・料金等 詳細: SGD8.80YQ/ SGD19.90SG/ SGD6.10OP/ SGD8.00OO/ SGD25.70SW/ SGD6.40OI/ TAXES/FEES/CHARGES/ AIRLINE CHARGES DETAILS シンガポールの空港から出発する場合、上記金額には OP TAX ( Aviation Levy ) が含まれています。 OP tax ( Aviation Levy ) is included on a ticket when departing from the airport in Singapore . 原券: 交換券: ORIGINAL ISSUE ISSUED IN EXCHANGE FOR ご注意及び契約条件 /TICKET NOTICE ・運送やその他のサービスは、各運送人の運送約款に従います。運送約款については発行運送人にご確認ください。なお、 ANA の運送による日本国内区間のみの旅行であって国際運送の一環ではない場合、 ANAの 国内旅客運送約款が適用となります。 ・旅客が出発国以外の国に最終到達地又は寄港地を有する旅行を行なう場合は、その旅客の旅程全体 ( 同一国内の区間を含む ) についてモントリオール条約又はその前身のワルソー条約 ( その改正を含む ) の適用を 受けることがあります。その旅客に対し適用となる条約 ( 適用タリフに含まれる特別運送契約を含む ) が、運送人の責任を制限することがあります。詳細は、各運送人へお問い合わせください。 ・エアゾール、花火、可燃性液体などの危険物は航空機へ持込はできません。これら制限の詳細は航空会社へお問い合わせください。 ・このお客様控とともに、航空券の一部を成し、かつ「契約条件及びその他重要事項」を含む、一連のご案内書をお受け取りになります。これらのご案内書をお受け取りになられたことを必ずご確認いただき、 もしお受け取りになられていない場合には、旅行開始前に次の URL : https : //www . ana . co . jp/other/int/meta/0192.html ? CONNECTION_KIND\u003djp\u0026LANG\u003dj で入手いただくか、又は、発行運送人若しくは旅行会社へ 連絡ください。 ・このお客様控は、モントリオール条約及びワルソー条約第3条でいう「航空券」の一部をなします。ただし、航空会社が第3条の要件を満たす別の書類を旅客へ渡す場合を除きます。 ・ ANA のコンピュータシステムに保管されている eチケット情報と e チケットお客様控の情報に相違がある場合、コンピュータシステム上の e チケット情報を有効と致します。 ・ Carriage and other services provided by the carrier are subject to conditions of carriage , which are hereby incorporated by reference . These conditions may be obtained from the issuing carrier . Please note that if you travel on ANA \u0027s domestic sector flights within Japan only , without any international connecting flights , ANA \u0027s Conditions of Carriage for Passengers and Baggage for domestic flights will apply . ・ Passengers on a journey involving an ultimate destination or a stop in a country other than the country of departure are advised that international treaties known as the Montreal Convention , or its predecessor , the Warsaw Convention , including its amendments ( the Warsaw Convention System ) , may apply to the entire journey , including any portion thereof within a country . For such passengers , the applicable treaty , including special contracts of carriage embodied in any applicable tariffs , governs and may limit the liability of the carrier . Check with your carrier for more information . ・ The carriage of certain hazardous materials , like aerosols , fireworks , and flammable liquids , aboard the aircraft is forbidden . If you do not understand these restrictions , further information may be obtained from your airline . ・ Further information may be obtained from the carrier . With this ticket you will receive a set of notices which forms part of the ticket and contains the " Conditions of Contract and Other Important Notices " . Please make sure that you have received these notices , and if not , obtain copies prior to the commencement of your journey at the following URL : https : //www . ana . co . jp/other/int/meta/0192.html ? CONNECTION_KIND\u003djp\u0026LANG\u003de , or contact the issuing airline or travel agent . ・ This Itinerary/Receipt constitutes the " passenger ticket " for the purposes of Article 3 of the Montreal Convention and the Warsaw Convention , except where the carrier delivers to the passenger another document complying with the requirements of Article 3. ・ Ticketing information contained in ANA \u0027s computer system shall prevail should any discrepancy occur between the Itinerary/Receipt held by the customer and the ticketing information in our computer system . 全日本空輸株式会社 ALL NIPPON AIRWAYS CO . , LTD . PAGE 2 / 2 PRINTED 24JAN16"

Unidic design flaw

Unidic's lex data doesn't have enough information for the viterbi algorithm to distinguish words with the same readings and same word types in context. So お父さん is always interpreted as お・ちち・さん, instead of お・とう・さん like it should be.

父,5142,5142,3860,名詞,普通名詞,一般,*,*,*,チチ,父,父,チチ,父,チチ,和,*,*,*,*

父,5142,5142,4656,名詞,普通名詞,一般,*,*,*,トウ,父,父,トー,父,トー,和,*,*,*,*

They're otherwise identical, but the ちち reading has a lower cost, so it always wins when the word is in the kanji form. Basically, unidic's segment features don't have a way to distinguish these. It's easy to write a script that looks for segments that are identical in surface form and feature list and see what problematic matches there are.

This is basically impossible to fix on kuromoji's side without adding a list of segments that act differently than their features indicate, which would be ridiculous. On the other hand, one of kuromoji's implicit goals is to not be worse than other morphological analyzers, so this is a problem worth posting about.

I added a bunch of お父 etc. entries to my user dictionary to gloss over this problem by prepending the お・御. (for unidic-kanaaccent STAGING)

おとう,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オトウ,御父,おとう,オトー,おとう,オトー,和,*,*,*,*,オトウ,オトウ,オトウ,オトウ,*,*,2,*,*
お父,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オトウ,御父,お父,オトー,お父,オトー,和,*,*,*,*,オトウ,オトウ,オトウ,オトウ,*,*,2,*,*
御父,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オトウ,御父,御父,オトー,御父,オトー,和,*,*,*,*,オトウ,オトウ,オトウ,オトウ,*,*,2,*,*

おかあ,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オカア,御母,おかあ,オカー,おかあ,オカー,和,*,*,*,*,オカア,オカア,オカア,オカア,*,*,2,*,*
お母,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オカア,御母,お母,オカー,お母,オカー,和,*,*,*,*,オカア,オカア,オカア,オカア,*,*,2,*,*
御母,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オカア,御母,御母,オカー,御母,オカー,和,*,*,*,*,オカア,オカア,オカア,オカア,*,*,2,*,*

おにい,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オニイ,御兄,おにい,オニー,おにい,オニー,和,*,*,*,*,オニイ,オニイ,オニイ,オニイ,*,*,2,*,*
お兄,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オニイ,御兄,お兄,オニー,お兄,オニー,和,*,*,*,*,オニイ,オニイ,オニイ,オニイ,*,*,2,*,*
御兄,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オニイ,御兄,御兄,オニー,御兄,オニー,和,*,*,*,*,オニイ,オニイ,オニイ,オニイ,*,*,2,*,*

おねえ,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オネエ,御姉,おねえ,オネー,おねえ,オネー,和,*,*,*,*,オネエ,オネエ,オネエ,オネエ,*,*,2,*,*
お姉,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オネエ,御姉,お姉,オネー,お姉,オネー,和,*,*,*,*,オネエ,オネエ,オネエ,オネエ,*,*,2,*,*
御姉,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オネエ,御姉,御姉,オネー,御姉,オネー,和,*,*,*,*,オネエ,オネエ,オネエ,オネエ,*,*,2,*,*

お姐,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オネエ,御姐,お姐,オネー,お姐,オネー,和,*,*,*,*,オネエ,オネエ,オネエ,オネエ,*,*,2,*,*
御姐,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オネエ,御姐,御姐,オネー,御姐,オネー,和,*,*,*,*,オネエ,オネエ,オネエ,オネエ,*,*,2,*,*

おばあ,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オバア,御婆,おばあ,オバー,おばあ,オバー,和,*,*,*,*,オバア,オバア,オバア,オバア,*,*,2,*,*
お婆,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オバア,御婆,お婆,オバー,お婆,オバー,和,*,*,*,*,オバア,オバア,オバア,オバア,*,*,2,*,*
御婆,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オバア,御婆,御婆,オバー,御婆,オバー,和,*,*,*,*,オバア,オバア,オバア,オバア,*,*,2,*,*


おじい,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オジイ,御爺,おじい,オジー,おじい,オジー,和,*,*,*,*,オジイ,オジイ,オジイ,オジイ,*,*,2,*,*
お爺,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オジイ,御爺,お爺,オジー,お爺,オジー,和,*,*,*,*,オジイ,オジイ,オジイ,オジイ,*,*,2,*,*
御爺,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オジイ,御爺,御爺,オジー,御爺,オジー,和,*,*,*,*,オジイ,オジイ,オジイ,オジイ,*,*,2,*,*

(weights are for illustration, I think they're too high to catch in all intended cases)

Kuromoji_tokenizer: sort clause does not seem to work for some specific character combinations

Query:

{
  "query": {
    "bool": {
    }
  },
  "sort": [
    {
      "attribute.sortable": {
        "order": "asc"
      }
    }
  ]
}

Results:

"hits": [
  {
    "_index": "example_1",
    "_type": "example_1",
    "_id": "A2Ff26qFaV",
    "_score": null,
    "_source": {
      "attributes": {
        "attribute": "サヨ",
      }
    },
    "sort": [
      "サヨ"
    ]
  },
  {
    "_index": "example_2",
    "_type": "example_2",
    "_id": "A2Ff26qFaV",
    "_score": null,
    "_source": {
      "attributes": {
        "attribute": "シヨ",
      }
    },
    "sort": [
      "シ"
    ]
  }
]

The sort is working on the characters in attribute field for example_1 doc but not for example_2 doc.

Observed this in 3 instances in total for these strings:

  • シヨ
  • ヲシ
  • シヲ

Longer string in Katakana has low priority

Tested with kuromoji-core-1.0-SNAPSHOT and kuromoji-ipadic-1.0-SNAPSHOT.
(build from master at 2017/3/8)

When the user dictionary is

くろも,くろも,くろも,カスタム名詞
ろ,ろ,ろ,カスタム名詞

, the string "くろもじ" is tokenized into

くろも カスタム名詞,*,*,*,*,*,*,くろも,*
じ 助動詞,*,*,*,不変化型,基本形,じ,ジ,ジ

which is fine.

When the user dictionary is

クロモ,クロモ,クロモ,カスタム名詞
ロ,ロ,ロ,カスタム名詞

, the string "クロモジ" is tokenized into

ク 名詞,一般,*,*,*,*,ク,ク,ク
ロ カスタム名詞,*,*,*,*,*,*,ロ,*
モ *,*,*,*,*,*,*,*,*
ジ *,*,*,*,*,*,*,*,*

which is not fine.

I expected below.

クロモ カスタム名詞,*,*,*,*,*,*,クロモ,*
ジ *,*,*,*,*,*,*,*,*

What should I do for the expectation?

sample code I used:

public static void main(String[] args) {
  String target = "くろもじ";
  List<String> dictionaryList = Arrays.asList("くろも,くろも,くろも,カスタム名詞", "ろ,ろ,ろ,カスタム名詞");
  String target = "クロモジ";
  List<String> dictionaryList = Arrays.asList("クロモ,クロモ,クロモ,カスタム名詞", "ロ,ロ,ロ,カスタム名詞");
  String dictionary = String.join(System.lineSeparator(), dictionaryList);
  Builder builder = new Tokenizer.Builder();
  try {
    InputStream inputStream = new ByteArrayInputStream(dictionary.getBytes("utf-8"));
    builder.userDictionary(inputStream);
  } catch (Exception e) {
  }
  Tokenizer tokenizer = builder.build();
  List<Token> tokens = tokenizer.tokenize(target);
  tokens.stream().forEach(token -> System.out.println(token.getSurface()+"\t"+token.getAllFeatures()));
}

Possible Issue with tokenization when English+Japanese are adjacent in text

Text => Dior化粧品等の輸入総代理店で , which indexed with the default Kuromoji analyzer produces the following tokens:

dior
start: 0 end: 4 pos: 0
化粧
start: 4 end: 6 pos: 1
品等
start: 6 end: 8 pos: 2
輸入
start: 9 end: 11 pos: 4
総
start: 11 end: 12 pos: 5
代理
start: 12 end: 14 pos: 6
店
start: 14 end: 15 pos: 7

However, we noticed that when a user searched for the term Dior化粧品, it did not produce a match (using same analyzer settings). The reason is that the search term is tokenized as such:

dior
start: 0 end: 4 pos: 0
化粧
start: 4 end: 6 pos: 1
品  
start: 6 end: 7 pos: 2

Since the word cosmetics is the Japanese term 化粧品, it seems like the search term got analyzed correctly but the piece of text produced an unexpected bigram sequence of 化粧 and 品等

Not sure if this is a valid issue due to the mix of English/Japanese in the text or my Japanese fundamentals are off here

Kuromoji POS Train

Hi,

Can we train Kuromoji POS ? If yes, Please tell me the format of data needed as input.

Thanks

Very odd tokenization of a sentence

Tokenizing "色々やらなきゃならんことがたくさんあるんだ" in the command line version of Kuromoji (using ipadic) results in

色々
やら
なき
ゃならんことがたくさんあるんだ

The same output is observed in the online demo available at http://www.atilika.org/

Unidic Tokenization on Romaji Words

Tested with version 0.9.0.

I know this is for Japanese, but it would be nice if some romaji words were tokenized consistently.

The string "hello golf2" is tokenized into:

  • hello
  • golf
  • 2

which is fine. But when I tokenize "golf2 hello", I get using com.atilika.kuromoji.unidic.Tokenizer (also unidic.kanaaccent, but not the other tokenizers):

  • g
  • o
  • l
  • f
  • 2
  • hello

It would be nice, if the second case were like the first. In the meantime, I might handle this with a user-dictionary.

Segmentation wrong with token contains square brackets?

Looks like the segmenter does not work properly if there are square brackets, e.g.:

[   名詞,サ変接続,*,*,*,*,*,*,*
滧 名詞,一般,*,*,*,*,*,*,*
の 助詞,連体化,*,*,*,*,の,ノ,ノ
]。    名詞,サ変接続,*,*,*,*,*,*,*

or

「 記号,括弧開,*,*,*,*,「,「,「
国宝  名詞,一般,*,*,*,*,国宝,コクホウ,コクホー
五 名詞,数,*,*,*,*,五,ゴ,ゴ
城 名詞,一般,*,*,*,*,城,シロ,シロ
」[    名詞,サ変接続,*,*,*,*,*,*,*
``

Internals documentation and academic papers?

Is there any description of how kuromoji works? E.g. an overview of what each class does, how they work together. And/or academic papers on what it is doing? (E.g. Is it behaving identically to MeCab, ChaSen or Juman, and if not, what innovations is it using and why? What design trade-offs are there?)

(If neither is available, this issue is a request for that kind of documentation; if they are then it is a request for them to be linked to from the README.md file. Thanks!)

http://www.atilika.org showcases the outdated maven artifact repository information

I'm not sure where to report this issue and here is the proper place to report or not...

See the Maven artifact repository information section:
http://www.atilika.org

It seems that:

  1. The groupid of org.atilika.kuromoji is outdated and the latest one is com.atilika.kuromoji.
  2. kuromoji currently has 7 dictionaries, but the above website showcases the version which has only a dictionary.
  3. The link for this package is broken :)

===== (日本の会社のようなので日本語でも書いておきます)
このISSUEを報告するべき場所がどこなのか、ここが適切なのかちょっとわかりません・・・

Maven artifact repository information を見てください:
http://www.atilika.org

下記のような状態になっているように見えます:

  1. org.atilika.kuromoji というグループIDは古くなっていて、現行のIDは com.atilika.kuromoji である。
  2. 現行のkuromojiは7つの辞書をもっているが、ウェブサイトに掲載されているのは辞書を1つしか持っていないバージョンである。
  3. このパッケージのリンクは使えなくなっています。 :)

how to increase heap size other than MAVEN_OPS

I hope to build kuromoji with neologd on CircleCI.
i try to increase heap size, so set env `MAVEN_OPTS="-Xmx4096m -XX:-UseGCOverheadLimit".
but OutOfMemoryError occurs:

...
[KUROMOJI] 02:40:21:     reading tokeninfo
[WARNING] 
java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:498)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:297)
    at java.lang.Thread.run (Thread.java:748)
Caused by: java.lang.OutOfMemoryError: Java heap space
...

debugged using -X option, then

env.MAVEN_OPTS= -Xmx4096m -XX:-UseGCOverheadLimit -Xmx512m

-Xmx512m found, so i think -Xmx4096m was overrode by this -Xmx512m.
i can't understand fundamental issue that why has set already -Xmx512m, but i should like to know if there is another way to increase heap size.

can you tell me any idea to avoid this issue? thanks.

Configuring with Maven

Hi folks,

I'm aware this isn't quite specific to Kuromoji exactly, however, it is related to getting Kuromoji functioning so I hope this question is ok here.

The Setup

I'm not 100% familiar with Java and have followed some guides to spin up a quick Maven project.

I'm using simple CMD commands to get started as described on maven getting started.

I've included the dependencies into the pom.xml but noticed no Kuromoji related files were inside the packaged .jar. After some digging, I learnt about maven-assembly-plugin and seem to be collecting all the needed bits and pieces.

Issue

This is my output to CMD from the example code posted in the README.md

?       ???,????,*,*,*,*,?,?,?
??      ??,??,*,*,*,*,??,??,??
?       ??,???,??,*,*,*,?,?,?
??      ??,??,*,*,??,???,???,??,??
??      ???,*,*,*,?????,???,??,??,??
?       ??,??,*,*,*,*,?,?,?

Has anyone come across this kind of thing before?

I think this issue is down to me not being very familiar with Java.
My next step is to move away from using CMD and set up the project in Eclipse (although their website is currently down 😦 )

Obtain furigana?

Hi, the documentation says kuromoji can extract the readings for kanji and shows an example in which the reading for each token is extracted. However, is it possible to extract what part of the reading corresponds to each kanji?

For example, given a token with the contents:

"寿司" -> 寿 = ス, 司 = シ

Kanji penalty and other penalty

Hi, what does kanji penalty and other penalty do? when would i need to use this feature?. thanks in advance, if there is any explanation or doc about that I haven't found, can someone link me to it.

Kuromoji on Android

Hello and thank you for your library.

I tried to use Kuromoji on Android (actually it's a bit overkill for me, I only try to convert Japanese text to romaji for pronounciation).
I encountered this error:

java.lang.IllegalArgumentException: capacity < 0: -4
  at java.nio.ByteBuffer.allocate(ByteBuffer.java:54)
  at com.atilika.kuromoji.io.IntegerArrayIO.readArray(IntegerArrayIO.java:38)
  at com.atilika.kuromoji.buffer.WordIdMap.<init>(WordIdMap.java:35)
  at com.atilika.kuromoji.dict.TokenInfoDictionary.setup(TokenInfoDictionary.java:168)
  at com.atilika.kuromoji.dict.TokenInfoDictionary.newInstance(TokenInfoDictionary.java:160)
  at com.atilika.kuromoji.ipadic.Tokenizer$Builder.loadDictionaries(Tokenizer.java:219)

I guessed I hit a memory limit and this is not the library I'm looking for.

Can you confirm? And do you have a better idea to extract romaji from japanese text?

Thanks a lot.

Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12.4:test (default-test) on project kuromoji-benchmark: There are test failures.

Another out of memory issue -- despite the fact I did

export MAVEN_OPTS=-Xmx3g
-------------------------------------------------------------------------------
Test set: com.atilika.kuromoji.benchmark.SimpleBenchmarkTest
-------------------------------------------------------------------------------
Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 67.952 sec <<< FAILURE!
testSimpleBenchmark(com.atilika.kuromoji.benchmark.SimpleBenchmarkTest)  Time elapsed: 5.8 sec  <<< ERROR!
java.lang.OutOfMemoryError: Java heap space
    at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
    at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
    at com.atilika.kuromoji.io.ByteBufferIO.read(ByteBufferIO.java:39)
    at com.atilika.kuromoji.buffer.TokenInfoBuffer.<init>(TokenInfoBuffer.java:39)
    at com.atilika.kuromoji.dict.TokenInfoDictionary.setup(TokenInfoDictionary.java:165)
    at com.atilika.kuromoji.dict.TokenInfoDictionary.newInstance(TokenInfoDictionary.java:160)
    at com.atilika.kuromoji.TokenizerBase$Builder.loadDictionaries(TokenizerBase.java:289)
    at com.atilika.kuromoji.TokenizerBase.configure(TokenizerBase.java:77)
    at com.atilika.kuromoji.unidic.neologd.Tokenizer.<init>(Tokenizer.java:67)
    at com.atilika.kuromoji.unidic.neologd.Tokenizer.<init>(Tokenizer.java:58)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at java.lang.Class.newInstance(Class.java:374)
    at com.atilika.kuromoji.benchmark.SimpleBenchmarkTest.tokenizeForName(SimpleBenchmarkTest.java:92)
    at com.atilika.kuromoji.benchmark.SimpleBenchmarkTest.testSimpleBenchmark(SimpleBenchmarkTest.java:48)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.