lindera-morphology / lindera Goto Github PK

View Code? Open in Web Editor NEW

368.0 6.0 37.0 182.16 MB

A multilingual morphological analysis library.

License: MIT License

Rust 98.70% Makefile 1.21% Dockerfile 0.09%

morphological analyzer tokenizer library multilingual

lindera's Introduction

Lindera

A morphological analysis library in Rust. This project fork from kuromoji-rs.

Lindera aims to build a library which is easy to install and provides concise APIs for various Rust applications.

The following products are required to build:

Rust >= 1.46.0

Tokenization examples

Basic tokenization

Put the following in Cargo.toml:

[dependencies]
lindera = { version = "0.31.0", features = ["ipadic"] }

This example covers the basic usage of Lindera.

It will:

Create a tokenizer in normal mode
Tokenize the input text
Output the tokens

use lindera::{
    DictionaryConfig, DictionaryKind, LinderaResult, Mode, Tokenizer, TokenizerConfig,
};

fn main() -> LinderaResult<()> {
    let dictionary = DictionaryConfig {
        kind: Some(DictionaryKind::IPADIC),
        path: None,
    };

    let config = TokenizerConfig {
        dictionary,
        user_dictionary: None,
        mode: Mode::Normal,
    };

    // create tokenizer
    let tokenizer = Tokenizer::from_config(config)?;

    // tokenize the text
    let tokens = tokenizer.tokenize("関西国際空港限定トートバッグ")?;

    // output the tokens
    for token in tokens {
        println!("{}", token.text);
    }

    Ok(())
}

The above example can be run as follows:

% cargo run --features=ipadic --example=ipadic_basic_example

You can see the result as follows:

関西国際空港
限定
トートバッグ

Tokenization with user dictionary

You can give user dictionary entries along with the default system dictionary. User dictionary should be a CSV with following format.

<surface>,<part_of_speech>,<reading>

Put the following in Cargo.toml:

[dependencies]
lindera-tokenizer = { version = "0.31.0", features = ["ipadic"] }

For example:

% cat ./resources/simple_userdic.csv
東京スカイツリー,カスタム名詞,トウキョウスカイツリー
東武スカイツリーライン,カスタム名詞,トウブスカイツリーライン
とうきょうスカイツリー駅,カスタム名詞,トウキョウスカイツリーエキ

With an user dictionary, Tokenizer will be created as follows:

use std::path::PathBuf;

use lindera::{
    DictionaryConfig, DictionaryKind, LinderaResult, Mode, Tokenizer, TokenizerConfig,
    UserDictionaryConfig,
};

fn main() -> LinderaResult<()> {
    let dictionary = DictionaryConfig {
        kind: Some(DictionaryKind::IPADIC),
        path: None,
    };

    let user_dictionary = Some(UserDictionaryConfig {
        kind: DictionaryKind::IPADIC,
        path: PathBuf::from("./resources/ipadic_simple_userdic.csv"),
    });

    let config = TokenizerConfig {
        dictionary,
        user_dictionary,
        mode: Mode::Normal,
    };

    let tokenizer = Tokenizer::from_config(config)?;

    // tokenize the text
    let tokens = tokenizer.tokenize("東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です")?;

    // output the tokens
    for token in tokens {
        println!("{}", token.text);
    }

    Ok(())
}

The above example can be by cargo run --example:

% cargo run --features=ipadic --example=ipadic_userdic_example
東京スカイツリー
の
最寄り駅
は
とうきょうスカイツリー駅
です

Analysis examples

Basic analysis

Put the following in Cargo.toml:

[dependencies]
lindera = { version = "0.31.0", features = ["ipadic", "filter"] }

This example covers the basic usage of Lindera Analysis Framework.

It will:

Apply character filter for Unicode normalization (NFKC)
Tokenize the input text with IPADIC
Apply token filters for removing stop tags (Part-of-speech) and Japanese Katakana stem filter

use std::collections::HashSet;

use lindera::{
    Analyzer, BoxCharacterFilter, BoxTokenFilter, DictionaryConfig, DictionaryKind,
    JapaneseCompoundWordTokenFilter, JapaneseCompoundWordTokenFilterConfig,
    JapaneseIterationMarkCharacterFilter, JapaneseIterationMarkCharacterFilterConfig,
    JapaneseNumberTokenFilter, JapaneseNumberTokenFilterConfig,
    JapaneseStopTagsTokenFilter, JapaneseStopTagsTokenFilterConfig, LinderaResult, Mode,
    Tokenizer, TokenizerConfig, UnicodeNormalizeCharacterFilter,
    UnicodeNormalizeCharacterFilterConfig, UnicodeNormalizeKind,
};

fn main() -> LinderaResult<()> {
    let mut character_filters: Vec<BoxCharacterFilter> = Vec::new();

    let unicode_normalize_character_filter_config =
            UnicodeNormalizeCharacterFilterConfig::new(UnicodeNormalizeKind::NFKC);
    let unicode_normalize_character_filter =
        UnicodeNormalizeCharacterFilter::new(unicode_normalize_character_filter_config);
    character_filters.push(BoxCharacterFilter::from(unicode_normalize_character_filter));

    let japanese_iteration_mark_character_filter_config =
        JapaneseIterationMarkCharacterFilterConfig::new(true, true);
    let japanese_iteration_mark_character_filter = JapaneseIterationMarkCharacterFilter::new(
        japanese_iteration_mark_character_filter_config,
    );
    character_filters.push(BoxCharacterFilter::from(
        japanese_iteration_mark_character_filter,
    ));

    let dictionary = DictionaryConfig {
        kind: Some(DictionaryKind::IPADIC),
        path: None,
    };

    let config = TokenizerConfig {
        dictionary,
        user_dictionary: None,
        mode: Mode::Normal,
    };

    let tokenizer = Tokenizer::from_config(config).unwrap();

    let mut token_filters: Vec<BoxTokenFilter> = Vec::new();

    let japanese_compound_word_token_filter_config =
        JapaneseCompoundWordTokenFilterConfig::new(
            DictionaryKind::IPADIC,
            HashSet::from_iter(vec!["名詞,数".to_string()]),
            Some("名詞,数".to_string()),
        )?;
    let japanese_compound_word_token_filter =
        JapaneseCompoundWordTokenFilter::new(japanese_compound_word_token_filter_config);
    token_filters.push(BoxTokenFilter::from(japanese_compound_word_token_filter));

    let japanese_number_token_filter_config =
        JapaneseNumberTokenFilterConfig::new(Some(HashSet::from_iter(vec![
            "名詞,数".to_string()
        ])));
    let japanese_number_token_filter =
        JapaneseNumberTokenFilter::new(japanese_number_token_filter_config);
    token_filters.push(BoxTokenFilter::from(japanese_number_token_filter));

    let japanese_stop_tags_token_filter_config =
        JapaneseStopTagsTokenFilterConfig::new(HashSet::from_iter(vec![
            "接続詞".to_string(),
            "助詞".to_string(),
            "助詞,格助詞".to_string(),
            "助詞,格助詞,一般".to_string(),
            "助詞,格助詞,引用".to_string(),
            "助詞,格助詞,連語".to_string(),
            "助詞,係助詞".to_string(),
            "助詞,副助詞".to_string(),
            "助詞,間投助詞".to_string(),
            "助詞,並立助詞".to_string(),
            "助詞,終助詞".to_string(),
            "助詞,副助詞／並立助詞／終助詞".to_string(),
            "助詞,連体化".to_string(),
            "助詞,副詞化".to_string(),
            "助詞,特殊".to_string(),
            "助動詞".to_string(),
            "記号".to_string(),
            "記号,一般".to_string(),
            "記号,読点".to_string(),
            "記号,句点".to_string(),
            "記号,空白".to_string(),
            "記号,括弧閉".to_string(),
            "その他,間投".to_string(),
            "フィラー".to_string(),
            "非言語音".to_string(),
        ]));
    let japanese_stop_tags_token_filter =
        JapaneseStopTagsTokenFilter::new(japanese_stop_tags_token_filter_config);
    token_filters.push(BoxTokenFilter::from(japanese_stop_tags_token_filter));

    let analyzer = Analyzer::new(character_filters, tokenizer, token_filters);

    let mut text =
        "Ｌｉｎｄｅｒａは形態素解析ｴﾝｼﾞﾝです。ユーザー辞書も利用可能です。".to_string();
    println!("text: {}", text);

    // tokenize the text
    let tokens = analyzer.analyze(&mut text)?;

    // output the tokens
    for token in tokens {
        println!(
            "token: {:?}, start: {:?}, end: {:?}, details: {:?}",
            token.text, token.byte_start, token.byte_end, token.details
        );
    }

    Ok(())
}

The above example can be run as follows:

% cargo run --features=ipadic,filter --example=analysis_example

You can see the result as follows:

text: Ｌｉｎｄｅｒａは形態素解析ｴﾝｼﾞﾝです。ユーザー辞書も利用可能です。
token: Lindera, start: 0, end: 21, details: Some(["UNK"])
token: 形態素, start: 24, end: 33, details: Some(["名詞", "一般", "*", "*", "*", "*", "形態素", "ケイタイソ", "ケイタイソ"])
token: 解析, start: 33, end: 39, details: Some(["名詞", "サ変接続", "*", "*", "*", "*", "解析", "カイセキ", "カイセキ"])
token: エンジン, start: 39, end: 54, details: Some(["名詞", "一般", "*", "*", "*", "*", "エンジン", "エンジン", "エンジン"])
token: ユーザ, start: 0, end: 26, details: Some(["名詞", "一般", "*", "*", "*", "*", "ユーザー", "ユーザー", "ユーザー"])
token: 辞書, start: 26, end: 32, details: Some(["名詞", "一般", "*", "*", "*", "*", "辞書", "ジショ", "ジショ"])
token: 利用, start: 35, end: 41, details: Some(["名詞", "サ変接続", "*", "*", "*", "*", "利用", "リヨウ", "リヨー"])
token: 可能, start: 41, end: 47, details: Some(["名詞", "形容動詞語幹", "*", "*", "*", "*", "可能", "カノウ", "カノー"])

API reference

The API reference is available. Please see following URL:

lindera

lindera's People

Contributors

Stargazers

Watchers

lindera's Issues

Support UniDic user dictionary

Add workflows that run benchmarks on the main branch

Change the project name again

The name Mokuzu has a similar pronunciation to mozc, so I want to avoid confusion.
Since this project is a fork of kuromoji-rs, change the name to be derived from kuromoji.

Migrate to cargo workspaces

It became difficult to manage the related packages in separate repositories, so we merge them into one repository.
The target repositories are:

Support for user dictionary in the CLI

Modify the CLI to allow user dictionary to be specified.

Migrate module directory tree from the 2015 edition to the 2018 edition

Migrate module directory tree from the 2015 edition to the 2018 edition.

Separate the dictionary into another package

Separate the dictionary into another package.
This is a preparation for using multiple dictionaries in the future.

Support ko-dic user dictionary

Automate release tasks

Update workflows.

regression.yml : Run tests on three platforms (Linux/Windows/OSX) for each push/pull request.
periodic.yml: Run tests on stable/beta/nightly version of Rust periodically.
release.yml: When create tag, release it to GitHub and publish to crates.io.

Avoid building dictionaries not specified in features

Avoid building dictionaries not specified in features.
For example, if --features=ipadic, only lindera-ipadic will be built as a built-in dictionary.

Build error in benches

   Compiling lindera v0.5.1 (/Users/johtani/IdeaProjects/rust-workspace/lindera-workspace/lindera/lindera)
error[E0599]: no function or associated item named `default_normal` found for struct `lindera::tokenizer::Tokenizer` in the current scope
 --> lindera/benches/bench.rs:8:40
  |
8 |         let mut tokenizer = Tokenizer::default_normal();
  |                                        ^^^^^^^^^^^^^^ function or associated item not found in `lindera::tokenizer::Tokenizer`

error: aborting due to previous error

For more information about this error, try `rustc --explain E0599`.
error: could not compile `lindera`.

Lindera doesn’t build

Currently, we can’t import lindera in the latest version. It doesn’t build, and since the change has been pushed as a minor version, it probably broke every project relying on lindera.

...
    Checking lindera-decompress v0.13.5
    Checking bstr v0.2.17
    Checking lindera-core v0.13.5
    Checking csv v1.1.6
   Compiling character_converter v2.1.0
    Checking lindera-unidic-builder v0.13.5
    Checking lindera-ipadic-builder v0.13.5
    Checking lindera-dictionary v0.13.5
    Checking lindera-ko-dic-builder v0.13.5
    Checking lindera-cc-cedict-builder v0.13.5
   Compiling lindera-ipadic v0.13.5
    Checking lindera v0.13.5
error[E0599]: no variant or associated item named `DictionaryTypeError` found for enum `lindera_core::error::LinderaErrorKind` in the current scope
  --> /Users/irevoire/.cargo/registry/src/github.com-1ecc6299db9ec823/lindera-0.13.5/src/tokenizer.rs:64:40
   |
64 |             _ => Err(LinderaErrorKind::DictionaryTypeError
   |                                        ^^^^^^^^^^^^^^^^^^^
   |                                        |
   |                                        variant or associated item not found in `lindera_core::error::LinderaErrorKind`
   |                                        help: there is a variant with a similar name: `DictionaryLoadError`

error[E0599]: no variant or associated item named `UserDictionaryTypeError` found for enum `lindera_core::error::LinderaErrorKind` in the current scope
  --> /Users/irevoire/.cargo/registry/src/github.com-1ecc6299db9ec823/lindera-0.13.5/src/tokenizer.rs:84:40
   |
84 |             _ => Err(LinderaErrorKind::UserDictionaryTypeError
   |                                        ^^^^^^^^^^^^^^^^^^^^^^^ variant or associated item not found in `lindera_core::error::LinderaErrorKind`

For more information about this error, try `rustc --explain E0599`.
error: could not compile `lindera` due to 2 previous errors

You can check this repository to reproduce the issue; https://github.com/meilisearch/charabia on this sha 82c9f3b

Enrich word details

The token contains text and its details, but only reads.
It does not contain the part of speech or other information, need to add them.

https://github.com/bayard-search/lindera/blob/446d5f9c491a1dd64a990832d16aacf3e700007d/src/core/tokenizer.rs#L125-L128

https://github.com/bayard-search/lindera/blob/446d5f9c491a1dd64a990832d16aacf3e700007d/src/core/word_entry.rs#L27-L29

Add field_length argment to parse_unk()

Now parse_dictionary_entry expects only that the length is 11.
But it depends on dictionary builder, e.g. unidic is 10, ko-dic is 12.

So, it should specify by arguments.

Downloading and decompressing dictionaries takes a lot of time

Hey @mosuka,

We were facing compilation slow dows at Meilisearch recently and investigated, we found out that it was lindera-ipadic that was taking a lot of time to probably download the mecab-ipadic-2.7.0-20070801.tar.gz tarball from SourceForge.

If you want to look at the time it takes on our side, you can just execute the below command and open the generated HTML report.

rustup update
cargo +nightly build --timings

But as we can see, the CPU is idle for a long time when it builds.

Reconsider default LZMA dependency without any option to avoid it

Issue

The PR #139 introduced in v0.9.0 make LZMA (rust-lzma or lzma-rs) a mandatory dependency.
This forces all users to install the external library liblzma to be able to compile Lindera.

In comparison, the v0.8.1 needs only to add lindera in the project's cargo.toml.

Context

In Meilisearch we plan to use Lindera to tokenize Japanese texts, but we don't want to ask our users to install external libraries manually, in order to keep Meilisearch easy to install and easy to use.

Potential solutions

reconsider #139
choose a compression library that doesn't need a manually installed library (vendoring or rust library)
- flate2
- snap
- brotli ? (🤷)
provide a feature flag to choose the compression method

Thanks for maintaining Lindera 😊

Compresses dictionaries for morphological analysis by default.

#126

Move lindera-cil to another repository

Move lindera-cil to another repository.
Currently, the lindera-cli package is managed in the lindera repository as a member of the workspace.
Keeping the lindera repository for library crates only and moving the binary crates like lindera-cli to a separate repository.

Support compressed user dictionary

Delete SystemDict

SystemDict doesn't seem to be used anywhere.

https://github.com/bayard-search/lindera/blob/master/src/core/system_dict.rs

Upgrade Yada to 0.4.x

Yada has extend maximum offset limitation.

takuyaa/yada#17

Unable download UniDic form clrd.ninjal.ac.jp

error: failed to run custom build command for `lindera-unidic v0.13.5 (/home/minoru/github.com/lindera-morphology/lindera/lindera-unidic)`

Caused by:
  process didn't exit successfully: `/home/minoru/github.com/lindera-morphology/lindera/target/debug/build/lindera-unidic-0a9382db4954e5bf/build-script-build` (exit status: 1)
  --- stdout
  cargo:rerun-if-changed=build.rs
  cargo:rerun-if-changed=Cargo.toml

  --- stderr
  Error: Transport(Transport { kind: ConnectionFailed, message: Some("tls connection init failed"), url: Some(Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("clrd.ninjal.ac.jp")), port: None, path: "/unidic_archive/cwj/2.1.2/unidic-mecab-2.1.2_src.zip", query: None, fragment: None }), source: Some(Custom { kind: InvalidData, error: InvalidCertificateData("invalid peer certificate: UnknownIssuer") }) })

Support tokenization mode ( normal and search ) with Lindera CLI.

Lindera offers two modes. Change to be able to specify the mode with Lindera CLI.

https://github.com/bayard-search/lindera/blob/d6a534be83188236953aa58365ab3d6446601326/src/core/tokenizer.rs#L76-L80

UniDic support

Question for user dictionary parsing when using non-compressed local dictionary

While keep using non-compressed local dictionary along with user dictionary, the build_user_dict is failed with error user dictionary path is not set.. I think the related code is here and want to confirm if it is ok to fallback the user dictionary parsing to use IpadicBuilder while using local dictionary?

Docs for 0.10 failed

Seems the process to generate documentation for 0.10 failed

https://docs.rs/crate/lindera/0.10.0/builds/515760

Restore the missing file.

dictionary builders expects costs[1] as a backward_size

Each dictionary builders set forward_size to cost[0] and backward_size to costs[1].
e.g. ipadic, neologd, unidic, and ko.

However, the load method at ConnectionCostMatrix reads the backward_size from conn_data[0].

IPADIC, IPADIC-neologd, Unidic has the same value, forward and backward size.
But, ko has different values. So the cost method returns the wrong value.

Create a user dictionary package

Separate the functionality of the user dictionary contained in the lindera-ipadic-builder package into separate packages.

For example: lindera-user-dic-builder

Can't build lindera-ipadic on Raspberry Pi 4B

Lindera-ipadic is a requirement of the zola static website generator written in Rust.

During the zola build, it fails while building lindera-ipadic with this error:
memory allocation of 805306368 bytes failed
error: could not compile lindera-ipadic.

Environment: Raspberry Pi 4B, 4GB memory, debian.

I have tried to give it more contiguous memory by rebooting and trying again with a fresh system and no user apps running. Even then, the system apparently can't give it 800MB (!) of presumably contiguous memory. free -mh shows 2.7GB free, but not contiguous, I imagine.

Zola developers have asked me to report this to you. They do not think lindera-ipadic requires 800MB to build.

Thanks.

error: api errors (status 200 OK): max upload size is: 10485760

 error: failed to run custom build command for `lindera-ipadic v0.10.0`

Caused by:
  process didn't exit successfully: `D:\a\milli\milli\target\release\build\lindera-ipadic-caf28ea0e76b9e29\build-script-build` (exit code: 1)
  --- stdout
  cargo:rerun-if-changed=build.rs
  cargo:rerun-if-changed=Cargo.toml

  --- stderr
  Error: Custom { kind: UnexpectedEof, error: TarError { desc: "failed to iterate over archive", io: Error { kind: UnexpectedEof, message: "failed to fill whole buffer" } } }

It seems to be related to dictionaries.

Any idea of what could be the reason, the google drive download? 🤔

Add GitHub Actions Integration

Add GitHub Actions Integration like mosuka/bayard#94 and some refactoring as follows:

Add GitHub Actions Integration
Make output format to enum
Make tokenize mode to enum
Optimize build script
Update Dockerfile

lindera-morphology / lindera Goto Github PK

lindera's Introduction

Lindera

Tokenization examples

Basic tokenization

Tokenization with user dictionary

Analysis examples

Basic analysis

API reference

lindera's People

Contributors

Stargazers

Watchers

Forkers

lindera's Issues

Issue

Context

Potential solutions

Recommend Projects

Recommend Topics

Recommend Org