Giter Club home page Giter Club logo

lindera's Introduction

Lindera

License: MIT Join the chat at https://gitter.im/lindera-morphology/lindera Crates.io

A morphological analysis library in Rust. This project fork from kuromoji-rs.

Lindera aims to build a library which is easy to install and provides concise APIs for various Rust applications.

The following products are required to build:

  • Rust >= 1.46.0

Tokenization examples

Basic tokenization

Put the following in Cargo.toml:

[dependencies]
lindera = { version = "0.31.0", features = ["ipadic"] }

This example covers the basic usage of Lindera.

It will:

  • Create a tokenizer in normal mode
  • Tokenize the input text
  • Output the tokens
use lindera::{
    DictionaryConfig, DictionaryKind, LinderaResult, Mode, Tokenizer, TokenizerConfig,
};

fn main() -> LinderaResult<()> {
    let dictionary = DictionaryConfig {
        kind: Some(DictionaryKind::IPADIC),
        path: None,
    };

    let config = TokenizerConfig {
        dictionary,
        user_dictionary: None,
        mode: Mode::Normal,
    };

    // create tokenizer
    let tokenizer = Tokenizer::from_config(config)?;

    // tokenize the text
    let tokens = tokenizer.tokenize("関西国際空港限定トートバッグ")?;

    // output the tokens
    for token in tokens {
        println!("{}", token.text);
    }

    Ok(())
}

The above example can be run as follows:

% cargo run --features=ipadic --example=ipadic_basic_example

You can see the result as follows:

関西国際空港
限定
トートバッグ

Tokenization with user dictionary

You can give user dictionary entries along with the default system dictionary. User dictionary should be a CSV with following format.

<surface>,<part_of_speech>,<reading>

Put the following in Cargo.toml:

[dependencies]
lindera-tokenizer = { version = "0.31.0", features = ["ipadic"] }

For example:

% cat ./resources/simple_userdic.csv
東京スカイツリー,カスタム名詞,トウキョウスカイツリー
東武スカイツリーライン,カスタム名詞,トウブスカイツリーライン
とうきょうスカイツリー駅,カスタム名詞,トウキョウスカイツリーエキ

With an user dictionary, Tokenizer will be created as follows:

use std::path::PathBuf;

use lindera::{
    DictionaryConfig, DictionaryKind, LinderaResult, Mode, Tokenizer, TokenizerConfig,
    UserDictionaryConfig,
};

fn main() -> LinderaResult<()> {
    let dictionary = DictionaryConfig {
        kind: Some(DictionaryKind::IPADIC),
        path: None,
    };

    let user_dictionary = Some(UserDictionaryConfig {
        kind: DictionaryKind::IPADIC,
        path: PathBuf::from("./resources/ipadic_simple_userdic.csv"),
    });

    let config = TokenizerConfig {
        dictionary,
        user_dictionary,
        mode: Mode::Normal,
    };

    let tokenizer = Tokenizer::from_config(config)?;

    // tokenize the text
    let tokens = tokenizer.tokenize("東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です")?;

    // output the tokens
    for token in tokens {
        println!("{}", token.text);
    }

    Ok(())
}

The above example can be by cargo run --example:

% cargo run --features=ipadic --example=ipadic_userdic_example
東京スカイツリー
の
最寄り駅
は
とうきょうスカイツリー駅
です

Analysis examples

Basic analysis

Put the following in Cargo.toml:

[dependencies]
lindera = { version = "0.31.0", features = ["ipadic", "filter"] }

This example covers the basic usage of Lindera Analysis Framework.

It will:

  • Apply character filter for Unicode normalization (NFKC)
  • Tokenize the input text with IPADIC
  • Apply token filters for removing stop tags (Part-of-speech) and Japanese Katakana stem filter
use std::collections::HashSet;

use lindera::{
    Analyzer, BoxCharacterFilter, BoxTokenFilter, DictionaryConfig, DictionaryKind,
    JapaneseCompoundWordTokenFilter, JapaneseCompoundWordTokenFilterConfig,
    JapaneseIterationMarkCharacterFilter, JapaneseIterationMarkCharacterFilterConfig,
    JapaneseNumberTokenFilter, JapaneseNumberTokenFilterConfig,
    JapaneseStopTagsTokenFilter, JapaneseStopTagsTokenFilterConfig, LinderaResult, Mode,
    Tokenizer, TokenizerConfig, UnicodeNormalizeCharacterFilter,
    UnicodeNormalizeCharacterFilterConfig, UnicodeNormalizeKind,
};

fn main() -> LinderaResult<()> {
    let mut character_filters: Vec<BoxCharacterFilter> = Vec::new();

    let unicode_normalize_character_filter_config =
            UnicodeNormalizeCharacterFilterConfig::new(UnicodeNormalizeKind::NFKC);
    let unicode_normalize_character_filter =
        UnicodeNormalizeCharacterFilter::new(unicode_normalize_character_filter_config);
    character_filters.push(BoxCharacterFilter::from(unicode_normalize_character_filter));

    let japanese_iteration_mark_character_filter_config =
        JapaneseIterationMarkCharacterFilterConfig::new(true, true);
    let japanese_iteration_mark_character_filter = JapaneseIterationMarkCharacterFilter::new(
        japanese_iteration_mark_character_filter_config,
    );
    character_filters.push(BoxCharacterFilter::from(
        japanese_iteration_mark_character_filter,
    ));

    let dictionary = DictionaryConfig {
        kind: Some(DictionaryKind::IPADIC),
        path: None,
    };

    let config = TokenizerConfig {
        dictionary,
        user_dictionary: None,
        mode: Mode::Normal,
    };

    let tokenizer = Tokenizer::from_config(config).unwrap();

    let mut token_filters: Vec<BoxTokenFilter> = Vec::new();

    let japanese_compound_word_token_filter_config =
        JapaneseCompoundWordTokenFilterConfig::new(
            DictionaryKind::IPADIC,
            HashSet::from_iter(vec!["名詞,数".to_string()]),
            Some("名詞,数".to_string()),
        )?;
    let japanese_compound_word_token_filter =
        JapaneseCompoundWordTokenFilter::new(japanese_compound_word_token_filter_config);
    token_filters.push(BoxTokenFilter::from(japanese_compound_word_token_filter));

    let japanese_number_token_filter_config =
        JapaneseNumberTokenFilterConfig::new(Some(HashSet::from_iter(vec![
            "名詞,数".to_string()
        ])));
    let japanese_number_token_filter =
        JapaneseNumberTokenFilter::new(japanese_number_token_filter_config);
    token_filters.push(BoxTokenFilter::from(japanese_number_token_filter));

    let japanese_stop_tags_token_filter_config =
        JapaneseStopTagsTokenFilterConfig::new(HashSet::from_iter(vec![
            "接続詞".to_string(),
            "助詞".to_string(),
            "助詞,格助詞".to_string(),
            "助詞,格助詞,一般".to_string(),
            "助詞,格助詞,引用".to_string(),
            "助詞,格助詞,連語".to_string(),
            "助詞,係助詞".to_string(),
            "助詞,副助詞".to_string(),
            "助詞,間投助詞".to_string(),
            "助詞,並立助詞".to_string(),
            "助詞,終助詞".to_string(),
            "助詞,副助詞/並立助詞/終助詞".to_string(),
            "助詞,連体化".to_string(),
            "助詞,副詞化".to_string(),
            "助詞,特殊".to_string(),
            "助動詞".to_string(),
            "記号".to_string(),
            "記号,一般".to_string(),
            "記号,読点".to_string(),
            "記号,句点".to_string(),
            "記号,空白".to_string(),
            "記号,括弧閉".to_string(),
            "その他,間投".to_string(),
            "フィラー".to_string(),
            "非言語音".to_string(),
        ]));
    let japanese_stop_tags_token_filter =
        JapaneseStopTagsTokenFilter::new(japanese_stop_tags_token_filter_config);
    token_filters.push(BoxTokenFilter::from(japanese_stop_tags_token_filter));

    let analyzer = Analyzer::new(character_filters, tokenizer, token_filters);

    let mut text =
        "Linderaは形態素解析エンジンです。ユーザー辞書も利用可能です。".to_string();
    println!("text: {}", text);

    // tokenize the text
    let tokens = analyzer.analyze(&mut text)?;

    // output the tokens
    for token in tokens {
        println!(
            "token: {:?}, start: {:?}, end: {:?}, details: {:?}",
            token.text, token.byte_start, token.byte_end, token.details
        );
    }

    Ok(())
}

The above example can be run as follows:

% cargo run --features=ipadic,filter --example=analysis_example

You can see the result as follows:

text: Linderaは形態素解析エンジンです。ユーザー辞書も利用可能です。
token: Lindera, start: 0, end: 21, details: Some(["UNK"])
token: 形態素, start: 24, end: 33, details: Some(["名詞", "一般", "*", "*", "*", "*", "形態素", "ケイタイソ", "ケイタイソ"])
token: 解析, start: 33, end: 39, details: Some(["名詞", "サ変接続", "*", "*", "*", "*", "解析", "カイセキ", "カイセキ"])
token: エンジン, start: 39, end: 54, details: Some(["名詞", "一般", "*", "*", "*", "*", "エンジン", "エンジン", "エンジン"])
token: ユーザ, start: 0, end: 26, details: Some(["名詞", "一般", "*", "*", "*", "*", "ユーザー", "ユーザー", "ユーザー"])
token: 辞書, start: 26, end: 32, details: Some(["名詞", "一般", "*", "*", "*", "*", "辞書", "ジショ", "ジショ"])
token: 利用, start: 35, end: 41, details: Some(["名詞", "サ変接続", "*", "*", "*", "*", "利用", "リヨウ", "リヨー"])
token: 可能, start: 41, end: 47, details: Some(["名詞", "形容動詞語幹", "*", "*", "*", "*", "可能", "カノウ", "カノー"])

API reference

The API reference is available. Please see following URL:

lindera's People

Contributors

abetomo avatar bluegreenmagick avatar chatblanc-ciel avatar dependabot[bot] avatar encody avatar gitter-badger avatar higumachan avatar ikawaha avatar johtani avatar kerollmops avatar kitaitimakoto avatar manythefish avatar mochi-sann avatar mocobeta avatar mosuka avatar tokuhirom avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

lindera's Issues

Change the project name again

The name Mokuzu has a similar pronunciation to mozc, so I want to avoid confusion.
Since this project is a fork of kuromoji-rs, change the name to be derived from kuromoji.

Automate release tasks

Update workflows.

regression.yml : Run tests on three platforms (Linux/Windows/OSX) for each push/pull request.
periodic.yml: Run tests on stable/beta/nightly version of Rust periodically.
release.yml: When create tag, release it to GitHub and publish to crates.io.

Build error in benches

   Compiling lindera v0.5.1 (/Users/johtani/IdeaProjects/rust-workspace/lindera-workspace/lindera/lindera)
error[E0599]: no function or associated item named `default_normal` found for struct `lindera::tokenizer::Tokenizer` in the current scope
 --> lindera/benches/bench.rs:8:40
  |
8 |         let mut tokenizer = Tokenizer::default_normal();
  |                                        ^^^^^^^^^^^^^^ function or associated item not found in `lindera::tokenizer::Tokenizer`

error: aborting due to previous error

For more information about this error, try `rustc --explain E0599`.
error: could not compile `lindera`.

Lindera doesn’t build

Currently, we can’t import lindera in the latest version. It doesn’t build, and since the change has been pushed as a minor version, it probably broke every project relying on lindera.

...
    Checking lindera-decompress v0.13.5
    Checking bstr v0.2.17
    Checking lindera-core v0.13.5
    Checking csv v1.1.6
   Compiling character_converter v2.1.0
    Checking lindera-unidic-builder v0.13.5
    Checking lindera-ipadic-builder v0.13.5
    Checking lindera-dictionary v0.13.5
    Checking lindera-ko-dic-builder v0.13.5
    Checking lindera-cc-cedict-builder v0.13.5
   Compiling lindera-ipadic v0.13.5
    Checking lindera v0.13.5
error[E0599]: no variant or associated item named `DictionaryTypeError` found for enum `lindera_core::error::LinderaErrorKind` in the current scope
  --> /Users/irevoire/.cargo/registry/src/github.com-1ecc6299db9ec823/lindera-0.13.5/src/tokenizer.rs:64:40
   |
64 |             _ => Err(LinderaErrorKind::DictionaryTypeError
   |                                        ^^^^^^^^^^^^^^^^^^^
   |                                        |
   |                                        variant or associated item not found in `lindera_core::error::LinderaErrorKind`
   |                                        help: there is a variant with a similar name: `DictionaryLoadError`

error[E0599]: no variant or associated item named `UserDictionaryTypeError` found for enum `lindera_core::error::LinderaErrorKind` in the current scope
  --> /Users/irevoire/.cargo/registry/src/github.com-1ecc6299db9ec823/lindera-0.13.5/src/tokenizer.rs:84:40
   |
84 |             _ => Err(LinderaErrorKind::UserDictionaryTypeError
   |                                        ^^^^^^^^^^^^^^^^^^^^^^^ variant or associated item not found in `lindera_core::error::LinderaErrorKind`

For more information about this error, try `rustc --explain E0599`.
error: could not compile `lindera` due to 2 previous errors

You can check this repository to reproduce the issue; https://github.com/meilisearch/charabia on this sha 82c9f3b

Downloading and decompressing dictionaries takes a lot of time

Hey @mosuka,

We were facing compilation slow dows at Meilisearch recently and investigated, we found out that it was lindera-ipadic that was taking a lot of time to probably download the mecab-ipadic-2.7.0-20070801.tar.gz tarball from SourceForge.

If you want to look at the time it takes on our side, you can just execute the below command and open the generated HTML report.

rustup update
cargo +nightly build --timings

But as we can see, the CPU is idle for a long time when it builds.

Reconsider default LZMA dependency without any option to avoid it

Issue

The PR #139 introduced in v0.9.0 make LZMA (rust-lzma or lzma-rs) a mandatory dependency.
This forces all users to install the external library liblzma to be able to compile Lindera.

In comparison, the v0.8.1 needs only to add lindera in the project's cargo.toml.

Context

In Meilisearch we plan to use Lindera to tokenize Japanese texts, but we don't want to ask our users to install external libraries manually, in order to keep Meilisearch easy to install and easy to use.

Potential solutions

  • reconsider #139
  • choose a compression library that doesn't need a manually installed library (vendoring or rust library)
  • provide a feature flag to choose the compression method

Thanks for maintaining Lindera 😊

Move lindera-cil to another repository

Move lindera-cil to another repository.
Currently, the lindera-cli package is managed in the lindera repository as a member of the workspace.
Keeping the lindera repository for library crates only and moving the binary crates like lindera-cli to a separate repository.

Unable download UniDic form clrd.ninjal.ac.jp

error: failed to run custom build command for `lindera-unidic v0.13.5 (/home/minoru/github.com/lindera-morphology/lindera/lindera-unidic)`

Caused by:
  process didn't exit successfully: `/home/minoru/github.com/lindera-morphology/lindera/target/debug/build/lindera-unidic-0a9382db4954e5bf/build-script-build` (exit status: 1)
  --- stdout
  cargo:rerun-if-changed=build.rs
  cargo:rerun-if-changed=Cargo.toml

  --- stderr
  Error: Transport(Transport { kind: ConnectionFailed, message: Some("tls connection init failed"), url: Some(Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("clrd.ninjal.ac.jp")), port: None, path: "/unidic_archive/cwj/2.1.2/unidic-mecab-2.1.2_src.zip", query: None, fragment: None }), source: Some(Custom { kind: InvalidData, error: InvalidCertificateData("invalid peer certificate: UnknownIssuer") }) })

Create a user dictionary package

Separate the functionality of the user dictionary contained in the lindera-ipadic-builder package into separate packages.

For example: lindera-user-dic-builder

Can't build lindera-ipadic on Raspberry Pi 4B

Lindera-ipadic is a requirement of the zola static website generator written in Rust.

During the zola build, it fails while building lindera-ipadic with this error:
memory allocation of 805306368 bytes failed
error: could not compile lindera-ipadic.

Environment: Raspberry Pi 4B, 4GB memory, debian.

I have tried to give it more contiguous memory by rebooting and trying again with a fresh system and no user apps running. Even then, the system apparently can't give it 800MB (!) of presumably contiguous memory. free -mh shows 2.7GB free, but not contiguous, I imagine.

Zola developers have asked me to report this to you. They do not think lindera-ipadic requires 800MB to build.

Thanks.

Support user dictionary

Currently, Lindela does not support user dictionary. Rebuilding the system dictionary to register new term into the morphological dictionary is too much of a burden for light users.
So we're going to support simple user dictionary such as Kuromoji.

Prepare a trait for implement each dictionary builder

Duplicate functions that are written in each dictionary builder package.
Because of the maintenance issues, will prepare traits and implement a dictionary builder structure for each dictionary builder in its package.

Publishing on crates.io

Publish on crates.io.
But cargo publish failed due to the following error:

error: api errors (status 200 OK): max upload size is: 10485760

Lindera-ipadict randomly as issue during build

When compiling lindera we frequently have a building error:

 error: failed to run custom build command for `lindera-ipadic v0.10.0`

Caused by:
  process didn't exit successfully: `D:\a\milli\milli\target\release\build\lindera-ipadic-caf28ea0e76b9e29\build-script-build` (exit code: 1)
  --- stdout
  cargo:rerun-if-changed=build.rs
  cargo:rerun-if-changed=Cargo.toml

  --- stderr
  Error: Custom { kind: UnexpectedEof, error: TarError { desc: "failed to iterate over archive", io: Error { kind: UnexpectedEof, message: "failed to fill whole buffer" } } }

It seems to be related to dictionaries.

Any idea of what could be the reason, the google drive download? 🤔

Add GitHub Actions Integration

Add GitHub Actions Integration like mosuka/bayard#94 and some refactoring as follows:

  • Add GitHub Actions Integration
  • Make output format to enum
  • Make tokenize mode to enum
  • Optimize build script
  • Update Dockerfile

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.