daac-tools / vibrato Goto Github PK

View Code? Open in Web Editor NEW

315.0 315.0 14.0 1.12 MB

🎤 vibrato: Viterbi-based accelerated tokenizer

Home Page: https://docs.rs/vibrato

License: Apache License 2.0

Rust 100.00%

japanese morphological-analysis nlp rust segmentation tokenization tokenizer

vibrato's People

Contributors

Stargazers

Watchers

Forkers

usagi doytsujin isgasho hirosassa matsutaku ebell495 mmizutani tokuhirom icodein pombredanne kenichihayashi akiomik bluegreenmagick

vibrato's Issues

Take `BufRead` as arguments

First, see #31 (comment).

Here, I consider the two options:

Keep Read and removes the Notes section.
Fix csv crate to take the BufRead argument and use BufRead through all arguments.

Embed versions into models

Is your feature request related to a problem? Please describe.

Current models do not have their versions. There is a risk that a model in a different version will be loaded incorrectly.

Describe the solution you'd like

Embedding the version number at the head of models and verifying it.

Describe alternatives you've considered

Additional context

Remove lifetime parameter from `Worker`

Currently, the Worker struct is defined as follows:

struct Worker<'a> {
    ...
}

where 'a is a lifetime parameter of the Tokenizer. By this definition, the Worker can refer to the Tokenizer automatically for every tokenization.

This definition causes a problem when creating wrappers for other programming languages that use garbage collectors (GC).
The above definition means that the Tokenizer cannot be removed when the Worker is alive, but there is no way to impose this constraint on the GC.

To solve this problem, we need to remove the lifetime parameter and give the Worker struct to the Tokenizer for every tokenization.

Loading a large dictionary such as UniDic cwj 2023-02 is slow in comparison to mecab

Is your feature request related to a problem? Please describe.
Loading a large dictionary such as UniDic 2023-02 is slow in comparison to mecab.

I've downloaded the latest unidic cwj 2023-02 at https://clrd.ninjal.ac.jp/unidic/download.html#unidic_bccwj and built my own compiled vibrato dictionary using

cargo run --release -p compile -- -l unidic-cwj-202302_full/lex.csv -m unidic-cwj-202302_full/matrix.def -u unidic-cwj-202302_full/unk.def -c unidic-cwj-202302_full/char.def -o system.dic.zst

and then I tried tokenizing the example sentence from the docs

> time echo '本とカレーの街神保町へようこそ。' | cargo run --release -p tokenize -- -i system.dic.zst
     Running `target/release/tokenize -i system.dic.zst`
Loading the dictionary...
Ready to tokenize
本      名詞,普通名詞,一般,*,*,*,ホン,本,本,ホン,本,ホン,漢,ホ濁,基本形,*,*,*,*,体,ホン,ホン,ホン,ホン,1,C3,*,9584176605045248,34867
と      助詞,格助詞,*,*,*,*,ト,と,と,ト,と,ト,和,*,*,*,*,*,*,格助,ト,ト,ト,ト,*,"名詞%F1,動詞%F1,形容詞%F2@-1",*,7099014038299136,25826
カレー  名詞,普通名詞,一般,*,*,*,カレー,カレー-curry,カレー,カレー,カレー,カレー,外,*,*,*,*,*,*,体,カレー,カレー,カレー,カレー,0,C2,*,2018162216411648,7342
の      助詞,格助詞,*,*,*,*,ノ,の,の,ノ,の,ノ,和,*,*,*,*,*,*,格助,ノ,ノ,ノ,ノ,*,名詞%F1,*,7968444268028416,28989
街      名詞,普通名詞,一般,*,*,*,マチ,街,街,マチ,街,マチ,和,*,*,*,*,*,*,体,マチ,マチ,マチ,マチ,2,C3,*,9827718430597632,35753
神保町  名詞,固有名詞,地名,一般,*,*,ジンボウチョウ,ジンボウチョウ,神保町,ジンボーチョー,神保町,ジンボーチョー,固,*,*,*,*,*,*,地名,ジンボウチョウ,ジンボウチョウ,ジンボウチョウ,ジンボウチョウ,
"3,0",*,*,5174035466035712,18823
へ      助詞,格助詞,*,*,*,*,ヘ,へ,へ,エ,へ,エ,和,*,*,*,*,*,*,格助,ヘ,ヘ,ヘ,ヘ,*,名詞%F1,*,9296104558567936,33819
よう    形容詞,非自立可能,*,*,形容詞,連用形-ウ音便,ヨイ,良い,よう,ヨー,よい,ヨイ,和,*,*,*,*,*,*,相,ヨウ,ヨイ,ヨウ,ヨイ,1,C3,*,10716957049496195,38988
こそ    助詞,係助詞,*,*,*,*,コソ,こそ,こそ,コソ,こそ,コソ,和,*,*,*,*,*,*,係助,コソ,コソ,コソ,コソ,*,"形容詞%F2@0,名詞%F2@1,動詞%F2@0",*,3501403402281472,12738
。      補助記号,句点,*,*,*,*,*,。,。,*,。,*,記号,*,*,*,*,*,*,補助,*,*,*,*,*,*,*,6880571302400,25
EOS

________________________________________________________
Executed in   13.96 secs    fish           external
   usr time   13.09 secs    0.00 micros   13.09 secs
   sys time    0.86 secs    0.00 micros    0.86 secs

but it takes around 14 seconds to load the dictionary.

In comparison, mecab is near instant

> time echo "本とカレーの街神保町へようこそ。" | mecab --dicdir="unidic-cwj-202302_full"
本      名詞,普通名詞,一般,*,*,*,ホン,本,本,ホン,本,ホン,漢,ホ濁,基本形,*,*,*,*,体,ホン,ホン,ホン,ホン,1,C3,*,9584176605045248,34867
と      助詞,格助詞,*,*,*,*,ト,と,と,ト,と,ト,和,*,*,*,*,*,*,格助,ト,ト,ト,ト,*,"名詞%F1,動詞%F1,形容詞%F2@-1",*,7099014038299136,25826
カレー  名詞,普通名詞,一般,*,*,*,カレー,カレー-curry,カレー,カレー,カレー,カレー,外,*,*,*,*,*,*,体,カレー,カレー,カレー,カレー,0,C2,*,2018162216411648,7342
の      助詞,格助詞,*,*,*,*,ノ,の,の,ノ,の,ノ,和,*,*,*,*,*,*,格助,ノ,ノ,ノ,ノ,*,名詞%F1,*,7968444268028416,28989
街      名詞,普通名詞,一般,*,*,*,マチ,街,街,マチ,街,マチ,和,*,*,*,*,*,*,体,マチ,マチ,マチ,マチ,2,C3,*,9827718430597632,35753
神保町  名詞,固有名詞,地名,一般,*,*,ジンボウチョウ,ジンボウチョウ,神保町,ジンボーチョー,神保町,ジンボーチョー,固,*,*,*,*,*,*,地名,ジンボウチョウ,ジンボウチョウ,ジンボウチョウ,ジンボウチョウ,
"3,0",*,*,5174035466035712,18823
へ      助詞,格助詞,*,*,*,*,ヘ,へ,へ,エ,へ,エ,和,*,*,*,*,*,*,格助,ヘ,ヘ,ヘ,ヘ,*,名詞%F1,*,9296104558567936,33819
よう    形容詞,非自立可能,*,*,形容詞,連用形-ウ音便,ヨイ,良い,よう,ヨー,よい,ヨイ,和,*,*,*,*,*,*,相,ヨウ,ヨイ,ヨウ,ヨイ,1,C3,*,10716957049496195,38988
こそ    助詞,係助詞,*,*,*,*,コソ,こそ,こそ,コソ,こそ,コソ,和,*,*,*,*,*,*,係助,コソ,コソ,コソ,コソ,*,"形容詞%F2@0,名詞%F2@1,動詞%F2@0",*,3501403402281472,12738
。      補助記号,句点,*,*,*,*,*,。,。,*,。,*,記号,*,*,*,*,*,*,補助,*,*,*,*,*,*,*,6880571302400,25
EOS

________________________________________________________
Executed in   28.32 millis    fish           external
   usr time    0.00 millis    0.00 micros    0.00 millis
   sys time   31.25 millis    0.00 micros   31.25 millis

I looked at the code and it seems like all the time is taken from deserializing bincode into the DictionaryInner struct. In particular, when it runs the read_common function

    fn read_common<R>(mut rdr: R) -> Result<DictionaryInner>
    where
        R: Read,
    {
        let mut magic = [0; MODEL_MAGIC.len()];
        rdr.read_exact(&mut magic)?;
        if magic != MODEL_MAGIC {
            return Err(VibratoError::invalid_argument(
                "rdr",
                "The magic number of the input model mismatches.",
            ));
        }
        let config = common::bincode_config();
        let data = bincode::decode_from_std_read(&mut rdr, config)?;
        Ok(data)
    }

It takes a long time to complete let data = bincode::decode_from_std_read(&mut rdr, config)?; so it seems like bincode deserialization is slow.

How is mecab able to return results so quickly despite not loading everything in memory like vibrato? It seems like it doesn't take any memory when I use mecab, whereas vibrato takes 1 GB memory to cache everything before being able to tokenize

Describe the solution you'd like
Could we use a faster serde framework like rkyv ? According to the benchmarks, it's a lot faster than bincode

According to the rkyv docs, it says

It’s similar to other zero-copy deserialization frameworks such as Cap’n Proto and FlatBuffers. However, while the former have external schemas and heavily restricted data types, rkyv allows all serialized types to be defined in code and can serialize a wide variety of types that the others cannot. Additionally, rkyv is designed to have little to no overhead, and in most cases will perform exactly the same as native types.

Not sure if there's any other way to speed it up? Could we somehow parallelize deserialization?

Describe alternatives you've considered
Apparently bincode is slow for structs that use Vec and byte slices, and the recommendation is to use serde_bytes

The features such as

pub struct UnkEntry {
    pub cate_id: u16,
    pub left_id: u16,
    pub right_id: u16,
    pub word_cost: i16,
    pub feature: String,
}
pub struct WordFeatures {
    features: Vec<String>,
}

are stored as strings, maybe they can be stored as Vec<u8> instead?

Additional context
I'm using vibrato version 0.5.1

And here are the compiled dictionary sizes

> du -sh system.dic.zst
291M    system.dic.zst

> du -sh system.dic
988M    system.dic

Distribute compiled dictionaries from neologd-ipadic

In v0.3.1, compiled dictionaries from neologd-ipadic have not been distributed because the licenses of Neologd and IPADIC collide.
To solve this, we will employ Naist-jdic instead of ipadic.

Distribute compiled dictionaries from JumanDic

In v0.3.1, compiled dictionaries from JumanDic have not been distributed because the lexicon file is in an unexpected CSV format.
More precisely, we will get the following error message from the compile command.

Error: InvalidFormat(InvalidFormatError { arg: "lex.csv", msg: "A csv row of lexicon must have five items at least, \"\\n\"" })

We need to modify the code to compile this file.

Using a memmap for the dictionary

Is your feature request related to a problem? Please describe.

It's harder to support lower-end hardware (with limited memory) particularly with bigger dictionaries.

Describe the solution you'd like

I would like the option to be able to use a memory map instead to refer to an uncompressed dictionary since storage is usually cheaper than memory. The application I need does not need extreme performance so I feel like the IO penalty would be acceptable. If the dictionary gets processed by Vibrato into something else, it would also be nice to be able serialize it to a file and memmap it as well. fst offers something like that: https://docs.rs/fst/latest/fst/#example-stream-to-a-file-and-memory-map-it-for-searching

Describe alternatives you've considered

None that I'm aware of.

Additional context

None

The difference between Vibrato and Vaporetto regarding the performance for a general tokenizing

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

I'd like to know the difference between Vibrato and Vaporetto regarding the performance for a general tokenizing as follows.
https://github.com/daac-tools/vibrato#fast-tokenization

Describe the solution you'd like
A clear and concise description of what you want to happen.

Could you add Vaporetto to the following image?
https://github.com/daac-tools/vibrato#fast-tokenization

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Nothing

Additional context
Add any other context or screenshots about the feature request here.

Nothing

Add compile option for user dictionary

In v0.3.1, user dictionaries can be given in a text CSV format. However, if the dictionary file is huge, the loading time can be a problem. This issue suggests adding a feature to compile user dictionaries into a binary format. This feature will help install Neologd entries.

Length 65535 causes errors

See #43 (comment)

ipadic-mecab-2_7_0-small is broken

The second feature of the unknown word dictionary is not pronunciation.

Distribute precompiled dictionaries instead scripts download external resources

Issue

Currently, Vibrato provides scripts to download and compile external resources. However, those scripts are dangerous to users because they may download large amounts of unintended data.

Solution

Distributing precompiled dictionaries through Assets.

Then, the main concern is licensing. KFTT used in the current version follows CC BY-SA 3.0 competing that of IPADIC or UniDic.

One of solutions would be using unlicensed texts such as Aozora Bunko.

Dictionary::read() cannot read compressed dictionary

When loading the compiled IPA dictionary based on https://github.com/daac-tools/vibrato#1-dictionary-preparation, the error below occurred.

malloc: can't allocate region
:*** mach_vm_map(size=8776504127295307776, flags: 100) failed (error code=3)
malloc: *** set a breakpoint in malloc_error_break to debug
memory allocation of 8776504127295305000 bytes failed

65535を超える文字で分かち書きできない。

バグの説明
65535を超える文字で分かち書きできません。

再現手順
(誰もが同じ状況を再現できるように、 git clone コマンドを含め、すべてのコマンドを記述してください。)
以下のコードを実行しました:
textが65535を超える文字数の場合、100%再現します。

let mut worker = tokenizer.new_worker();
worker.reset_sentence(&text);
worker.tokenize();

以下のメッセージが出力されました:
thread 'main' panicked at 'assertion failed: input.len() <= 0xFFFF', /Users/saitoukosuke/.cargo/registry/src/github.com-1ecc6299db9ec823/vibrato-0.4.0/src/dictionary/lexicon/map/trie.rs:53:9

期待する結果
(期待した出力を明確かつ簡潔に記述してください。)
65535を超える文字数でも分かち書きできるようにしたいです。
https://github.com/daac-tools/vibrato/blob/main/vibrato/src/dictionary/lexicon/map/trie.rs#L53
を外せばうまくいくと思われますが、制限を付けている理由があるようでしたら、このままの仕様で構いません。

あなたの実行環境

OS: [e.g. Ubuntu 22.04] mac big sur
Rust: [e.g. 1.64.0] 1.67.1

Update the benchmark results

In v0.3.1, the benchmark result in README is old: Vibrato has been updated, other dictionaries have been distributed, and lindera has been accelerated.

Keep member functions of a struct in the same file

Currently, member functions of a struct are sometimes placed in different files.
e.g., Dictionary is placed at dictionary.rs, but Dictionary::from_readers is placed at dictionary/builder.rs.

This makes finding functions difficult in development.
However, simply moving the implementation of Dictionary::from_readers in dictionary.rs enlarges the file.

Solution

Defining a new structure Dictionary::Builder in dictionary/builder.rs to always keep member functions of a struct in the same file.

Add N best results

Is your feature request related to a problem? Please describe.
Mecab has the flag -N which provides the N best results

mecab --help
...
 -N, --nbest=INT                output N best results (default 1)

However, I couldn't find docs or in the source code how to do this with vibrato

Describe the solution you'd like
Allow support for providing the N best results

Describe alternatives you've considered
N/A

Additional context
Vibrato 0.5.1

Support text longer than 65535 characters

Currently, Vibrato does not support texts longer than 65535 characters.
The limit is specified here:

vibrato/vibrato/src/common.rs

Line 16 in de25020

pub const MAX_SENTENCE_LENGTH: u16 = 0xFFFF;

This limit is long enough to parse a typical sentence but should be changed to a larger value, such as 2^32-1, to increase robustness in actual operations.

Keep an easy description in the top-level readme

This issue proposes to maintain command line tools related to training in one workspace and put the description of training under the workspace (like prepare) for keeping an easy description in the top-level readme.

Why

For many light users, the primary concern would be how to get started on tokenization using Vibrato. So, I think the top-level readme should keep an easy description.

The description of 3. Training in Section Basic usage is for advanced users because to understand this, it is necessary to know the configuration of dictionary files in MeCab.

Also, maintaining multiple workspaces increases the costs of updating dependencies in Cargo.toml.

Solution

We keep the command line tools train, dictgen, evaluate, and split in one workspace (such as named train) and write the usage in the readme under this workspace.

Benefits

Descriptions related to training are maintained in a single document, allowing for leading only advanced users there.
Fewer workspaces can be maintained, resulting in avoiding redundant update costs of Cargo.toml.

daac-tools / vibrato Goto Github PK

vibrato's People

Contributors

Stargazers

Watchers

Forkers

vibrato's Issues

Issue

Solution

Solution

Why

Solution

Benefits

Recommend Projects

Recommend Topics

Recommend Org