daac-tools / vibrato Goto Github PK
View Code? Open in Web Editor NEW🎤 vibrato: Viterbi-based accelerated tokenizer
Home Page: https://docs.rs/vibrato
License: Apache License 2.0
🎤 vibrato: Viterbi-based accelerated tokenizer
Home Page: https://docs.rs/vibrato
License: Apache License 2.0
First, see #31 (comment).
Here, I consider the two options:
Read
and removes the Notes section.csv
crate to take the BufRead
argument and use BufRead
through all arguments.Is your feature request related to a problem? Please describe.
Current models do not have their versions. There is a risk that a model in a different version will be loaded incorrectly.
Describe the solution you'd like
Embedding the version number at the head of models and verifying it.
Describe alternatives you've considered
NA
Additional context
NA
Currently, the Worker
struct is defined as follows:
struct Worker<'a> {
...
}
where 'a
is a lifetime parameter of the Tokenizer
. By this definition, the Worker
can refer to the Tokenizer
automatically for every tokenization.
This definition causes a problem when creating wrappers for other programming languages that use garbage collectors (GC).
The above definition means that the Tokenizer
cannot be removed when the Worker
is alive, but there is no way to impose this constraint on the GC.
To solve this problem, we need to remove the lifetime parameter and give the Worker
struct to the Tokenizer
for every tokenization.
Is your feature request related to a problem? Please describe.
Loading a large dictionary such as UniDic 2023-02 is slow in comparison to mecab.
I've downloaded the latest unidic cwj 2023-02 at https://clrd.ninjal.ac.jp/unidic/download.html#unidic_bccwj and built my own compiled vibrato dictionary using
cargo run --release -p compile -- -l unidic-cwj-202302_full/lex.csv -m unidic-cwj-202302_full/matrix.def -u unidic-cwj-202302_full/unk.def -c unidic-cwj-202302_full/char.def -o system.dic.zst
and then I tried tokenizing the example sentence from the docs
> time echo '本とカレーの街神保町へようこそ。' | cargo run --release -p tokenize -- -i system.dic.zst
Running `target/release/tokenize -i system.dic.zst`
Loading the dictionary...
Ready to tokenize
本 名詞,普通名詞,一般,*,*,*,ホン,本,本,ホン,本,ホン,漢,ホ濁,基本形,*,*,*,*,体,ホン,ホン,ホン,ホン,1,C3,*,9584176605045248,34867
と 助詞,格助詞,*,*,*,*,ト,と,と,ト,と,ト,和,*,*,*,*,*,*,格助,ト,ト,ト,ト,*,"名詞%F1,動詞%F1,形容詞%F2@-1",*,7099014038299136,25826
カレー 名詞,普通名詞,一般,*,*,*,カレー,カレー-curry,カレー,カレー,カレー,カレー,外,*,*,*,*,*,*,体,カレー,カレー,カレー,カレー,0,C2,*,2018162216411648,7342
の 助詞,格助詞,*,*,*,*,ノ,の,の,ノ,の,ノ,和,*,*,*,*,*,*,格助,ノ,ノ,ノ,ノ,*,名詞%F1,*,7968444268028416,28989
街 名詞,普通名詞,一般,*,*,*,マチ,街,街,マチ,街,マチ,和,*,*,*,*,*,*,体,マチ,マチ,マチ,マチ,2,C3,*,9827718430597632,35753
神保町 名詞,固有名詞,地名,一般,*,*,ジンボウチョウ,ジンボウチョウ,神保町,ジンボーチョー,神保町,ジンボーチョー,固,*,*,*,*,*,*,地名,ジンボウチョウ,ジンボウチョウ,ジンボウチョウ,ジンボウチョウ,
"3,0",*,*,5174035466035712,18823
へ 助詞,格助詞,*,*,*,*,ヘ,へ,へ,エ,へ,エ,和,*,*,*,*,*,*,格助,ヘ,ヘ,ヘ,ヘ,*,名詞%F1,*,9296104558567936,33819
よう 形容詞,非自立可能,*,*,形容詞,連用形-ウ音便,ヨイ,良い,よう,ヨー,よい,ヨイ,和,*,*,*,*,*,*,相,ヨウ,ヨイ,ヨウ,ヨイ,1,C3,*,10716957049496195,38988
こそ 助詞,係助詞,*,*,*,*,コソ,こそ,こそ,コソ,こそ,コソ,和,*,*,*,*,*,*,係助,コソ,コソ,コソ,コソ,*,"形容詞%F2@0,名詞%F2@1,動詞%F2@0",*,3501403402281472,12738
。 補助記号,句点,*,*,*,*,*,。,。,*,。,*,記号,*,*,*,*,*,*,補助,*,*,*,*,*,*,*,6880571302400,25
EOS
________________________________________________________
Executed in 13.96 secs fish external
usr time 13.09 secs 0.00 micros 13.09 secs
sys time 0.86 secs 0.00 micros 0.86 secs
but it takes around 14 seconds to load the dictionary.
In comparison, mecab is near instant
> time echo "本とカレーの街神保町へようこそ。" | mecab --dicdir="unidic-cwj-202302_full"
本 名詞,普通名詞,一般,*,*,*,ホン,本,本,ホン,本,ホン,漢,ホ濁,基本形,*,*,*,*,体,ホン,ホン,ホン,ホン,1,C3,*,9584176605045248,34867
と 助詞,格助詞,*,*,*,*,ト,と,と,ト,と,ト,和,*,*,*,*,*,*,格助,ト,ト,ト,ト,*,"名詞%F1,動詞%F1,形容詞%F2@-1",*,7099014038299136,25826
カレー 名詞,普通名詞,一般,*,*,*,カレー,カレー-curry,カレー,カレー,カレー,カレー,外,*,*,*,*,*,*,体,カレー,カレー,カレー,カレー,0,C2,*,2018162216411648,7342
の 助詞,格助詞,*,*,*,*,ノ,の,の,ノ,の,ノ,和,*,*,*,*,*,*,格助,ノ,ノ,ノ,ノ,*,名詞%F1,*,7968444268028416,28989
街 名詞,普通名詞,一般,*,*,*,マチ,街,街,マチ,街,マチ,和,*,*,*,*,*,*,体,マチ,マチ,マチ,マチ,2,C3,*,9827718430597632,35753
神保町 名詞,固有名詞,地名,一般,*,*,ジンボウチョウ,ジンボウチョウ,神保町,ジンボーチョー,神保町,ジンボーチョー,固,*,*,*,*,*,*,地名,ジンボウチョウ,ジンボウチョウ,ジンボウチョウ,ジンボウチョウ,
"3,0",*,*,5174035466035712,18823
へ 助詞,格助詞,*,*,*,*,ヘ,へ,へ,エ,へ,エ,和,*,*,*,*,*,*,格助,ヘ,ヘ,ヘ,ヘ,*,名詞%F1,*,9296104558567936,33819
よう 形容詞,非自立可能,*,*,形容詞,連用形-ウ音便,ヨイ,良い,よう,ヨー,よい,ヨイ,和,*,*,*,*,*,*,相,ヨウ,ヨイ,ヨウ,ヨイ,1,C3,*,10716957049496195,38988
こそ 助詞,係助詞,*,*,*,*,コソ,こそ,こそ,コソ,こそ,コソ,和,*,*,*,*,*,*,係助,コソ,コソ,コソ,コソ,*,"形容詞%F2@0,名詞%F2@1,動詞%F2@0",*,3501403402281472,12738
。 補助記号,句点,*,*,*,*,*,。,。,*,。,*,記号,*,*,*,*,*,*,補助,*,*,*,*,*,*,*,6880571302400,25
EOS
________________________________________________________
Executed in 28.32 millis fish external
usr time 0.00 millis 0.00 micros 0.00 millis
sys time 31.25 millis 0.00 micros 31.25 millis
I looked at the code and it seems like all the time is taken from deserializing bincode into the DictionaryInner
struct. In particular, when it runs the read_common
function
fn read_common<R>(mut rdr: R) -> Result<DictionaryInner>
where
R: Read,
{
let mut magic = [0; MODEL_MAGIC.len()];
rdr.read_exact(&mut magic)?;
if magic != MODEL_MAGIC {
return Err(VibratoError::invalid_argument(
"rdr",
"The magic number of the input model mismatches.",
));
}
let config = common::bincode_config();
let data = bincode::decode_from_std_read(&mut rdr, config)?;
Ok(data)
}
It takes a long time to complete let data = bincode::decode_from_std_read(&mut rdr, config)?;
so it seems like bincode deserialization is slow.
How is mecab able to return results so quickly despite not loading everything in memory like vibrato? It seems like it doesn't take any memory when I use mecab, whereas vibrato takes 1 GB memory to cache everything before being able to tokenize
Describe the solution you'd like
Could we use a faster serde framework like rkyv ? According to the benchmarks, it's a lot faster than bincode
According to the rkyv docs, it says
It’s similar to other zero-copy deserialization frameworks such as Cap’n Proto and FlatBuffers. However, while the former have external schemas and heavily restricted data types, rkyv allows all serialized types to be defined in code and can serialize a wide variety of types that the others cannot. Additionally, rkyv is designed to have little to no overhead, and in most cases will perform exactly the same as native types.
Not sure if there's any other way to speed it up? Could we somehow parallelize deserialization?
Describe alternatives you've considered
Apparently bincode is slow for structs that use Vec and byte slices, and the recommendation is to use serde_bytes
The features such as
pub struct UnkEntry {
pub cate_id: u16,
pub left_id: u16,
pub right_id: u16,
pub word_cost: i16,
pub feature: String,
}
pub struct WordFeatures {
features: Vec<String>,
}
are stored as strings, maybe they can be stored as Vec<u8>
instead?
Additional context
I'm using vibrato version 0.5.1
And here are the compiled dictionary sizes
> du -sh system.dic.zst
291M system.dic.zst
> du -sh system.dic
988M system.dic
In v0.3.1, compiled dictionaries from neologd-ipadic have not been distributed because the licenses of Neologd and IPADIC collide.
To solve this, we will employ Naist-jdic instead of ipadic.
In v0.3.1, compiled dictionaries from JumanDic have not been distributed because the lexicon file is in an unexpected CSV format.
More precisely, we will get the following error message from the compile
command.
Error: InvalidFormat(InvalidFormatError { arg: "lex.csv", msg: "A csv row of lexicon must have five items at least, \"\\n\"" })
We need to modify the code to compile this file.
Is your feature request related to a problem? Please describe.
It's harder to support lower-end hardware (with limited memory) particularly with bigger dictionaries.
Describe the solution you'd like
I would like the option to be able to use a memory map instead to refer to an uncompressed dictionary since storage is usually cheaper than memory. The application I need does not need extreme performance so I feel like the IO penalty would be acceptable. If the dictionary gets processed by Vibrato into something else, it would also be nice to be able serialize it to a file and memmap it as well. fst
offers something like that: https://docs.rs/fst/latest/fst/#example-stream-to-a-file-and-memory-map-it-for-searching
Describe alternatives you've considered
None that I'm aware of.
Additional context
None
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
I'd like to know the difference between Vibrato and Vaporetto regarding the performance for a general tokenizing as follows.
https://github.com/daac-tools/vibrato#fast-tokenization
Describe the solution you'd like
A clear and concise description of what you want to happen.
Could you add Vaporetto to the following image?
https://github.com/daac-tools/vibrato#fast-tokenization
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Nothing
Additional context
Add any other context or screenshots about the feature request here.
Nothing
In v0.3.1, user dictionaries can be given in a text CSV format. However, if the dictionary file is huge, the loading time can be a problem. This issue suggests adding a feature to compile user dictionaries into a binary format. This feature will help install Neologd entries.
See #43 (comment)
Currently, Vibrato provides scripts to download and compile external resources. However, those scripts are dangerous to users because they may download large amounts of unintended data.
Distributing precompiled dictionaries through Assets.
Then, the main concern is licensing. KFTT used in the current version follows CC BY-SA 3.0 competing that of IPADIC or UniDic.
One of solutions would be using unlicensed texts such as Aozora Bunko.
When loading the compiled IPA dictionary based on https://github.com/daac-tools/vibrato#1-dictionary-preparation, the error below occurred.
malloc: can't allocate region
:*** mach_vm_map(size=8776504127295307776, flags: 100) failed (error code=3)
malloc: *** set a breakpoint in malloc_error_break to debug
memory allocation of 8776504127295305000 bytes failed
バグの説明
65535を超える文字で分かち書きできません。
再現手順
(誰もが同じ状況を再現できるように、 git clone
コマンドを含め、すべてのコマンドを記述してください。)
以下のコードを実行しました:
textが65535を超える文字数の場合、100%再現します。
let mut worker = tokenizer.new_worker();
worker.reset_sentence(&text);
worker.tokenize();
以下のメッセージが出力されました:
thread 'main' panicked at 'assertion failed: input.len() <= 0xFFFF', /Users/saitoukosuke/.cargo/registry/src/github.com-1ecc6299db9ec823/vibrato-0.4.0/src/dictionary/lexicon/map/trie.rs:53:9
期待する結果
(期待した出力を明確かつ簡潔に記述してください。)
65535を超える文字数でも分かち書きできるようにしたいです。
https://github.com/daac-tools/vibrato/blob/main/vibrato/src/dictionary/lexicon/map/trie.rs#L53
を外せばうまくいくと思われますが、制限を付けている理由があるようでしたら、このままの仕様で構いません。
あなたの実行環境
In v0.3.1, the benchmark result in README is old: Vibrato has been updated, other dictionaries have been distributed, and lindera has been accelerated.
Currently, member functions of a struct are sometimes placed in different files.
e.g., Dictionary
is placed at dictionary.rs
, but Dictionary::from_readers
is placed at dictionary/builder.rs
.
This makes finding functions difficult in development.
However, simply moving the implementation of Dictionary::from_readers
in dictionary.rs
enlarges the file.
Defining a new structure Dictionary::Builder
in dictionary/builder.rs
to always keep member functions of a struct in the same file.
Is your feature request related to a problem? Please describe.
Mecab has the flag -N
which provides the N best results
mecab --help
...
-N, --nbest=INT output N best results (default 1)
However, I couldn't find docs or in the source code how to do this with vibrato
Describe the solution you'd like
Allow support for providing the N best results
Describe alternatives you've considered
N/A
Additional context
Vibrato 0.5.1
Currently, Vibrato does not support texts longer than 65535 characters.
The limit is specified here:
Line 16 in de25020
This limit is long enough to parse a typical sentence but should be changed to a larger value, such as 2^32-1, to increase robustness in actual operations.
This issue proposes to maintain command line tools related to training in one workspace and put the description of training under the workspace (like prepare
) for keeping an easy description in the top-level readme.
For many light users, the primary concern would be how to get started on tokenization using Vibrato. So, I think the top-level readme should keep an easy description.
The description of 3. Training
in Section Basic usage
is for advanced users because to understand this, it is necessary to know the configuration of dictionary files in MeCab.
Also, maintaining multiple workspaces increases the costs of updating dependencies in Cargo.toml
.
We keep the command line tools train
, dictgen
, evaluate
, and split
in one workspace (such as named train
) and write the usage in the readme under this workspace.
Cargo.toml
.A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.