meilisearch / charabia Goto Github PK

View Code? Open in Web Editor NEW

211.0 211.0 81.0 5.38 MB

Library used by Meilisearch to tokenize queries and documents

License: MIT License

Rust 99.83% Shell 0.17%

charabia's Introduction

Charabia

Library used by Meilisearch to tokenize queries and documents

Role

The tokenizer’s role is to take a sentence or phrase and split it into smaller units of language, called tokens. It finds and retrieves all the words in a string based on the language’s particularities.

Details

Charabia provides a simple API to segment, normalize, or tokenize (segment + normalize) a text of a specific language by detecting its Script/Language and choosing the specialized pipeline for it.

Supported languages

Charabia is multilingual, featuring optimized support for:

Script / Language	specialized segmentation	specialized normalization	Segmentation Performance level	Tokenization Performance level
Latin	✅ CamelCase segmentation	✅ compatibility decomposition + lowercase + nonspacing-marks removal + `Ð vs Đ` spoofing normalization	🟩 ~23MiB/sec	🟨 ~9MiB/sec
Greek	❌	✅ compatibility decomposition + lowercase + final sigma normalization	🟩 ~27MiB/sec	🟨 ~8MiB/sec
Cyrillic - Georgian	❌	✅ compatibility decomposition + lowercase	🟩 ~27MiB/sec	🟨 ~9MiB/sec
Chinese CMN 🇨🇳	✅ jieba	✅ compatibility decomposition + kvariant conversion	🟨 ~10MiB/sec	🟧 ~5MiB/sec
Hebrew 🇮🇱	❌	✅ compatibility decomposition + nonspacing-marks removal	🟩 ~33MiB/sec	🟨 ~11MiB/sec
Arabic	✅ `ال` segmentation	✅ compatibility decomposition + nonspacing-marks removal + [Tatweel, Alef, Yeh, and Taa Marbuta normalization]	🟩 ~36MiB/sec	🟨 ~11MiB/sec
Japanese 🇯🇵	✅ lindera IPA-dict	❌ compatibility decomposition	🟧 ~3MiB/sec	🟧 ~3MiB/sec
Korean 🇰🇷	✅ lindera KO-dict	❌ compatibility decomposition	🟥 ~2MiB/sec	🟥 ~2MiB/sec
Thai 🇹🇭	✅ dictionary based	✅ compatibility decomposition + nonspacing-marks removal	🟩 ~22MiB/sec	🟨 ~11MiB/sec
Khmer 🇰🇭	✅ dictionary based	✅ compatibility decomposition	🟧 ~7MiB/sec	🟧 ~5MiB/sec

We aim to provide global language support, and your feedback helps us move closer to that goal. If you notice inconsistencies in your search results or the way your documents are processed, please open an issue on our GitHub repository.

If you have a particular need that charabia does not support, please share it in the product repository by creating a dedicated discussion.

About Performance level

Performances are based on the throughput (MiB/sec) of the tokenizer (computed on a scaleway Elastic Metal server EM-A410X-SSD - CPU: Intel Xeon E5 1650 - RAM: 64 Go) using jemalloc:

0️⃣⬛️: 0 -> 1 MiB/sec
1️⃣🟥: 1 -> 3 MiB/sec
2️⃣🟧: 3 -> 8 MiB/sec
3️⃣🟨: 8 -> 20 MiB/sec
4️⃣🟩: 20 -> 50 MiB/sec
5️⃣🟪: 50 MiB/sec or more

Examples

Tokenization

use charabia::Tokenize;

let orig = "Thé quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3°F!";

// tokenize the text.
let mut tokens = orig.tokenize();

let token = tokens.next().unwrap();
// the lemma into the token is normalized: `Thé` became `the`.
assert_eq!(token.lemma(), "the");
// token is classfied as a word
assert!(token.is_word());

let token = tokens.next().unwrap();
assert_eq!(token.lemma(), " ");
// token is classfied as a separator
assert!(token.is_separator());

Segmentation

use charabia::Segment;

let orig = "The quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3°F!";

// segment the text.
let mut segments = orig.segment_str();

assert_eq!(segments.next(), Some("The"));
assert_eq!(segments.next(), Some(" "));
assert_eq!(segments.next(), Some("quick"));

charabia's People

Contributors

Stargazers

Watchers

Forkers

carofg alessiofino dichotommy irevoire shekhirin kination ahmedkrmn samyak2 thearas mmizutani suryatmodulus psvnlsaikumar bidoubiwa gmourier benny-n mount-blanc cyberflamego osored isgasho afluffyhotdog ekusiadadus curquiza kimyvgy-forks aniketh-varma viveknshah mjoaaquin sokom141 crudiedo yenwel daniel-shuy choznerol adithyaakrishna amab8901 takatori seongjaehun qbx2 dongguk-seo roms1383 romilpunetha charlesschaefer harshalkhachane rashmibharambe ehdgurdkf goodhoko cymruu pavan-nambi james-2001 zackangelo lzw65 xshadowlegendx akeamc daflyinbed miiton jack13xx mayhemheroes iq-scm vvv g-xd wyilio bradshaw hasan-raja cmrajan brankokrstic ngdbao davidalphafox timvisee kanarikanaru elkiwa abastien1734 6543-forks theopendictionary shabbirhasan1 muffpy tejas-chandrashekarraju soham1803 mosuka datamaker gusted strogo 42plamusse giovannicuccu

charabia's Issues

Handle words containing non-separating dots and coma in Latin tokenization

Summary

Handle S.O.S as one word (S.O.S) instead of three (S, O, S) or numbers like 3.5 as one word (3.5) instead of two (3, 5).

Explanation

The actual tokenizer considers any . or , as a hard separator meaning that the two separated words are not considered to be part of the same context.
But there is some exceptions for some words like number that are separated by . or , but should be considered as one and only one word.

We should modify the actual latin tokenizer to handle this case.

Handle non-breakable spaces

The tokenizer must handle non-breakable spaces.

For example, it should handle the following examples this way:

3 456 678 where the (space) is not considered as a separator
~~- Альфа where ь is not considered as space, so a separator~~

~~Related to meilisearch/meilisearch#1335~~ cf @shekhirin comment

Implement a Japanese specialized Normalizer

Today, there is no specialized normalizer for the Japanese Language.

drawback

Meilisearch is unable to find the hiragana version of a word with a katakana query, for instance, ダメ, is also spelled 駄目, or だめ

Technical approach

Create a new Japanese normalizer that unifies hiragana and katakana equivalences.

Interesting libraries

wana_kana seems promising to convert everything in Hiragana

Files expected to be modified

create /src/normalizer/japanese.rs
/src/normalizer/mod.rs

Misc

related to product#532

Introduce HTML tags separators

Related to meilisearch/meilisearch#955

We might introduce the HTML tags as soft and hard separators: / < >

And we should define if they are soft or hard separators

Latin script: Segmenter should split camelCased words

Today, Meilisearch is splitting snake_case, SCREAMING_CASE and kebab-case properly but doesn't split PascalCase nor camelCase.

drawback

Meilisearch doesn't completely support code documentation.

enhancement

Make Latin Segmenter split camelCased/PascalCase words:

"camelCase" -> ["camel", "Case"]
"PascalCase" -> ["Pascal", "Case"]
"IJsland" -> ["IJsland"] (Language trap)
"CASE" -> ["CASE"] (another trap)

Files expected to be modified

/src/segmenter/latin.rs

Change the crate name

Waiting for internal decisions

Handle multi languages in the same attribute

The Tokenizer currently uses the whatlang library that detects the language of the attribute (in probability).

The Tokenizer must be able to detect several languages in the same attributes.

Also, maybe it's a better idea to let the user decide the language?

Upgrade dependencies

Checked with @ManyTheFish

We need to upgrade the dependencies of Charabia for v0.30.0

Arabic script: Implement specialized Segmenter

Currently, the Arabic Script is segmented on whitespaces and punctuation.

Drawback

Following the dedicated discussion on Arabic Language support and the linked issues, the agglutinative words are not segmented, for example in this comments:

Enhancement

We should find a specialized segmenter for the Arabic Script, or else, a dictionary to implement our own segmenter inspired by the Thaï Segmenter.

Support for Tatar language

While we plan to rewrite the tokenizer it would be better to support most of the languages and the Tatar language should be part of them.

This issue has been created in response to someone asking about it in a discussion.

Classify tokens after Segmentation instead of Normalization

Classifying tokens after Segmentation instead of Normalization in the tokenization pipeline would enhance the precision of the stop_words classification.
Today, stop words need to be normalized to be properly classified, however, the normalization is more or less lossy and can classy unexpected stop words.
For instance, in French maïs (corn in 🇬🇧) would be normalized as mais (but in 🇬🇧), and so, maïs will be classified as a stop word if the stop word list contains mais.
This would not be possible if the classifier is called before the normalizer.

Technical approach

Invert the normalization step and the classification step in the tokenization process

Add support for Hebrew

Make legacy tokenizer handle unicode separators

related to #milli:229

Disable char map creation for when we don't need it

When we want to index the tokens just to store the words in the inverted index we don't need the char map, maybe we can provide an option to disable creating it.

Add Actual Tokenizer state

Export actual meilisearch Tokenizer in this repository to start with a compatible version of it.
The goal it to enhance this iso state and be able to test it iteratively on meilisearch instead of deliver a final version

Rework meilisearch tokenizer

TBD

meilisearch/meilisearch#624

Refactor normalizers

Today creating a normalizer is way harder than creating a segmenter, this is mainly due to the char map, a necessary field to manage highlights.

Technical Approach

Refactor the Normalizer trait to implement a normalize_str and normalize_char method that takes a Cow<str> in parameter and return a Cow<str>. All the char map creation should be done in a function calling these two methods.

Make Latin Segmenter split on `'`

In French, some determiners and adverbs are fusioned with words that begin with a vowel using the ' character:

l'aventure
d'avantage
qu'il
...

By default, the Latin segmenter doesn't split them.

Implement Pinyin normalizer

Today Meilisearch normalizes Chinese characters by converting traditional characters into simplified ones.

drawback

This normalization process doesn't seem to enhance the recall of Meilisearch.

enhancement

Files expected to be modified

/src/normalizer/chinese.rs

Misc

related to product#503

Tokenizer for Ja/Ko

Hello~
I'm currently testing tokenizer with Japanese/Korean, but seems it is not working correctly.

Is there some working plan for this?

Thanks.

Upgrade Whatlang dependency

Whatlang introduced new Languages and Scripts in the newer version.
We should upgrade our dependency to the latest version.

Decompose Japanese compound words

Summary

The morphological dictionary that Lindera includes by default is IPADIC.
IPADIC includes many compound words. For example, 関西国際空港 (Kansai International Airport).
However, if you index in the default mode, the word 関西国際空港 (Kansai International Airport) will be indexed in the term 関西国際空港, and you will not be able to search for the keyword 空港 (Airport).
So, Lindera has a function to decompose such compound words.
This is a feature similar to Kuromoji's search mode.

Latin script: segmenter should support word segmentation

We should be able to support splitting words by methods other than the text casing. Libraries like instant-segment exist to do that.

redneckbossryan -> redneck, boss and ryan can be extracted
massachusetsinstutitute -> massachusetts, institute

`num_graphemes_from_bytes` does not work when used for a prefix of a raw Token

The Issue

The output of num_graphemes_from_bytes is wrong when:

num_bytes is smaller than the length of the string
the token does not have the char_map initialized - possibly since the Token was created outside of Tokenizer or because the unicode segmenter was not run.

It should return num_bytes back since each character is assumed to occupy one byte. Instead, it returns the length of the underlying string.

Context

This bug was introduced by me in #59 😆

Wrong matching for Arabic

Related to meilisearch/meilisearch#1331

Add bors

Create an issue triage process differencing each tokenizer scopes

Create an issue triage process differencing each tokenizer scopes (detector, segmenter, normalizer, classifier)

Get size of char after normalization

Related to meilisearch/meilisearch#1480

Publish tokenizer to crate.io

We should automate this push in a CI (triggered on each release for example)

Publish manually the first version
Add meili-bot as an Owner
Automate using CI

⚠️ Should be done by a core-engine team member

Enhance Chinese normalizer by unifying `Z`, `Simplified`, and `Semantic` variants

Following the official discussion about Chinese support in Meilisearch, it is relevant to normalize Chinese characters by unifying Z Simplified and Semantic variants before transliterating them into Pinyin.

There are several dictionaries listing variations that we can use, I suggest using the kvariants dictionary made by hfhchan (see the related documentation on the same repo).

technical approach

Import and Rework the dictionary to be a key-value binding of each variant, then, in the Chinese normalizer, convert the provided character before transliterating it into Pinyin.

Files expected to be modified

/src/normalizer/chinese.rs

Misc

related to meilisearch/product#503

readme Hebrew segmentation link points to jieba

As the title: on the readme, clicking on "unicode-segmentation" on the Hebrew row takes the user to the jieba repo.

I assume the correct link would be the same as Latin's "unicode-segmentation."

Implement Vietnamese tokenizer for Meilisearch

Hello. It seems Meilisearch doesn’t have the tokenizer for Vietnamese, does It? I would like to implement a tokenizer for Vietnamese.

Implement an efficient `Nonspacing Mark` Normalizer

In the Information Retrieval (IR) context, removing Nonspacing Marks like diacritics is a good way to increase recall without losing much precision, like in Latin, Arabic, or Hebrew.

Technical Approach

Implement a new Normalizer, named NonspacingMarkNormalizer, that removes the nonspacing marks from a provided token (find a naive implementation with the exhaustive list in the Misc section).
Because there are a lot of sparse character ranges to match, it would be inefficient to create a big if-forest to know if a character is a nonspacing mark.
This way, I suggest trying several implementations of the naive implementation below in a small local project.

Interesting Rust Crates

hyperfine: a small command-line tool to benchmark several binaries
roaring-rs: a bitmap data structure that has an efficient contains method
once_cell: a good Library to create lazy statics already used in the repository

Misc

naive implementation of is_nonspacing_mark
related discussion about the Arabic Language

Korean support

Hello. I’m going to submit a PR for korean support, please review.

Implement Jyutping normalizer

Today Meilisearch normalizes Chinese characters by converting traditional characters into simplified ones.

drawback

This normalization process doesn't seem to enhance the recall of Meilisearch.

enhancement

Following the official discussion about Chinese support in Meilisearch, it is more relevant to normalize Chinese characters by transliterating them into a Phonological version.
In order to have accurate phonology for Cantonese, we should normalize Chinese characters into Jyutping using the kCantonese dictionary of the unihan database.
We should find an efficent way to normalize characters, and so, the dictionary may be reformated.

Files expected to be modified

/src/normalizer/chinese.rs

Misc

related to product#503
original source of the dictionnary: unihan.zip in https://unicode.org/Public/UNIDATA/

Add workflows that run benchmarks on the main branch

Explain the name of the repo in the README

Following the @CaroFG idea, we could explain the name of the repo in the README since some people finds it "offensive"

Here are some explanations Many made on Twitter:

The requirement or advice of chinese word segmentation

Describe the requirement
i expect the chinese input text will be splited into all possible words
for example:

The behavior of Current Version

The advice of optimization
i notice that you use jieba default constraction and this will cause some highlight errors or search errors of chinese word segmentation.So,can you use the cat_all method from jieba library to get chinese word segmentation?

Additional text or screenshots

I except your reply,thanks @ManyTheFish

Reimplement Japanese Segmenter

Reimplement Japanese segmenter using Lindera.

TODO list

Read Contributing.md about Segmenters implementation
Lindera loads dictionaries at initialization
- Ensure that Lindera is not initialized at each tokenization
- Add a feature flag for Japanese
use a custom config to initailize Lindera (better segmentation for search usage)

test segmenter

Add benchmarks

Add support for Thai

Disable HMM feature of Jieba

Today, we are using the Hidden Markov Model algorithm (HMM) provided by the cut method of Jieba to segment unknown Chinese words in the Chinese segmenter.

drawback

Following the subdiscussion in the official discussion about Chinese support in Meilisearch, it seems that the HMM feature of Jieba is not relevant in the context of a search engine. This feature creates longer words and inconsistencies in the segmentation, which reduces the recall of Meilisearch without significantly raising the precision.

enhancement

Deactivate the HMM feature in Chinese segmentation.

Files expected to be modified

/src/segmenter/chinese.rs

Misc

related to product#503

Comment on `Analyzer.analyze` default is out of date

https://github.com/meilisearch/tokenizer/blob/1dfc8ad9f5b338c39c3bc5fd5b2d0c1328314ddc/src/analyzer.rs#L300-L301

This default behavior mentioned in the comment seems to have changed in #27 :

             normalizer: Box::new(IdentityNormalizer),
-            tokenizer: Box::new(UnicodeSegmenter),
+            tokenizer: Box::new(LegacyMeilisearch),
-            normalizer: Box::new(IdentityNormalizer),
+            normalizer: Box::new(normalizer),

Lindera compilation issue due to an error when reading dictionnaries

It is not the first time that I get an issue with lindera not being able to read an archive that is downloaded from some Google drive folder.

Add an allowlist to the tokenizer builder

Today Charabia detects automatically the Language of the provided text choosing the best tokenization pipeline in consequence.

drawback

Sometimes the detection is not accurate, mainly when the provided text is short, and the user can't choose manually the Languages contained in the provided text.

enhancement

Add a new setting in the TokenizerBuilder forcing the detection to choose in a subset of Languages, and when there are no choices, skip the detection and pick directly the specialized pipeline.
Whatlang, the library used to detect the Language, provides a way to set a subset of Languages that can be detected with the Detector::with_allowlist method.

Technical approach:

add an optional allowlist parameter to the method detect of the Detect trait in detection/mod.rs
add a segment_with_allowlist and a segment_str_with_allowlist with an additional allowlist parameter to the Segment trait in segmenter/mod.rs
add an allowlist method to the TokenizerBuilder struct in tokenizer.rs

The allowlist should be a hashmap of Script -> [Languages]

Files expected to be modified

Project naming question

Hello.

How do you think about naming the project with camel_case? Currently Tokenizer is starting with capital letter...
https://doc.rust-lang.org/1.0.0/style/style/naming/README.html

This is just a question, so if there is a specific reason, just ignore it 🙏

Move the FST based Segmenter in a standalone file

For the Thaï segmenter, we tried a Final-state-transducer (FST) based segmenter.
This segmenter has really good performance and the dictionaries encoded as FSTs are smaller than raw txt/csv/tsv dictionaries.
For now, the segmenter is in the Thaï segmenter file (segmenter/thai.rs), and, in order to reuse it for other Languages, it would be better to move this segmenter to its own file.
A new struct FstSegmenter may be created wrapping all the iterative segmentation logic.

File expected to be modified

segmenter/thai.rs
create segmenter/utils.rs

Implement a Compatibility Decomposition Normalizer

Meilisearch is unable to find Canonical and Compatibility equivalences, for instance, ｶﾞｷﾞｸﾞｹﾞｺﾞ can't be found with a query ガギグゲゴ.

Technical approach

Implement a new Normalizer CompatibilityDecompositionNormalizer using the method nfkd of the unicode-normalization crate.

Files expected to be modified

Misc

related to product#532

Chinese highlight

Following this issue created in MeiliSearch: meilisearch/meilisearch#2091

This fix should be done on the tokenizer side.

Compile/Instal Charabia on openBSD

I am on openBSD running on Raspberry Pi 4. I am unable to install meilisearch due to cargo not able to find charabia. So i decided to
compile from source.
I downloaded the source from github and did "cargo run' in the charabia source code. I get the error:

error: failed to parse manifest at `/home/kabira/LibOpenSource/charabia/Cargo.toml`

Caused by:
  namespaced features with the `dep:` prefix are only allowed on the nightly channel and requires the `-Z namespaced-features` flag on the command-line

Any workaround suggesting would be great.

Tokenizer refactoring strategy

Implementation Branch: tokenizer-v1.0.0
Draft PR: #77

Summary

As a fast search engine, Meilisearch needs a tokenizer that is a pragmatic balance between processing time and relevancy.
The current implementation of the tokenizer leaks clarity and contains ugly hotfixes making contributions, optimizations, and maintenance difficult.

How to find a pragmatic balance between processing time and relevancy?

First of all, we are not linguists and we don't speak or understand most of the Language that we would want to support, this means that we can't write a tokenizer from scratch and prove that this tokenizer is relevant or not.
That's why the current implementation, and the future ones, rely on segmentation libraries like jieba, unicode-segmentation, or lindera to segment texts in words, theses libraries are recommended and included by external contributors in the library.
But this has some limits and the main one is the processing time, some libraries, even if they have good relevancy, don't suit our needs because the processing time is too long (👋 Jieba).

Relevancy

Because we can't measure relevancy by ourselves, we want to continue to rely on the community and external libraries.
In this perspective, we need to make the inclusion of an external library by an external contributor the easiest as possible:

Code shape

Refactor Pipeline by removing preprocessors and making normalizers global #76
Refactor Analyzer in order to make a new Tokenizerregistration straightforward #76
Simplify the return value of Tokenizer (returning a Script and a &str instead of a Token) #76
Wrap normalizer in an iterator allowing them to yield several items from 1 (["l'aventure"] -> ["l", "'", "adventure"]) #76
~~Add a search mode in Segmenter returning all the word derivation~~ (tokenizers search mode are doing ngrams internally)
Enhance clarity by renaming some structure, function, and files (Segmenter instead of Tokenizer, chinese_cmn.rs instead of jieba.rs) #76
Create test macro allowing contributors to easily test their tokenizer and improve the trust we have in tests assuring that all tokenizers are equally tested

Documentation and contribution processes

Add documenting comments in main structures (Token, Tokenizer trait..) #76
Add a template of a tokenizer as a dummy example of how to add a new tokenizer #76
Add a template of a normalizer as a dummy example of how to add a new normalizer
Add a CONTRIBUTING.md explaining how to test, bench, and implement tokenizers
Enhance README.md
~~Create an issue triage process differencing each tokenizer scopes (detector, segmenter, normalizer, classifier)~~ #88

Minimal requirement to have no regressions

Use unicode-segmentation instead of legacy tokenizer for Latin tokenization #76
Reimplement Chinese Segmenter (using Jieba)
~~Reimplement Japanese Segmenter (using Lindera)~~ #89
Reimplement Deunicode Normalizer only on Script::Latin
Reimplement traditional Chinese translation preprocessor into a Normalizer only on Language::Cmn
Reimplement control Character remover Normalizer

Processing time

Because tokenization has an impact on Meilisearch performances, we have to measure the processing time of every new implementation and define limits that can't be reached in order to be merged. Sometimes, we should think of implementing by ourselves instead of relying on an external library that could significantly impact Meilisearch performances.

Refactor benchmarks to ease benchmarks creation by any contributor
Defines hard limits, like throughput thresholds, to objectively accept or refuse a contribution
~~Add workflows that run benchmarks on the main branch~~ #91

Publish meilisearch tokenizer as a crates

In order to increase the visibility and external contributions, we may publish this library as a crate.

#51
Add a user documentation
#35

crates link: https://crates.io/crates/charabia

NLP

For now, we don't plan to use NLP to tokenize in Meilisearch.

meilisearch / charabia Goto Github PK

charabia's Introduction

Charabia

Role

Details

Supported languages

About Performance level

Examples

Tokenization

Segmentation

charabia's People

Contributors

Stargazers

Watchers

Forkers

charabia's Issues

Summary

Explanation

drawback

Technical approach

Interesting libraries

Files expected to be modified

Misc

drawback

enhancement

Files expected to be modified

Drawback

Enhancement

Technical approach

Related

Technical Approach

drawback

enhancement

Files expected to be modified

Misc

Summary

The Issue

Context

technical approach

Files expected to be modified

Misc

Technical Approach

Interesting Rust Crates

Misc

drawback

enhancement

Files expected to be modified

Misc

TODO list

drawback

enhancement

Files expected to be modified

Misc

drawback

enhancement

Files expected to be modified

File expected to be modified

Technical approach

Files expected to be modified

Misc

Summary

How to find a pragmatic balance between processing time and relevancy?

Relevancy

Code shape

Documentation and contribution processes

Minimal requirement to have no regressions

Processing time

Publish meilisearch tokenizer as a crates

NLP

Recommend Projects

Recommend Topics

Recommend Org