On Sun, Sep 7, 2014 at 5:04 PM, Marty Schoch <a href="mailto:notifications@github.com"

cjk_bigram about bleve HOT 8 CLOSED

blevesearch commented on May 18, 2024

cjk_bigram

from bleve.

Comments (8)

purohit commented on May 18, 2024

What needs to be done for this? I'm considering using bleve for my website for Chinese, outspokenlanguage.com. Right now I use redis-go-search but it's not good with CJK searching.

I can easily get a cleaned dataset of all simplified Chinese (and probably Traditional) unigrams and bigrams from Google's Ngram data. But I've never worked on word searching before, so you'd have to give me some guidance.

from bleve.

mschoch commented on May 18, 2024

Great, we need help in this area.

The purpose of the CJK bigram filter is to form bigrams from the input tokens, but only if the tokens are CJK characters. It outputs non-CJK tokens as is, and it can optionally output the CJK unigrams as well.

I believe the main issue we have right now is that the ICU tokenizer doesn't give us exactly the same information as lucene, so I couldn't do a straight port. In Lucene the tokenizer has already flagged the tokens with more detailed script information, specifically, ideographic, hiragana, katakana and hangul. The ICU tokenizer only gives us alphanumeric, kana, and ideographic. Worse, it seems to report hangul as alphanumeric, and kana as ideographic.

I think ultimately we can ignore that for now and proceed anyway. What we need to do is introduce token types for KANA and IDEOGRAPHIC, set those types based on the what the tokenizer gives us. Then the cjk bigram filter can operate on those token types, and pass through others unchanged. Despite the problems I noted above, it probably will work pretty well for Chinese and Japanese text.

I don't think we need the google ngram data. Although for more sophisticated analysis it could be handy.

from bleve.

purohit commented on May 18, 2024

Ah. Do I understand you correctly: bigram meaning character bigrams, not word bigrams? (As in, 喜歡 is a character bigram, but a word unigram). The issue of word segmentation in Chinese is harder and would require the use of a corpus like Google's ngram data (http://www.hathitrust.org/blogs/large-scale-search/multilingual-issues-part-1-word-segmentation).

from bleve.

mschoch commented on May 18, 2024

Yes. My understanding is (possibly wrong), that lucene/elasticsearch index character bigrams, specifically because they don't do word segmentation well.

And looking at this closer, this is actually even more problematic for us as the ICU tokenizer we use is dictionary based. The ICU docs claim that dictionary tokenization is already done for Thai, Khmer and CJK. See the section at the end "Details about Dictionary-Based Break Iteration" http://userguide.icu-project.org/boundaryanalysis

Based on this, it sounds like the bigram filter may not even be needed for now. If we later find that the ICU tokenizer is not good enough, then we could use a different tokenizer that just gives us each character, and index bigrams.

Does this make sense?

I'm putting together an online tool to let us experiment with the analyzers, this should make it easier to experiment by inputing arbitrary text, and seeing how it would be indexed.

from bleve.

mschoch commented on May 18, 2024

Also thanks for that link it has a lot of good background information in it.

One of the frustrating things is that so much of this information is stale. They talk about what Lucene and ICU do, but the article is from 2011. They talk about it only doing unigrams, but bigrams in the future. Has that future arrived? Hard to say. :)

from bleve.

mschoch commented on May 18, 2024

OK, having read that link in more detail it seems to confirm what I thought. The default behavior in Lucene/Solr/ES is to tokenize into characters, then use a bigram filter to form overlapping character bigrams. Then these character bigrams are what is indexed.

I was thinking we'd used the ICU tokenizer since it was dictionary based. But my thought that that would automatically better doesn't seem to be supported by docs.

Probably we should introduce a tokenizer that just emits the characters individually. Then use the cjk width and bigram filters. This will more closely approximate Lucene/Solr/ES in the short term.

Longer term we can experiment to see if the dictionary/word based tokenizers are any good yet.

from bleve.

Shugyousha commented on May 18, 2024

On Sun, Sep 7, 2014 at 5:04 PM, Marty Schoch [email protected] wrote:

Probably we should introduce a tokenizer that just emits the characters individually. Then use the cjk width and bigram filters. This will more closely approximate Lucene/Solr/ES in the short term.

It seems like Solr/Lucene uses the term "n-gram" to mean "character
n-gram" and "shingles"[1] to mean "token n-grams" (which is what is
usually referred to as just "n-grams" outside of Solr/Lucene).

The Tokenizer interface in bleve is returning a TokenStream at the
moment. Would we then just refer to the individual characters as
tokens when implementing a Tokenizer for CJK in bleve?

I wonder if it wouldn't make more sense to return 'word' tokens
directly in CJK as well which would be more consistent with how the
other languages are handled AFAICT.

I am also not sure how Lucene/Solr handle derivations like

感激的

for example which seem like they would be mangled by a character bi-gram filter.

Longer term we can experiment to see if the dictionary/word based tokenizers are any good yet.

I would be interested in helping with the analysis of Japanese.

If we decide to not use character n-grams as output for the tokenizer,
we could try to compare the unicode_word_boundary tokenizer and
something like kagome[2]/mecab[3](which would give you POS tags as
well).

[1] https://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/shingle/ShingleFilter.html
[2] https://github.com/ikawaha/kagome (it seems to use the mecab dictionary)
[3] http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html

from bleve.

mschoch commented on May 18, 2024

The Tokenizer interface in bleve is returning a TokenStream at the
moment. Would we then just refer to the individual characters as
tokens when implementing a Tokenizer for CJK in bleve?

We should have several tokenizers that behavior differently. Then at a higher level we will assemble analyzers that correctly pair tokenizers with other filters to deliver useful functionality. Even at this level there will probably be multiple options. Users should have a choice as there may not be one single best solution.

It seems to me from what I've read so far, the simplest thing that can work acceptably is to tokenize CJK characters individually, and then apply a few other filters (such as the CJK bigram filter) before indexing. To start down this path, I'm updating the "whitespace" tokenizer to give us this behavior (previously it just ate the characters which made no sense). This will give acceptable behavior for a large number of languages using a simple tokenizer.

Before we expose this as a "cjk" analyzer in the registry, we still need to implement the cjk bigram filter. I'll see if I can get around to this later today.

But this is still just part of the solution, as I'll explain below:

I wonder if it wouldn't make more sense to return 'word' tokens
directly in CJK as well which would be more consistent with how the
other languages are handled AFAICT.

Yes, I think we should have options here. For some users the character tokenization won't be what they want. We already have one tokenizer, named "unicode" which attempts to do word tokenization on CJK words. It uses a dictionary based approach, but I have no idea how well it works. What would help me is someone who knows a CJK language well to create some example test cases and report back how well it works.

I am also not sure how Lucene/Solr handle derivations like

感激的

My understanding (I'll test this one later today when working on the bigram filter) is that if you use the default CJK analyzer in Elasticsarch, it will tokenize on the characters, then for bigrams. I'm assuming this is a 3 character word, and so on some level this is wrong. But, if you read the link @purohit shared above, researchers have found that in practice this isn't a huge problem. Often times the bigrams of 3 character words are related root words and this can increase search recall.

I would be interested in helping with the analysis of Japanese.

Thats great. Perhaps the best next step would be to research how well the 'unicode' tokenizer does on some Japanese text. The google group may be a good place to discuss your findings.

If we decide to not use character n-grams as output for the tokenizer,
we could try to compare the unicode_word_boundary tokenizer and
something like kagome[2]/mecab[3](which would give you POS tags as
well).

Yes, I think the key that we should offer options. I think proper CJK handling is complex enough that there will not be just one solution.

I have a separate issue opened to track integration with kagome. #93

I'll take a look at the other links now.

from bleve.

cjk_bigram about bleve HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent