Code Relevant type: <div class="Box Box--condensed m

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Requiring 3 characters to perform a search works poorly for logographic corpora about stork HOT 4 CLOSED

jameslittle230 commented on May 19, 2024

Requiring 3 characters to perform a search works poorly for logographic corpora

from stork.

Comments (4)

jameslittle230 commented on May 19, 2024

Hey @DenialAdams! Thanks for writing in, hope Stork is working well for you so far despite this!

I'll have to noodle on this issue for a bit. If a corpus gets too large, Stork has a lot of trouble searching for any query less than three characters: the index file size itself gets pretty big and the search algorithm can get kind of slow.

I think (off the top of my head, not looking at code) that this would have to involve an index regeneration so Stork indexed one- and two-character-long substrings.

Would it work for you if I added a configuration option that let you tell Stork to index substrings as short as one character? (I probably wouldn't be able to get to this for a few weeks, to set expectations.)

As an aside, I haven't tested this at all with Chinese-language text -- I'd love to hear about any other unexpected behavior that you encounter!

Thanks again,
James

from stork.

DenialAdams commented on May 19, 2024

I think (off the top of my head, not looking at code) that this would have to involve an index regeneration so Stork indexed one- and two-character-long substrings.

Interestingly (and I only thought to try this out this morning), if I use the CLI to search (bypassing the >= 3 requirement) I do get some results with a one character search. Well... I get exactly one result. I suspect this is because that's the only time this character appears alone, i.e. with whitespace on either side of it. A cursory look at the code seems to confirm that:

stork/src/index_versions/v3/builder/word_list_generators/mod.rs

Lines 47 to 63 in db1b958

 impl WordListGenerator for PlainTextWordListGenerator { 

 fn create_word_list( 

 &self, 

 _config: &InputConfig, 

 buffer: &str, 

 ) -> Result<Contents, WordListGenerationError> { 

 Ok(Contents { 

 word_list: buffer 

 .split_whitespace() 

 .map(|word| AnnotatedWord { 

 word: word.to_string(), 

 ..Default::default() 

 }) 

 .collect(), 

 }) 

 } 

 }

Since Chinese sentences don't use whitespace to separate words, this also might be a little bit of an issue :) but I don't know exactly what the word list is used for; I'll keep learning by reading your code (which is very readable, nice job!)

Would it work for you if I added a configuration option that let you tell Stork to index substrings as short as one character? (I probably wouldn't be able to get to this for a few weeks, to set expectations.)

Hmm. This might work well for a smaller corpus but my index already takes about 3 hours to build and searching takes about a second, so I guess it depends on how much more it slows things down 😅

A thought here: could we examine 1 and 2 character substrings but only index them if the characters fall into the range specified by the CJK Unicode Blocks?

As an aside, I haven't tested this at all with Chinese-language text -- I'd love to hear about any other unexpected behavior that you encounter!

I'll definitely keep you posted :) I don't speak (or read) Chinese, but I'm building this for my roommate, so I'll forward along his thoughts

from stork.

DenialAdams commented on May 19, 2024

Hmm. This might work well for a smaller corpus but my index already takes about 3 hours to build and searching takes about a second, so I guess it depends on how much more it slows things down 😅

I did some profiling and was able to easily cut 3 hours down to 1 min (PR incoming), so this is much less of a concern for me now :)

(edit: #74)

from stork.

DenialAdams commented on May 19, 2024

A thought here: could we examine 1 and 2 character substrings but only index them if the characters fall into the range specified by the CJK Unicode Blocks?

Here's a prototype for this approach:
DenialAdams@fa2b24d

Let me know what you think of it, so far it's really improved the search results but the searches do take a little longer now (worth it to me)

from stork.

Requiring 3 characters to perform a search works poorly for logographic corpora about stork HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	impl WordListGenerator for PlainTextWordListGenerator {
	fn create_word_list(
	&self,
	_config: &InputConfig,
	buffer: &str,
	) -> Result<Contents, WordListGenerationError> {
	Ok(Contents {
	word_list: buffer
	.split_whitespace()
	.map(\|word\| AnnotatedWord {
	word: word.to_string(),
	..Default::default()
	})
	.collect(),
	})
	}
	}