Giter Club home page Giter Club logo

Comments (4)

jameslittle230 avatar jameslittle230 commented on May 19, 2024

Hey @DenialAdams! Thanks for writing in, hope Stork is working well for you so far despite this!

I'll have to noodle on this issue for a bit. If a corpus gets too large, Stork has a lot of trouble searching for any query less than three characters: the index file size itself gets pretty big and the search algorithm can get kind of slow.

I think (off the top of my head, not looking at code) that this would have to involve an index regeneration so Stork indexed one- and two-character-long substrings.

Would it work for you if I added a configuration option that let you tell Stork to index substrings as short as one character? (I probably wouldn't be able to get to this for a few weeks, to set expectations.)

As an aside, I haven't tested this at all with Chinese-language text -- I'd love to hear about any other unexpected behavior that you encounter!

Thanks again,
James

from stork.

DenialAdams avatar DenialAdams commented on May 19, 2024

I think (off the top of my head, not looking at code) that this would have to involve an index regeneration so Stork indexed one- and two-character-long substrings.

Interestingly (and I only thought to try this out this morning), if I use the CLI to search (bypassing the >= 3 requirement) I do get some results with a one character search. Well... I get exactly one result. I suspect this is because that's the only time this character appears alone, i.e. with whitespace on either side of it. A cursory look at the code seems to confirm that:

impl WordListGenerator for PlainTextWordListGenerator {
fn create_word_list(
&self,
_config: &InputConfig,
buffer: &str,
) -> Result<Contents, WordListGenerationError> {
Ok(Contents {
word_list: buffer
.split_whitespace()
.map(|word| AnnotatedWord {
word: word.to_string(),
..Default::default()
})
.collect(),
})
}
}

Since Chinese sentences don't use whitespace to separate words, this also might be a little bit of an issue :) but I don't know exactly what the word list is used for; I'll keep learning by reading your code (which is very readable, nice job!)

Would it work for you if I added a configuration option that let you tell Stork to index substrings as short as one character? (I probably wouldn't be able to get to this for a few weeks, to set expectations.)

Hmm. This might work well for a smaller corpus but my index already takes about 3 hours to build and searching takes about a second, so I guess it depends on how much more it slows things down 😅

A thought here: could we examine 1 and 2 character substrings but only index them if the characters fall into the range specified by the CJK Unicode Blocks?

As an aside, I haven't tested this at all with Chinese-language text -- I'd love to hear about any other unexpected behavior that you encounter!

I'll definitely keep you posted :) I don't speak (or read) Chinese, but I'm building this for my roommate, so I'll forward along his thoughts

from stork.

DenialAdams avatar DenialAdams commented on May 19, 2024

Hmm. This might work well for a smaller corpus but my index already takes about 3 hours to build and searching takes about a second, so I guess it depends on how much more it slows things down 😅

I did some profiling and was able to easily cut 3 hours down to 1 min (PR incoming), so this is much less of a concern for me now :)

(edit: #74)

from stork.

DenialAdams avatar DenialAdams commented on May 19, 2024

A thought here: could we examine 1 and 2 character substrings but only index them if the characters fall into the range specified by the CJK Unicode Blocks?

Here's a prototype for this approach:
DenialAdams@fa2b24d

Let me know what you think of it, so far it's really improved the search results but the searches do take a little longer now (worth it to me)

from stork.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.