Comments (4)
Hey @DenialAdams! Thanks for writing in, hope Stork is working well for you so far despite this!
I'll have to noodle on this issue for a bit. If a corpus gets too large, Stork has a lot of trouble searching for any query less than three characters: the index file size itself gets pretty big and the search algorithm can get kind of slow.
I think (off the top of my head, not looking at code) that this would have to involve an index regeneration so Stork indexed one- and two-character-long substrings.
Would it work for you if I added a configuration option that let you tell Stork to index substrings as short as one character? (I probably wouldn't be able to get to this for a few weeks, to set expectations.)
As an aside, I haven't tested this at all with Chinese-language text -- I'd love to hear about any other unexpected behavior that you encounter!
Thanks again,
James
from stork.
I think (off the top of my head, not looking at code) that this would have to involve an index regeneration so Stork indexed one- and two-character-long substrings.
Interestingly (and I only thought to try this out this morning), if I use the CLI to search (bypassing the >= 3 requirement) I do get some results with a one character search. Well... I get exactly one result. I suspect this is because that's the only time this character appears alone, i.e. with whitespace on either side of it. A cursory look at the code seems to confirm that:
stork/src/index_versions/v3/builder/word_list_generators/mod.rs
Lines 47 to 63 in db1b958
Since Chinese sentences don't use whitespace to separate words, this also might be a little bit of an issue :) but I don't know exactly what the word list is used for; I'll keep learning by reading your code (which is very readable, nice job!)
Would it work for you if I added a configuration option that let you tell Stork to index substrings as short as one character? (I probably wouldn't be able to get to this for a few weeks, to set expectations.)
Hmm. This might work well for a smaller corpus but my index already takes about 3 hours to build and searching takes about a second, so I guess it depends on how much more it slows things down 😅
A thought here: could we examine 1 and 2 character substrings but only index them if the characters fall into the range specified by the CJK Unicode Blocks?
As an aside, I haven't tested this at all with Chinese-language text -- I'd love to hear about any other unexpected behavior that you encounter!
I'll definitely keep you posted :) I don't speak (or read) Chinese, but I'm building this for my roommate, so I'll forward along his thoughts
from stork.
Hmm. This might work well for a smaller corpus but my index already takes about 3 hours to build and searching takes about a second, so I guess it depends on how much more it slows things down 😅
I did some profiling and was able to easily cut 3 hours down to 1 min (PR incoming), so this is much less of a concern for me now :)
(edit: #74)
from stork.
A thought here: could we examine 1 and 2 character substrings but only index them if the characters fall into the range specified by the CJK Unicode Blocks?
Here's a prototype for this approach:
DenialAdams@fa2b24d
Let me know what you think of it, so far it's really improved the search results but the searches do take a little longer now (worth it to me)
from stork.
Related Issues (20)
- Search required 3 chars to return results, but appear to load with 1 and 2 HOT 4
- Question: Exclude HTML tags of CSS selectors from being added to the search index HOT 6
- Build & publish Mac ARM binary upon new release HOT 1
- Working with Subtitles | timestamp_format "MinutesAndSeconds" HOT 2
- Allow user to configure result messages HOT 1
- Test failure in pretty_print_search_results
- indexation is pretty slow HOT 1
- Ruby text isn't handled properly
- How to enable search with filename HOT 2
- Enable exact searching for phrases by using quotes
- Publish Stork indexer library to NPM
- Expose URLs for each excerpt in a result
- Filter results by metadata when searching
- how to use Stork to handle a list of 200-500 documents? (lunrjs style) HOT 1
- Main thread panic in stork search "not a char boundary" HOT 3
- Panic when importing indices built with excerpts_per_result == 0
- onResultSelected should stop changing location if the returned promise is rejected
- Stork freezes the page on iOS when search index is >20MB HOT 1
- Stork was not compiled with test server support
- Ubuntu error HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from stork.