jeancroy / fuzzysearch Goto Github PK
View Code? Open in Web Editor NEW:mag: Fast autocomplete suggestion engine using approximate string matching
License: MIT License
:mag: Fast autocomplete suggestion engine using approximate string matching
License: MIT License
Hi, like others, I can't believe not more people are aware of this library. I love it. Thanks you for working on it. I'm happy to contribute to the project if this is a bug that can be fixed or better docs are helpful.
I'm trying to use FuzzySearch.prototype.add()
to add new documents to my searcher
after initial index creation. After calling searcher.add()
, I introspect searcher.index.length
and searcher.nb_indexed
and can see that the document has been added to the index. But when I call searcher.search()
, the expected result is missing.
Here's my test code:
// TEST
const docs = [
{ _id: 1, title: 'Item 1', domain: 'item1.com' },
{ _id: 2, title: 'Item 2', domain: 'item2.com' },
];
const keys = {
title: 'title',
domain: 'domain',
};
const searcher = new FuzzySearch({
source: docs,
keys: keys,
identify_item: function(doc) {
return doc._id;
},
});
let results;
// First test search
console.log(`Results for 'title:Item', round 1`);
results = searcher.search('title:Item');
results.forEach((result) => {
console.log(result);
});
console.log(`There are ${searcher.nb_indexed} docs`);
// Add a doc
console.log(`Adding a doc`);
const newDoc = { _id: 3, title: 'Item 3', domain: 'item3.com' };
searcher.add(newDoc, keys);
// Do I need searcher.dirty = true; here? That doesn't work either
// Sanity check internals
console.log(`Doc added, now there are ${searcher.nb_indexed} docs, let's search 'em!`);
// Second test search
console.log(`Results for 'title:Item', round 2`);
results = searcher.search('title:Item');
results.forEach((result) => {
console.log(result);
});
Console output:
Results for 'title:Item', round 1
commandDB.js:76 Object {_id: 1, title: "Item 1", domain: "item1.com"}
commandDB.js:76 Object {_id: 2, title: "Item 2", domain: "item2.com"}
commandDB.js:78 There are 2 docs
commandDB.js:81 Adding a doc
commandDB.js:86 Doc added, now there are 3 docs, let's search 'em!
commandDB.js:89 Results for 'title:Item', round 2
commandDB.js:92 Object {_id: 1, title: "Item 1", domain: "item1.com"}
commandDB.js:92 Object {_id: 2, title: "Item 2", domain: "item2.com"}
The expected result is that the Item 3 document would also show in the results, but it doesn't
I noticed that there is a FuzzySearch.prototype.dirty
property, and I've tried setting searcher.dirty = true;
after search.add()
but that doesn't fix the problem, and in fact it resets searcher.nb_indexed back to 2 if I do that. Maybe there is an issue where settings searcher.dirty = true;
causes a refresh of the index based on original data, and documents added are lost?
Maybe I'm just not adding the documents in the right way?
The issue doesn't seem to be related to the usage of Tagged Search or identify_item
.
Any hints or ways you could point me are much appreciated. I'm happy to hunt things down. Thanks again for building this great library.
I need a way to specify the score result if it fit a minimum, I found that thresh_include not convenient in general, where it depends input length. I think I need to compare the score and its relativity to input length ?
In the minimal set up, we have this:
setsource("movies.json");
in my case, I wanted to be something like
setsource("movies.json.php?query"+firstWordOfWhatUserTypedSoFar);
This is because the movies.json in my case 1.6M records and that becomes too much for the js to handle when running your fuzzy search algs. I thought, if I could manage to return a sub set of the movies.json by sending the very first word or whatever he typed up so far, then I can significantly minimize the JSON size being returned by the server.
But I guess this is not possible, or is it?
Hi Jean,
Another quick check, can i sort the result based on string length instead of alphabetic order.
can we pass in an argument option or do we need to change in FuzzySearch.js
Thanks Again for your work.
Small word are currently filtered out. With length 2, we filter words like "a, of".
However when the small word are number they may be significant. Ie 2, 5, 10, 18 ...
Possible solution is to accept any length if first char is a number.
Hello Team,
Quick question, do we have a synonym substitute that we could do
for example `'lat' -> 'Geo Coordinates``
--Thanks B
Hi Jean,
Please take a look at the 2 screenshots
The first screenshot is showing the search results of "Black Seed and Honey".
The second screenshot shows the same data set when just the "honey" is searched.
When honey is searched with "black seed and honey", almost all of its entries are being crashed and pushed down.
When the word "black" is a full match, I cannot criticize it because both "black" and "honey" are 5 letters...
But the following entries should not end up getting more points than the 2nd screenshot I posted above.
What do I need to achieve so that the one above is less valued than the full word match "honey"?
Also, as the examples below demonstrates, these are very poor matches. What param should be at which value ( in the option ) so the algorithm does not think these are great matches and therefore no scores should be given for those scattered matches.
As you can easily tell, these have nothing to do with "Black Seed and Honey" and entires like the below ( with perfect & bingo "honey" matches should definitely be above them.
Sometimes, I think of catching the full results set of the fuzzy search and then run thru a loop ( which I will be writing to weed out the scattered ones and move the bingo words to the top by revalueing them with more scores..
But if we can do this with your params already, then, why get into a second loop and a second scoring process. Right?
Please, could you specify under which conditions can we use your library?
@jeancroy Thank you for this wonderful library!!!
indeed, It's a buried treasure just need some attention.
I have experienced some issues when a property is null
Sometime it's good to be a little less fuzzy.
Filters:
Prune results using a custom function. Example use case is to filter by a status code.
Must be done before spending time on computing fuzzy score, must also preserve cache.
"literal expression"
:
Remove the literal word from the fuzzy search, but build a filter that require it to be present as-is in one of the field (case insensitive indexOf). Let the highlighter be aware of this and act accordingly.
field: "literal expression"
:
Same, but must exist in this particular field. (see tagged search)
literal field:
Use case is for date / status code. A match is not required. But for this field to contribute to the score, one of the query word must be present (case insensitive indexOf). Does not require usage of "literal" although it should respect it (for example "multi word"). Let the highlighter be aware of this and act accordingly.
sigil #word
, @word
:
For all use case this syntactic sugar over the tag:
syntax. Difference is that it may happen in the middle of a query, and is limited to a single word.
Would like to store the Index in local storage once it has built for a lot of entries.
Is there a public API for retrieving / storing?
And any issues known when compressing the index with ez?
Please take a look at this screenshot.
You may also play with the actual page:
https://www.islamicity.org/quransearch/lab/scrape-sunnah-com/search/fuzzy-search-client.php and selecting the 2nd option (Search Field: FUZZYSEARCH__topic) . Type Ablution pls and you will see the same thing I posted in the screenshot
My question is...
The first match is a great bingo.. But starting with the 2nd and downward is too far away from the first word. It is better to not show such suggestions. Which settings do I need to play with so that it only shows the first one (Ablutions ) as it has a huge resemblance to what the user has typed and the others far far away from it such as ( Afflictions )?
I think this is a hard one.. Cause trying to make that view better may result sacrificing some of the good suggestions..
If this is not much possible, let me ask you something else...
Why in your demo the entire word Brain is bolded. Should not only the applicable letters to be bold?
Here is the view:
Should not we expect only the b and r and a to be in bold but not the i and n?
For setting the source and initializing the fuzzyhound, we do this.
function setsource(url, keys, output) {
$.getJSON(url).then(function (response) {
fuzzyhound.setOptions({
source: response,
keys: keys,
output_limit: 200, // Return up to N result, 0 to disable
//output_map: output,
output_map: function(root){ final_item = removeTags(root.item); console.log (final_item); return final_item; },
})
});
}
```Let's assume our url is this
`blah.blah.com/blah.js
`
And we have also this:
`$.ajaxSetup({cache: true});
`
This is great for efficiency but, what if the blah.js is changed on the server, ( say a spelling error has been fixed in it ), how do you make sure people who already cached the blah.js earlier in their browser gets the latest and greatest blah.js?
Would the following trick work?
`url = 'blah.blah.com/blah.js?version=' .getFileModificationDate("... /blah.js");
`
So whenever the file changes, I can change the version number on the URL of the setSource, either programmatically in an unattended fashion or manually.
In that case, would "$.ajaxSetup({cache: true}); " still be serving the older file?
Or is there a better way to address this intelligent caching?
Hi Jean
In this example: https://github.com/jeancroy/FuzzySearch#scoring-an-item you make no reference to fullname
. You may want to update the Books object with the "fullname" in it.
Author: "John MiddleName Doe",
should be something like this,
Author:{fullname:"John MiddleName Doe", Age:"60"},
No?
Hello, @jeancroy I LOVE your library! It's fantastic. So much better than the many other popular fuzzy search libraries.
A lot of your writing on the theory and algorithms implementation goes over my head and I don't have time to get into that, but I love that it's there. The mere presence of extensive intellectual stuff gives me extra confidence in your library, plus I can tell you really value the things you create, and aim to produce stuff that works well. So I give your library many stars! ⭐ ⭐ ⭐ ⭐ ⭐ ⭐
I also LOVE how you have an add method, it fits my use case perfectly. My use case is adding documents (and queries, but they are shorter, and are easier to handle), to an index inside of a desktop app that archives your browsing history so you can read it when you are offline. I recently added full text search to this (using flexsearch and ndx), but while these are good (and serializedable, to save time on restarting the app), they do not support typos. So I went searching for a fuzzy library, and after a few false starts, ended up here (actually via the GitHub search, not by npm search nor by Google search). I'm really thankful I did end up here.
Even if you can't implement any of these things (such as serialization, or remove), I am still immensely grateful for your library. Thank you for crafting such a thing of beauty and releasing into the world for all to use!
Anyway, I'd like there to be a remove, but I know it's not your job to make it, and I am sure you are in any case super busy. But let me just describe it here for the sake of the issue.
If I added a document to the index using add()
, and then later removed()
that document from the index, the document would not show up in search results, nor in serializations (such as they may be) of the index.
My particular motivation for this feature is that very often, a webpage (with the same URL) will add additional content (such as comments on a story), and my archiver would like to index those so that the person using the archiver can get (and search for) the most up-to-date version of the content they have seen.
Right now my solution is simply to discard all fz-search results that are older duplicates, but I am worried about the scalability of this, so have delayed deploying it to production. This is because people may re-index a page on the change of its content many times, but if I never remove those pages from the fuzzy search index, my index will end up dominated by pages that I no longer need and are out of date. So you can see I'm very scared of this possibility! It would be terrible!
But it's OK. If that's the case that you cannot implement a remove (or even serialization--I commented on another issue #21 ), I will still use your wonderful fz-search library for providing typeahead and suggestion for search queries (just like Google search), rather than on results. It will be a little sad to not be able to search with typos, but I'm sure I can figure out another way to do it somehow.
Please delete this issue if you already have this functionality (how embarrassing! 😱 )
is there a way to integrate this to an existing vue component?
Hello,
I was planing on rewriting fuzz-aldrin-plus
to c++, until I found this library. It seems to be more suited for my case, however what matters to me is performance. Is it more or less like in fuzz-aldrin-plus
? I would test performance myself, but I'm unable to do it for a couple of days (and I can't get this question out of my head).
Hi. First, thank you for building this promising library, and thanks for your work on fuzzaldrin-plus as well.
My data set is 1mm records (natural language document titles), and I'm providing an auto-complete search that only needs to return 15 results. With fuzzaldrin-plus
, there's a settings maxInners
which is detailed here. I've found that with 1mm records, this fuzzaldrin-plus
config works amazingly well, returning 15 results from 1mm records in 1-2ms.
let fuzzaldrinOptions = {
debug: false,
maxResults: 15,
usePathScoring: false,
maxInners: 50,
key: 'title',
};
This is due to the maxInners
option and its High positive count mitigation as detailed in the README. Basically, my interpretation is that maxInners tells the algorithm to abort scoring early in the process if there are a ton of potential matches (i.e. searching for "a" across 1mm records).
I'm looking at the tuning options for FuzzySearch
. Is there a similar mitigation in FuzzySearch
that lets me efficiently escape from a large number of matches early? I tried playing with the thresh_relative_to_best
setting, thinking that putting it higher (i.e. 0.8) would select from a smaller set of records, but that didn't change the performance significantly.
Any advice you could provide on tuning for a search across a large number of records would be very helpful. Maybe FuzzySearch
doesn't have that mechanism, as it looks like maxInners
was added as a later performance enhancement PR to fuzzaldrin-plus
, but I figured it was worth asking.
Thanks again for your work on this library.
Hi, when the algorithm suggests a few items, how do you select one so it is added into the textbox or submit that selection to my server? By default, when I click/choose an item, the text box gets cleared immediately and the suggestions drop-down disappears. I would like it to behave like a typical Google search... In google, when you type, google makes some auto-suggestions, and as soon as you select an item on the suggestions list, you end up firing a Google search for that selection. Is there something like onClick, onSelect type hook, that passes the selection to a js function as inout and from there, I do whatever I can???
Hi Jean,
I've got a 13,000 item JSON. and works great....
but sometimes, I need to add, say 1000-to-2000 more items if I detect a certain pattern in the user input!
The example, you give in the docs is as follows but it is only for a single item addding.
var data = ["survey","surgery","insurgence"];
var searcher = new FuzzySearch({source:data});
var query = "assurance";
var result = searcher.search(query)
var newEntry = "surgeon";
searcher.add(newEntry);
var newResult = searcher.search("surgeo");
And, this example is only for a single dimension array!
In my case, the json array is multiple item! And I specify which key to search for too ( thru the final arg of setsource function )
When adding 1000 or more items, repeating code such as var newEntry = "surgeon";searcher.add(newEntry);
1000 times does not make sense... I also do not know how to point to a key in a more complex json array.
Is there a way in the API to tell the engine, "here! add this JSON file on top of the first json loaded! " or perhaps, completely change the json with another json in the middle of the search..
Again, the scenario is something like this..
User types a string, typeahead start working based on the original json data set...
user input reaches the 4th word (and because it is a long input or my algorithm detects a certain pattern in the user input), I then replace the org json with another data set, so that the input at that moment is searched against the new json data set... And if the user starts removing words (or deletes that pattern), I can load the first json...
Can I do something like this?
Is there a built-in option so that when experimented with it, the density of the number of matching characters ( I mean the highlighted matches in red ) is honored more compared to the lengthier entries? I think that would increase the relevancy automatically.
Example screenshot:
Here, I searched for Hajj ... ( in 1.8Mb 13,000 items file ) And the most top of the line entries ( which are below ) ended up around at 100th or so in the suggested items.
They are pretty short and bingo like matches yet plenty of long ones were preceding them.
How can I easily rise them to the top? Or at least near to the top?
Here are the winners...
Here are the poor little ones being crashed by the winners:
Clearly, the red density on those short ones are noticeably higher.
Especially this guy:
What do you think Jean?
I have a feeling this has to do with your compare method.
The solution may be it, but if so, how do I create that comparison?
I use
bonus_match_start: 0.6,
highlight_bridge_gap: 0
Must see it in practice, but what I'm asking for could create a tremendous difference in quality especially when main topics and sub-topics are searched like in my case.
Hello how i can install your project in nodejs. Thanks
Characters that are not present in a search query get highlighted:
import FuzzySearch from 'fz-search'; // v1.0.0
const nodes = [
{
value: 'primer'
},
];
const fuzzy = new FuzzySearch({
source: nodes,
keys: 'value',
});
const [result] = fuzzy.search('pre').map(
item => fuzzy.highlight(item.value)
);
console.log(result);
<strong class="highlight">prime</strong>r
primer
<strong class="highlight">pr</strong>im<strong class="highlight">e</strong>r
primer
As you can see, symbols im
aren't present in a search query, but they are still highlighted.
Is it possible to avoid highlighting of symbols that are not present in a search query?
Hi Jean, thank you for Fuzzysearch, it's excellent. My feed contains categories and products, I'm wondering if...
My feed looks like this...
[
{ "type": "P", "id": "123", "title": "Product Name", "descr": "Description", "sku": "ABC123", "price": "19.95", "image": "./images/image.jpg", "category": "Category Name" },
...
{ "type": "C", "id": "456", "title": "Category Name", "descr": "Description" },
...
]
Note, I'm using Twitter typeahead. Grateful for any advice you can give.
Phil
I need data from database via post method, so how can i do that?
I've just try as follow:
`<script>
var fuzzyhound = new FuzzySearch();
$('#demo-query').typeahead({
minLength: 2,
highlight: false //let FuzzySearch handle highlight
},
{
name: 'movies',
source: fuzzyhound,
templates: {
suggestion: function(result){return "<div>"+fuzzyhound.highlight(result)+"</div>"}
}
});
$.ajaxSetup({cache: true});
function setsource(url, keys, output) {
$.post( url, { query:query,_csrf:_csrf })
.done(function( response ) fuzzyhound.setOptions({
source: response,
keys: keys,
output_map: output
})
});
}
setsource("test.php");
</script>`
How can i define the keyword (query)?
Thanks
I notice there's at least one place the library uses window.performance, which fails in node. It's also not published to npm... But it sure looks like a very nice module for autocompletion amongst other things. Any reason it doesn't support node and/or isn't published?
Is it possible to integrate your library into this UI:
https://github.com/farzher/fuzzysort
The demo of that repo is here: https://rawgit.com/farzher/fuzzysort/master/test.html
His search alg is fast but only good for words. Yours is superior but the his UI is excellent.
Hi Jean,
Just to see which item matched with what score, is it possible modify the code so I can see numbers like
one flew over the cuckoos nest (1975) @ 78 // biggest score in this case was 78
one false move ( 1992) 54 // here the score is 54 and so on...
...
last item 20 // here the last item's score is 20 because field_good_enough setting is set to 20
If I see what score items are matching, I can play with field_good_enough more effectively to fine tune the numbers.
This patch provides the ability to add documents incrementally by using a new add
method in conjunction with an identifyItem
function. It does this by keeping track of the item ids in the indexmap
dictionary where the values are the position into the index of where the item lives.
0001-Provide-the-ability-to-incrementally-add-to-the-inde.patch.zip
Additionally, this patch moves grunt into devDependencies
instead of as a dependency
.
i have these items to search through
when I search "sea" the results are in this order
I would like to give full words that are in the query that match higher ranking is this possible?
here are my configurations
output_limit: 6,
output_map: 'alias',
source: this.items,
thresh_include: 8,
minimum_match: 1,
bonus_token_order: 10,
bonus_match_start: 10,
bonus_position_decay: 0,
highlight_bridge_gap: 2,
highlight_prefix: true,
thresh_relative_to_best: 0.5
Hi...
Beforehand, I am really thank to you because of creating this library. This library help me to search keyword based on similarity.
My question :
"Sorry for my bad grammar"
Thanks
Is it possible to let fuzzy-search search items like
United States <!--America-->
so the user can type America and match it but only see this in the search results:
United States
This way, we can add lots of helper keys or synonyms in HTML comments to help the search but not confuse the user with those helper words..
Think of it like this
user types "below" and you match "under" because of this:
under <!--below underneath -->
or user types "heaven" or "gardens of eden" and you match "paradise" because of this:
paradise <!--heaven, gardens of eden -->]
etc..
The classic edit distance algorithm is O(NMK)
Where N
is number of item. M
is average length of item and K
the length of search term.
With bit parallelism we are able to shave K.
Which bring us closer to O(N*M) for a single word query.
We still have to visit every character of every items !
We would like an index structure that can reduce N.
That is pre-selecting some good candidates, on which we do the slower but higher quality scoring.
We can improve over that using an index data structure.
They basically amount to pre-generating a number of spelling error that we can accept, and organize those results in an efficient way.
Out of the misspelling we can pre-compute, delete is one of the easiest. In particular because it does not require knowledge of the alphabet where the word exist.
The symmetric delete algorithm apply delete variation computation to both candidate and search term.
One way to understand symmetric delete, is that we are trying to match X out of Y first characters of the word. To do so we generate every subset of X characters at a given delete-distance.
For example, two deletes variations of "hello" and "hills" are:
// del2("hello")
[ 'llo', 'elo', 'ell', 'hlo', 'hll', 'heo', 'hel' ]
// del2("hills")
[ 'lls', 'ils', 'ill', 'hls', 'hll', 'his', 'hil' ]
They share the hll
key, as they indeed match from a 3 out of 5 standpoint.
// del2("jello")
[ 'llo', 'elo', 'ell', 'jlo', 'jll', 'jeo', 'jel' ]
Jello and hello share llo
, elo
, ell
keys. And thus is a better match than hello and hills.
For longer words we simply take the first Y characters.
For smaller words we have fallback X out of Y strategies.
We also compute smaller X strategy for long words. (This allow smaller search query to match and is a difference between spell-check and autocomplete.)
We following principle "editorial choice"
Match must happen toward the start of the word
(With start that reset at each word)
Match must include at least half the characters
(And more is better)
We consider the following strategy X out of Y:
(XoY)
3o6,3o5,3o4,3o3
2o4,2o3,2o2
1o1
Note that, for example, 2o4 include every variation included in 2o3 and some more. So for every X, we only compute one XoY strategy, with the largest Y available.
1o1 index the word by the first letter, and 2o2 is not included because we only consider word > 3 chars.
Those rules are subject to change as we gain more feedback.
For example 3o6 might generate too many variations.
For the search term we compute a single X out of Y strategy, so we use as many character typed as possible.
Absolute maximum on keys with above strategies ans a-z alphabet:
26^3 + 26^2 + 26 = 18278 keys.
The multi-word strategy consist of splitting input in word and looping the key generation strategy on each of those words. All of the collected word part point to the same index entry.
This mean we'll have false positive where an entry can match the query using mix and match of word parts.
For now we accept this deficit because it simplify the index structure and it as a pre-selector we prefer to approximate toward false positive, than false negative.
A proper fix might involve two maps:
key => word and word =>entry
This might be needed if each entry is a large corpus and the pre selection part is rendered useless because of word-part soup.
In which case:
For every word of query:
word-part => word => filter for minimum quality.
Put those word together and retrieve entries:
word => entry => filter for minimum quality.
entry => refined score => filter for minimum quality.
This was inspired by this blog post, in particular they make very bold speed claim and the algorythm is relatively easy to understand.
http://blog.faroo.com/2012/06/07/improved-edit-distance-based-spelling-correction/
Javascript come built in with a good hashmap implementation, and we are taking advantage of that.
Data structure is mostly flat compared to say a radix tree, and that can help import/export and thus offset some of the preparation time.
As implemented the index structure has some measure of quality. It might be enough for certain use (IE do not need to fold that into FuzzySearch)
If this is implemented (and turned of) FuzzySearch will become less fuzzy. It may be a good thing. Or if not, it may be a preferable trade off to something like max-inner.
I've made a proof of concept of the above.
https://gist.github.com/jeancroy/9187627975a2632a8f8e4aaae17d358e
The work of integrating something like this into FuzzySearch only make sens if it is both much faster and fast enough for our need.
Hi Jean,
Please take a look at the screen shot which demonstrates the issue more visually.
In my demo, I type "Belief", And many Belief entries match but I want one that has the max "Belief" words to match as #1.
For that, I created a "belief" item in which I put 4 "Belief" words back to back.
But it is still not #1.
Is there a setting to move that kind of match higher?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.