Giter Club home page Giter Club logo

pruningradixtrie's Introduction

PruningRadixTrie
MIT License

PruningRadixTrie - 1000x faster Radix trie for prefix search & auto-complete

The PruningRadixTrie is a novel data structure, derived from a radix trie - but 3 orders of magnitude faster.

A Radix Trie or Patricia Trie is a space-optimized trie (prefix tree).
A Pruning Radix trie is a novel Radix trie algorithm, that allows pruning of the Radix trie and early termination of the lookup.

In many cases, we are not interested in a complete set of all children for a given prefix, but only in the top-k most relevant terms. Especially for short prefixes, this results in a massive reduction of lookup time for the top-10 results. On the other hand, a complete result set of millions of suggestions wouldn't be helpful at all for autocompletion.

The lookup acceleration is achieved by storing in each node the maximum rank of all its children. By comparing this maximum child rank with the lowest rank of the results retrieved so far, we can heavily prune the trie and do early termination of the lookup for non-promising branches with low child ranks.


Performance

Benchmark

The Pruning Radix Trie is up to 1000x faster than an ordinary Radix Trie.

While 37 ms for an autocomplete might seem fast enough for a single user, it becomes insufficient when we have to serve thousands of users in parallel. Then autocomplete lookups in large dictionaries are only feasible when powered by something much faster than an ordinary radix trie.

While a prefix of length=1 is not very useful for the Latin alphabet, it does make sense for CJK languages. Also, there are many more application fields for a fast prefix search algorithm beyond character-wise word completion: Instead of characters, the prefix can be composed of arbitrary items, e.g. whole words in a query completion, or towns in a long routing sequence.

Dictionary

Terms.txt contains 6 million unigrams and bigrams derived from English Wikipedia, with term frequency counts used for ranking. But you can use any frequency dictionary for any language and domain of your choice.

Blog Posts

The Pruning Radix Trie โ€” a Radix trie on steroids
1000x Faster Spelling Correction algorithm
SymSpell vs. BK-tree: 100x faster fuzzy string search & spell checking

Application:

The PruningRadixTrie is perfect for auto-completion, query completion or any other prefix search in large dictionaries. While 37 ms for an auto-complete might seem fast enough for a single user, it becomes a completely different story if we have to serve thousands of users in parallel. Then autocomplete lookups in large dictionaries become only feasible when powered by something much faster than an ordinary radix trie.

Usage:

Create Object

PruningRadixtrie pruningRadixTrie = new PruningRadixtrie();

AddTerm: insert term and term frequency count into Pruning Radix Trie. Frequency counts for same term are summed up.

pruningRadixTrie.AddTerm("microsoft", 1000);

GetTopkTermsForPrefix: retrieve the top-k most relevant terms for a given prefix from the Pruning Radix Trie.

string prefix="micro";
int topK=10;
var results = pruningRadixTrie.GetTopkTermsForPrefix(prefix, topK, out long termFrequencyCountPrefix);
foreach ((string term,long termFrequencyCount) in results) Console.WriteLine(term+" "+termFrequencyCount);

ReadTermsFromFile: Deserialise the Pruning Radix Trie from disk for persistence.

pruningRadixTrie.ReadTermsFromFile("terms.txt");

WriteTermsToFile: Serialise the Pruning Radix Trie to disk for persistence.

pruningRadixTrie.WriteTermsToFile("terms.txt");

Ports

The following third party ports or reimplementations to other programming languages have not been tested by myself whether they are an exact port, error free, provide identical results or are as fast as the original algorithm.

Go
https://github.com/olympos-labs/pruning-radix-trie

Java
https://github.com/benldr/JPruningRadixTrie

Python
https://github.com/otto-de/PyPruningRadixTrie

Rust
https://github.com/peterall/pruning_radix_trie


PruningRadixTrie is contributed by SeekStorm - the high performance Search as a Service & search API

pruningradixtrie's People

Contributors

lukevp avatar wolfgarbe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

pruningradixtrie's Issues

[BUG] UpdateMaxCounts can contaminate the sorted state

The UpdateMaxCounts function updates the termFrequencyCountChildMax field, which is also the field by which the Node.Children Lists are sorted. Since the sort call comes before the UpdateMaxCounts call in the following code, each layer is not actually guaranteed to be in sorted order.

else
{
curr.Children.Add((term, new Node(termFrequencyCount)));
//sort children descending by termFrequencyCountChildMax to start lookup with most promising branch
curr.Children.Sort((x, y) => y.Item2.termFrequencyCountChildMax.CompareTo(x.Item2.termFrequencyCountChildMax));
}
termCount++;
UpdateMaxCounts(nodeList, termFrequencyCount);
}
catch (Exception e) { Console.WriteLine("exception: " + term + " " + e.Message); }
}

And there is no sort call after this UpdateMaxCounts:

if ((common == term.Length) && (common == key.Length))
{
if (node.termFrequencyCount == 0) termCount++;
node.termFrequencyCount += termFrequencyCount;
UpdateMaxCounts(nodeList, node.termFrequencyCount);
}

In practice, this can subtly worsen performance.

Also, because UpdateMaxCounts updates termFrequencyCountChildMax in potentially all ancestor nodes, those layers would need to be sorted as well if every layer is supposed to maintain sorted order.

Store additional data

It would be helpful if we could store additional data in relation to the term like an database id or similar.
Maybe there is an easier way but storing all terms inside an dictionary inside the trie (or maybe there is a whole set of all terms statically) would be a suggestion.

And is there a specific reason why the result tuple is not typed (a separate type). If it was maybe it makes sense to introduce also a generic result where T is the type which is added when indexing.

something like

AddTerm(string term, int weight, T data)

LookupResult results = Lookup(string term, int limit, int offset)
LookupResult results = Lookup(string term, int limit, int offset)

I have ported to Java. Reference in the README?

Hi Mr Garbe,

Thanks for your brilliant work! I have ported the source code to Java, and put it in benldr/JPruningRadixTrie. You might like to reference this in your readme, so that any other Java users can avoid having to port the code themselves?

I have tested it and seems to work as desired (in any case, I will be actively using this code in months to come so will become aware of any bugs if there are any). Note I did not port the PruningRadixTrie.Benchmark code.

Thanks again for your great work,
Ben

Terms.txt file missing

Could you also provide your terms file to start the benchmark ?
Would be great

Thanks a lot for this piece of code!

Any plans to put out a nuget package?

Hi,
Thanks for this, I have a custom implementation of Tries that this could likely replace. Are there any plans to upload this as a nuget package? I appreciate it's only 2 classes of code which can easily be copied, but was just wondering.

Capable of backing a web api for autocomplete

Would have the need of an autocomplete functionality which can handle thousands of words of a given domain within a website.

So my question is if I can use your data structure in a website with high concurrency and if its writable or read only ?

Its awesome what you have done here, the only place where I found something similar is CQEngine in Java, which has also special Trie structure for prefix searches...and is build for concurre cy in mind.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.