Giter Club home page Giter Club logo

Comments (2)

BurntSushi avatar BurntSushi commented on August 22, 2024

You know your users best, but my instinct is that you might consider doing the same this library does:

  • By default, chooses the Aho-Corasick implementation that heuristics believe are best automatically.
  • Provides an option to override the automatic choice and lets the user specify which to use.

If you want an idea of the differences between the different implementations, you might do some ad hoc benchmarking:

$ git clone https://github.com/BurntSushi/aho-corasick
$ cd aho-corasick
$ cargo install -f --path aho-corasick-debug
$
$ aho-corasick-debug texts/wikipedia/titles.10000 subtitles/2018/OpenSubtitles2018.raw.sample.en --match-kind leftmost-longest --kind noncontiguous
pattern read time: 402.378µs
automaton build time: 27.103352ms
automaton heap usage: 11307832 bytes
match count: 161135
count time: 4.675598321s

$ aho-corasick-debug texts/wikipedia/titles.10000 subtitles/2018/OpenSubtitles2018.raw.sample.en --match-kind leftmost-longest --kind contiguous
pattern read time: 413.211µs
automaton build time: 21.877186ms
automaton heap usage: 2057424 bytes
match count: 161135
count time: 2.359616092s

$ aho-corasick-debug texts/wikipedia/titles.10000 subtitles/2018/OpenSubtitles2018.raw.sample.en --match-kind leftmost-longest --kind dfa
pattern read time: 383.93µs
automaton build time: 155.796865ms
automaton heap usage: 180510904 bytes
match count: 161135
count time: 1.346125399s

$ aho-corasick-debug texts/wikipedia/titles.10000 subtitles/2018/OpenSubtitles2018.raw.ru --match-kind leftmost-longest --kind noncontiguous
pattern read time: 453.172µs
automaton build time: 17.181189ms
automaton heap usage: 11307832 bytes
match count: 3700
count time: 7.22918367s

$ aho-corasick-debug texts/wikipedia/titles.10000 subtitles/2018/OpenSubtitles2018.raw.ru --match-kind leftmost-longest --kind contiguous
pattern read time: 419.158µs
automaton build time: 29.077667ms
automaton heap usage: 2057424 bytes
match count: 3700
count time: 3.204312942s

$ aho-corasick-debug texts/wikipedia/titles.10000 subtitles/2018/OpenSubtitles2018.raw.ru --match-kind leftmost-longest --kind dfa
pattern read time: 486.045µs
automaton build time: 147.695433ms
automaton heap usage: 180510904 bytes
match count: 3700
count time: 2.81868388s

The needles are 10,000 randomly selected Wikipedia titles and the haystacks are English and Russian subtitles. In more condensed format:

  • The noncontiguous NFA uses 11MB and takes 21-27ms to build. It searches in
    4.68s and 7.23s.
  • The contiguous NFA uses 2MB and takes 21-29ms to build. It searches in 2.36s
    and 3.2s.
  • The DFA uses 180MB(!) and takes 147-155ms to build. It searches in 1.35s and
    2.82s.

Notice that sometimes the DFA is a fair bit faster, but sometimes only a little faster, than the contiguous NFA. But look at the build times and heap usage.

If we increase to 100,000 English titles, the differences change yet again:

$ aho-corasick-debug texts/wikipedia/titles.100000 subtitles/2018/OpenSubtitles2018.raw.ru --match-kind leftmost-longest --kind noncontiguous
pattern read time: 2.76288ms
automaton build time: 218.309437ms
automaton heap usage: 98589200 bytes
match count: 17069
count time: 13.264365099s

$ aho-corasick-debug texts/wikipedia/titles.100000 subtitles/2018/OpenSubtitles2018.raw.ru --match-kind leftmost-longest --kind contiguous
pattern read time: 3.023829ms
automaton build time: 253.405188ms
automaton heap usage: 14535888 bytes
match count: 17069
count time: 8.674279809s

$ aho-corasick-debug texts/wikipedia/titles.100000 subtitles/2018/OpenSubtitles2018.raw.ru --match-kind leftmost-longest --kind dfa
pattern read time: 3.468222ms
automaton build time: 1.764152679s
automaton heap usage: 1574058520 bytes
match count: 17069
count time: 2.833372042s

Now the DFA is taking 1.5GB while the contiguous NFA takes a mere 14MB. That's two orders of magnitude difference. Sadly, the contiguous NFA is quite a bit slower in this case, but IMO, even if you want the faster search times, one should be very clear eyed about what it's going to cost to do it. So I feel like it's something one should explicitly have to opt into.

If you do go the route of just trusting the crate's automatic heuristics, then it will use the DFA in some circumstances. Namely, when the number of patterns is pretty small (low hundreds), then the DFA will be used. Otherwise, the contiguous NFA is used in almost all other cases.

It's possible some perf improvements to the contiguous NFA might make it even more attractive, but it already fairs pretty well.

Do also note that if you use a DFA and enable both anchored and unanchored searches, then its heap usage will double:

$ aho-corasick-debug texts/wikipedia/titles.100000 subtitles/2018/OpenSubtitles2018.raw.ru --match-kind leftmost-longest --kind dfa
pattern read time: 3.142986ms
automaton build time: 1.691781473s
automaton heap usage: 1574058520 bytes
match count: 17069
count time: 2.839759428s

$ aho-corasick-debug texts/wikipedia/titles.100000 subtitles/2018/OpenSubtitles2018.raw.ru --match-kind leftmost-longest --kind dfa --start-kind both
pattern read time: 2.542809ms
automaton build time: 2.335389008s
automaton heap usage: 3147712944 bytes
match count: 17069
count time: 2.837407089s

Finally, note that aho-corasick 1.0 isn't actually out yet. I'm not quite sure when I'm going to put it out. Hopefully within the next couple months. A lot of these changes were motivated by my work on regex-automata, which needed some of the API updates and benefits from the better memory efficiency of the contiguous NFA. I want to hold off on 1.0 until I'm more confident that what I have works well for regex-automata.

from aho-corasick.

itamarst avatar itamarst commented on August 22, 2024

Thanks for the detailed answer, and for all your work on this! I will likely follow your suggestion, after also doing some benchmarking of my own once a new release is available.

from aho-corasick.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.