Giter Club home page Giter Club logo

go-ahocorasick's Introduction

Aho-Corasick

⚠️ Use the newer aho-corasick instead. This implementation does not scale well with number of patterns. ⚠️

Build Status

Aho-Corasick string search algorithm implemented in Go.

Uses a double array trie for improved access speeds and reduced memory consumption.

Licensed under MIT License.

Documentation

Can be found here.

Usage

Use a TrieBuilder to create a Trie:

trie := NewTrieBuilder().
    AddStrings([]string{"hers", "his", "he", "she"}).
    Build()

Match something:

matches := trie.MatchString("I have never tasted a hershey bar.")
fmt.Printf("We got %d matches.\n", len(matches))

// => We got 4 matches.

Examine matches:

for _, match := range matches {
    fmt.Printf("Matched %q at offset %d.\n", match.Match(), match.Pos())
}

// => Matched "he" at offset 22.
// => Matched "hers" at offset 22.
// => Matched "she" at offset 25.
// => Matched "he" at offset 26.

For debugging you may output the trie in DOT format:

NewTrieGrapher(trie).DrawFailLinks(true).Graph("example.dot")

And convert to image, e.g.:

$ dot -Tpng -o example.png example.dot

example-trie

Building

You can use ReadStrings or ReadHex to read patterns from a file (one pattern on each line).

patterns, err := ReadStrings("patterns.txt")
if err != nil {
    log.Fatal(err)
}

trie := NewTrieBuilder().AddPatterns(patterns).Build()

Saving/Loading

Building a large trie can take some time:

chart

So you can create a trie and save to file and load it instead of recreating it each time:

err := SaveTrie(trie, "my.trie")
if err != nil {
    log.Fatal(err)
}

And later:

trie, err := LoadTrie("my.trie")
if err != nil {
    log.Fatal(err)
}

Performance

Tested on a Dell XPS (i7-6700HQ @ 2.60GHz and 16 GiB RAM).

Building

BenchmarkBuildNSF/10-8         	  200000	      11871 ns/op
BenchmarkBuildNSF/50-8         	   20000	      66552 ns/op
BenchmarkBuildNSF/100-8        	   10000	     168264 ns/op
BenchmarkBuildNSF/500-8        	    1000	    2118400 ns/op
BenchmarkBuildNSF/1000-8       	     200	    6864025 ns/op
BenchmarkBuildNSF/5000-8       	      10	  126533350 ns/op
BenchmarkBuildNSF/10000-8      	       2	  504624124 ns/op
BenchmarkBuildNSF/50000-8      	       1	11154774829 ns/op
BenchmarkBuildNSF/100000-8     	       1	45294850018 ns/op

build-chart

As you can see, building gets a bit rough above 10,000 patterns.

Matching

Using 10,000 patterns.

BenchmarkMatchIbsen/100-8         	 1000000	      1193 ns/op
BenchmarkMatchIbsen/500-8         	  300000	      5279 ns/op
BenchmarkMatchIbsen/1000-8        	  200000	      9704 ns/op
BenchmarkMatchIbsen/5000-8        	   30000	     49823 ns/op
BenchmarkMatchIbsen/10000-8       	   20000	    102436 ns/op
BenchmarkMatchIbsen/50000-8       	    3000	    490882 ns/op
BenchmarkMatchIbsen/100000-8      	    2000	    976724 ns/op

match-chard

Matching follows the input more linearly and is quite fast.

Memory usage

Haven't tested this properly, but a quick test with 10,000 patterns gave Trie with size 99830 (that is, the length of all its slices combined). This should in theory equal around 0.76MiB. I do not know the internals of golang enough to know how this is in practice.

The memory usage is (obviously) higher when actually doing matching.

go-ahocorasick's People

Contributors

bobusumisu avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.