robert-bor / aho-corasick Goto Github PK

View Code? Open in Web Editor NEW

875.0 875.0 350.0 669 KB

Java implementation of the Aho-Corasick algorithm for efficient string matching

License: Apache License 2.0

Java 100.00%

aho-corasick's Introduction

Aho-Corasick

Dependency

Include this dependency in your POM. Be sure to check for the latest version in Maven Central.

<dependency>
  <groupId>org.ahocorasick</groupId>
  <artifactId>ahocorasick</artifactId>
  <version>0.6.3</version>
</dependency>

Introduction

Most free-text searching is based on Lucene-like approaches, where the search text is parsed into its various components. For every keyword a lookup is done to see where it occurs. When looking for a couple of keywords this approach is great, but when searching for 100,000 words, the approach is quite slow (for example, checking against a dictionary).

The Aho-Corasick algorithm shines when looking for multiple words. Rather than chop up the search text, it uses all the keywords to build a Trie construct. The crucial Aho-Corasick components include:

goto
fail
output

Every character encountered is presented to a state object within the goto structure. If there is a matching state, that will be elevated to the new current state.

However, if there is no matching state, the algorithm will signal a fail and fall back to states with less depth (i.e., a match less long) and proceed from there, until it found a matching state, or it has reached the root state.

Whenever a state is reached that matches an entire keyword, it is emitted to an output set which can be read after the entire scan has completed.

The algorithm is O(n). No matter how many keywords are given, or how large the search text is, the performance will decline linearly.

The Aho-Corasick algorithm can help:

find words in texts to link or emphasize them;
add semantics to plain text; or
check against a dictionary to see if syntactic errors were made.

See the white paper by Aho and Corasick for algorithmic details.

Usage

Set up the Trie using a builder as follows:

Trie trie = Trie.builder()
    .addKeyword("hers")
    .addKeyword("his")
    .addKeyword("she")
    .addKeyword("he")
    .build();
Collection<Emit> emits = trie.parseText("ushers");

The collection will contain Emit objects that match:

"she" starting at position 1, ending at position 3
"he" starting at position 2, ending at position 3
"hers" starting at position 2, ending at position 5

In situations where overlapping instances are not desired, retain the longest and left-most matches by calling ignoreOverlaps():

Trie trie = Trie.builder()
    .ignoreOverlaps()
    .addKeyword("hot")
    .addKeyword("hot chocolate")
    .build();
Collection<Emit> emits = trie.parseText("hot chocolate");

The ignoreOverlaps() method tells the Trie to remove all overlapping matches. For this it relies on the following conflict resolution rules:

longer matches prevail over shorter matches; and
left-most prevails over right-most.

Only one result is returned:

"hot chocolate" starting at position 0, ending at position 12

To check for whole words exclusively, call onlyWholeWords() as follows:

Trie trie = Trie.builder()
    .onlyWholeWords()
    .addKeyword("sugar")
    .build();
Collection<Emit> emits = trie.parseText("sugarcane sugar canesugar");

Only one match is found; whereas, without calling onlyWholeWords() three matches are found. The sugarcane/canesugar words are discarded because they are partial matches.

Some text is WrItTeN in mixed case, which makes it hard to identify. Instruct the Trie to convert the searchtext to lowercase to ease the matching process. The lower-casing applies to keywords as well.

Trie trie = Trie.builder()
    .ignoreCase()
    .addKeyword("casing")
    .build();
Collection<Emit> emits = trie.parseText("CaSiNg");

Normally, this match would not be found. By calling ignoreCase(), the entire search text is made lowercase before matching begins. Therefore it will find exactly one match.

It is also possible to just ask whether the text matches any of the keywords, or just to return the first match it finds.

Trie trie = Trie.builder().ignoreOverlaps()
        .addKeyword("ab")
        .addKeyword("cba")
        .addKeyword("ababc")
        .build();
Emit firstMatch = trie.firstMatch("ababcbab");

The value for firstMatch will be "ababc" from position 0. The containsMatch() method checks whether firstMatch found a match and returns true if that is the case.

For a barebones Aho-Corasick algorithm with a custom emit handler use:

Trie trie = Trie.builder()
        .addKeyword("hers")
        .addKeyword("his")
        .addKeyword("she")
        .addKeyword("he")
        .build();

final List<Emit> emits = new ArrayList<>();
EmitHandler emitHandler = new EmitHandler() {

    @Override
    public void emit(Emit emit) {
        emits.add(emit);
    }
};

In many cases you may want to do perform tasks with both the non-matching and the matching text. Such implementations may be better served by using Trie.tokenize(). The tokenize() method allows looping over the corpus to deal with matches as soon as they are encountered. Here's an example that outputs key words as italicized HTML elements:

String speech = "The Answer to the Great Question... Of Life, " +
        "the Universe and Everything... Is... Forty-two,' said " +
        "Deep Thought, with infinite majesty and calm.";

Trie trie = Trie.builder().ignoreOverlaps().onlyWholeWords().ignoreCase()
    .addKeyword("great question")
    .addKeyword("forty-two")
    .addKeyword("deep thought")
    .build();

Collection<Token> tokens = trie.tokenize(speech);
StringBuilder html = new StringBuilder();
html.append("<html><body><p>");

for (Token token : tokens) {
    if (token.isMatch()) {
        html.append("<i>");
    }
    html.append(token.getFragment());
    if (token.isMatch()) {
        html.append("</i>");
    }
}

html.append("</p></body></html>");
System.out.println(html);

You can also emit custom outputs. This might for example be useful to implement a trivial named entity recognizer. In this case use a PayloadTrie instead of a Trie as follows:

class Word {
    private final String gender;
    public Word(String gender) {
        this.gender = gender;
    }
}

PayloadTrie<Word> trie = PayloadTrie.<Word>builder()
    .addKeyword("hers", new Word("f"))
    .addKeyword("his", new Word("m"))
    .addKeyword("she", new Word("f"))
    .addKeyword("he", new Word("m"))
    .addKeyword("nonbinary", new Word("nb"))
    .addKeyword("transgender", new Word("tg"))
    .build();
Collection<PayloadEmit<Word>> emits = trie.parseText("ushers");

Releases

See releases for details.

aho-corasick's People

Contributors

Stargazers

Watchers

Forkers

tsquare lidasheng chenjw nkabir phillipsinger xcurry shenc-nrelate matuzaki jerryye matanox gladtosee hankcs listentowindy hqren yim1990 beykery cteyton adityamugali tml enisher kunalmanch zhangweiabc iolo satendrakumar suboptimal colin719 jibaro rajagopal067 mcelvg xiongs guofengstc vinodkandula openlogic sxhao charactory suneony rklick-solutions anand-singh xinglongbing tomsunelite irisindy wang-qiang lost rajacsp dlozinski shabesoglu coolcoder001 jiyuuun cn19 neilireson dgriffis mkrnr zuolong16 christopherwirt edwardchu-studio bingerzhang nuixkeith dotrado joechanghyun happynoom valters bensnw zzhsteady jeremyblog kowdk shacharo krespo zhimingyang isapir abargnesi wenbotse oshnew light-bringer tiatin ikhanryu jozeb mikewlange sanjeevgour deerwood-mccord-jr lunaticjay luoxianchao strategist922 gbhatia81 412418908 erictapen crystark lucky870315 vihari binghaobhw louisliaoxh1989 chued xushenkun cryptomcgrath msbuggy edoarsla abhichabhi joswlv chunde frankchu0229 lukebutters

aho-corasick's Issues

I want to kown how to make keystoneV3 work in Joss?

Unicode offset problem?

I added the following test case to TrieTest.java and it fails:

    @Test
    public void unicodeIssue() {
        String target = "LİKE THIS";
        Trie trie = new Trie().caseInsensitive().onlyWholeWords();
        trie.addKeyword("this");
        Collection<Emit> emits = trie.parseText(target);
        System.out.println(emits);
        assertEquals(1, emits.size());
        Iterator<Emit> it = emits.iterator();
        checkEmit(it.next(), 5, 8, "this");
    }

Notice the capital I with a dot. The code thinks the offsets are at 6, 9 and that actually causes trie.tokenize(target) to crash because 9 is past the end of the string. That character is two bytes wide, so it seems to be pushing the offset calculation off.

Refactor the aho-corasick implementation

The current implementation has grown bloated under the various change requests. Also, the check for whole words and non-overlapping matches is done by post-processing the found emits. Because of this, the interactive algorithm run does not support checking for whole words and non-overlaps, only the full run does.

Goal:

have an implementation which make a single run similar to the original algorithm, but also for whole words and non-overlapping matches
the iterator and full run both support casing, whole words and non-partials
make the code faster
have a cleaner implementation

Some ideas:

introduce a concept called 'emit candidate'. An emit candidate will not be emitted until it has been confirmed as an emit.
when non-overlaps have been requested, there will only be room for a single emit candidate, whereas the alternative allows for multiple emit candidates. New emit candidates overwrite old emit candidates. The nature of the AC algorithm ensures that the longest match will be selected.
when whole words have been requested, the only acceptable 'situations' before and after a match are i) is a whitespace character, ii) is the start of the file, or iii) is the end of the file. Before submitting an emit candidate, the match will be checked to comply with these conditions. If it does not comply, the algorithm must fall back to its fail state and allow for other matches to be found.
when the Trie returns to its root state or the file has ended, emit candidates are flushed and turned into actual emits.
the system ideally works with an iterator. The algorithm can be called in the old way (ie, returning a collection of emits or tokens), or by calling hasNext / next directly on its iterator.
stopOnHit will be removed since it holds no more value with firstMatch being the superior implementation.
onlyWholeWords and onlyWholeWordsWhiteSpaceSeparated must work in exactly the same way

This release will be turned into a v0.4.0.

If any of you has more or better ideas, please state so here.

Can we get a push to maven with new updates?

There must be a call that returns tokens instead of Emits

Based on the original search text, the Trie must be able to return tokens. A token can either be plaintext or a match.

When it is a match, it should hold the original emit besides the original text.

Allow to specify Payload with Keyword

It'd be great if each keyword could have an associated object that is used as a payload. That way the Trie can store associated data with each keyword, and Emits can retrieve that data in addition to just the keyword and its index position.

Question about thread safety

Hi,

As much as I gather without going too deeply into the implementation, it looks safe to build the Trie once and then search it in parallel (call trie.parseText() on the same Trie instance from multiple threads).

This is because all the "state" of the search appears to be on local variables on the stack.

This is of course ignoring the issue of memory barriers that can be dealt with by making the Trie instance volatile.

Is this assesment correct?

Thanks,
Uri

New versions unavailable on Maven Central

I tried using the code but unfortunately version newer than 0.4.0 do not seem to be available through the Maven Central repository anymore.
Would it be possible to push the artifacts to Maven Central for 0.6.0 and future releases again?

Missing null check in Emit.toString()

Using version 0.3.0. I tried this code on a test file in Eclipse:

for(String line; (line = br.readLine()) != null; ) {
    Emit firstMatch = trie.firstMatch(line);
    System.out.println(firstMatch.toString());
}

and received a java.lang.NullPointerException. I changed it to the following and got expected behavior (no more NullPointerException):

for(String line; (line = br.readLine()) != null; ) {
    Emit firstMatch = trie.firstMatch(line);
    System.out.println(firstMatch);
}

I think you are missing a null dereference check somewhere in your overrided toString() implementations. Minor issue, but also easy to fix.

Maintaining additional node information

Hi,

First of all thanks for providing an implementation of the algorithm.

Wanted to check -
Is there a way to maintain additional information along with each keyword/node within the trie data structure ? Basically whenever there is a match, I would like to retrieve corresponding match information.

For example,
Assume we have 10 documents each having distinct words. I can create the trie by adding the words as keywords to the trie builder. However, whenever there is a match, I would like to know from which document the word came from. So, need to maintain some extra information like document id in each of the node of the trie along with the words.

Let me know if you need more information.

Thanks,
Rajan Jana

README outdated: .caseInsensitive()

README tutorial uses deprecated .caseInsensitive() instead of .ignoreCase()

parseText() API in Trie is not thread safe

Hi Robert,

Thanks for your help. I have constructed Trie with keywords upfront and calling the parseText() API on the constructed Trie in a multithreaded env getting NullPointerException. Could you please help us by providing the Thread-Safe Code.

Behavior at scale

Hello,

I tried using the algorithm when scanning for 6M of values, but I got a Java out of heap space error.
java.lang.OutOfMemoryError: Java heap space

What would be the changes needed to support Aho-Corasick at scale ?
Could augmenting the heap space of the JVM be sufficient or is a bigger refactoring needed to support scale ?
Any alternatives (either AC or another algorithm) that could scale well, maybe being distributed ?

Thanks for any pointers to a possible solution.

NullPointerException under parseText - Perhaps due to non-thread safety?

Hi, I saw the following exception reported from the parseText method, but haven't bee able to reproduce it in my development environment.

java.lang.NullPointerException
    at org.ahocorasick.trie.Trie.getState(Trie.java:138) ~[ahocorasick-0.2.4.jar:?]
    at org.ahocorasick.trie.Trie.parseText(Trie.java:99) ~[ahocorasick-0.2.4.jar:?]
    at myCode...

I'm wondering if it's possibly being caused by using a single Trie from multiple threads simultaneously (though hammering it from a number of threads on my machine doesn't exhibit a problem).

Should I expect calling parseText on a single Trie from multiple threads to be safe?

Stop Trie#removePartialMatches() from being O(n^2)

Hi
org.ahocorasick.trie.Trie.removePartialMatches(CharSequence, List<Emit>) is slow when the given list is large. The problem is because of the n^2 running time where a list of Emits to delete is made and then each one is removed (not by index) from the list.

If you are happy to move to java 8 the change you need to make to speed this up is to change the method to be removePartialMatches:

collectedEmits.removeIf(e -> this.isPartialMatch(searchText, e));

This resulted in a speed up of 1941053ms to 142ms for some test that somehow used that method I think the doc was 8MB had lots of emits

Add list of keywords to Trie

The Trie class should take an array and/or of keywords. Rather than:

        .addKeyword("hers")
        .addKeyword("his")
        .addKeyword("she")
        .addKeyword("he")

It should be possible to add them all at once:

        .addKeywords(new String[]{"hers", "his", "she", "he"})

Possible to emit non-matches

Apologies, maybe bad use of 'issues' for this, but is it possible to emit non-matches only?

Example: match list is : {"foo", "bar", "blah"}

Sentence is: "this is foo xx yyy bar aaa blah"

will emit

{this is xx yy aaa}

When to publish the 0.3.1 version?

hello, I'm wondering about what's the time to publish the 0.3.1 version on maven repository?

Build error

Hi Mr. Robert,

I'm trying to implement your code. I put the following dependency -

    <dependency>
        <groupId>org.ahocorasick</groupId>
        <artifactId>ahocorasick</artifactId>
        <version>0.2.4</version>
    </dependency>

I'm getting the following error on mvn install-

priom@ASUS-S551LA:~/Public/6461/aho$ mvn -e install
+ Error stacktraces are turned on.
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[ERROR] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] The projects in the reactor contain a cyclic reference: Edge between 'Vertex{label='org.ahocorasick:ahocorasick'}' and 'Vertex{label='org.ahocorasick:ahocorasick'}' introduces to cycle in the graph org.ahocorasick:ahocorasick --> org.ahocorasick:ahocorasick
[INFO] ------------------------------------------------------------------------
[INFO] Trace
org.apache.maven.BuildFailureException: The projects in the reactor contain a cyclic reference: Edge between 'Vertex{label='org.ahocorasick:ahocorasick'}' and 'Vertex{label='org.ahocorasick:ahocorasick'}' introduces to cycle in the graph org.ahocorasick:ahocorasick --> org.ahocorasick:ahocorasick
    at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:295)
    at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:138)
    at org.apache.maven.cli.MavenCli.main(MavenCli.java:362)
    at org.apache.maven.cli.compat.CompatibleMain.main(CompatibleMain.java:60)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.codehaus.classworlds.Launcher.launchEnhanced(Launcher.java:315)
    at org.codehaus.classworlds.Launcher.launch(Launcher.java:255)
    at org.codehaus.classworlds.Launcher.mainWithExitCode(Launcher.java:430)
    at org.codehaus.classworlds.Launcher.main(Launcher.java:375)
Caused by: hidden.org.codehaus.plexus.util.dag.CycleDetectedException: Edge between 'Vertex{label='org.ahocorasick:ahocorasick'}' and 'Vertex{label='org.ahocorasick:ahocorasick'}' introduces to cycle in the graph org.ahocorasick:ahocorasick --> org.ahocorasick:ahocorasick
    at hidden.org.codehaus.plexus.util.dag.DAG.addEdge(DAG.java:143)
    at hidden.org.codehaus.plexus.util.dag.DAG.addEdge(DAG.java:123)
    at org.apache.maven.project.ProjectSorter.(ProjectSorter.java:118)
    at org.apache.maven.execution.ReactorManager.(ReactorManager.java:99)
    at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:288)
    ... 11 more
[INFO] ------------------------------------------------------------------------
[INFO] Total time: Less than 1 second
[INFO] Finished at: Thu Sep 25 01:00:02 EDT 2014
[INFO] Final Memory: 2M/61M
[INFO] ------------------------------------------------------------------------

Can you help? Can it be a version issue? I am using-

Apache Maven 2.2.1 (rdebian-14)
Java version: 1.7.0_65

Thanks, really appreciate your work!

String out of bound exception

java.lang.StringIndexOutOfBoundsException: String index out of range: 311
at java.lang.String.charAt(String.java:658)
at org.ahocorasick.trie.Trie.removePartialMatches(Trie.java:119)
at org.ahocorasick.trie.Trie.parseText(Trie.java:103)

Loading "/usr/share/dict/words" as the dictionary and searching for
String text = "aardvark aardvarks aardwolf aardwolves Aaren Aargau aargh Aarhus Aarika Aaron aaron Aaronic aaronic Aaronical Aaronite Aaronitic Aaron's-beard Aaronsburg Aaronson babelet Babelic babelike Babelisation Babelise Babelised Babelish Babelising Babelism Babelization Babelize Babelized Babelizing babels Baber babery"

If I add an additional space at end of text the exception does not happen.

/usr/share/dict/words is from words-3.0-22.fc20.noarch

Forgot to mention that the trie is being constructed as new Trie().onlyWholeWords()

Stop Search once find any result

Hi robert,

Is it possible to add a feature or another method to stop searching the tire once find any keyword.

because i have requirement which only want to check exist or not.

it will be more efficient if have this function. think it is easy to add it to make the library more stronger:)

Thanks

Tokens must report on their whole word status

When the tokenize method is used, a single token must report whether it is a whole word or not.

The definition for whole word in Trie must be reused.

Add configurable option to remove overlapping matches

The basic Aho-Corasick algorithm returns all matches, whether they overlap or not. It should be possible to eliminate overlaps, so that the series of matches is interspersed in non-matching text, never in or between matches.

Example:

keywords: "ab", "cba", "ababc"
searchtext: ababcbab
basic result: ab@1, ab@3, ababc@4, cba@6, ab@7
non-overlapping result: ababc@4, ab@7

Conflict resolution:

longer matches prevail over shorter matches
if equally long, left-side matches prevail over right-side matches

Allow checks versus lowercase only

When a text is entered, it may contain combinations of upper and lower case characters. If the casing is not important to the matching, it is possible to convert the entire search text into lower case to allow for more matches to be found.

Performance Issue with Overlaps

I encountered a performance problem, documented in a Unit-Test https://github.com/almondtools/stringbench/blob/master/src/test/java/com/almondtools/stringbench/incubation/ACAhoCorasickIncubationTest.java.

I tracked down the reason: The removeOverlaps-Method removes overlaps in an inefficient way. The experiments show, that following section is the critical one:

for (Intervalable removeInterval : removeIntervals) {
   intervals.remove(removeInterval);
 }

Doing this on an ArrayList consumens O(n^2) and an ArrayList containing millons of elements will have long to work.

Performance concern

I have a simple microbenchmark project to measure the string matching performance with different algorithms.

It looks to me Aho-Corasick performance doesn't look very cool in comparing with others:

test scenario	String	KMP	KMP - precompiled	Aho-Corasick
short text	`*78,763.767/ms`	`19,118.220/ms`	`29,489.667/ms`	`7824.821/ms`
long text	`*4,093.381/ms`	`329.124/ms`	`426.696/ms`	`197.690/ms`

Is it because the design of my tests doesn't fit Aho-Corasick's typical usage scenario?

Ascertain only entire words are taken into consideration

For example:

keyword: "sugar"
searchtext: "sugarcane"

In this case the match must not be made, or removed afterwards.

Question on scalability of Aho Corasick implementation.

Hi,
We are using this library for some time now.

Recently there was an OOM due to the increased size of the dictionary.
A simple test shows that memory usage is linear. (there is no attach screenshot option, sorry!)

Currently, our dictionary size is 0.75 million entries (33 MB csv file each line is a term.)

Trie constructed using this dictionary takes about 1.9GB (heap entries below)

"Class","Objects","Shallow Size","Retained Size"
"java.util.TreeMap","16034426","769652448","2060340624"
"java.util.TreeMap$Entry","16036279","641451160","2060340624"
"org.ahocorasick.trie.Statesun.misc.Launcher$AppClassLoader","15292852","489371264","2060340624"

My question is

Is this amount of heap usage normally seen using this library? or is it something unusual?
Any general advice on using the library for very big dictionaries (few million entries )?

Thanks in advance,
Phaneendra

Improve the performance of org.ahocorasick.trie.Trie

My profiler seems to show that a bunch of time is being spent in:

java.util.concurrent.LinkedBlockingDeque#add

inside of

inside of org.ahocorasick.trie.Trie.constructFailureStates();

LinkedList is a faster implementation of Queue than LinkedBlockingDeque, I am measuring it to be about 3-4 times faster.
Could:

private void constructFailureStates() {
        final Queue<State> queue = new LinkedBlockingDeque<>();

be changed to:

private void constructFailureStates() {
        final Queue<State> queue = new LinkedList<>();

Stack overflow error

java.lang.StackOverflowError
at java.util.ArrayList.iterator(ArrayList.java:814)
at org.ahocorasick.interval.IntervalNode.determineMedian(IntervalNode.java:43)
at org.ahocorasick.interval.IntervalNode.(IntervalNode.java:17)
at org.ahocorasick.interval.IntervalNode.(IntervalNode.java:36)
at org.ahocorasick.interval.IntervalNode.(IntervalNode.java:36)
at org.ahocorasick.interval.IntervalNode.(IntervalNode.java:36)
at org.ahocorasick.interval.IntervalNode.(IntervalNode.java:36)

(chopping of rest for sake of brevity)

Here is what I am doing.

In test setup I am loading a dictionary into a trie and a set of sentences. Then I am passing each one of the sentences in a loop to the trie and after processing two sentences I get the above error.

    for (String sentence : sentences) {
      Collection<Token> tokens = trie.tokenize(sentence);
      for (Token token : tokens) {
        System.out.println("token: " + token);
      }
      System.out.println("\n");
  }

Does the trie need to be reset after processing of a sentence? If so, what is the call?

Thanks.

Use this algorithm for url matching (strings)

Can this algo be modified to accomplish URL pattern matching to return exact (best) match instead of first match?
modify State.java to have String instead of Character and compare strings based on "/" tokenizer in a way that doesn't impact performance?

Making Trie Serializable

For instance in the case of caching to some third party provider like memcached (and since building a Trie is not a cheap computation), Trie should implement Serializable interface. Could you provide this feature?

More flexible tokenize() function

I was looking at the tokenize function, and need things like EmitHandler.
Then I checked the code of tokenize, and realized that it's actually a stage after parseText.
So why not use a pattern like tokenize(parseText(...)), so we can enjoy all the new features offered by parseText!
Right now we already have a few parseText variants, and there may be more in the future!

Use regexp in keywords

Hello,

First of all, i would like to thanks this project which really help me in some of my projects.

Can you advice me if it can be possible to use regex in keywords ?
If yes, how can you do that ?

Thanks for the reply,

Production ready?

Hello,

i would like to ask whether this library is considered production ready? or whether it's being used somewhere?

Thanks!

Fragment tokens must report if they consist completely of whiteSpace

When the tokenize method is used, a single token must report itself as whiteSpace if it consists completely of whitespace characters.

Can't get Version 0.6.0 from Maven Central Repository

I can see that 0.6.0 is the latest version. But I can only refer 0.4.0 from Maven Repository. Did I miss something?

Pluggable Trie Implementations

It would be illuminating to compare various Trie implementations to see if there can be any significant performance gains to be made. This would require creating a common Trie interface that can be used, independent of the underlying implementation.

Implementations

Java

https://commons.apache.org/proper/commons-collections/apidocs/org/apache/commons/collections4/trie/PatriciaTrie.html
http://www.stochasticgeometry.ie/2008/05/06/implementing-hat-tries-in-java/ (incomplete)
https://github.com/colindean/autocomplete/tree/master/java/src/main/java/com/marcusmccurdy/trie
https://github.com/krukow/clj-ds/blob/master/src/main/java/com/github/krukow/clj_lang/PersistentHATTrie.java (missing HATTrieNode.java, also the persistence overhead would not be desired).

C/C++

Papers

have u tried this test case

@Test
    public void shefgTest() {
        Trie trie = new Trie(false);
        trie.addKeyword("shefg");
        trie.addKeyword("hefg");
        trie.addKeyword("efg");
        trie.addKeyword("fg");
        Collection<Emit> emits = trie.parseText("shefge");
        System.out.println(emits);
    }

when I run the test case above, it outputs nothing.

Suggestions about the type of emits in State.java

Hi, Rober :
Thanks for your help, i fix the dumplicate problems by using static constructor. The codes showing below :

public class AhoCorasickMethod {

    private static Trie trie;

    private static Boolean alreadyExecuted = false;

    static {
        trie = MssConst.METHODS.get(MssConst.METHOD_MATCHEST);
    }

    public static void init(String keyword) {
        if (!alreadyExecuted) {
            trie.addKeyword(keyword);
            alreadyExecuted = true;
        }
    }

    public AhoCorasickMethod() {

    }

    public Collection<Emit> ahoCorasickMethod() {
        init("community");
        return trie.parseText("anjuke, communityparts");
    }
}

However, one user may create repetitive keywords, then ahocorasick's results will be repetitive.
So, i suggest that in State.java, change the type of emits from List to Set.

private List<String> emits = null;

become

private Set<String> emits = null;

It's my points, hope it helps.

Allow to ack emits

Hi,

I was taking a look at how storeEmit works (https://github.com/robert-bor/aho-corasick/blob/master/src/main/java/org/ahocorasick/trie/Trie.java#L277) and thought it could be nice if the emitted boolean was based on a value returned by emitHandler.emit. This would allow us to do additional checks on values before acking an emit and thus taking advantage of isStopOnHit.

In my case, I have a post condition based on each keyword. I would like to stop as soon as I find a match that also satisfies the post condition. Currently, even if I override the emit handler, i have no way of forcing a stop.

Upper/lower case issues with non-english texts

Hello,
I was using the ahocorasick library to perform some text processing on English Wikipedia dump. As it contains several non-english words, it lead me to discovering an issue in how the library handles cases.

I wrote a test case to explain it. Here it is

@Test
public void caseInsensitiveTrieWithSomeUnicodeCharactersCreatesEmitsWithWrongStart(){
    // when this is lower cased, it becomes a string of length 2!
    String upperLengthOne = "İ";
    String normalI = "I";
    Trie trie = Trie.builder()
                    .caseInsensitive()
                    //.onlyWholeWords()
                    .addKeyword(upperLengthOne)
                    .build();
    // because when lower cased it becomes 2 characters, the emit gets confused and creates a string that starts at -1
    // this can cause further problems if we index into the original string with this emit start and end.
    // This happens if we make the trie builder.onlyWholeWords() in which case we get exceptions
    assertEquals(-1, trie.parseText(upperLengthOne).stream().findAny().get().getStart());
}

The whole problem happens when lower-casing the keyword has a different length than the original keyword. I can't think of a quick easy fix for this issue, but I guess the least that can be done about it is that the library should throw an error when the lowercased keyword's length is != the original keywords length warning the library's user to this issue.

Case insenstive setting presumes keywords are lowercase?

Robert
Thanks for the code. I noticed that for case insensitive matching, Trie.parseText() reduces the input text to lower case but the keywords are left in their original case (as added). So for the situation where
trie.addKeyword("Alpha");
and
Collection emits = trie.parseText("Alpha");
no matches are returned when case insensitive = true

Adding
if (trieConfig.isCaseInsensitive()) {
character = Character.toLowerCase(character);
}
to addKeyword would make addKeyword and parseText symmetric and match all combinations of case.

thanks

Different results as other String matching algorithms

I used your algorithm to compare the results with my suite of algorithms. Differences occurred.

I cannot say if they are related to wrong usage. You may reproduce the problem

by checking out https://github.com/almondtools/stringbench (just to have the right dependencies)

use following unit test

@Test
public void testACAhoCorasick1() {
    MultiPatternMatcherBenchmark benchmark = new ACAhoCorasickBenchmark();
    benchmark.setup(createMultiPatternSample(2, 8, 8));
    benchmark.benchmarkFind();
}

private static MultiPatternSample createMultiPatternSample(int alphabet, int pattern, int patternNumber) {
    try {
        MultiPatternSample sample = new MultiPatternSample();
        sample.setPatternNumber(patternNumber);
        sample.setAlphabetSize(alphabet);
        sample.setPatternSize(pattern);
        sample.setup();
        return sample;
    } catch (IOException e) {
        throw new AssertionError(e);
    }
}

the problem of wrong results does also appear with other sample/pattern combinations (most documented here, search for IllegalStateException)

If the problem is part of the benchmark (or of the other algorithms) it would be kind to give feedback.

caseInsensitive() must be called before addKeyword()

Hi,

I noticed that if you call caseInsensitive() configuration method after addKeyword(...) method then trie.firstMatch() and anything else that returns an Emit type will always return null for those keywords added before calling the case insensitive configuration method. This can be avoided by calling the caseInsensitive() configuration method before calling addKeyword(...) method as in your examples, but I think you should make it clear in the documentation somewhere that the order matters for caseInsensitive(), or change the code to check for different possible orderings during the build() process. The other configuration methods like onlyWholeWords() work fine regardless of when they are invoked.

For example, the following returns null for emit:

Trie trie = Trie.builder()
    .addKeyword(...)
    .caseInsensitive()
    .build();

Emit emit = trie.firstMatch(...);

But this works fine:

Trie trie = Trie.builder()
    .caseInsensitive()
    .addKeyword(...)
    .build();

Emit emit = trie.firstMatch(...);

Building repeatedly.

Any issues are if I do this repeatedly.

fun checkList(list):
Trie trie = Trie.builder().stopOnHit().ignoreCase().addKeywords(list).build();
return trie.containsMatch(text);

NullPointerExceptions

Hi,

I am getting frequent NullPointerException inside Trie.java

Caused by: java.lang.NullPointerException
    at org.ahocorasick.trie.Trie.getState(Trie.java:138)
    at org.ahocorasick.trie.Trie.parseText(Trie.java:99)

Looks like currentState can be null because failure() can return null?

    private State getState(State currentState, Character character) {
        State newCurrentState = currentState.nextState(character);
        while (newCurrentState == null) {
            currentState = currentState.failure();
            newCurrentState = currentState.nextState(character);
        }
        return newCurrentState;
    }

Pom file for 0.3.1 release cannot be found on the maven central

It is listed, but cannot be downloaded from here:
http://repo1.maven.org/maven2/org/ahocorasick/ahocorasick/0.3.1/

Improving performance with TCharObjectHashMap

Hi,
Thanks for this excellent library! We've just started using in all our clients for our backup service https://degoo.com and it has helped us speed up our progress calculations a lot.

When profiling I noticed that (at least in our use case) it spends quite a lot of time calling this.success.get(character) in State. In our case >50% of all execution time is spent there. My hunch is that this could be improved quite a lot by using Trove's TCharObjectHashMap. It would add an external dependency but it might be worth it if the performance enhancement is big enough. If adding an external dependency is not an option perhaps implementing something similar to TCharObjectHashMap would be a better solution. Let me know if we can be of any assistance with this.

Regards
Carl Hasselskog

Cater for callback handler on parseText

It must be possible to call parseText with a callback handler. In this situation, emits will not be stored in the list, but instead for every emit, a call will be made to the callback handler.

To guarantee backwards compatibility, the system must use a default handler which stores the emits as it does now and returns the list in the end.

robert-bor / aho-corasick Goto Github PK

aho-corasick's Introduction

Aho-Corasick

Dependency

Introduction

Usage

Releases

aho-corasick's People

Contributors

Stargazers

Watchers

Forkers

aho-corasick's Issues

Implementations

Java

C/C++

Papers

Recommend Projects

Recommend Topics

Recommend Org