Giter Club home page Giter Club logo

aho-corasick's Introduction

aho-corasick

A library for finding occurrences of many patterns at once with SIMD acceleration in some cases. This library provides multiple pattern search principally through an implementation of the Aho-Corasick algorithm, which builds a finite state machine for executing searches in linear time. Features include case insensitive matching, overlapping matches, fast searching via SIMD and optional full DFA construction and search & replace in streams.

Build status crates.io

Dual-licensed under MIT or the UNLICENSE.

Documentation

https://docs.rs/aho-corasick

Usage

Run cargo add aho-corasick to automatically add this crate as a dependency in your Cargo.toml file.

Example: basic searching

This example shows how to search for occurrences of multiple patterns simultaneously. Each match includes the pattern that matched along with the byte offsets of the match.

use aho_corasick::{AhoCorasick, PatternID};

let patterns = &["apple", "maple", "Snapple"];
let haystack = "Nobody likes maple in their apple flavored Snapple.";

let ac = AhoCorasick::new(patterns).unwrap();
let mut matches = vec![];
for mat in ac.find_iter(haystack) {
    matches.push((mat.pattern(), mat.start(), mat.end()));
}
assert_eq!(matches, vec![
    (PatternID::must(1), 13, 18),
    (PatternID::must(0), 28, 33),
    (PatternID::must(2), 43, 50),
]);

Example: ASCII case insensitivity

This is like the previous example, but matches Snapple case insensitively using AhoCorasickBuilder:

use aho_corasick::{AhoCorasick, PatternID};

let patterns = &["apple", "maple", "snapple"];
let haystack = "Nobody likes maple in their apple flavored Snapple.";

let ac = AhoCorasick::builder()
    .ascii_case_insensitive(true)
    .build(patterns)
    .unwrap();
let mut matches = vec![];
for mat in ac.find_iter(haystack) {
    matches.push((mat.pattern(), mat.start(), mat.end()));
}
assert_eq!(matches, vec![
    (PatternID::must(1), 13, 18),
    (PatternID::must(0), 28, 33),
    (PatternID::must(2), 43, 50),
]);

Example: replacing matches in a stream

This example shows how to execute a search and replace on a stream without loading the entire stream into memory first.

use aho_corasick::AhoCorasick;

let patterns = &["fox", "brown", "quick"];
let replace_with = &["sloth", "grey", "slow"];

// In a real example, these might be `std::fs::File`s instead. All you need to
// do is supply a pair of `std::io::Read` and `std::io::Write` implementations.
let rdr = "The quick brown fox.";
let mut wtr = vec![];

let ac = AhoCorasick::new(patterns).unwrap();
ac.stream_replace_all(rdr.as_bytes(), &mut wtr, replace_with)
    .expect("stream_replace_all failed");
assert_eq!(b"The slow grey sloth.".to_vec(), wtr);

Example: finding the leftmost first match

In the textbook description of Aho-Corasick, its formulation is typically structured such that it reports all possible matches, even when they overlap with another. In many cases, overlapping matches may not be desired, such as the case of finding all successive non-overlapping matches like you might with a standard regular expression.

Unfortunately the "obvious" way to modify the Aho-Corasick algorithm to do this doesn't always work in the expected way, since it will report matches as soon as they are seen. For example, consider matching the regex Samwise|Sam against the text Samwise. Most regex engines (that are Perl-like, or non-POSIX) will report Samwise as a match, but the standard Aho-Corasick algorithm modified for reporting non-overlapping matches will report Sam.

A novel contribution of this library is the ability to change the match semantics of Aho-Corasick (without additional search time overhead) such that Samwise is reported instead. For example, here's the standard approach:

use aho_corasick::AhoCorasick;

let patterns = &["Samwise", "Sam"];
let haystack = "Samwise";

let ac = AhoCorasick::new(patterns).unwrap();
let mat = ac.find(haystack).expect("should have a match");
assert_eq!("Sam", &haystack[mat.start()..mat.end()]);

And now here's the leftmost-first version, which matches how a Perl-like regex will work:

use aho_corasick::{AhoCorasick, MatchKind};

let patterns = &["Samwise", "Sam"];
let haystack = "Samwise";

let ac = AhoCorasick::builder()
    .match_kind(MatchKind::LeftmostFirst)
    .build(patterns)
    .unwrap();
let mat = ac.find(haystack).expect("should have a match");
assert_eq!("Samwise", &haystack[mat.start()..mat.end()]);

In addition to leftmost-first semantics, this library also supports leftmost-longest semantics, which match the POSIX behavior of a regular expression alternation. See MatchKind in the docs for more details.

Minimum Rust version policy

This crate's minimum supported rustc version is 1.60.0.

The current policy is that the minimum Rust version required to use this crate can be increased in minor version updates. For example, if crate 1.0 requires Rust 1.20.0, then crate 1.0.z for all values of z will also require Rust 1.20.0 or newer. However, crate 1.y for y > 0 may require a newer minimum version of Rust.

In general, this crate will be conservative with respect to the minimum supported version of Rust.

FFI bindings

aho-corasick's People

Contributors

adetaylor avatar allaboutevemirolive avatar andrew-d avatar apanatshka avatar atouchet avatar brauliobz avatar burntsushi avatar danburkert avatar draphar avatar elidoran avatar emmanuel-keller avatar faern avatar frewsxcv avatar guillaumegomez avatar he32 avatar ignatenkobrain avatar jamesyoungman avatar jneem avatar koba-e964 avatar manishearth avatar marwes avatar mgeisler avatar oherrala avatar petar-dambovaliev avatar ruuda avatar ten0 avatar tmikus avatar ulidtko avatar vmx avatar waywardmonkeys avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aho-corasick's Issues

Removing unused/redundant code

Hey there, i have noticed that some struct fields aren't used.
For example, https://github.com/BurntSushi/aho-corasick/blob/master/src/ahocorasick.rs#L1222
And some small things that could be simplified https://github.com/BurntSushi/aho-corasick/blob/master/src/automaton.rs#L192

Both ends of that condition are calling the same method.
self.prefilter() already returns an option and the method could be a 1 liner passing the result of self.prefilter().
If changes like that are significant enough, i will make a PR and give my 2 cents.

API question regarding stream_replace_all

Hi,

First of all, cheers, very handy repo!

I wanted to ask, for the search and replace example -

  1. Why is the string to be searched passed as bytes? I see that regular search examples are taking the string itself.
  2. Why is the result vector passed inside? What's the harm in returning it, or better, the replaced string? (Pardon any Rust noobness on my part, this just feels like it belongs to C)

As it is, the usage looks a bit awkward. For example, one needs to -

String::from_utf8(replaced_vector).unwrap()

to get the replaced string.

Thanks,
Eli

Using aho-corasick with byte patterns

Hi there,

I was hoping to see if it was possible to use this crate with byte patterns (i.e. patterns of &[u8] and searching the same). It seems that right now, the crate is pretty firmly tied to patterns of type String - is there any way to use byte patterns? I can sort of hack around it with the from_utf8_unchecked function on String, but that's ... not a real solution ๐Ÿ˜บ

Error "use of undeclared trait name `IntoIterator`" when building.

I'm building rusti which is based on this project, and got this error preventing me to install it. I happens with the 0.3.0 version, whole log is listed below:

/Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/aho-corasick-0.3.0/src/lib.rs:208:22: 208:42 error: use of undeclared trait name `IntoIterator`
/Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/aho-corasick-0.3.0/src/lib.rs:208             where I: IntoIterator<Item=P> {
                                                                                                                         ^~~~~~~~~~~~~~~~~~~~
/Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/aho-corasick-0.3.0/src/lib.rs:221:22: 221:42 error: use of undeclared trait name `IntoIterator`
/Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/aho-corasick-0.3.0/src/lib.rs:221             where I: IntoIterator<Item=P> {
                                                                                                                         ^~~~~~~~~~~~~~~~~~~~
/Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/aho-corasick-0.3.0/src/lib.rs:467:55: 467:75 error: use of undeclared trait name `IntoIterator`
/Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/aho-corasick-0.3.0/src/lib.rs:467     fn from_iter<T>(it: T) -> AcAutomaton<S> where T: IntoIterator<Item=S> {
                                                                                                                                                          ^~~~~~~~~~~~~~~~~~~~
error: aborting due to 3 previous errors
/Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/tempfile-1.1.0/src/named.rs:119:9: 119:21 error: failed to resolve. Use of undeclared type or module `Self`
/Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/tempfile-1.1.0/src/named.rs:119         Self::new_in(&env::temp_dir())
                                                                                                          ^~~~~~~~~~~~
/Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/tempfile-1.1.0/src/named.rs:119:9: 119:21 error: unresolved name `Self::new_in`
/Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/tempfile-1.1.0/src/named.rs:119         Self::new_in(&env::temp_dir())
                                                                                                          ^~~~~~~~~~~~
/Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/tempfile-1.1.0/src/unnamed.rs:47:9: 47:21 error: failed to resolve. Use of undeclared type or module `Self`
/Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/tempfile-1.1.0/src/unnamed.rs:47         Self::new_in(&env::temp_dir())
                                                                                                           ^~~~~~~~~~~~
/Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/tempfile-1.1.0/src/unnamed.rs:47:9: 47:21 error: unresolved name `Self::new_in`
/Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/tempfile-1.1.0/src/unnamed.rs:47         Self::new_in(&env::temp_dir())
                                                                                                           ^~~~~~~~~~~~
/Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/tempfile-1.1.0/src/unnamed.rs:65:9: 65:24 errorBuild failed, waiting for other jobs to finish...
: failed to resolve. Use of undeclared type or module `Self`
/Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/tempfile-1.1.0/src/unnamed.rs:65         Self::shared_in(&env::temp_dir(), count)
                                                                                                           ^~~~~~~~~~~~~~~
/Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/tempfile-1.1.0/src/unnamed.rs:65:9: 65:24 error: unresolved name `Self::shared_in`
/Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/tempfile-1.1.0/src/unnamed.rs:65         Self::shared_in(&env::temp_dir(), count)
                                                                                                           ^~~~~~~~~~~~~~~
error: aborting due to 6 previous errors
Could not compile `aho-corasick`.

To learn more, run the command again with --verbose.

The relecant code is here: https://github.com/BurntSushi/aho-corasick/blob/master/src%2Flib.rs#L208

Understanding the DFA vs. NFA implementations of the algorithm

Hi,

Someone at the company that funded the Python wrapper for aho-corasick is going to give an internal talk about it, and so I was suggested they include a slide or two about why DFA approach is faster than NFA (they have existing slides showing performance gains from increasingly less naive solutions of multi-pattern matching).

Do you have any pointers to a summary/additional info about why the DFA approach is faster, especially in this particular context?

Thanks!

Add support for async/non-blocking

Unless I'm missing something, aho-corasick either requires the entire haystack, or a Read on which it blocks.

It would be nice if you could feed it bytes as they come in.

A lot of two-letter combinations break ASCII case-insensitivity

The following script tests all ASCII two-letter combinations by searching for the lowercase needle in the uppercase haystack. I would expect this script to not output anything, instead a couple of combinations fail with 0.7.9.

The real-world test that triggered this investigation was the haystack SECRET_KEY with the needle secret_key.

extern crate aho_corasick; // 0.7.9

fn main() {
    for c in b'a'..b'z' {
        for c2 in b'a'..b'z' {
            let c = c as char;
            let c2 = c2 as char;
            let needle = format!("{}{}", c, c2).to_lowercase();
            let haystack = needle.to_uppercase();
            let ac = aho_corasick::AhoCorasickBuilder::new()
                .ascii_case_insensitive(true)
                .build(&[&needle]);
            let finds: Vec<_> = ac.find_iter(&haystack).collect();
            if finds.is_empty() {
                println!("needle = {}, haystack = {} => broken", needle, haystack);
            }
        }
    }
}

Playground

Motion to grant V1.0.0 status

I move that you version the current (or next) release as version 1.0.0. Recall the Semantic Versioning spec:

  1. Major version zero (0.y.z) is for initial development. Anything MAY change at any time. The public API SHOULD NOT be considered stable.

  2. Version 1.0.0 defines the public API. The way in which the version number is incremented after this release is dependent on this public API and how it changes.

Commits to this repo are now sparse. The last commit was 7 months ago. This is strong evidence of its API stability. Indeed, at this point, given the importance of this crate and how widespread its use is, changing the public API at any time is impractical anyway. The crate is functionally at v1. Officially bumping the version would just align the version number to its de facto use.

crate overhaul

I think this crate is due for overhaul. I think it has existed more or less in its current form for about three years now, but there are some things I'd like to add in order to lay the ground work for better regex optimizations. Here is what I'm thinking:

  • Switch to a more builder oriented API.
  • Get rid of the Automaton trait. Or at least, don't export it.
  • Fix the dense/sparse swap that has been driving me nuts (see #18).
  • Refine the use of memchr. It is currently quite sub-optimal on non-ASCII text.
  • Employ ideas to make the automaton more memory efficient, or at least, make the trade off more easily configurable.
  • Add support for "leftmost first" and "leftmost longest" match semantics. This is an important piece for using aho-corasick effectively in a regex engine. We will retain the existing match semantics as well. I believe the leftmost semantics will require small tweaks to how the automaton is constructed (e.g., by causing match states to die instead of automatically wrapping around to start again).
  • Don't require the automaton to hold on to the strings (see #19).
  • Try switching to criterion for benchmarks.

cc @Marwes

Feat Req: Making it easier to statically construct an AcAutomaton

Hi there!

I am interested in creating something like the equivalent of the phf_codegen, but for AcAutomaton.

use case: search across CommonCrawl's dataset in search of emojis quickly!

To construct the AcAutomaton struct statically, I need to create State objects, but this is not possible. I guess the fields would have to be public for my current strategy to work as well... What do you think about potential changes for making this possible?

here is the idea what I'd like to achieve: http://play.integer32.com/?gist=a218ca3476bfc40fb39ff6ac79da8572

thanks!

Build failure using Rustc 1.28.0-1.39.0 because of cfg(doctest)

933b6a7, released as part of v0.7.11, includes cfg(doctest) which isn't stabilized until Rust 1.40.0. I get a build failure in uom using Rust 1.38.0 running cargo clippy --all --tests. Based on the MSRV I would expect this to still work with 1.38.0.

Reproduced locally:

$ RUSTFLAGS="-D warnings" cargo +1.38.0 clippy --all --tests
    Checking aho-corasick v0.7.12
    Checking num-traits v0.2.12
    Checking serde_json v1.0.55
error[E0658]: `cfg(doctest)` is experimental and subject to change
   --> C:\Users\mjboutin\.local\share\cargo\registry\src\github.com-1ecc6299db9ec823\aho-corasick-0.7.12\src\lib.rs:190:7
    |
190 | #[cfg(doctest)]
    |       ^^^^^^^
    |
    = note: for more information, see https://github.com/rust-lang/rust/issues/62210

error[E0658]: `cfg(doctest)` is experimental and subject to change
   --> C:\Users\mjboutin\.local\share\cargo\registry\src\github.com-1ecc6299db9ec823\aho-corasick-0.7.12\src\lib.rs:194:7
    |
194 | #[cfg(doctest)]
    |       ^^^^^^^
    |
    = note: for more information, see https://github.com/rust-lang/rust/issues/62210

error: aborting due to 2 previous errors

For more information about this error, try `rustc --explain E0658`.
error: Could not compile `aho-corasick`.
warning: build failed, waiting for other jobs to finish...
error: build failed

Empty pattern never matches

This is probably intended, but I can't find it in the docs. It's a difference to other searching methods, such as str::contains and Regex::find so at least it should be mentioned...

(Of course I'd prefer to produce a match, but backwards compatibility might be a pain.)

big simplification to the leftmost match NFA construction

Hello @BurntSushi, I am trying to understand how the NFA states are constructed for leftmost match searching, and I cannot understand these lines from QueuedState<S>::next_match_at_depth.

aho-corasick/src/nfa.rs

Lines 979 to 981 in f8197af

let depth = nfa.state(next).depth
- nfa.state(next).get_longest_match_len().unwrap()
+ 1;

We are in a trie, so isn't the length of the longest match of a state equal to the state's depth? Won't depth be always equal to 1? Thanks.

Memory allocation failed

use aho_corasick::AhoCorasickBuilder;
use std::fs::File;
use std::io::Read;

fn main() {
    let mut f = File::open("keyword.txt").unwrap();
    let mut dat = String::new();
    f.read_to_string(&mut dat).unwrap();
    let words = dat.split("\n");
    let mut v = vec![];
    for i in words {
        v.push(i);
    }
    AhoCorasickBuilder::new()
        .ascii_case_insensitive(true)
        .build(&v);
}

memory allocation of 1629519872 bytes failed
[1] 13458 abort (core dumped) cargo run

free -h
              total        used        free      shared  buff/cache   available
Mem:          7.6Gi       4.0Gi       2.9Gi       492Mi       752Mi       2.9Gi
Swap:            0B          0B          0B

https://github.com/observerss/textfilter/blob/master/keywords

Masked searches?

Is it possible to perform masked searches with aho-corasick? My use-case is to search for byte patterns where certain bytes can be arbitrary and should not be matched, e.g. A1 ?? ?? ?? ?? 53 33 DB 3B C3 74 ?? 68 where all ?? bytes can be any value in the haystack for the match to be valid. I'm not experienced in search algorithms so I'm not sure which ones can be generalized to this case.

Will bitmap and/or path compression help?

Bitmap compression and path compression are common performance optimization in Aho-Corasick implementations. I think it was first described in http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.5.9799. I'm less familiar with the bitmap compression so I'm not comfortable saying whether it can apply to this implementation

I've implemented path compression in another Aho-Corasick library and used a simpler version of path compression described in http://deepness-lab.org/pubs/hpsr11_space.pdf. The insight is to "compress only branches whose states have a single outgoing forward transition and no incoming failure transitions." The absence of failure transitions makes it much easier to implement the compression. Reasonable memory and runtime improvements are observed for this simpler path compression.

Something to consider is that path compression is designed to throw away states as they are compressed into a superstate. This would need to play nicely with the current implementation that uses indices into the states vector.

I wonder what your thoughts are on bitmap compression and/or path compression?

For Python wrapper, is it worth switching to new NFA, or sticking to DFA?

I explicitly chose DFA for the previous version with the goal of maximum runtime performance, since that's the typical Python tradeoff for bulk data processing: live with slow setup in return for fast bulk operations.

It sounds like the new contiguous NFA is a lot faster in 1.0, though. What do you think?

Duplicate match when using find_overlapping_iter() with ascii_case_insensitive

Hello, I think there may be a bug in find_overlapping_iter() with ascii_case_insensitive. I narrowed it down to the following example:

fn main() {
    use aho_corasick::AhoCorasickBuilder;

    let patterns = &["abc", "def", "abcdef"];
    let haystack = "abcdef";

    let ac = AhoCorasickBuilder::new()
        .ascii_case_insensitive(true)
        .build(patterns);
    for mat in ac.find_overlapping_iter(haystack) {
        println!("{:?}", mat);
    }
}

I'd expect a match for each of "abc", "def", and "abcdef", but the output has two entries for "def":

Match { pattern: 0, len: 3, end: 3 }
Match { pattern: 2, len: 6, end: 6 }
Match { pattern: 1, len: 3, end: 6 }
Match { pattern: 1, len: 3, end: 6 }

If I change the line to .ascii_case_insensitive(false) then it outputs what I'd expect:

Match { pattern: 0, len: 3, end: 3 }
Match { pattern: 2, len: 6, end: 6 }
Match { pattern: 1, len: 3, end: 6 }

Is this a bug or am I misunderstanding how the overlapping iter should work?

Make len() available on `Match`

It looks like Match stores a len, but that is not made available except through recomputing it as m.end() - m.start().
How about just making it available as m.len()?

Compile time patterns

Is it possible to create a AhoCorasick pattern using const functions at compile time?

Potential performance improvement, maybe

At https://github.com/BurntSushi/aho-corasick/blob/master/src/classes.rs#L46, the alphabet_len is calculated each time the function is called. This function is called in (what appears to me to be) a hot inner function, next_state(). It seems like this value doesn't change once the BytesClasses are constructed, so there's no need to recalculate each time, it could be recalculated and cached only when the ByteClasses changes.

(I imagine the compiler might be smart enough to figure this out? But then again maybe not.)

question: comparisson with a regex union.

At which point would this crate be preferred over writing a regex union in the Regex crate. Would that be a certain number inputs or would this algorithm always be preferable?

why does this library return byte offsets instead of codepoint offsets?

I'm using your package to match some patterns against long pieces of text, sometimes containing non-ASCII characters. When I investigate the matches I find that the offsets correspond to byte-offsets, not character offsets. This is confusing. It would be great to get character offsets instead, or as an option.

character at position 36 is code point 8217 in this example:
The Project Gutenberg eBook of Aliceโ€™s Adventures in Wonderland, by Lewis Carroll

So a match searching from eBook correctly returns 22 as the offset, while a match searching for Adventures returns 41 though it should be 39

Build errors on aarch64-apple-darwin

The issue seems to have been introduced in 45a4ee7.

aho-corasick on ๎‚  HEAD (3852632) is ๐Ÿ“ฆ v0.7.15 via ๐Ÿฆ€ v1.51.0
โฏ git checkout 45a4ee770e45805bd79d391031eacc846262a645
Previous HEAD position was 3852632 0.7.15
HEAD is now at 45a4ee7 edition: run 'cargo fix --edition --all'

aho-corasick on ๎‚  HEAD (45a4ee7) is ๐Ÿ“ฆ v0.7.15 via ๐Ÿฆ€ v1.51.0
โฏ cargo check --target aarch64-apple-darwin
    Updating crates.io index
    Checking aho-corasick v0.7.15 (/Users/aspen/Downloads/aho-corasick)
error[E0433]: failed to resolve: use of undeclared crate or module `packed`
 --> src/packed/teddy/mod.rs:4:9
  |
4 | pub use packed::teddy::fallback::Builder;
  |         ^^^^^^ use of undeclared crate or module `packed`

error[E0433]: failed to resolve: use of undeclared crate or module `packed`
 --> src/packed/teddy/mod.rs:6:9
  |
6 | pub use packed::teddy::fallback::Teddy;
  |         ^^^^^^ use of undeclared crate or module `packed`

error[E0433]: failed to resolve: use of undeclared crate or module `packed`
  --> src/packed/teddy/mod.rs:17:9
   |
17 |     use packed::pattern::Patterns;
   |         ^^^^^^ use of undeclared crate or module `packed`

error[E0432]: unresolved import `Match`
  --> src/packed/teddy/mod.rs:18:9
   |
18 |     use Match;
   |         ^^^^^ no external crate `Match`

error[E0432]: unresolved import `crate::packed::teddy::Teddy`
 --> src/packed/api.rs:5:34
  |
5 | use crate::packed::teddy::{self, Teddy};
  |                                  ^^^^^ no `Teddy` in `packed::teddy`

error[E0433]: failed to resolve: could not find `Builder` in `teddy`
   --> src/packed/api.rs:280:16
    |
280 |         teddy::Builder::new()
    |                ^^^^^^^ not found in `teddy`
    |
help: consider importing one of these items
    |
1   | use crate::dfa::Builder;
    |
1   | use crate::nfa::Builder;
    |
1   | use crate::packed::Builder;
    |
1   | use crate::prefilter::Builder;
    |
      and 1 other candidate

error[E0412]: cannot find type `Patterns` in this scope
  --> src/packed/teddy/mod.rs:28:33
   |
28 |         pub fn build(&self, _: &Patterns) -> Option<Teddy> {
   |                                 ^^^^^^^^ not found in this scope
   |
help: consider importing this struct
   |
17 |     use crate::packed::pattern::Patterns;
   |

error[E0412]: cannot find type `Patterns` in this scope
  --> src/packed/teddy/mod.rs:47:17
   |
47 |             _: &Patterns,
   |                 ^^^^^^^^ not found in this scope
   |
help: consider importing this struct
   |
17 |     use crate::packed::pattern::Patterns;
   |

error: aborting due to 8 previous errors

Some errors have detailed explanations: E0412, E0432, E0433.
For more information about an error, try `rustc --explain E0412`.
error: could not compile `aho-corasick`

To learn more, run the command again with --verbose.

Tracking issue for full-unicode case-insensitive matching

The docs for ascii_case_insensitive state:

NOTE: In the future, support for full Unicode case insensitivity may be added, but ASCII case insensitivity is comparatively much simpler to add.

I don't currently need this feature but it could be nice to have, hence this tracking issue.

For those interested in a work-around, see this and this.

FullAcAutomaton fails depending of the order of the patterns

let aut = AcAutomaton::new(vec!["libcore/", "libstd/"]);
let aut = FullAcAutomaton::new(aut);
let matches: Vec<_> = aut.find_overlapping("libcore/char/methods.rs").collect();
assert_eq!(matches, vec![Match {pati: 0, start: 0, end: 8}]);

let aut = AcAutomaton::new(vec!["libstd/", "libcore/"]);
let aut = FullAcAutomaton::new(aut);
let matches: Vec<_> = aut.find_overlapping("libcore/char/methods.rs").collect();
assert_eq!(matches, vec![Match {pati: 1, start: 0, end: 8}]);

The second assertion is failing.
If I comment the lines let aut = FullAcAutomaton::new(aut);, everything is ok.
This bug is a followup on:
BurntSushi/ripgrep#1079

Support non-u8 slices

Currently, patterns and haystack only support u8 slice.
Please support other numeric types.

u8 slice is useful for supporting both String and bytes.
However, it is too small to represent other arrays such as word IDs.

AhoCorasick::stream_find_iter may fail if the needle is split between buffer.

This is a regression from aho_corasick 0.7.4 to 0.7.5. And the bug still happen in the last tested version 0.7.13

Here is a test-case to reproduce:

use std::io::Read;

// NOTE: The test only fails if this ends with j
pub const MAGIC: [u8; 5] = *b"1234j";


// NOTE: The test fails for value in 8188..=8191
// These value put the string to search accross two call to read because the buffer size is 8192 by default
pub const BEGIN: usize = 8191;


#[derive(Default)]
/// This is just a structure that implements Reader.
/// The reader implementation will simulate a file filled with 0, except for the MAGIC string at offset BEGIN
struct R {
    read: usize,
}

impl Read for R {
    fn read(&mut self, buf: &mut [u8]) -> std::io::Result<usize> {
        //dbg!(buf.len());
        if self.read > 100000 {
            return Ok(0);
        }
        let mut from = 0;
        if self.read < BEGIN {
            from = buf.len().min(BEGIN - self.read);
            for x in 0..from {
                buf[x] = 0;
            }
            self.read += from;
        }
        if self.read >= BEGIN && self.read <= BEGIN + MAGIC.len() {
            let to = buf.len().min(BEGIN + MAGIC.len() - self.read + from);
            if to > from {
                buf[from..to]
                    .copy_from_slice(&MAGIC[self.read - BEGIN..self.read - BEGIN + to - from]);
                self.read += to - from;
                from = to;
            }
        }
        for x in from..buf.len() {
            buf[x] = 0;
            self.read += 1;
        }
        Ok(buf.len())
    }
}

fn main() -> std::io::Result<()> {
    let aut = aho_corasick::AhoCorasick::new(&[&MAGIC]);

    // While reading from a vector, it works:
    let mut buf = vec![];
    R::default().read_to_end(&mut buf)?;
    let from_whole = aut.find_iter(&buf).next().unwrap().start();

    //But using stream_find_iter fails!
    let mut file = R::default();
    let begin = aut
        .stream_find_iter(&mut file)
        .next()
        .expect("NOT FOUND!!!!")? // Panic here
        .start();
    assert_eq!(from_whole, begin);
    Ok(())
}

Playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=45d7da0d4705ecafd8fd7804676bbfc8

(Was found because the cpp_macro crate use stream_find_iter to find the metadata in a binary file, and this randomly fails)

Enabling DFA breaks ASCII case insensitivity

It appears that enabling dfa will stop ascii_case_insensitive from working, or are these two features intended to be mutally exclusive?

The below code will fail with dfa set to true, but will pass with it set to false.

extern crate aho_corasick;

use aho_corasick::AhoCorasickBuilder;

fn main() {
    let patterns = &["Samwise"];
    let haystack = "SAMWISE.abcd";

    let ac = AhoCorasickBuilder::new()
        .ascii_case_insensitive(true)
        .dfa(true)
        .build(patterns);
    let mat = ac.find(haystack).expect("should have a match");
    assert_eq!("SAMWISE", &haystack[mat.start()..mat.end()]);
}

Apologies in advanced if this is in the docs and I am just being unobservant.

RFC: remove configuration knobs

This RFC is basically a mirror of BurntSushi/regex-automata#7

The idea here is that I would like to make premultiply=true, byte_classes=true be the only option that one can choose for DFAs. (It is currently the default for DFAs.) I think it was a mistake to expose these knobs, as disabling them doesn't really confer benefits (if any). But they are quite costly in terms of code size and clarity.

Are there any strenuous objections to removing these options?

Full dense automaton fails to find a match

The following program triggers no assert, yet the two asserts contradict each other:

Long program, see below for more minimal example
extern crate aho_corasick;
use aho_corasick::{Automaton, AcAutomaton, Dense};

fn main() {
    let haystack: &'static [u8] = &[73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 51,
    0, 49, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 51, 0, 50, 0, 10,
    0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 51, 0, 51, 0, 46, 0, 48, 0, 10,
    0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 51, 0, 51, 0, 46, 0, 49, 0, 10,
    0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 51, 0, 51, 0, 46, 0, 50, 0, 10,
    0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 51, 0, 56, 0, 10, 0, 73, 0, 83,
    0, 73, 0, 49, 0, 48, 0, 48, 0, 51, 0, 51, 0, 46, 0, 51, 0, 10, 0, 73, 0, 83,
    0, 73, 0, 49, 0, 48, 0, 48, 0, 51, 0, 52, 0, 46, 0, 48, 0, 10, 0, 73, 0, 83,
    0, 73, 0, 49, 0, 48, 0, 48, 0, 51, 0, 52, 0, 46, 0, 49, 0, 10, 0, 73, 0, 83,
    0, 73, 0, 49, 0, 48, 0, 48, 0, 51, 0, 52, 0, 46, 0, 50, 0, 10, 0, 73, 0, 83,
    0, 73, 0, 49, 0, 48, 0, 48, 0, 51, 0, 57, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49,
    0, 48, 0, 48, 0, 51, 0, 52, 0, 46, 0, 51, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49,
    0, 48, 0, 48, 0, 51, 0, 53, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48,
    0, 51, 0, 53, 0, 46, 0, 48, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48,
    0, 51, 0, 53, 0, 46, 0, 49, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48,
    0, 51, 0, 54, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 51, 0, 55,
    0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 52, 0, 48, 0, 10, 0, 73,
    0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 52, 0, 49, 0, 10, 0, 73, 0, 83, 0, 73,
    0, 49, 0, 48, 0, 48, 0, 52, 0, 50, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48,
    0, 48, 0, 52, 0, 51, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 52,
    0, 52, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 53, 0, 50, 0, 46,
    0, 48, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 52, 0, 54, 0, 10,
    0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 52, 0, 55, 0, 10, 0, 73, 0, 83,
    0, 73, 0, 49, 0, 48, 0, 48, 0, 52, 0, 57, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49,
    0, 48, 0, 48, 0, 53, 0, 48, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48,
    0, 53, 0, 49, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 53, 0, 57,
    0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 53, 0, 50, 0, 46, 0, 49,
    0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 53, 0, 50, 0, 46, 0, 50,
    0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 53, 0, 50, 0, 46, 0, 51,
    0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 53, 0, 51, 0, 10, 0, 73,
    0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 53, 0, 52, 0, 46, 0, 48, 0, 10, 0, 73,
    0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 53, 0, 52, 0, 46, 0, 49, 0, 10, 0, 73,
    0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 53, 0, 52, 0, 46, 0, 50, 0, 10, 0, 73,
    0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 53, 0, 52, 0, 46, 0, 51, 0, 10, 0, 73,
    0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 53, 0, 53, 0, 10, 0, 73, 0, 83, 0, 73,
    0, 49, 0, 48, 0, 48, 0, 53, 0, 55, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48,
    0, 48, 0, 53, 0, 56, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 54,
    0, 48, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 54, 0, 49, 0, 10,
    0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 54, 0, 49, 0, 46, 0, 48, 0, 10,
    0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 54, 0, 49, 0, 46, 0, 49, 0, 10,
    0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 54, 0, 50, 0, 10, 0, 73, 0, 83,
    0, 73, 0, 49, 0, 48, 0, 48, 0, 54, 0, 56, 0, 46, 0, 49, 0, 10, 0, 73, 0, 83,
    0, 73, 0, 49, 0, 48, 0, 52, 0, 48, 0, 48, 0, 55, 0, 50, 0, 10, 0, 73, 0, 83,
    0, 73, 0, 49, 0, 48, 0, 48, 0, 54, 0, 51, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49,
    0, 48, 0, 48, 0, 54, 0, 53, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48,
    0, 54, 0, 54, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 54, 0, 55,
    0, 46, 0, 49, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 54, 0, 55,
    0, 46, 0, 50, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 54, 0, 56,
    0, 46, 0, 48, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 54, 0, 57,
    0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 54, 0, 57, 0, 46, 0, 48,
    0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 54, 0, 57, 0, 46, 0, 49,
    0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 54, 0, 57, 0, 46, 0, 50,
    0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 55, 0, 48, 0, 10, 0, 73,
    0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 55, 0, 49, 0, 10, 0, 73, 0, 83, 0, 73,
    0, 49, 0, 48, 0, 48, 0, 55, 0, 50, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48,
    0, 48, 0, 55, 0, 51, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 55,
    0, 56, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 55, 0, 57, 0, 10,
    0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 56, 0, 49, 0, 10, 0, 73, 0, 83,
    0, 73, 0, 49, 0, 48, 0, 48, 0, 56, 0, 51, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49,
    0, 48, 0, 48, 0, 56, 0, 52, 0, 46, 0, 48, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49,
    0, 48, 0, 49, 0, 48, 0, 48, 0, 48, 0, 51, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49,
    0, 48, 0, 48, 0, 56, 0, 52, 0, 46, 0, 49, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49,
    0, 48, 0, 48, 0, 56, 0, 53, 0, 46, 0, 48, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49,
    0, 48, 0, 48, 0, 56, 0, 53, 0, 46, 0, 49, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49,
    0, 48, 0, 48, 0, 56, 0, 55, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48,
    0, 57, 0, 50, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 48, 0, 57, 0, 53,
    0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 49, 0, 48, 0, 48, 0, 10, 0, 73,
    0, 83, 0, 73, 0, 49, 0, 48, 0, 49, 0, 48, 0, 48, 0, 48, 0, 48, 0, 10, 0, 73,
    0, 83, 0, 73, 0, 49, 0, 48, 0, 49, 0, 48, 0, 48, 0, 48, 0, 49, 0, 10, 0, 73,
    0, 83, 0, 73, 0, 49, 0, 48, 0, 49, 0, 48, 0, 48, 0, 48, 0, 49, 0, 49, 0, 10,
    0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 49, 0, 48, 0, 48, 0, 48, 0, 49, 0, 50,
    0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 49, 0, 48, 0, 48, 0, 48, 0, 49,
    0, 51, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 49, 0, 48, 0, 48, 0, 48,
    0, 49, 0, 52, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 49, 0, 48, 0, 48,
    0, 48, 0, 49, 0, 53, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 49, 0, 48,
    0, 48, 0, 48, 0, 49, 0, 54, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 49,
    0, 48, 0, 48, 0, 48, 0, 50, 0, 48, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48,
    0, 49, 0, 48, 0, 48, 0, 48, 0, 51, 0, 56, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49,
    0, 48, 0, 49, 0, 48, 0, 48, 0, 48, 0, 50, 0, 49, 0, 10, 0, 73, 0, 83, 0, 73,
    0, 49, 0, 48, 0, 49, 0, 48, 0, 48, 0, 48, 0, 50, 0, 50, 0, 10, 0, 73, 0, 83,
    0, 73, 0, 49, 0, 48, 0, 49, 0, 48, 0, 48, 0, 48, 0, 50, 0, 51, 0, 10, 0, 73,
    0, 83, 0, 73, 0, 49, 0, 48, 0, 49, 0, 48, 0, 48, 0, 48, 0, 50, 0, 52, 0, 10,
    0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 49, 0, 48, 0, 48, 0, 48, 0, 50, 0, 53,
    0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 49, 0, 48, 0, 48, 0, 48, 0, 50,
    0, 54, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 49, 0, 48, 0, 48, 0, 48,
    0, 51, 0, 48, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 49, 0, 48, 0, 48,
    0, 48, 0, 51, 0, 49, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 49, 0, 48,
    0, 48, 0, 48, 0, 51, 0, 50, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 49,
    0, 48, 0, 48, 0, 48, 0, 51, 0, 52, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48,
    0, 49, 0, 48, 0, 48, 0, 48, 0, 51, 0, 54, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49,
    0, 48, 0, 49, 0, 48, 0, 48, 0, 48, 0, 51, 0, 55, 0, 10, 0, 73, 0, 83, 0, 73,
    0, 49, 0, 48, 0, 49, 0, 48, 0, 48, 0, 48, 0, 51, 0, 57, 0, 10, 0, 73, 0, 83,
    0, 73, 0, 49, 0, 48, 0, 49, 0, 48, 0, 48, 0, 48, 0, 52, 0, 49, 0, 10, 0, 73,
    0, 83, 0, 73, 0, 49, 0, 48, 0, 49, 0, 48, 0, 48, 0, 48, 0, 52, 0, 54, 0, 10,
    0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 49, 0, 48, 0, 48, 0, 48, 0, 52, 0, 55,
    0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 49, 0, 48, 0, 48, 0, 48, 0, 52,
    0, 56, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 49, 0, 48, 0, 48, 0, 48,
    0, 52, 0, 57, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 49, 0, 48, 0, 48,
    0, 48, 0, 53, 0, 50, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 49, 0, 48,
    0, 48, 0, 48, 0, 54, 0, 48, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 49,
    0, 48, 0, 48, 0, 48, 0, 54, 0, 50, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48,
    0, 49, 0, 48, 0, 48, 0, 48, 0, 54, 0, 52, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49,
    0, 48, 0, 49, 0, 48, 0, 48, 0, 48, 0, 54, 0, 53, 0, 10, 0, 73, 0, 83, 0, 73,
    0, 49, 0, 48, 0, 49, 0, 48, 0, 48, 0, 48, 0, 54, 0, 55, 0, 10, 0, 73, 0, 83,
    0, 73, 0, 49, 0, 48, 0, 49, 0, 48, 0, 48, 0, 48, 0, 55, 0, 49, 0, 46, 0, 48,
    0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 49, 0, 48, 0, 48, 0, 48, 0, 56,
    0, 50, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 49, 0, 48, 0, 48, 0, 48,
    0, 56, 0, 53, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 49, 0, 48, 0, 48,
    0, 48, 0, 56, 0, 56, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 49, 0, 48,
    0, 48, 0, 48, 0, 57, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 49, 0, 48,
    0, 48, 0, 49, 0, 50, 0, 52, 0, 46, 0, 48, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49,
    0, 48, 0, 49, 0, 48, 0, 48, 0, 50, 0, 54, 0, 10, 0, 73, 0, 83, 0, 73, 0, 49,
    0, 48, 0, 49, 0, 48, 0, 48, 0, 48, 0, 57, 0, 48, 0, 10, 0, 73, 0, 83, 0, 73,
    0, 49, 0, 48, 0, 49, 0, 48, 0, 48, 0, 48, 0, 57, 0, 49, 0, 10, 0, 73, 0, 83,
    0, 10, 0];

    let needles: Vec<&'static [u8]> = vec![
        &[73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 53, 0, 48, 0, 54, 0, 54, 0, 57, 0],
        &[73, 0, 83, 0, 73, 0, 49, 0, 48, 0, 49, 0, 48, 0, 48, 0, 48, 0, 57, 0, 49, 0],
        &[73, 0, 83, 0, 73, 0, 49, 0, 53, 0, 49, 0, 55, 0, 57, 0, 54, 0],
    ];

    let automaton = AcAutomaton::<_, Dense>::with_transitions(&needles[..]).into_full();

    let start = 2472;
    let end = start + needles[1].len();
    assert_eq!(&haystack[start..end], needles[1]);
    assert_eq!(automaton.find(&haystack).count(), 0);
    println!("{:?}", &haystack[start..end]);
    println!("{:?}", &needles[1]);
}

I am using aho-corasick 0.6.8.

Without .into_full(), the match is found. With a Sparse rather than Dense automaton, the match is also found.

next_state?

We were using (abusing?) v0.6's exposure of next_state, get_match, and has_match in order to run on streams (running over a chunk/frame at a time). Can this API be brought back?

heap_bytes is inaccurate

I need to load a file containing 70,000 pieces of data.

  • If I comment out aho_corasick::AhoCorasick::new, the program occupies 7M of RES.
  • If I do not comment aho_corasick::AhoCorasick::new, the program will occupy 89M of RES at the beginning and 5M of RES after a period of time.

Is there a way to reduce peak memory usage to run on resource-constrained devices?

Using of sparse and dense is backwards

Normally in math and computing, 'dense' regarding storage means that memory is allocated for all elements, even if not used or zero. Sparse means that only used or non-zero elements occupy space. See dense matrix vs. sparse matrix, dense hash map vs sparse hash map, etc. You usage is other way around, which is confusing.

Discussion: Location for Python wrapper

Hi,

I think we briefly talked on Hacker News about preliminary experiment I did wrapping your library with Python. After further testing, it appears meaningfully faster than existing alternatives (pyahocorasick, cyac) so I am going to be turning this into a real Python package.

There are two ways this could work:

  • Separate repository.
  • It's included in this repository. This is the approach taken by https://github.com/ritchie46/polars, which has both Rust package and corresponding Python wrapper in the same repository.

In either case I would write the code, so not asking for any of your effort in the first pass.

The benefits of the latter are that things like new releases of Rust package could (probably?) automatically do Python releases with latest Rust version. The downsides to you are that the Python package is now on your critical path to releases etc., so plausibly it will eventually cause extra work for you in the future.

What do you prefer?

In what circumstances is the SIMD code used?

Hi,

Thanks for the library! Already see some speedups in my prototype benchmarks.

Was curious in what circumstances SIMD will be used? Basically I am trying to optimize purely for speed of running the regex, compile time and memory usage don't matter, so anything I can do to choose fast paths would help.

Substring Value

Hi, how do I use the following crate to return a string value so that "I like maple apples." Would return "I like maple" on pati: 0, start: 0, end: 11.

Question: Matches surrounded by separator characters

Hi,

I use aho-corasick algorithm to search for keywords in a text. This example is basically what I use https://github.com/BurntSushi/aho-corasick#example-basic-searching Yet, there is one caveat. I need to find out whether a match (i.e. a found keyword) provided by this algo is surrounded by separator characters (e.g. " ", "?", "!", ".", "*", ",", ";", ":", "\n", "\r", "(", ")", "โ€ž", "โ€œ", "<", ">").

e.g.

let patterns = &["apple"];
let haystack = "I love Snapple.";

Snapple word contains apple, so it is a match but it is not in form [separator-character]apple[separator-character] so I want to skip this match in my program.

aho-corasick algo returns matches as byte positions - i.e. (keyword_index, start, end) and my text is in UTF8 with non-ascii characters.

I can think of several solutions but I don't find any of them elegant.

Does anyone know a proper-rust-solution for this?
Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.