Giter Club home page Giter Club logo

aho-corasick's Introduction

Aho-Corasick algorithm

public domain tree diagram

Library

This is a proof of concept implementation in c# of the Aho-Corasick algorithm for efficient O(n + m + z) pattern matching in text processing, where n is the length of the text, m is the total length of all patterns, and z is the number of matches found.

Test

The test example usage demonstrates a text replacement process of (12K x 12K) potential replacements operations. It utilizes two key reference files (each containing over 12K lines):

  • 'xREF.txt' random words for redacting and Bates replacement, the cross reference
  • 'STARGATE.dat' for processing public CIA STARGATE document metadata, format is eDiscovery standard, UTF-8

The implementation showcases efficient text processing using parallel and sequential methods, achieving significant performance in both. Note negligible differences in replacement count repeatability using parallel due to overlapping search key substrings in the test data.

The 'ReplacementDetails' struct orchestrates the replacement operation, leveraging the Aho-Corasick algorithm for pattern matching.

Helper methods facilitate loading of input and reference data, execution of search operations, and output generation, with support for parallel processing to enhance performance.

using Aho_Corasick_Helpers;
...

// Defines the structure for managing replacement operations, including file paths, processing times, and match counts.
ReplacementDetails RD = new("xREF.txt", "STARGATE.dat", "STARGATE.xREF.dat") { Start = DateTime.Now };

// Operations to load input, reference data, perform the replacement search, and generate the output.
Helpers.LoadInput(ref RD);
Helpers.LoadXREF(ref RD);
Helpers.PerformSearch(ref RD);
Helpers.WriteOutputParallel(RD);

// Mark the end of processing and display the duration and match count.
RD.End = DateTime.Now;


Console.WriteLine($"Time: {RD.Duration.Seconds}secs \nReplacements: {RD.MatchCount}");

// Time: 1secs
// Replacements: 222194

aho-corasick's People

Contributors

g33k5z avatar

Stargazers

 avatar aA avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.