Giter Club home page Giter Club logo

domainmatcher's Introduction

DomainMatcher Goals

Implement a fast algorithm which can solve the following type of matching problem.

Full

Full is the type of matcher that the input string must exactly equal to the pattern.

Substr

Substr is the type of matcher that the input string must contain the pattern as a sub-string.

Domain

Domain is the type of matcher that the input string must be a sub-domain or itself of the pattern.

Implementation detail

The DomainMatcher is divided into two parts:

  1. full and domain patterns are matched by Rabin-Karp algorithm & minimal perfect hash table;
  2. substr patterns are matched by ac automaton;

Matching problem definition:

  • a domain rule baidu.com can be seen as exact match moc.udiab and moc.udiab. when traversing the domain names in reverse order. And moc.udiab and moc.udiab. should not appear in the middle of the string.
  • a full rule baidu.com can be seen as exact match moc.udiab when traversing the domain names in reverse order. And moc.udiab should not appear in the middle of the string.
  • a substr rule baidu.com is a matching problem that check if baidu.com is substring of the given domain names. substr rules can be matched by ACAutomaton.

Through the above definition, we can merge the full and domain rules together to match. The simplest way is to store these rules in the HashMap. However, when we query, we need to calculate the hash value of the same string and its substrings. This additional overhead can be reduced by rolling hash.

We choose 32bit FNV-prime 16777619 to calculate our rolling hash.

Inspired by "Hash, displace, and compress" algorithm, we can design a minimal perfect hash table through two rounds hashes. The first round of hash is rolling hash, which we get directly from the process of traversing the string. The second round of hash is memhash.

In this way, when checking whether the rule is hit, we only need to calculate the hash and compare it once.

#[inline(always)]
fn lookup(&self, h: RollingHashType, query_string: &str) -> bool {
    let level0_idx = h & self.level0_mask;
    let seed = self.level0[level0_idx as usize] as Level1HashType;
    let level1_idx = seed.mem_hash(query_string) & self.level1_mask;
    return self.rules[self.level1[level1_idx as usize] as usize] == query_string;
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.