richardstartin / multi-matcher Goto Github PK

View Code? Open in Web Editor NEW

92.0 1.0 9.0 6.68 MB

simple rules engine

Home Page: https://richardstartin.github.io/posts/classifying-documents

License: Apache License 2.0

Java 96.42% Kotlin 3.58%

classification classifier decision-table decision-tree

multi-matcher's Introduction

multi-matcher

I have often needed to implement tedious classification logic in data processing projects. The requirements are often ambiguous to the extent that it would be difficult to implement them even in SQL, with aspects such as fallback and overlap. This logic often ends up expressed as large blocks of nested if statements which are hard to read or modify and perform poorly. This small project aims to make such classification logic easier, and improve performance too.

usage

Build a generic classification engine

    Classifier<Product, String> classifier = Classifier.<String, Product, String>builder(
                Schema.<String, Product, String>create()
                        .withAttribute("productType", Product::getProductType)
                        .withAttribute("issueDate", Product::getIssueDate, Comparator.naturalOrder().reversed())
                        .withAttribute("productName", Product::getProductName)
                        .withAttribute("availability", Product::getAvailability)
                        .withAttribute("discountedPrice", value -> 0.2 * value.getPrice())
                ).build(Arrays.asList(
                    MatchingConstraint.<String, String>named("rule1") 
                            .eq("productType", "silk")
                            .startsWith("productName", "luxury")
                            .gt("discountedPrice", 1000)
                            .priority(0)
                            .classification("EXPENSIVE_LUXURY_PRODUCTS")
                            .build(),
                    MatchingConstraint.<String, String>named("rule2")
                            .eq("productType", "caviar")
                            .gt("discountedPrice", 100)
                            .priority(1)
                            .classification("EXPENSIVE_LUXURY_PRODUCTS")
                            .build(),
                    MatchingConstraint.<String, String>anonymous()
                            .eq("productName", "baked beans")
                            .priority(2)
                            .classification("CHEAP_FOOD")
                            .build()
                )
            );

Classify

  Product p = getProduct();
  String classification = classifier.classification(p).orElse("UNCLASSIFIED");

multi-matcher's People

Contributors

Stargazers

Watchers

Forkers

cogito-clarus yespon schrepfler murodin andy-wagner wangxin223 kasparthommen paulbares rfarrey

multi-matcher's Issues

Investigate generating code for some matchers

Reorder priorities on construction

It simplifies the code to iterate over overlapping classifications in reverse, but this is much slower than iterating forwards. It should be possible to reorder priorities on construction for query time performance improvements.

Implement String != rules

It's not currently possible to exclude a string value from a classification.

use jdk10

Generate classifier with annotation processing

Currently attributes used for classification are registered separately from the definition of the classified objects. It should be possible to annotate methods and generate a classifier from the definition. This could include simple functionality like scaling values.

The code below would generate a classifier with a weight attribute.

class ToBeClassified {

@Classifiable(scale = 0.5)
public double getWeight() {
...
}

}

Use RoaringBitmap for keys in IntNode

Implement ruleset analyser

If each constraint is associated with only one rule, this library represents a huge waste of space, and it would be better to use a list of rules with the most selective rule first in the list. The cost of the data structure can be calculated from the rules.

Find a better way to implement range queries

After "freezing" the classifier, match queries on numeric ranges take between 1-100 microseconds for a range of numbers of segments and feature space dimensions. This comes at the cost of potentially enormous memory usage, since the bits corresponding to all entailed thresholds are duplicated onto the threshold. One alternative is to compute the entailment on the fly to reduce memory requirements, but this reduces query performance at least 10x, and gets worse with the number of thresholds.

Implement threshold rules for Comparable attributes

It should be possible to extend threshol rules from primitive types to Comparable types. For convenience this may include incomparable types supplied with a comparator. These should be range and reverse range encodeable, reducing the evaluation of several thresholds to a find operation on a TreeMap.

Implement DFA matchers

Serialization

Since they are really just trees of bitsets, it should be possible to serialise classifiers and publish them to workers colocated with shards of a data source.

Add range encoding to IntNode and LongNode

DoubleNode is range encoded, which means all the ids of all rules triggered by a value can be accessed with a single binary search, at the cost of reencoding inside the optimise method. This should also work for IntNode and LongNode which each support >= and <= unlike DoubleNode.

Build from JSON spec

Allow lower level access to classification result (avoid wrapper types)

String wildcard and equivalence class matcher

Use shift or without mismatches to match inputs against expressions like "abc*[^x]"

Explicit support for int, long and Duration types

Currently only strings and doubles can be used for writing rules.

Implement String.startsWith rules

Use a trie data structure to make "starts with" rules for string values

PrefixMatcher is inefficient

The PrefixMatcher uses a very naive algorithm to find the longest common prefix between the input and the rules.

Deploy to bintray

Fuzzy (hamming distance) matcher

Use shift-or to implement this.

Implement PBT

Filter more discriminatory attributes first

Currently attributes are ordered more or less randomly. An attribute with a binary constraint could be evaluated ahead of one that would produce no match, or one that reduces possible matches to a very small number of rules. The former case means that when no valid classificatin exists, the algorithm doesn't exit as soon as it could, and the second means that opportunities for cache efficiency are missed. Attributes should be ordered so those with a larger number of constraints are evaluated first.

Implement IP address matchers

Should support ranges, point addresses, white lists, black lists. Some kind of compressed 256 bit radix tree is probably a good choice for storing the constraints.

Dynamic selectivity / Selectivity stats

I do not know the selectivities of my Matchers beforehand - they rely on the data very much. That said, once a set of filters is running, their selectivity will only change gradually (over hours or days).

This calls for a self-balancing Classifier that would keep stats about its matchers' performance (sketch?) and sort based on that (and re-sort once in a while or on a manual input).

I could obviously do this myself, and I probably will. The question is - is something like this interesting to you, would you be interested in writing it yourself, would you accept a SelfBalancingClassifier (or whatever the name will be)?

Use code generation to expand MaskType type parameter as template

Shade, repackage and minimise RoaringBitmap and Apache commons dependencies

The library uses the containers and iterators from RoaringBitmap, and a patricia trie from Apache speculatively included for starts with and ends with rules. These should be shaded, the rest of the libraries removed, and renamed so they don't clash if users already use these libraries.

Support enums

Both classifications and schemas could benefit from generic types.