Giter Club home page Giter Club logo

multi-matcher's Introduction

multi-matcher

Build Status Coverage Status Maven Central License Javadoc Total alerts Language grade: Java

I have often needed to implement tedious classification logic in data processing projects. The requirements are often ambiguous to the extent that it would be difficult to implement them even in SQL, with aspects such as fallback and overlap. This logic often ends up expressed as large blocks of nested if statements which are hard to read or modify and perform poorly. This small project aims to make such classification logic easier, and improve performance too.

usage

Build a generic classification engine

    Classifier<Product, String> classifier = Classifier.<String, Product, String>builder(
                Schema.<String, Product, String>create()
                        .withAttribute("productType", Product::getProductType)
                        .withAttribute("issueDate", Product::getIssueDate, Comparator.naturalOrder().reversed())
                        .withAttribute("productName", Product::getProductName)
                        .withAttribute("availability", Product::getAvailability)
                        .withAttribute("discountedPrice", value -> 0.2 * value.getPrice())
                ).build(Arrays.asList(
                    MatchingConstraint.<String, String>named("rule1") 
                            .eq("productType", "silk")
                            .startsWith("productName", "luxury")
                            .gt("discountedPrice", 1000)
                            .priority(0)
                            .classification("EXPENSIVE_LUXURY_PRODUCTS")
                            .build(),
                    MatchingConstraint.<String, String>named("rule2")
                            .eq("productType", "caviar")
                            .gt("discountedPrice", 100)
                            .priority(1)
                            .classification("EXPENSIVE_LUXURY_PRODUCTS")
                            .build(),
                    MatchingConstraint.<String, String>anonymous()
                            .eq("productName", "baked beans")
                            .priority(2)
                            .classification("CHEAP_FOOD")
                            .build()
                )
            );

Classify

  Product p = getProduct();
  String classification = classifier.classification(p).orElse("UNCLASSIFIED");

multi-matcher's People

Contributors

richardstartin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

multi-matcher's Issues

Reorder priorities on construction

It simplifies the code to iterate over overlapping classifications in reverse, but this is much slower than iterating forwards. It should be possible to reorder priorities on construction for query time performance improvements.

Generate classifier with annotation processing

Currently attributes used for classification are registered separately from the definition of the classified objects. It should be possible to annotate methods and generate a classifier from the definition. This could include simple functionality like scaling values.

The code below would generate a classifier with a weight attribute.

class ToBeClassified {

@Classifiable(scale = 0.5)
public double getWeight() {
...
}

}

Implement ruleset analyser

If each constraint is associated with only one rule, this library represents a huge waste of space, and it would be better to use a list of rules with the most selective rule first in the list. The cost of the data structure can be calculated from the rules.

Find a better way to implement range queries

After "freezing" the classifier, match queries on numeric ranges take between 1-100 microseconds for a range of numbers of segments and feature space dimensions. This comes at the cost of potentially enormous memory usage, since the bits corresponding to all entailed thresholds are duplicated onto the threshold. One alternative is to compute the entailment on the fly to reduce memory requirements, but this reduces query performance at least 10x, and gets worse with the number of thresholds.

Implement threshold rules for Comparable attributes

It should be possible to extend threshol rules from primitive types to Comparable types. For convenience this may include incomparable types supplied with a comparator. These should be range and reverse range encodeable, reducing the evaluation of several thresholds to a find operation on a TreeMap.

Serialization

Since they are really just trees of bitsets, it should be possible to serialise classifiers and publish them to workers colocated with shards of a data source.

Add range encoding to IntNode and LongNode

DoubleNode is range encoded, which means all the ids of all rules triggered by a value can be accessed with a single binary search, at the cost of reencoding inside the optimise method. This should also work for IntNode and LongNode which each support >= and <= unlike DoubleNode.

PrefixMatcher is inefficient

The PrefixMatcher uses a very naive algorithm to find the longest common prefix between the input and the rules.

Filter more discriminatory attributes first

Currently attributes are ordered more or less randomly. An attribute with a binary constraint could be evaluated ahead of one that would produce no match, or one that reduces possible matches to a very small number of rules. The former case means that when no valid classificatin exists, the algorithm doesn't exit as soon as it could, and the second means that opportunities for cache efficiency are missed. Attributes should be ordered so those with a larger number of constraints are evaluated first.

Implement IP address matchers

Should support ranges, point addresses, white lists, black lists. Some kind of compressed 256 bit radix tree is probably a good choice for storing the constraints.

Dynamic selectivity / Selectivity stats

I do not know the selectivities of my Matchers beforehand - they rely on the data very much. That said, once a set of filters is running, their selectivity will only change gradually (over hours or days).

This calls for a self-balancing Classifier that would keep stats about its matchers' performance (sketch?) and sort based on that (and re-sort once in a while or on a manual input).

I could obviously do this myself, and I probably will. The question is - is something like this interesting to you, would you be interested in writing it yourself, would you accept a SelfBalancingClassifier (or whatever the name will be)?

Support enums

Both classifications and schemas could benefit from generic types.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.