Giter Club home page Giter Club logo

Comments (5)

richardstartin avatar richardstartin commented on June 6, 2024

Hi Petr,

In the classifier, you have a list of attributes you need to match on, each with a bitset associated with each constrained value at the matcher level. As soon as you have built the classifier, you know the number of individual rules which could still fire after applying each matcher, because you can count the bits in each bitset. The idea of the selectivity heuristics is to reorder the matchers so that the most selective matchers (those which knock the most rules out when they match) go to the start of the list, and the ones which don't knock out many rules go to the end. The reasoning here is that each bitset aggregation operation is linear in the number of rules which can still fire, even if they enjoy bit level parallelism, it's better to knock rules out early.

The problem with the heuristics mechanism is that every attribute has a distribution of matcher selectivity rather than a point value. If you have a uniform distribution, averaging the cardinalities per constrained value works. It's bad if you have a zipfian distribution, and no amount of dynamism solves this. It also doesn't take in to account the actual data, as you note. Even if every constraint on an attribute knocks out 10k rules when satisfied, the heuristic does nothing if you always hit the wildcard for some dataset, so putting that attribute first is the wrong thing to do. So I see value in dynamism, so long as you accept that reordering is not a perfect mechanism and will only get you so far. If you are thinking of something other than reordering the matchers, I would be interested in your ideas.

On a side note, I have been working on replacing most of the implementation with runtime bytecode generation, to eliminate various indirection and allocation costs, so it's probably better to just fork the code, though if you come up with something that works well please let me know.

from multi-matcher.

JanecekPetr avatar JanecekPetr commented on June 6, 2024

Hello,
Sorry for such a long delay, the issue had a low priority as more performance was available to get for us in lower hanging fruit. That said, I finally got to do this. A dynamic sorting heuristic helps a little in our case for matching, but complicates the code and the API so much that we dropped it, and the gains were in low percent because of the extra overhead anyway.

Thank you for your work. I think we can safely close this issue now as something not applicable to the general library.

from multi-matcher.

richardstartin avatar richardstartin commented on June 6, 2024

Hi @JanecekPetr I would be interested to know if you are using this library, or what you were trying to use it for in any case. I don’t intend to maintain it so if you are using it I recommend forking the source code.

from multi-matcher.

JanecekPetr avatar JanecekPetr commented on June 6, 2024

We are indeed using this through a private fork (mostly adding more matchers). I'd like to give that back, but have been unsuccessful in the legal battle internally so far.
The use case is pretty much archetypal to what you designed this for: a set of user-provided rules classifying an infinite stream of documents into non-overlapping groups.
This worked better for us than any of the "standard" rule engines as it's built to do exactly what we want, not as a generic engine.

from multi-matcher.

richardstartin avatar richardstartin commented on June 6, 2024

OK thanks very much for the information, I'm happy that it's useful, and don't feel bad about not being able to contribute the enhancements (I would only need to look after them...)

from multi-matcher.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.