I do not know the selectivities of my Matchers beforehand - they rely on the data very

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Dynamic selectivity / Selectivity stats about multi-matcher HOT 5 CLOSED

JanecekPetr commented on June 6, 2024

Dynamic selectivity / Selectivity stats

from multi-matcher.

Comments (5)

richardstartin commented on June 6, 2024

Hi Petr,

In the classifier, you have a list of attributes you need to match on, each with a bitset associated with each constrained value at the matcher level. As soon as you have built the classifier, you know the number of individual rules which could still fire after applying each matcher, because you can count the bits in each bitset. The idea of the selectivity heuristics is to reorder the matchers so that the most selective matchers (those which knock the most rules out when they match) go to the start of the list, and the ones which don't knock out many rules go to the end. The reasoning here is that each bitset aggregation operation is linear in the number of rules which can still fire, even if they enjoy bit level parallelism, it's better to knock rules out early.

The problem with the heuristics mechanism is that every attribute has a distribution of matcher selectivity rather than a point value. If you have a uniform distribution, averaging the cardinalities per constrained value works. It's bad if you have a zipfian distribution, and no amount of dynamism solves this. It also doesn't take in to account the actual data, as you note. Even if every constraint on an attribute knocks out 10k rules when satisfied, the heuristic does nothing if you always hit the wildcard for some dataset, so putting that attribute first is the wrong thing to do. So I see value in dynamism, so long as you accept that reordering is not a perfect mechanism and will only get you so far. If you are thinking of something other than reordering the matchers, I would be interested in your ideas.

On a side note, I have been working on replacing most of the implementation with runtime bytecode generation, to eliminate various indirection and allocation costs, so it's probably better to just fork the code, though if you come up with something that works well please let me know.

from multi-matcher.

JanecekPetr commented on June 6, 2024

Hello,
Sorry for such a long delay, the issue had a low priority as more performance was available to get for us in lower hanging fruit. That said, I finally got to do this. A dynamic sorting heuristic helps a little in our case for matching, but complicates the code and the API so much that we dropped it, and the gains were in low percent because of the extra overhead anyway.

Thank you for your work. I think we can safely close this issue now as something not applicable to the general library.

from multi-matcher.

richardstartin commented on June 6, 2024

Hi @JanecekPetr I would be interested to know if you are using this library, or what you were trying to use it for in any case. I don’t intend to maintain it so if you are using it I recommend forking the source code.

from multi-matcher.

JanecekPetr commented on June 6, 2024

We are indeed using this through a private fork (mostly adding more matchers). I'd like to give that back, but have been unsuccessful in the legal battle internally so far.
The use case is pretty much archetypal to what you designed this for: a set of user-provided rules classifying an infinite stream of documents into non-overlapping groups.
This worked better for us than any of the "standard" rule engines as it's built to do exactly what we want, not as a generic engine.

from multi-matcher.

richardstartin commented on June 6, 2024

OK thanks very much for the information, I'm happy that it's useful, and don't feel bad about not being able to contribute the enhancements (I would only need to look after them...)

from multi-matcher.

Dynamic selectivity / Selectivity stats about multi-matcher HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent