Comments (5)
Hi Petr,
In the classifier, you have a list of attributes you need to match on, each with a bitset associated with each constrained value at the matcher level. As soon as you have built the classifier, you know the number of individual rules which could still fire after applying each matcher, because you can count the bits in each bitset. The idea of the selectivity heuristics is to reorder the matchers so that the most selective matchers (those which knock the most rules out when they match) go to the start of the list, and the ones which don't knock out many rules go to the end. The reasoning here is that each bitset aggregation operation is linear in the number of rules which can still fire, even if they enjoy bit level parallelism, it's better to knock rules out early.
The problem with the heuristics mechanism is that every attribute has a distribution of matcher selectivity rather than a point value. If you have a uniform distribution, averaging the cardinalities per constrained value works. It's bad if you have a zipfian distribution, and no amount of dynamism solves this. It also doesn't take in to account the actual data, as you note. Even if every constraint on an attribute knocks out 10k rules when satisfied, the heuristic does nothing if you always hit the wildcard for some dataset, so putting that attribute first is the wrong thing to do. So I see value in dynamism, so long as you accept that reordering is not a perfect mechanism and will only get you so far. If you are thinking of something other than reordering the matchers, I would be interested in your ideas.
On a side note, I have been working on replacing most of the implementation with runtime bytecode generation, to eliminate various indirection and allocation costs, so it's probably better to just fork the code, though if you come up with something that works well please let me know.
from multi-matcher.
Hello,
Sorry for such a long delay, the issue had a low priority as more performance was available to get for us in lower hanging fruit. That said, I finally got to do this. A dynamic sorting heuristic helps a little in our case for matching, but complicates the code and the API so much that we dropped it, and the gains were in low percent because of the extra overhead anyway.
Thank you for your work. I think we can safely close this issue now as something not applicable to the general library.
from multi-matcher.
Hi @JanecekPetr I would be interested to know if you are using this library, or what you were trying to use it for in any case. I donβt intend to maintain it so if you are using it I recommend forking the source code.
from multi-matcher.
We are indeed using this through a private fork (mostly adding more matchers). I'd like to give that back, but have been unsuccessful in the legal battle internally so far.
The use case is pretty much archetypal to what you designed this for: a set of user-provided rules classifying an infinite stream of documents into non-overlapping groups.
This worked better for us than any of the "standard" rule engines as it's built to do exactly what we want, not as a generic engine.
from multi-matcher.
OK thanks very much for the information, I'm happy that it's useful, and don't feel bad about not being able to contribute the enhancements (I would only need to look after them...)
from multi-matcher.
Related Issues (20)
- Filter more discriminatory attributes first
- Serialization
- Support enums HOT 1
- Find a better way to implement range queries
- Implement DFA matchers
- Implement IP address matchers
- use jdk10
- Use RoaringBitmap for keys in IntNode HOT 1
- Deploy to bintray
- Investigate XOR style rules
- Build from JSON spec
- Investigate generating code for some matchers
- Use code generation to expand MaskType type parameter as template
- Allow lower level access to classification result (avoid wrapper types) HOT 1
- PrefixMatcher is inefficient
- Fuzzy (hamming distance) matcher
- String wildcard and equivalence class matcher
- Implement ruleset analyser
- Implement PBT
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from multi-matcher.