Comments (12)
Yes, sadly spans are not fully supported. When inspecting the query in the QueryFlattener we only extract term information, position info is not tracked.
I'll try to find some time to have a closer look, thanks.
from search-highlighter.
Thanks @nomoa. Just so I understand, you're saying the problem is basically when the Automata is generated it doesn't take into account SpanTermQuery.slop?
More specifically that the slop gets lost in QueryFlattener.flattenQuery here:
protected void flattenQuery(SpanNearQuery query, float pathBoost, Object sourceOverride,
IndexReader reader, Callback callback) {
pathBoost *= query.getBoost();
for (SpanQuery clause : query.getClauses()) {
flattenSpan(clause, pathBoost, sourceOverride, reader, callback);
}
}
Where the SpanNearQuery knows the slop but the constituent clauses do not.
Thanks for any help. I'm just trying to wrap my head around the high level algorithm so I can help wherever possible.
from search-highlighter.
Yes exactly, as you can see we only flatten the clauses but we should keep track of span info such as slop. The highlighter currently only support PhraseQuery by calling callback#Phrase methods to keep positional information. Unfortunately spans are more complex than phrase queries and supporting them does not seem trivial and it's unclear to me how we can do that yet, possibly by reusing WeightedSpanTermExtractor?
from search-highlighter.
We are thinking about modifying the flattenQuery(SpanNearQuery...) method to build an Automaton object directly. To handle nested span queries it could maybe pass around a current Automaton state node, defaulting to the init state. Once the span query automaton is built up BasicQueryWeigher.FlattenerCallback.flattened(Automaton...) could be called (instead of the flattened that takes a term byte reference) to add it to the list of automata. We're still working through the concept but I thought I'd bounce it off you to see if you had any feedback.
from search-highlighter.
Thanks for digging into this!
I must confess that I don't have much time to explore possible solutions.
Modifying flattenQuery(SpanNearQuery...) is certainly the first thing to do. But I'm not sure to follow your solution with a custom Automaton. You mean you would like to extend Automaton to store additional info?
Note that Automaton should be considered to something very similar to ByteRef, it's just a way to handle MultiTermQueries (Prefix/Fuzzy/WildcardQuery) without calling Query#rewrite.
If you need to pass more complex object to the callback I'd suggest using a dedicated class rather than extending Automaton.
At a glance I would change FlattenerCallback to allow capturing Spans states which then would allow you to create custom HitEnum subclasses to handle Spans. This would be very similar to what we do for Phrases but for Spans:
- FlattenerCallback would expose specific methods to capture the phrase states: startSpan(int slop, boolean ordered)/endSpan().
- Then BasicQueryWeigher will generate SpanNearHitEnumWrappers to properly emit span matches
Handling nested spans will certainly be non trivial e.g. FlattenerCallback#startSpan may be called multiple times before endSpan is called.
But if you have a solution that works I'd be happy to review it.
Thanks!
from search-highlighter.
I wasn't expecting we'd have to extend Automaton as there's already an example in the unit tests for TermAutomatonQuery that handles slop. See:
TestTermAutomatonQuery.java
// "comes sun" or "comes * sun"
public void testBasicSlop() throws Exception {
Directory dir = newDirectory();
RandomIndexWriter w = new RandomIndexWriter(random(), dir);
Document doc = new Document();
doc.add(newTextField("field", "here comes the sun", Field.Store.NO));
w.addDocument(doc);
doc = new Document();
doc.add(newTextField("field", "here comes sun", Field.Store.NO));
w.addDocument(doc);
doc = new Document();
doc.add(newTextField("field", "here comes the other sun", Field.Store.NO));
w.addDocument(doc);
IndexReader r = w.getReader();
IndexSearcher s = newSearcher(r);
TermAutomatonQuery q = new TermAutomatonQuery("field");
int init = q.createState();
int s1 = q.createState();
q.addTransition(init, s1, "comes");
int s2 = q.createState();
q.addAnyTransition(s1, s2);
int s3 = q.createState();
q.setAccept(s3, true);
q.addTransition(s1, s3, "sun");
q.addTransition(s2, s3, "sun");
q.finish();
assertEquals(2, s.search(q, 1).totalHits);
w.close();
r.close();
dir.close();
}
Maybe this wouldn't get us enough, but it's the road we were walking down. We'll take a peek at what you suggested today as well. Thanks for your feedback!!
from search-highlighter.
I'm confused; how does TermAutomatonQuery
relate to highlighting? Unfortunately, TermAutomatonQuery isn't a SpanQuery derivative and/or doesn't expose the offsets of matching terms; although it could be modified to do so.
from search-highlighter.
Could someone correct my understanding of how queryWeight is being used to filter terms matched outside a phrase? I see all the PostingsHitEnums for a field get grouped under a MergeHitEnumWrapper which is wrapped in nested PhraseHitEnumWrappers (one for each phrase for the field). These are then wrapped by a WeightFilteredHitEnumWrapper. When the hits come through if the query weight is not positive the PhraseHitEnumWrapper checks to see if it's a phrase hit. If so it bumps up the query weight. But I can't see what's making the queryWeight 0 to begin with. As I'm iterating through the PostingsHitEnum hits the queryWeight always seems to be 1. I was trying to use the same basic algorithm for a new SpanHitEnumWrapper class but if the next calls to wrapped always result in a queryWeight that's 1 I'm sunk. I'm missing something here but I can't see what. I'll keep digging but a gentle nudge in the right direction would certainly help the effort:)
from search-highlighter.
Alrighty, I think I figured it out. I was missing that in QueryFlattener we were passing a 0 for the weight of terms that were part of a phrase.
I think I have a partial solution implemented, piggybacking on a lot of the concepts for phrase handling. So far I seem to have it working for SpanTerm and SpanNear queries and would like to run the concept past you experts before I spend time implementing the rest.
The high level solution is to, while flattening the query, build up a list of root-level SpanValidator
trees. Then, when wrapping (BasicQueryWeigher.wrap
), for each root-level validator I wrap the current HitEnum with a SpanHitEnumWrapper
.
The SpanHitEnumWrapper
tries to validate the current hit. If there is a partial match (e.g. we have a hit on one of the SpanTerms
in a list of span near clauses) it looks forward (wrapped.next()
) until either there is a match or there is no match. For example, if the positions of the current hits are outside the bounds allowed by slop. When validating each acceptable hit is stored by that validator node.
Lastly, when SpanHitEnumWrapper.queryWeight()
is called it checks with its validator to see if the given source/position combination was recorded as a successful hit during validation:
@Override
public float queryWeight() {
float queryWeight = toReplay.queryWeight();
if (queryWeight <= 0) {
if (spanValidator.isSuccessful()) {
if (spanValidator.isSourceAtPositionValid(source(), position())) {
queryWeight = 1;
}
}
}
return queryWeight;
}
A little more detail:
I added functions to QueryFlattener.Callback to mark the start and end of the span types:
void startSpanTermQuery(SpanTermQuery query);
void endSpanTermQuery(SpanTermQuery query, int source);
void startSpanNearQuery(SpanNearQuery query);
void endSpanNearQuery(SpanNearQuery query);
These call back into BasicQueryWeigher
which maintains a stack of List
of SpanValidator
instances. When the bottom of the stack is popped off this validator is added to the list of root-level validators for the given field.
Concerns:
- It seems to be following the pattern established by phrases but with this solution we're throwing away queryWeight/boost for nested span queries. This seems like it could be more problematic than it was for phrases.
- I see when we encounter a term multiple times we update the termInfo to have a new source. I followed the pattern established in phrase handling and performed a source check for SpanTerm to see if it's a hit. I'm not sure what the impact of changing the source on termInfo would have in this case (if any).
from search-highlighter.
@mkrauklis everything you describe sounds good to me.
Concerning queryWeight if we want to do like phrases we need to use the top-level boost recorded when flattening and as you said you may have to forget nested span query boosts. So instead of returning 1 in SpanHitEnumWrapper#queryWeight() would it be possible to return the boost you could capture when you flatten the query?
Concerning the problem with duplicated terms I have no idea... I'd need to spend some time in the code to tell. Would you mind opening a PR so we can discuss implementation details?
This looks very promising, thanks!
from search-highlighter.
@nomoa I'm doing this on company time so I'm just waiting to get the approval to contribute back to this project. We've done this for other projects so it shouldn't be a problem. Just takes a few days for everything to get approved. Once that's all set I'll open a PR.
from search-highlighter.
Sorry for the delay. Got a little held up in red-tape but I was finally able to move forward with the pull request. In the mean time I implemented handling of span_or and span_multi.
See: #21
from search-highlighter.
Related Issues (20)
- elasticsearch 2.0 support HOT 9
- Notice: Lucene UnifiedHighlighter HOT 1
- Cannot compile for elasticsearch-5.1.1 / lucene-6.3.0 HOT 12
- Weird highlighting behaviour HOT 1
- Builds for ES5 HOT 2
- html encoder
- plugin install for ES 5.1.2 HOT 2
- match_phrase not highlighting when stopwords removed HOT 3
- regression: hunspell filter does not highlight correctly in 5.x HOT 2
- string_index_out_of_bounds_exception HOT 2
- 5.3 HOT 1
- 1.7 branch, and Java 1.8 HOT 3
- parse error: Invalid \uXXXX\uXXXX surrogate pair escape HOT 2
- Does the plugin work with version 6.x HOT 1
- Support for ElasticSearch 5.6.2 HOT 3
- Elasticsearch Keyword not getting highlighted HOT 8
- How to install from source? HOT 3
- Upgrade to Elasticsearch 7.16.2
- Can I use this highlighter to highlight intervals query result?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from search-highlighter.