Giter Club home page Giter Club logo

Comments (12)

nomoa avatar nomoa commented on September 13, 2024

Yes, sadly spans are not fully supported. When inspecting the query in the QueryFlattener we only extract term information, position info is not tracked.
I'll try to find some time to have a closer look, thanks.

from search-highlighter.

mkrauklis avatar mkrauklis commented on September 13, 2024

Thanks @nomoa. Just so I understand, you're saying the problem is basically when the Automata is generated it doesn't take into account SpanTermQuery.slop?

More specifically that the slop gets lost in QueryFlattener.flattenQuery here:

    protected void flattenQuery(SpanNearQuery query, float pathBoost, Object sourceOverride,
            IndexReader reader, Callback callback) {
        pathBoost *= query.getBoost();
        for (SpanQuery clause : query.getClauses()) {
            flattenSpan(clause, pathBoost, sourceOverride, reader, callback);
        }
    }

Where the SpanNearQuery knows the slop but the constituent clauses do not.

Thanks for any help. I'm just trying to wrap my head around the high level algorithm so I can help wherever possible.

from search-highlighter.

nomoa avatar nomoa commented on September 13, 2024

Yes exactly, as you can see we only flatten the clauses but we should keep track of span info such as slop. The highlighter currently only support PhraseQuery by calling callback#Phrase methods to keep positional information. Unfortunately spans are more complex than phrase queries and supporting them does not seem trivial and it's unclear to me how we can do that yet, possibly by reusing WeightedSpanTermExtractor?

from search-highlighter.

mkrauklis avatar mkrauklis commented on September 13, 2024

We are thinking about modifying the flattenQuery(SpanNearQuery...) method to build an Automaton object directly. To handle nested span queries it could maybe pass around a current Automaton state node, defaulting to the init state. Once the span query automaton is built up BasicQueryWeigher.FlattenerCallback.flattened(Automaton...) could be called (instead of the flattened that takes a term byte reference) to add it to the list of automata. We're still working through the concept but I thought I'd bounce it off you to see if you had any feedback.

from search-highlighter.

nomoa avatar nomoa commented on September 13, 2024

Thanks for digging into this!

I must confess that I don't have much time to explore possible solutions.
Modifying flattenQuery(SpanNearQuery...) is certainly the first thing to do. But I'm not sure to follow your solution with a custom Automaton. You mean you would like to extend Automaton to store additional info?
Note that Automaton should be considered to something very similar to ByteRef, it's just a way to handle MultiTermQueries (Prefix/Fuzzy/WildcardQuery) without calling Query#rewrite.
If you need to pass more complex object to the callback I'd suggest using a dedicated class rather than extending Automaton.

At a glance I would change FlattenerCallback to allow capturing Spans states which then would allow you to create custom HitEnum subclasses to handle Spans. This would be very similar to what we do for Phrases but for Spans:

  • FlattenerCallback would expose specific methods to capture the phrase states: startSpan(int slop, boolean ordered)/endSpan().
  • Then BasicQueryWeigher will generate SpanNearHitEnumWrappers to properly emit span matches

Handling nested spans will certainly be non trivial e.g. FlattenerCallback#startSpan may be called multiple times before endSpan is called.

But if you have a solution that works I'd be happy to review it.

Thanks!

from search-highlighter.

mkrauklis avatar mkrauklis commented on September 13, 2024

I wasn't expecting we'd have to extend Automaton as there's already an example in the unit tests for TermAutomatonQuery that handles slop. See:
TestTermAutomatonQuery.java

  // "comes sun" or "comes * sun"
  public void testBasicSlop() throws Exception {
    Directory dir = newDirectory();
    RandomIndexWriter w = new RandomIndexWriter(random(), dir);
    Document doc = new Document();
    doc.add(newTextField("field", "here comes the sun", Field.Store.NO));
    w.addDocument(doc);

    doc = new Document();
    doc.add(newTextField("field", "here comes sun", Field.Store.NO));
    w.addDocument(doc);

    doc = new Document();
    doc.add(newTextField("field", "here comes the other sun", Field.Store.NO));
    w.addDocument(doc);
    IndexReader r = w.getReader();
    IndexSearcher s = newSearcher(r);

    TermAutomatonQuery q = new TermAutomatonQuery("field");
    int init = q.createState();
    int s1 = q.createState();
    q.addTransition(init, s1, "comes");
    int s2 = q.createState();
    q.addAnyTransition(s1, s2);
    int s3 = q.createState();
    q.setAccept(s3, true);
    q.addTransition(s1, s3, "sun");
    q.addTransition(s2, s3, "sun");
    q.finish();

    assertEquals(2, s.search(q, 1).totalHits);

    w.close();
    r.close();
    dir.close();
  }

Maybe this wouldn't get us enough, but it's the road we were walking down. We'll take a peek at what you suggested today as well. Thanks for your feedback!!

from search-highlighter.

dsmiley avatar dsmiley commented on September 13, 2024

I'm confused; how does TermAutomatonQuery relate to highlighting? Unfortunately, TermAutomatonQuery isn't a SpanQuery derivative and/or doesn't expose the offsets of matching terms; although it could be modified to do so.

from search-highlighter.

mkrauklis avatar mkrauklis commented on September 13, 2024

Could someone correct my understanding of how queryWeight is being used to filter terms matched outside a phrase? I see all the PostingsHitEnums for a field get grouped under a MergeHitEnumWrapper which is wrapped in nested PhraseHitEnumWrappers (one for each phrase for the field). These are then wrapped by a WeightFilteredHitEnumWrapper. When the hits come through if the query weight is not positive the PhraseHitEnumWrapper checks to see if it's a phrase hit. If so it bumps up the query weight. But I can't see what's making the queryWeight 0 to begin with. As I'm iterating through the PostingsHitEnum hits the queryWeight always seems to be 1. I was trying to use the same basic algorithm for a new SpanHitEnumWrapper class but if the next calls to wrapped always result in a queryWeight that's 1 I'm sunk. I'm missing something here but I can't see what. I'll keep digging but a gentle nudge in the right direction would certainly help the effort:)

from search-highlighter.

mkrauklis avatar mkrauklis commented on September 13, 2024

Alrighty, I think I figured it out. I was missing that in QueryFlattener we were passing a 0 for the weight of terms that were part of a phrase.

I think I have a partial solution implemented, piggybacking on a lot of the concepts for phrase handling. So far I seem to have it working for SpanTerm and SpanNear queries and would like to run the concept past you experts before I spend time implementing the rest.

The high level solution is to, while flattening the query, build up a list of root-level SpanValidator trees. Then, when wrapping (BasicQueryWeigher.wrap), for each root-level validator I wrap the current HitEnum with a SpanHitEnumWrapper.

The SpanHitEnumWrapper tries to validate the current hit. If there is a partial match (e.g. we have a hit on one of the SpanTerms in a list of span near clauses) it looks forward (wrapped.next()) until either there is a match or there is no match. For example, if the positions of the current hits are outside the bounds allowed by slop. When validating each acceptable hit is stored by that validator node.

Lastly, when SpanHitEnumWrapper.queryWeight() is called it checks with its validator to see if the given source/position combination was recorded as a successful hit during validation:

    @Override
    public float queryWeight() {
        float queryWeight = toReplay.queryWeight();

        if (queryWeight <= 0) {
            if (spanValidator.isSuccessful()) {
                if (spanValidator.isSourceAtPositionValid(source(), position())) {
                    queryWeight = 1;
                }
            }
        }

        return queryWeight;
    }

A little more detail:

I added functions to QueryFlattener.Callback to mark the start and end of the span types:

        void startSpanTermQuery(SpanTermQuery query);
        void endSpanTermQuery(SpanTermQuery query, int source);
        void startSpanNearQuery(SpanNearQuery query);
        void endSpanNearQuery(SpanNearQuery query);

These call back into BasicQueryWeigher which maintains a stack of List of SpanValidator instances. When the bottom of the stack is popped off this validator is added to the list of root-level validators for the given field.

Concerns:

  • It seems to be following the pattern established by phrases but with this solution we're throwing away queryWeight/boost for nested span queries. This seems like it could be more problematic than it was for phrases.
  • I see when we encounter a term multiple times we update the termInfo to have a new source. I followed the pattern established in phrase handling and performed a source check for SpanTerm to see if it's a hit. I'm not sure what the impact of changing the source on termInfo would have in this case (if any).

from search-highlighter.

nomoa avatar nomoa commented on September 13, 2024

@mkrauklis everything you describe sounds good to me.

Concerning queryWeight if we want to do like phrases we need to use the top-level boost recorded when flattening and as you said you may have to forget nested span query boosts. So instead of returning 1 in SpanHitEnumWrapper#queryWeight() would it be possible to return the boost you could capture when you flatten the query?

Concerning the problem with duplicated terms I have no idea... I'd need to spend some time in the code to tell. Would you mind opening a PR so we can discuss implementation details?

This looks very promising, thanks!

from search-highlighter.

mkrauklis avatar mkrauklis commented on September 13, 2024

@nomoa I'm doing this on company time so I'm just waiting to get the approval to contribute back to this project. We've done this for other projects so it shouldn't be a problem. Just takes a few days for everything to get approved. Once that's all set I'll open a PR.

from search-highlighter.

mkrauklis avatar mkrauklis commented on September 13, 2024

Sorry for the delay. Got a little held up in red-tape but I was finally able to move forward with the pull request. In the mean time I implemented handling of span_or and span_multi.

See: #21

from search-highlighter.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.