wikimedia / search-highlighter Goto Github PK

Github mirror of "search/highlighter" - our actual code is hosted with Gerrit (please see https://www.mediawiki.org/wiki/Developer_access for contributing

Java 98.77% Shell 1.23%

search-highlighter's Introduction

Experimental Highlighter

Text highlighter for Java designed to be pluggable enough for easy experimentation. The idea being that it should be possible to play with how hits are weighed or how they are grouped into snippets without knowing about the guts of Lucene or Elasticsearch.

Comes in three flavors:

Core: No dependencies jar containing most of the interesting logic
Lucene: A jar containing a bridge between the core and lucene
Elasticsearch: An Elasticsearch plugin

You can read more on how it works here.

Elasticsearch value proposition

This highlighter

Doesn't need offsets in postings or term enums with offsets but can use either to speed itself up.
Can fragment like the Postings Highlighter, the Fast Vector Highlighter, or it can highlight the entire field.
Can combine hits using multiple different fields (aka matched_fields support).
Can boost matches that appear early in the document.
By default boosts matches on unique query terms per fragment

This highlighter does not (currently):

Support require_field_match

Elasticsearch installation

Experimental Highlighter Plugin	Elasticsearch
7.10.0, master branch	7.10.0
7.5.1	7.5.1
6.3.1.2	6.3.1
5.5.2.2	5.5.2
5.4.3	5.4.3
5.3.2	5.3.2
5.3.1	5.3.1
5.3.0	5.3.0
5.2.2	5.2.2
5.2.1	5.2.1
5.2.0	5.2.0
5.1.2	5.1.2
2.4.1	2.4.1
2.4.0,	2.4.0
2.3.5, 2.3 branch	2.3.5
2.3.4	2.3.4
2.3.3	2.3.3
2.2.2, 2.2 branch	2.2.2
2.1.2, 2.1 branch	2.1.2
2.0.2, 2.0 branch	2.0.2
1.7.0 -> 1.7.1, 1.7 branch	1.7.X
1.6.0, 1.6 branch	1.6.X
1.5.0 -> 1.5.1, 1.5 branch	1.5.X
1.4.0 -> 1.4.1, 1.4 branch	1.4.X
0.0.11 -> 1.3.0, 1.3 branch	1.3.X
0.0.10	1.2.X
0.0.1 -> 0.0.9	1.1.X

Install it like so for Elasticsearch 5.x.x:

./bin/elasticsearch-plugin install org.wikimedia.search.highlighter:experimental-highlighter-elasticsearch-plugin:5.x.x

Install it like so for Elasticsearch 2.x.x:

./bin/plugin install org.wikimedia.search.highlighter/experimental-highlighter-elasticsearch-plugin/2.x.x

Install it like so for Elasticsearch 1.7.x:

./bin/plugin --install org.wikimedia.search.highlighter/experimental-highlighter-elasticsearch-plugin/1.7.0

Then you can use it by searching like so:

{
  "_source": false,
  "query": {
    "query_string": {
      "query": "hello world"
    }
  },
  "highlight": {
    "order": "score",
    "fields": {
      "title": {
        "number_of_fragments": 1,
        "type": "experimental"
      }
    }
  }
}

Elasticsearch options

The fragmenter field defaults to scan but can also be set to sentence or none. scan produces results that look like the Fast Vector Highlighter. sentence produces results that look like the Postings Highlighter. none won't fragment on anything so it is cleaner if you have to highlight the whole field. Multi-valued fields will always fragment between each value, even on none. Example:

  "highlight": {
    "fields": {
      "title": {
        "type": "experimental",
        "fragmenter": "sentence",
        "options": {
          "locale": "en_us"
        }
      }
    }
  }

If using the sentence fragmenter you should specify the locale used for sentence rules with the locale option as above.

Each fragmenter has different no_match_size strategies based on the spirit of the fragmenter.

By default fragments are weighed such that additional matches for the same query term are worth less than unique matched query terms. This can be customized with the fragment_weigher option. Setting it to sum will weight a fragment as the sum of all its matches, just like the FVH. The default settings, exponential weighs fragments as the sum of: (base ^ match_count) * average_score where match_count is the number of matches for that query term, average_score is the average of the score of each of those matches, and base is a free parameter that defaults to 1.1. The default value of base is what provides the discount on duplicate terms. It can be changed by setting fragment_weigher like this: {"exponential": {"base": 1.01}}. Setting the base closer to 1 will make duplicate matches worth less. Setting the base between 0 and 1 will make duplicate matches worth less than single matches which doesn't make much sense (but is possible). Similarly, setting base to a negative number or a number greater then sqrt(2) will do other probably less than desirable things.

The top_scoring option can be set to true while sorting fragments by source to return only the top scoring fragmenter but leave them in source order. Example:

  "highlight": {
    "fields": {
      "text": {
        "type": "experimental",
        "number_of_fragments": 2,
        "fragmenter": "sentence",
        "sort": "source",
        "options": {
           "locale": "en_us",
           "top_scoring": true
        }
      }
    }
  }

The default_similarity option defaults to true for queries with more than one term. It will weigh each matched term using Lucene's default similarity model similarly to how the Fast Vector Highlighter weighs terms. If can be set to false to leave out that weighing. If there is only a single term in the query it will never be used.

  "highlight": {
    "fields": {
      "title": {
        "type": "experimental",
        "options": {
          "default_similarity": false
        }
      }
    }
  }

The hit_source option can force detecting matched terms from a particular source. It can be either postings, vectors, or analyze. If set to postings but the field isn't indexed with index_options set to offsets or set to vectors but term_vector isn't set to with_positions_offsets then the highlight throw back an error. Defaults to using the first option that wouldn't throw an error.

  "highlight": {
    "fields": {
      "title": {
        "type": "experimental",
        "options": {
          "hit_source": "analyze"
        }
      }
    }
  }

The boost_before option lets you set up boosts before positions. For example, this will multiply the weight of matches before the 20th position by 5 and before the 100th position by 1.5.

  "highlight": {
    "fields": {
      "title": {
        "type": "experimental",
        "order": "score",
        "options": {
          "boost_before": {
            "20": 5,
            "100": 1.5
          }
        }
      }
    }
  }

Note that the position is not reset between multiple values of the same field but is handled independently for each of the matched_fields. Note also that boost_before works with top_scoring.

The max_fragments_scored option lets you limit the number of fragments scored. The default is Integer.MAX_VALUE so you'll score them all. This can be used to limit the CPU cost of scoring many matches when it is likely that the first few matches will have the highest score.

The matched_fields field turns on combining matches from multiple fields, just like the Fast Vector Highlighter. See the Elasticsearch documentation for more on it. The only real difference is that if hit_source is left out then each field's HitSource is determined independently which isn't possible with the fast vector highlighter as it only supports the postings hit source. Remember: For very short fields analyze hit source will be the most efficient because no secondary data has to be loaded from disk.

A limitation in matched_fields: if the highlighter has to analyze the field value to find hits then you can't reuse analyzers in each matched field.

The fetch_fields option can be used to return fields next to the highlighted field. It is designed for use with object fields but has a number of limitations. Read more about it here.

The phrase_as_terms option can be set to true to highlight phrase queries (and multi phrase prefix queries) as a set of terms rather then a phrase. This defaults to false so phrase queries are restricted to full phrase matches.

The regex option lets you set regular expressions that identify hits. It can be specified as a string for a single regular expression or a list for more than one. Your regex_flavor option sets the flavor of regex. The default flavor is lucene and the other option is java. It's also possible to skip matching the query entirely by setting the skip_query option to true. The regex_case_insensitive option can be set to true to make the regex case insensitive using the case rules in the locale specified by locale. Example:

  "highlight": {
    "fields": {
      "title": {
        "type": "experimental",
        "options": {
          "regex": [
            "fo+",
            "bar|z",
            "bor?t blah"
          ],
          "regex_flavor": "lucene",
          "skip_query": true,
          "locale": "en_US",
          "regex_case_insensitive": true
        }
      }
    }
  }

If a regex match is wider than the allowed snippet size it won't be returned.

The max_determinized_states option can be used to limit the complexity explosion that comes from compiling Lucene Regular Expressions into DFAs. It defaults to 20,000 states. Increasing it allows more complex regexes to take the memory and time that they need to compile. The default allows for reasonably complex regexes.

The skip_if_last_matched option can be used to entirely skip highlighting if the last field matched. This can be used to form "chains" of fields only one of which will return a match:

  "highlight": {
    "type": "experimental",
    "fields": {
      "text": {},
      "aux_text": { "options": { "skip_if_last_matched": true } },
      "title": {},
      "redirect": { "options": { "skip_if_last_matched": true } },
      "section_heading": { "options": { "skip_if_last_matched": true } },
      "category": { "options": { "skip_if_last_matched": true } },
    }
  }

The above example creates two "chains":

aux_text will only be highlighted if there isn't a match in text. -and-
redirect will only be highlighted if there isn't a match in title.
section_heading will only be highlighted if there isn't a match in redirect and title.
category will only be highlighted if there isn't a match in section_heading, redirect, or title.

The remove_high_freq_terms_from_common_terms option can be used to highlight common terms when using the common_terms query. It defaults to true meaning common terms will not be highlighted. Setting it to false will highlight common terms in common_terms queries. Note that this behavior was added in 1.3.1, 1.4.3, and 1.5.0 and before that common terms were always highlighted by the common_terms query.

The max_expanded_terms option can be used to control how many terms the highlighter expands multi term queries into. The default is 1024 which is the same as the fvh's default. Note that the highlighter doesn't need to expand all multi term queries because it has special handling for many of them. But when it does, this is how many terms it expands them into. This was added in 1.3.1, 1.4.3, and 1.5.0 and before the value was hard coded to 100.

The return_offsets option changes the results from a highlighted string to the offsets in the highlighted that would have been highlighted. This is useful if you need to do client side sanity checking on the highlighting. Instead of a marked up snippet you'll get a result like 0:0-5,18-22:22. The outer numbers are the start and end offset of the snippet. The pairs of numbers separated by the ,s are the hits. The number before the - is the start offset and the number after the - is the end offset. Multi-valued fields have a single character worth of offset between them.

Offsets in postings or term vectors

Since adding offsets to the postings (set index_options to offsets in Elasticsearch) and creating term vectors with offsets (set term_vector to with_positions_offsets in Elasticsearch) both act to speed up highlighting of this highlighter you have a choice which one to use. Unless you have a compelling reason to use term vectors, go with adding offsets to the postings because that is faster (by my tests, at least) and uses much less space.

search-highlighter's People

Contributors

Stargazers

Watchers

search-highlighter's Issues

Highlighting Not Honoring Slop

I should first note this exact bug exists with the Plain Highlighter in 1.7.x-2.4.x and is supposedly fixed in 5.x as reported here: elastic/elasticsearch#18246

The root of the problem is that although the we only get hits on documents that meet the slop criteria for a span_near query, the highlighter highlights all instances of the matched terms regardless of proximity.

For example, the following request:

POST /idx/_search
{
   "highlight": {
      "pre_tags": [
         "|~|"
      ],
      "post_tags": [
         "|^|"
      ],
      "order": "score",
      "fields": {
         "fieldx": {
            "fragment_size": 1000,
            "number_of_fragments": 3,
            "type": "experimental",
            "matched_fields": [
               "fieldx",
               "fieldx.exact"
            ]
         }
      }
   },
   "query": {
      "span_near": {
         "clauses": [
            {
               "span_term": {
                  "fieldx.exact": {
                     "value": "term1"
                  }
               }
            },
            {
               "span_term": {
                  "fieldx.exact": {
                     "value": "term2"
                  }
               }
            }
         ],
         "slop": 2,
         "in_order": false
      }
   }
}

Returns the following highlight:

...
"fieldx": ["Shouldn't hit |~|term1|^| because it's not close enough to |~|term2|^|. However, the distance between the second instances of |~|term1|^| and |~|term2|^| are close enough and should get highlighted."]
...

In this example the highlights for the first instance of the search terms are incorrect.

We've traced this into the PostingsHitEnum's iteration through the PostingsEnum which is returning the offsets for the incorrect term offsets.

version 0.0.11 incompatible with ES 1.2.2

the docs suggest search-highlighter 0.0.11 is compatible with 1.2.0

I took this to mean 1.2.x

should this work?

upon attempting installation:

[WARN ][plugins ] [Omega] failed to lo
ad plugin from [jar:file:/elasticsearch-1.2.2/plugins/experimental-highl
ighter-elasticsearch-plugin/experimental-highlighter-elasticsearch-plugin-0.0.11
.jar!/es-plugin.properties]
org.elasticsearch.ElasticsearchException: Failed to load plugin class [org.wikim
edia.highlighter.experimental.elasticsearch.plugin.ExperimentalHighlighterPlugin
]
at org.elasticsearch.plugins.PluginsService.loadPlugin(PluginsService.ja
va:522)
at org.elasticsearch.plugins.PluginsService.loadPluginsFromClasspath(Plu
ginsService.java:397)
at org.elasticsearch.plugins.PluginsService.(PluginsService.java:1
07)
at org.elasticsearch.node.internal.InternalNode.(InternalNode.java
:144)
at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:159)
at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:70)
at org.elasticsearch.bootstrap.Bootstrap.main(Bootstrap.java:203)
at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:32)

Caused by: org.elasticsearch.ElasticsearchException: Plugin is incompatible with
the current node
at org.elasticsearch.plugins.PluginsService.loadPlugin(PluginsService.ja
va:515)
... 7 more

Does the plugin work with version 6.x

My problem is that I need text positions of search results and after doing some research it seems that this plugin is the only way to do it without coding.

What's the highest version of elastics this can run on so i can downgrade and/or are there plans for making it run on 6?

Script Field Highlighting

This is a redirect from elastic/elasticsearch#9890.

The idea there is that, with a script field from a very long string, it would be very nice to be able to highlight only from its arbitrary value/length. The general use case is to provide a paging mechanism for very long documents for different client profiles.

Imagine naively serving a 1,000 page book, stored in a single document field, to a digital reader with search highlighting, only to have the user go to the next page. Once they have gone to the next page, then imagine that they resize the font so that the number of words/characters on screen changes. Or they use a different device with a much bigger screen that contain many more words.

In both scenarios, the client controls the size of the "page" and highlighting is completely irrelevant when not visible and potentially as expensive to perform against the entire string.

The ideal solution for this scenario would be to simply allow the highlighter to pass script fields through an appropriate analyzer, assuming it's necessary, and then run them through traditional highlighting. (the script is a bit simple to show the highlighting and not the script)

GET /books/_search
{
  "_source" : false,
  "query" : {
    "query_string" : { "query" : "jumping through hoops" }
  },
  "script_fields" : {
    "book_chunk" : {
      "script" : "doc['text'].value.substring(start, end)",
      "params" : {
        "start" : 2000,
        "end" : 4000
      }
    }
  },
  "highlight": {
    "script_fields": {
      "book_chunk": {
        "type": "experimental",
        "analyzer" : "standard"
      }
    }
  }
}

Weird highlighting behaviour

tl;dr The point of using experimental highligter for me is accomplishing less storage space while enabling multiple types of analysis on fields.

Longer:

A have a set of fields. Let's call one of them text. Because I want to be able to reap the benefits of multiple analyzers, the field text has a subfield - text.raw. On text.raw I don't lowercase or stem. So I can have very broad queries like dog matching Doggy or very precise where Dogs only matches dogs. And so on. The caveat is that I need to highlight results

Running this in production, where I store every subfield makes for insanely big index (north of 500GB). Same with FVH using matched_fields. So I really need the experimental highlighter matched_fields in this case.

Here's an example of the highlight part of the request:

"text": {
            "number_of_fragments": 0,
            "matched_fields": [
               "text",
               "text.raw"
            ],
            "type": "experimental"
}

As you can see - I'm trying to tie parent and subfield into single field for purpose of highlight.

Now: depending on how the query part of request is formed, I either get highlight results or not. Here's the isolated cause/difference in query structure that I see:

This query will work:

"query_string": {
                              "use_dis_max": true,
                              "query": "\"Eskimos\"",
                              "fields": [
                                  "description",
                                 "text.raw",
                              ]
                           }

And this won't:

"query_string": {
                              "use_dis_max": true,
                              "query": "\"Eskimos\"",
                              "fields": [
                                 "text.raw",
                              ]
                           }

and this won't either

"query_string": {
                              "use_dis_max": true,
                              "query": "\"Eskimos\"",
                              "fields": [
                                  "description.raw",
                                 "text.raw",
                              ]
                           }

Now, the single hit in this testing index is returned for every query. But, only in the 1st case it also includes the highlighting dict. Neither of raw fields is stored. The interesting part is that explain shows no need for description field - it doesn't contain the word Eskimos. Also, the description field can be replaced by any field that has stored option on.

My question: what's going on? Is there some kind of optimization in place here that needs to be force disabled of is this a cryptic bug?

Builds for ES5

Are there packages available for ES 5?
Thanks a lot!

"unknown highlighter type" in elasticsearch 1.4.3

I've now tried to use the experimental highlighter in elasticsearch 1.4.3, but I get the error "unknown highlighter type [experimental] ...".
Is elasticsearch 1.4.3 not yet supported or did I hit a bug here?

Here is a minimal scenario that gives me the error message:

PUT test_experimental_highlighter

POST test_experimental_highlighter/test_type/
{
  "field": "value"
}

GET test_experimental_highlighter/_search
{
  "query": {
    "match": {
      "field": "value"
    }
  },
  "highlight": {
    "type": "experimental",
    "fields": {
      "*": {}
    }
  }
}

Also, the plugin does not show up in the _nodes endpoint of my elasticsearch installation ...

regression: hunspell filter does not highlight correctly in 5.x

The following example works with 2.4.1, but not 5.1.2 and 5.2.2. When hunspell is used for stemming, a term that is stemmed and expanded (e.g. contract) is not highlighted correctly. I verified that the output of the analyzer with Hunspell filter is the same between 2.4.1 and 5.2.2. Maybe it has something to do with the handling of term positions?

// complete Node.js example

var async = require('async');
var es = require('elasticsearch');

var INDEX_NAME = 'test_hunspell';
var SEARCH_TERMS = 'contract';
var TEST_ANALYZER = 'hunspell';
var TEST_TEXT = '\
    8-K\n1\nd67628d8k.htm\n8-K\n8-K\nUNITED STATES\nSECURITIES AND EXCHANGE \
    COMMISSION\nWashington, D.C. 20549\nFORM 8-K\nCURRENT REPORT Pursuant\nto \
    Section 13 or 15(d) of the Securities Exchange Act of 1934\nDate of Report \
    (Date of earliest event reported): August 6, 2015\nIndependence Contract \
    Drilling, Inc.\n(Exact name of registrant as specified in its charter)\n \
    Delaware\n001-36590\n37-1653648\n(State or other jurisdiction\nof \
    incorporation)\n(Commission\nFile Number)\n(I.R.S. Employer\nIdentification \
     No.)\n11601 North Galayda Street\nHouston, TX 77086\n(Address of principal \
    executive offices)\n(281) 598-1230\n(Registrant’s telephone number, including \
    area code)\nN/A (Former name or\nformer address, if changed since last \
    report)\nCheck the appropriate box below\nif the Form 8-K filing is intended \
    to simultaneously satisfy the filing obligation of the registrant under \
    any of the following provisions (see General Instruction A.2. below):\n¨\nWritten \
    communications pursuant to Rule 425 under the Securities Act (17 CFR 230.425)\
    \n¨\nSoliciting material pursuant to Rule 14a-12 under the Exchange Act (17 \
    CFR 240.14a-12)\n¨\nPre-commencement communications pursuant to Rule 14d-2(b) \
    under the Exchange Act (17 CFR 240.14d-2(b))\n¨\nPre-commencement communications \
    pursuant to Rule 13e-4(c) under the Exchange Act (17 CFR 240.13e-4(c))\nItem \
    2.02\nResults of Operations and Financial Condition On August 6, 2015, \
    Independence\nContract Drilling, Inc. (“ICD”) issued a press release reporting \
    financial results for the second quarter and the six months ended June 30, \
    2015. A copy of the press release is being furnished as Exhibit 99.1 hereto \
    and is\nincorporated herein by reference. The information furnished pursuant \
    to Item 2.02, including Exhibit 99.1, shall not be deemed\n“filed” for purposes \
    of Section 18 of the Securities Exchange Act of 1934, as amended (the “Exchange \
    Act”), is not subject to the liabilities of that section and is not deemed \
    incorporated by reference in any filing of\nICD’s under the Exchange Act or \
    the Securities Act of 1933, as amended, unless specifically identified \
    therein as being incorporated therein by reference.\nItem 9.01\nFinancial \
    Statements and Exhibits\n(d)\nExhibits\n99.1\nPress Release dated August 6, \
    2015\nSIGNATURES\nPursuant to the requirements of the Securities Exchange \
    Act of 1934, the registrant has duly caused this report to be signed on \
    its behalf by\nthe undersigned hereunto duly authorized.\nIndependence \
    Contract Drilling, Inc.\nDate: August 6, 2015\nBy:\n/s/ Philip A. \
    Choyce\nName:\nPhilip A. Choyce\nTitle:\nSenior Vice President and Chief \
    Financial Officer\nEXHIBIT INDEX\nExhibit\nNo.\nDescription\n99.1\nPress \
    Release dated August 6, 2015';

var esClient = new es.Client({
    apiVersion: '5.0',
    hosts: [ 'localhost:9200' ],
});

async.waterfall([
    function (callback) {

        var params = {
            index: INDEX_NAME,
        };

        esClient.indices.delete(params, function (err) {

            if (err && err.response) {
                var res = JSON.parse(err.response);
                if (res.error && res.error.type === 'index_not_found_exception') {
                    return callback(null);
                }
            }

            return callback(err);
        });
    },
    function (callback) {

        var params = {
            index: INDEX_NAME,
            body: {
                mappings: {
                    default: {
                        _all: { enabled: false },
                        properties: {
                            text: {
                                analyzer: TEST_ANALYZER,
                                type: 'string',
                            },
                        },
                    },
                },
                settings: {
                    analysis: {
                        char_filter: {
                            single_quotes: {
                                type: 'mapping',
                                mappings: [
                                    '\\u0091=>\\u0027',
                                    '\\u0092=>\\u0027',
                                    '\\u2018=>\\u0027',
                                    '\\u2019=>\\u0027',
                                    '\\u201B=>\\u0027'
                                ],
                            },
                        },
                        filter: {
                            en_US_porter: {
                                type: 'stemmer',
                                language: 'english',
                            },
                            en_US_hunspell: {
                                type: 'hunspell',
                                language: 'en_US',
                                dedup: true,
                            },
                            english_stopwords: {
                                type: 'stop',
                                stopwords: '_english_',
                            },
                            word_delimiter: {
                                type: 'word_delimiter',
                                catenate_all: true,
                                generate_number_parts: false,
                                generate_word_parts: false,
                                preserve_original: false,
                                split_on_case_change: false,
                                split_on_numerics: false,
                                stem_english_possessive: true,
                            },
                        },
                        analyzer: {
                            hunspell: {
                                char_filter: [ 'single_quotes' ],
                                filter: [
                                    'lowercase',
                                    'asciifolding',
                                    'word_delimiter',
                                    'english_stopwords',
                                    'en_US_hunspell',
                                ],
                                tokenizer: 'whitespace',
                            },
                            porter: {
                                char_filter: [ 'single_quotes' ],
                                filter: [
                                    'lowercase',
                                    'asciifolding',
                                    'word_delimiter',
                                    'english_stopwords',
                                    'en_US_porter',
                                ],
                                tokenizer: 'whitespace',
                            },
                        },
                    },
                },
            },
        };

        esClient.indices.create(params, function (err) {
            return callback(err);
        });
    },
    function (callback) {

        var params = {
            index: INDEX_NAME,
            analyzer: TEST_ANALYZER,
            text: TEST_TEXT,
        };

        esClient.indices.analyze(params, function (err, res) {

            if (err) {
                return callback(err);
            }

            var tokens = [];
            for (var i = 0; i < res.tokens.length; i++) {
                tokens.push(res.tokens[i].token);
            }

            console.log('----------------------------------------------------------');
            console.log('  Indexed text using ' + TEST_ANALYZER + ' analyzer.');
            console.log('----------------------------------------------------------');
            console.log(tokens.join(' '));

            return callback(null);
        });
    },
    function (callback) {

        var params = {
            index: INDEX_NAME,
            type: 'default',
            id: 1,
            body: {
                text: TEST_TEXT,
            },
            refresh: true,
        };

        esClient.index(params, function (err) {
            return callback(err);
        });
    },
    function (callback) {

        console.log('----------------------------------------------------------');
        console.log('  No highlight returned using experimental highlighter.');
        console.log('----------------------------------------------------------');

        var params = {
            index: INDEX_NAME,
            type: 'default',
            body: {
                query: {
                    match: {
                        text: {
                            query: SEARCH_TERMS,
                        },
                    },
                },
                highlight: {
                    fields: {
                        text: {
                            type: 'experimental',
                        },
                    },
                },
            },
        };

        esClient.search(params, function (err, res) {

            if (err) {
                return callback(err);
            }

            console.log(JSON.stringify(res,null,4));

            return callback(null);
        });
    },
    function (callback) {

        console.log('----------------------------------------------------------');
        console.log('  Correctly highlighted using plain highlighter.');
        console.log('----------------------------------------------------------');

        var params = {
            index: INDEX_NAME,
            type: 'default',
            body: {
                query: {
                    match: {
                        text: {
                            query: SEARCH_TERMS,
                        },
                    },
                },
                highlight: {
                    fields: {
                        text: {},
                    },
                },
            },
        };

        esClient.search(params, function (err, res) {

            if (err) {
                return callback(err);
            }

            console.log(JSON.stringify(res,null,4));

            return callback(null);
        });
    },
],
function (err) {
    esClient.close();
    if (err) {
        console.error(JSON.stringify(JSON.parse(err.response),null,4));
        console.error(err.stack);
    }
});

elasticsearch 2.0 support

Are there any plans to support Elasticsearch 2.0?

Do you have plans to update this plugin so that it works with current versions of elasticsearch?

Hi there. Thanks for the plugin. Do you have plans to update it so that it works with current versions of elasticsearch?
Kibana4 requires 1.4.4, and elasticsearch is up to 1.5.x

thanks!

html encoder

Could you add the "encoder:html" feature?
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#_encoder
Thanks a lot!

elasticsearch 1.4 not supported

Are there any plans when support for elasticsearch 1.4 will be added?

I would really need this since I'm facing problems with the built-in highlighters in elasticsearch, see here: elastic/elasticsearch#8468

plugin install for ES 5.1.2

Thanks for updating the plugin for 5.x. However the current command do not seem to work. Has the command changed or is the version not available yet?

Regex highlighting fails on multi-byte string

Regex highlighting fails on multi-byte string. At least I think that is what is going wrong when highlighting the source_text field of https://en.wikipedia.org/wiki/Rashidun_Caliphate?action=cirrusdump

Leading repeating wildcards don't work

Starting with a .* or .+ causes the highlighter to take a long time and return nothing:
http://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special%3ASearch&profile=default&search=insource%3A%2F.ts+attend%28ee%29%3F%2F&fulltext=Search

Cannot compile for elasticsearch-5.1.1 / lucene-6.3.0

Hello all,

thank you for this useful plugin (I'm using it for obtaining text offsets of a highlight, a feature which lucene doesn't do).

I'm trying to upgrade to elasticsearch-5.1.1 (which uses lucene-6.3.0) but I couldn't compile the plugin using maven (which is complaining about missing symbols...), although I've changed pom to reflect the new versions.

I guess there's more work to be done but I don't really know where to start, since I've little experience with Java.

Any pointers?
Thanks!

parse error: Invalid \uXXXX\uXXXX surrogate pair escape

The highlighter creates wrong json on 4-byte UTF-8 codepoints.
Steps to reproduce:

curl -XPUT 'localhost:9200/index/test/1?pretty' -d '
{
  "message":"Lorem ipsum dolor sit amet, consectetur, adipisci velit \uD83D\uDC4A\uD83D\uDC4A\uD83D\uDC4A\uD83D\uDC4A\uD83D\uDC4A\uD83D\uDC4A\uD83D\uDC4A\uD83D\uDC4A\uD83D\uDC4A\uD83D\uDC4A\uD83D\uDC4A\uD83D\uDC4A\uD83D\uDC4A\uD83D\uDC4A\uD83D\uDC4A\uD83D\uDC4A\uD83D\uDC4A\uD83D\uDC4A\uD83D\uDC4A\uD83D\uDC4A\uD83D\uDC4A\uD83D\uDC4A\uD83D\uDC4A\uD83D\uDC4A\uD83D\uDC4A\uD83D\uDC4A\uD83D\uDC4A"
}'

curl -XGET 'localhost:9200/index/test/_search?pretty' -d '
{
  "query": { "match": { "message": "velit" } },
  "highlight": {"fields":{"message":{"type":"experimental","fragment_size": 50}}}
}'

Everything is parseable, but "hits" -> "hits" -> "highlight" -> "message".
Piping the result-query to "jq .":
parse error: Invalid \uXXXX\uXXXX surrogate pair escape at line 23, column 177

1.7 branch, and Java 1.8

I take it that it is not currently supported given the flood of Maven errors, but correct me if I am wrong.

Elasticsearch Keyword not getting highlighted

Hi,
I'm using elasticsearch version 7.0.0 and search-highlighter 6.5.4. When I tried to match a "KEYWORD" field, highlight results are not populated but the same code working fine for the "TEXT" field. Any help or document reference is appreciated.

        final HighlightBuilder highlightBuilder = new HighlightBuilder();
        highlightBuilder.highlighterType("experimental");
        final Map<String , Object> options = new HashMap<>();
        options.put("return_offsets" , true);
        highlightBuilder.options(options);

Highlighting not working for * field

If I try this query:

GET test_index/text/_search
{
  "query": {
    "match": {
      "_all": {
        "query": "test"
      }
    }
  },
  "highlight": {
    "fields": {
      "*": {
        "type": "experimental"
      }
    }
  }
}

I get the following error:
FetchPhaseExecutionException[[test_index][1]: query[filtered(_all:test)->cache(_type:test)],from[0],size[5]: Fetch Failed [Failed to highlight field [_size]]]; nested: NumberFormatException[For input string: ""];

I only have string fields in my mapping. And I don't use a field called "_size".

If I explicitely specify the fields for highlighting (instead of "*"), it works. However, this is not really an option for me, since the query should be agnostic to new fields in the mapping.

search-highlighter version: 1.4.0

How to install from source?

Hello!
I have downloaded sources, compiled and then I get stuck.

pom.xml:

-        <elasticsearch.version>7.5.1</elasticsearch.version>
+        <elasticsearch.version>7.6.2</elasticsearch.version>
-        <lucene.version>8.3.0</lucene.version>
+        <lucene.version>8.4.0</lucene.version>

/src/main/java/org/wikimedia/highlighter/experimental/elasticsearch/FieldWrapper.java

-    MappedFieldType fieldType = context.context.getMapperService().fullName(fieldName);
+    final MappedFieldType fieldType = context.context.getMapperService().fieldType(fieldName);

mvn -Denforcer.skip=true -Dmaven.test.skip=true -DskipTests package
mvn -Denforcer.skip=true -Dmaven.test.skip=true -DskipTests install

ES error:
unknown highlighter type [experimental] for the field [content]

How to add plugin to Elasticsearch?

Regex highlighting

We should be able to highlighting hits from regexes

Is there any way to highlight a phrase query spanning an array

This is probably a feature request :),
If I have an indexed array, [{text:"one a",o:"1",d:"2"},{text:"two b",o:"2",d:"2"},{text:"three c",o:"3",d:"2"}]
and a phrase query "one a two", highlighted using fetch_fields :["o","d"]

is there any way I can get highlighted results that keep the phrase in tact across array boundaries?
usually this query will result in 2 fragment results, and not necessarily in document order, ie something like:
["two",2,2,
"one a",1,2]

what I would like to get is some way to determine result groups that match the original query,
perhaps as a sub array?
[["one a",1,2,
"two",2,2]]

suggestions for approaches also gratefully accepted

Cheers

Support for ElasticSearch 5.6.2

Hello
I need to install this plugin for ES 5.6.2 in a near future.
Does anyone know any quick turnaround for this?

string_index_out_of_bounds_exception

Plugins:

https://github.com/jprante/elasticsearch-langdetect

Steps to reproduce:

curl -XDELETE http://127.0.0.1:9200/test?pretty=true

curl -XPUT http://127.0.0.1:9200/test?pretty=true -d '
{
  "mappings": {
    "_default_": {
      "properties": {
        "description": {
          "type": "text",
          "index_options": "offsets",
          "term_vector": "with_positions_offsets",
          "fields": {
            "language": {
              "type": "langdetect",
              "languages": [ "de", "en" ],
              "language_to": {
                "de": "description_de",
                "en": "description_en"
              }
            }
          }
        },
        "description_de": {
          "type": "text",
          "index_options": "offsets",
          "term_vector": "with_positions_offsets",
          "analyzer": "german"
        },
        "description_en": {
          "type": "text",
          "index_options": "offsets",
          "term_vector": "with_positions_offsets",
          "analyzer": "english"
        }
      }
    }
  }
}'

curl -XPUT http://127.0.0.1:9200/test/table/1?pretty=true -d '
{
   "description" : "Eine wunderbare Heiterkeit hat meine ganze Seele eingenommen, gleich den süßen Frühlingsmorgen, die ich mit ganzem Herzen genieße. Ich bin allein und freue mich meines Lebens in dieser Gegend, die für solche Seelen geschaffen ist wie die meine. Ich bin so glücklich, mein Bester, so ganz in dem Gefühle von ruhigem Dasein versunken, daß meine Kunst darunter leidet. Ich könnte jetzt nicht zeichnen, nicht einen Strich, und bin nie ein größerer Maler gewesen als in diesen Augenblicken. Wenn das liebe Tal um mich dampft, und die hohe Sonne an der Oberfläche der undurchdringlichen Finsternis meines Waldes ruht, und nur einzelne Strahlen sich in das innere Heiligtum stehlen, ich dann im hohen Grase am fallenden Bache liege, und näher an der Erde tausend mannigfaltige Gräschen mir merkwürdig werden; wenn ich das Wimmeln der kleinen Welt zwischen Halmen, die unzähligen, unergründlichen Gestalten der Würmchen, der Mückchen näher an meinem Herzen fühle, und fühle die Gegenwart des Allmächtigen, der uns nach seinem Bilde schuf, das Wehen des Alliebenden, der uns in ewiger Wonne schwebend trägt und erhält; mein Freund! Wenn es dann um meine Augen dämmert, und die Welt um mich her und der Himmel ganz in meiner Seele ruhn wie die Gestalt einer"
}'

sleep 1

curl -XGET http://127.0.0.1:9200/test/table/_search?pretty=true -d '
{
  "query": {
    "simple_query_string" : {
      "query": "Sonne",
      "fields": ["description*"]
    }
  },
  "highlight": {
    "order": "score",
    "fields": {
      "description*": {
        "type": "experimental"
      }
    }
  }
}'

Result:

{
  "took" : 7,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 4,
    "failed" : 1,
    "failures" : [
      {
        "shard" : 3,
        "index" : "test",
        "node" : "US4V_2DJTm6tr3hd4G2ukg",
        "reason" : {
          "type" : "string_index_out_of_bounds_exception",
          "reason" : "String index out of range: -534"
        }
      }
    ]
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.52231133,
    "hits" : [ ]
  }
}

highlighting too many terms causes highlighter to return nothing

I'm using the latest release of this plugin 1.4.1 with ES 1.4.2

in the following query I've manually expanded the fields to overcome problems with the highlighter
and wildcard queries (it seems to arbitrarily decide not to return any highlights depending on the query fields - possibly related to internal expansion and this 'too many terms' issue.) eg if I have in my primary search transcript.en*.p.text, then it may or may not work.

I've tried it against indexes with "index_options":"offsets", and with
"term_vector" :"with_positions_offsets"
but it doesnt seem to make much difference with this issue.

manually expanding the terms in the highlighter results does give a more consistent result, and below I've omitted all but the first entry for the highlighter fields for brevity.

with this many terms in the main query OR the highlight_query, I get zero highlighted results from this experimental highlighter (there is no error) where the fvh highlighter returns good results.
If I reduce the number of terms I get results from this highlighter.

the issue only seems to affect queries where phrase_prefix is used, but it doesn't occur in every query with a phrase_prefix in it.
When it occurs for a given query (irrespective of the query terms) it consistently fails.

if I have a phrase_prefix query ONLY as a highlight_query, and nowhere in the main query then it seems to give results, but if I have it in both it doesn't.

neither the fvh nor this highlighter seem to respect the common terms cutoff frequency hence my use below of the highlight_query

I realise this is a bit vague, but I cant pin down exactly what seems to be going on. I love this highlighter mainly due to the fetch_fields feature which improves my performance hugely so I'm very keen to have/help have the issue resolved.

any hints/help greatly appreciated.

{
  "from" : 0,
  "size" : 10,
  "query" : {
    "filtered" : {
      "query" : {
        "bool" : {
          "must" : {
            "multi_match" : {
              "query" : "to influence",
              "fields" : [ "transcript.en.p.text", "transcript.en-AU.p.text",  "transcript.en-BZ.p.text",  "transcript.en-CA.p.text",  "transcript.en-CB.p.text",  "transcript.en-GB.p.text",  "transcript.en-IE.p.text", "transcript.en-JM.p.text", "transcript.en-NZ.p.text",  "transcript.en-PH.p.text", "transcript.en-TT.p.text",  "transcript.en-US.p.text", "transcript.en-ZA.p.text", "transcript.en-ZW.p.text" ],
              "type" : "phrase_prefix",
              "slop" : 2,
              "minimum_should_match" : "75%",
              "cutoff_frequency" : 0.01
            }
          },
          "should" : {
            "multi_match" : {
              "query" : "to influence",
              "fields" : ["title^0.2", "description^0.7" ],
              "type" : "best_fields",
               "cutoff_frequency" : 0.01
            }
          }
        }
      },
    }
  },
  "min_score" : 1.8E-5,
  "highlight" : {
    "order" : "score",
    "type" : "experimental",
    "highlight_query" : {
      "multi_match" : {
        "query" : "to influence",
        "fields" : ["transcript.en.p.text", "transcript.en-AU.p.text",  "transcript.en-BZ.p.text",  "transcript.en-CA.p.text",  "transcript.en-CB.p.text",  "transcript.en-GB.p.text",  "transcript.en-IE.p.text", "transcript.en-JM.p.text", "transcript.en-NZ.p.text",  "transcript.en-PH.p.text", "transcript.en-TT.p.text",  "transcript.en-US.p.text", "transcript.en-ZA.p.text", "transcript.en-ZW.p.text"],
        "type" : "phrase_prefix",
        "slop" : 2
      }
    },
    "fields" : {
      "transcript.en.p.text" : {
        "fragment_size" : 150,
        "number_of_fragments" : 10,
        "fragmenter" : "scan",
        "options" : {
          "hit_source" : "postings",
          "skip_if_last_matched" : false
        }
      }
... more follow 
    }
  }
}

how to affect fragment scoring (elasticsearch plugin)

Hi! Thanks for the plugin; I hope I can get it to work for my use case.

How do you affect fragment scoring in the elasticsearch plugin? I search my data for organic compound and find that fragments (I'm using sentences) with compound show up higher than fragments with _organic compound_ and that, in fact, organic is not highlighted, but compound is.

I'm looking for a way to get back sentences, and to ensure that both proximity and order affect fragment scoring such that the following would be the order in which the sentences would be returned:

Organic compounds can also be classified or subdivided by the presence of heteroatoms.
Organic chemistry is the science concerned with all aspects of organic compounds.
Others state that if a molecule contains carbon―it is organic.
Natural compounds refer to those that are produced by plants or animals.

(where I can choose to have organic compounds highlighted as a phrase or separate terms)

You mention fragment_weigher as a way to customize fragment scoring but I can't seem to get it to do anything (fragments returned are always the same, and in the same order). Is this the parameter I should be looking at?

If you could point me to some examples that would do what I'm trying to do, I'd sure appreciate it.

thank you!

If matched_fields includes a field that doesn't exist then we throw a null pointer exception

Like this:

      "fields": {
         "title": {
            "type": "experimental",
            "fragmenter": "none",
            "number_of_fragments": 1,
            "matched_fields": [
               "title",
               "cats" <--- This field doesn't exist.
            ]
         }
      }

Analyze hit source does not honor path specified analyzer

This is a clone of elastic/elasticsearch#5497, reflected in this highlighter.

Upgrade to Elasticsearch 7.16.2

The log4j DoS and data exfiltration vulnerability (CVE-2021-44228 and CVE-2021-45046) has been fixed in Elasticsearch versions 7.16.2 and newer. However the current version of search-highlighter requires Elasticsearch 7.10.2 which will likely not be updated to fix the vulnerability. Are there any plans to make such a huge version jump?

"fetch_fields" - best feature ever!

You sir are a god!, thank you for this awesome contribution!!!!!

Can I use this highlighter to highlight intervals query result?

Hello! I am using intervals query but getting incorrect highlights. My problem is the same as this problem. Can I use experimental highlighter to highlight intervals query result correctly? Thanks!

Documentation Clarification - Offsets in postings or term vectors

in the docs under 'Offsets in postings or term vectors', it states

" ...Unless you have a compelling reason go with adding offsets to the postings. That is faster (by my tests, at least) and uses much less space."

does this mean that 'term_vector to with_positions_offsets' is faster and uses less space?
could you clarify this please?

Cheers.

trying to find a way to remove common terms and debris from highlighted terms

if I do a match or multimatch query with a cutoff frequency set, the low frequency terms are still highlighted.

for example, a multi term query of something like 'I have a fish' on a reasonable sized corpus
with a cutOffFrequency set, produces reasonable query results, but the highlighting is poor.
the first few highlights for a given result are good, but then it deteriorates into an assortment of single 'I', and 'a' results which I would like to prevent. I've tried variously combining weighted phrase queries with others, to achieve the desired result, but no luck. I've tried using a separate highlighter query which works very well in some cases, but because its not a match for the main search query,
sometimes there are no highlights at all.

I'm looking for a way to be able to trim the highlighted results by hit score since I dont seem to be able to control what's being highlighted well enough by query.
I've been hunting through the code but its not clear to me where I should best do this.

the BasicScoreBasedSnippetChooser seems like the most likely place,
in the mustKeepGoing method, but I cant see how to access the info needed.

so far my best attempt (in some cases) has been to try adding the following condition in the results method
where 'cutOff' is some float value.

        for (ProtoSnippet proto: protos) {
            if( proto.weight > cutOff){
            results.add(new Snippet(proto.pickedBounds.startOffset(), proto.pickedBounds.endOffset(), proto.hits));
            }
        }

in some cases this produces good results, partly I suspect because it seems to influence the exponentialSnippetWeigher rather than just clipping the junk from the results. but it certainly doesnt do what I intended, and its clearly not the place to be trying this, but it was the simplest place to get access to some values I could use.

any insights into what I may be doing wrong, or how I could best achieve this will be greatly appreciated,

thanks

match_phrase not highlighting when stopwords removed

When searching an exact phrase where the search terms contain a stopword and stopwords have been removed, the experimental highlighter does not highlight the phrase. However ES finds the phrase and the plain highlighter highlights it correctly. I don't think stemming and word_delimiter have anything to do with the problem, but they are part of the real world analyzer where I found the problem. Below is a complete Node.js test case.

OS: Ubuntu 14.04
ES version: 2.4.1

var async = require('async');
var es = require('elasticsearch');

var INDEX_NAME = 'test_word_delimiter';
var SEARCH_TERMS = 'board of directors';
var TEST_SENTENCE = '\
    On February 9, 2017 in Form 8-K/A, the Board of Directors (the “Board”) of Tractor \
    Supply Company ("the Company"), amended and restated the Company’s \
    Fourth Amended and Restated By-laws (the “By-laws” and, as amended \
    and restated, the “Amended By-laws”). The following is a brief summary \
    of the material changes effected by adoption of the Amended By-laws, \
    which is qualified in its entirety by reference to the Amended By-laws \
    filed as Exhibit 3.1(i) hereto.';

var esClient = new es.Client({
    apiVersion: '2.4',
    hosts: [ 'localhost:9200' ],
});

async.waterfall([
    function (callback) {

        var params = {
            index: INDEX_NAME,
        };

        esClient.indices.delete(params, function (err) {

            if (err && err.response) {
                var res = JSON.parse(err.response);
                if (res.error && res.error.type === 'index_not_found_exception') {
                    return callback(null);
                }
            }

            return callback(err);
        });
    },
    function (callback) {

        var params = {
            index: INDEX_NAME,
            body: {
                mappings: {
                    default: {
                        _all: { enabled: false },
                        properties: {
                            text: {
                                analyzer: 'word_delimiter_stopword_stem',
                                type: 'string',
                            },
                        },
                    },
                },
                settings: {
                    analysis: {
                        char_filter: {
                            single_quotes: {
                                type: 'mapping',
                                mappings: [
                                    '\\u0091=>\\u0027',
                                    '\\u0092=>\\u0027',
                                    '\\u2018=>\\u0027',
                                    '\\u2019=>\\u0027',
                                    '\\u201B=>\\u0027'
                                ],
                            },
                        },
                        filter: {
                            en_US: {
                                type: 'stemmer',
                                language: 'english',
                            },
                            english_stopwords: {
                                type: 'stop',
                                stopwords: '_english_',
                            },
                            word_delimiter: {
                                type: 'word_delimiter',
                                catenate_all: true,
                                generate_number_parts: false,
                                generate_word_parts: false,
                                preserve_original: false,
                                split_on_case_change: false,
                                split_on_numerics: false,
                                stem_english_possessive: true,
                            },
                        },
                        analyzer: {
                            word_delimiter_stopword_stem: {
                                char_filter: [ 'single_quotes' ],
                                filter: [
                                    'lowercase',
                                    'word_delimiter',
                                    'english_stopwords',
                                    'en_US',
                                ],
                                tokenizer: 'whitespace',
                            },
                        },
                    },
                },
            },
        };

        esClient.indices.create(params, function (err) {
            return callback(err);
        });
    },
    function (callback) {

        var params = {
            index: INDEX_NAME,
            type: 'default',
            id: 1,
            body: {
                text: TEST_SENTENCE,
            },
            refresh: true,
        };

        esClient.index(params, function (err) {
            return callback(err);
        });
    },
    function (callback) {

        console.log('----------------------------------------------------------');
        console.log('  No highlight returned using experimental highlighter.');
        console.log('----------------------------------------------------------');

        var params = {
            index: INDEX_NAME,
            type: 'default',
            body: {
                query: {
                    match_phrase: {
                        text: {
                            query: SEARCH_TERMS,
                        },
                    },
                },
                highlight: {
                    fields: {
                        text: {
                            type: 'experimental',
                        },
                    },
                },
            },
        };

        esClient.search(params, function (err, res) {

            if (err) {
                return callback(err);
            }

            console.log(JSON.stringify(res,null,4));

            return callback(null);
        });
    },
    function (callback) {

        console.log('----------------------------------------------------------');
        console.log('  Correctly highlighted using plain highlighter.');
        console.log('----------------------------------------------------------');

        var params = {
            index: INDEX_NAME,
            type: 'default',
            body: {
                query: {
                    match_phrase: {
                        text: {
                            query: SEARCH_TERMS,
                        },
                    },
                },
                highlight: {
                    fields: {
                        text: {},
                    },
                },
            },
        };

        esClient.search(params, function (err, res) {

            if (err) {
                return callback(err);
            }

            console.log(JSON.stringify(res,null,4));

            return callback(null);
        });
    },
],
function (err) {
    esClient.close();
    if (err) {
        console.error(JSON.stringify(JSON.parse(err.response),null,4));
        console.error(err.stack);
    }
});

Notice: Lucene UnifiedHighlighter

I just wish to bring this to the attention of the developers here: https://issues.apache.org/jira/browse/LUCENE-7438 "UnifiedHighlighter" (I did most of it)

Some differences I've observed reading about the Wikimedia Experimental Highlighter (WEH) (not exhaustive!):

WEH effectively forces requireFieldMatch=false whereas UH forces requireFieldMatch=true. Of course it'd be nice for this to be user-configurable.
WEH supports not only BreakIterator based fragmentation but position gap based like FVH. I wonder if a better/custom BreakIterator impls could shore up the desire for whatever people like in the FVH approach?
UH supports SpanQueries, including custom ones the user may have
WEH can highlight phrases in an analysis mode in a streaming fashion without resorting to fully analyzing the content (unlike the UH)
UH has a unique "postings with light term vector" mode in which term vectors (no pos/offsets needed) are there only for accelerating multi-term queries (e.g. wildcards)
I figure the UH & PH PassageScorer "k1" param could be lowered to 1.0 or something to further decrease the term-frequency component of the score and thus increase term-diversity in passages. WEH seems to more directly have term diversity support for the passage? Nevertheless I have plans for the UH to address this more holistically (i.e. across snippets).
WEH has an explicit merge_fields feature, just as the FVH does. UnifiedHighlighter internally supports this (we've customized it to do so in our app) with some code but it's not exposed as first class feature. I think the UH should add this now.
WEH can be configured to not highlight "common" terms. Presumably that's "TF" (term freq). The UH (& PH from which it derives) factors the TF into passage weighting but it's ultimately highlighted.
The fall-back style support in WEH looks like a useful optimization to avoid highlighting fields.
It seems WEH has an optimization to not fetch the field value if there are no hits in the field? In contrast, the UH grabs all values up front, even across the documents to highlight so that the stored-data disk is accessed first and then the postings/TV second (better mechanical sympathy than doc-at-a-time). I'm not sure which is better, but I'm sure "it depends" of course.

BTW, nice work on this Wikimedia Experimental Highlighter!

5.3

Thanks!