Giter Club home page Giter Club logo

elasticsearch-batch-percolator's Introduction

Deprecation Warning

Our Elasticsearch stack has evolved, so we are not actually using this component anymore. Therefore it has been deprecated and is no longer actively maintained by Meltwater.

Elasticsearch batch percolator

Build Status Download

The batch percolator is a fork of the official elasticsearch percolator. It's highly optimized for large volume percolation with complex Lucene queries like wildcards, spans and phrases.

Using the official multi percolator we were able to reach ~1 document/second with 100.000 registered queries. With the batch-percolator, we are currently handling ~1000 documents/second with 225.000 registered queries. However, this will differ greatly depending on the nature of you queries and if you have an efficient strategy for filtering out queries.

For more information, see this blog post.

Installation

elasticsearch/bin/plugin --install elasticsearch-batch-percolator -u "https://dl.bintray.com/meltwater/elasticsearch-batch-percolator/com/meltwater/elasticsearch-batch-percolator/1.1.2/elasticsearch-batch-percolator-1.1.2.zip"

Version matrix:

┌─────────────────────────────────────────┬──────────────────────────┐
│ Elasticsearch batch percolator          │ ElasticSearch            │
├─────────────────────────────────────────┼──────────────────────────┤
│ 1.x.x                                   │ 1.7.0 ─► 1.7.6           │
└─────────────────────────────────────────┴──────────────────────────┘

API documentations

Create index with mapping

curl -XPUT localhost:9200/index -d '{
  "mappings": {
    "type": {
      "_source": {
        "enabled": false
      },
      "properties": {
        "field1": {
          "type": "string",
          "index": "analyzed"
        },
        "field2": {
          "type": "string",
          "index": "analyzed"
        }
      }
    }
  }
}'

You could also use a template, or store a document to get dynamic mapping for the document type to percolate agains

Registration of queries

curl -XPOST localhost:9200/index/.batchpercolator/query1 -d '{
  "query": {
    "term": {
      "field1": "fox"
    }
  },
  "highlight": {
    "pre_tags": [
      "<b>"
    ],
    "post_tags": [
      "</b>"
    ],
    "fields": {
      "field1": {}
    }
  }
}'

Sending in documents

curl -XPOST localhost:9200/index/type/_batchpercolate -d '{
  "docs": [
    {
      "_id": "doc1",
      "field1": "the fox is here",
      "field2": "meltwater"
    },
    {
      "_id": "doc2",
      "field1": "the fox is not here",
      "field2": "percolator"
    }
  ]
}'

example response:

{
  "took": 23,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "results": [
    {
      "doc": "doc1",
      "matches": [
        {
          "query_id": "query1",
          "highlights": {
            "field1": [
              "the <b>fox</b> is here"
            ]
          }
        }
      ]
    }
  ]
}

How does it differ from the official (multi) percolator?

Batching of documents

The official multi-percolator uses a 'MemoryIndex' which is a highly optimized index often used for percolation. The downside with the MemoryIndex is that it can only hold one document at a time.

The batch percolator instead uses a RamDirectory which means that we can process documents in batches.

Two-phase query execution

Complex queries like Span, Phrase and especially MultiPhraseQueries are magnitudes slower than Term or Boolean queries. All complex queries can be approximated using cheaper queries (for example, a SpanNear can be approximated using an AND query).

In the batch-percolator, a simplified approximation of each query is first executed on the batch of documents. We only execute the original expensive query if the approximated query has any matches in the batch. This is similar to how Lucene 5 executes those queries, and we expect to phase out this step once Elasticsearch 2.0 has a stable release.

Less features

We've removed a lot of features from the official multi-percolator. This means that you can no longer use filter queries or aggregations on matching queries. You should consider this plugin to be 'vanilla percolation'. Some of the features were removed because they cannot be supported in batch-mode. Some have been removed to reduce the complexity of the code.

elasticsearch-batch-percolator's People

Contributors

bergetp avatar karlney avatar traxmaxx avatar vorce avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

elasticsearch-batch-percolator's Issues

Rewrite tests using elasticsearch test suite

The elasticsearch test suite comes with a number of nice features (randomized number of nodes, setup etc). They have started to publish the test-suite as a test-jar which we can use, so we don't need to fork it.

can't batch-percolator after restart elasticsearch

Hi I using elasticsearch-batch-percolator 1.0.1 version with elasticsearch version 1.7.1. It working great with a good performance. But everytime I restart elasticsearch, batch percolate can't match docs anymore. The index contain queries still there and I still can query query docs but now call percolate api will always match no docs.
I have to reindex "query docs" to make it work again.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.