Giter Club home page Giter Club logo

elasticsearch-assets's Introduction

elasticsearch-assets

A bundle of Teraslice processors for reading and writing elasticsearch data

Getting Started

This asset bundle requires a running Teraslice cluster, you can find the documentation here.

# Step 1: make sure you have teraslice-cli installed
yarn global add teraslice-cli

# Step 2:
teraslice-cli assets deploy <cluster-alias> --build

APIS

Operations

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

MIT licensed.

elasticsearch-assets's People

Contributors

busma13 avatar ciorg avatar dependabot-preview[bot] avatar dependabot[bot] avatar godber avatar jeffmontagna avatar jsnoble avatar kstaken avatar lesleydreyer avatar macgyver603 avatar peterdemartini avatar sotojn avatar twkiel avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

jsnoble twkiel

elasticsearch-assets's Issues

elasticsearch reader doesn't work with persistent job lifecycle

Elasticsearch reader throws an error

Failure to determine slice: TypeError: Cannot read property '0' of undefined
elasticsearch_reader/elasticsearch_date_range/slicer.js:182:56

the dateParams is missing the interval property and maybe other parameters once it gets further along in the process

Handle out-of-range documents in index selector op

With a timeseries index selector, there is potential to create unintended indices when data source contains bad data. Would be nice to specify a valid range in relative terms (eg ["-1 year", "+1 month"]) and specify the index to ship such documents to (eg "errors").

If decide to include something like this, then should consider breaking index selector into two processors, one for time series and one for not.

Elasticsearch reader throws error when using '*' in index name, used when reading from multiple indices in one job

Using * in the index name to read from multiple indices at a time with [email protected] throws the error:

"error": "TSError: Cannot read property 'defaults' of undefined\n    at pRetry (...utils/dist/src/promises.js:61:21)\n    at processTicksAndRejections (internal/process/task_queues.js:97:5)\n    at ElasticsearchAPI.getWindowSize (.../dist/elasticsearch_reader_api/elasticsearch-api.js:346:49)\n 

Tried using the long and short form with the same result

Version 2.0.3 works with these same settings.

Job settings:

{
    "name": "temp#reindex",
    "lifecycle": "once",
    "workers": 1,
    "assets": [
        "elasticsearch:2.1.0",
        "standard:0.10.0"
    ],
    "apis": [
        {
            "_name": "elasticsearch_reader_api",
             "connection": "CONNECTION",
            "index": "index-name-v2-*",
            "type": "_doc",
            "date_field_name": "DATEFIELD",
            "interval": "auto",
            "time_resolution": "ms",
            "size": 100000
        },
        {
            "_name": "elasticsearch_sender_api",
            "index": "INDEX-NAME",
            "size": 10000
        }
    ],
    "operations": [
        {
            "_op": "elasticsearch_reader",
           "api_name": "elasticsearch_reader_api"
        },
        {
            "_op": "date_router",
            "field": "DATEFIELD",
            "resolution": "monthly",
            "field_delimiter": ".",
            "include_date_units": false
        },
        {
            "_op": "routed_sender",
            "api_name": "elasticsearch_sender_api",
            "routing": {
                "**": "CONNECTION"
            }
        }
    ],

ID slicer not working correctly with many slicers

Given a dataset of 499,865 records with an md5 key, using the default configuration the keys divided up are sometimes wrong depending the number of slicers.

For 2 slicers I get the expected result (which fetches the full dataset):

[
    {
        "type": "ID",
        "total_slicers": 2,
        "range": {
            "keys": [
                "a",
                "b",
                "c",
                "d",
                "e",
                "f",
                "g",
                "h",
                "i",
                "j",
                "k",
                "l",
                "m",
                "n",
                "o",
                "p",
                "q",
                "r",
                "s",
                "t",
                "u",
                "v",
                "w",
                "x",
                "y",
                "z",
                "A",
                "B",
                "C",
                "D",
                "E",
                "F"
            ],
            "count": 187558
        },
        "id": 0
    },
    {
        "type": "ID",
        "total_slicers": 2,
        "range": {
            "keys": [
                "G",
                "H",
                "I",
                "J",
                "K",
                "L",
                "M",
                "N",
                "O",
                "P",
                "Q",
                "R",
                "S",
                "T",
                "U",
                "V",
                "W",
                "X",
                "Y",
                "Z",
                "0",
                "1",
                "2",
                "3",
                "4",
                "5",
                "6",
                "7",
                "8",
                "9",
                "-",
                "_"
            ],
            "count": 312307
        },
        "id": 1
    }
]

For 4 slicers I get the wrong result (3x the number of records will be fetched):

[
    {
        "type": "ID",
        "total_slicers": 4,
        "range": {
            "keys": [
                "a",
                "b",
                "c",
                "d",
                "e",
                "f",
                "g",
                "h",
                "i",
                "j",
                "k",
                "l",
                "m",
                "n",
                "o",
                "p"
            ],
            "count": 187558
        },
        "id": 0
    },
    {
        "type": "ID",
        "total_slicers": 4,
        "range": {
            "keys": [],
            "count": 499865
        },
        "id": 1
    },
    {
        "type": "ID",
        "total_slicers": 4,
        "range": {
            "keys": [],
            "count": 499865
        },
        "id": 2
    },
    {
        "type": "ID",
        "total_slicers": 4,
        "range": {
            "keys": [
                "q",
                "r",
                "s",
                "t",
                "u",
                "v",
                "w",
                "x",
                "y",
                "z",
                "A",
                "B",
                "C",
                "D",
                "E",
                "F",
                "G",
                "H",
                "I",
                "J",
                "K",
                "L",
                "M",
                "N",
                "O",
                "P",
                "Q",
                "R",
                "S",
                "T",
                "U",
                "V",
                "W",
                "X",
                "Y",
                "Z",
                "0",
                "1",
                "2",
                "3",
                "4",
                "5",
                "6",
                "7",
                "8",
                "9",
                "-",
                "_"
            ],
            "count": 312307
        },
        "id": 3
    }
]

Needs documentation

The documentation from Teraslice core should be pulled over to this project and updated with information about deploying the asset.

elasticsearch date reader issues

There is an issue that the date reader does not properly handle multiple repeated sequences of data that have many slices of zero then big massive spikes that cannot be broken down.

IE size: 100 but sequence is [0, 1000, 0, 1000, 0, 1000, 0, 0, 1000, 0]

There was a slice record that encapsulated three of the spikes instead of one.

We should consider a threshold of just failing the slice if the data is to large compared to size. How this would work for the slicer needs to be thought out

elasticsearch_reader does not work with elasticsearch v7

The elasticsearch_reader only processes one slice then marks the job as complete when reading from elasticsearch v7 indices.

I tested this locally and on de2 on various data types and indices and saw the same behavior every time.

used asset version 1.8.5

latest job settings for local test:

 "_op": "elasticsearch_reader",
 "connection": "dev_es_v7",
 "size": 10000,
 "index": "es-generated-v1",
  "date_field_name": "created",
  "interval": "auto",
  "time_resolution": "ms"

Don't really know why it is doing this, but will add more as I dig in.

Large windowSize can trip ES circuit breaker, fail slices, crash ES nodes.

windowSize is defined here in the elasticsearch-reader-api, but is not externally configurable via job parameters, it seems (adding windowSize to job does not override the value). :

/**
* we should expose this because in some cases
* it might be an optimization to set this externally
*/
windowSize: number|undefined = undefined;

windowSize seems to be automatically set to match the per-index max_result_window setting from elasticsearch, and is ultimately used as the size parameter for the query:

/**
* This used verify the index.max_result_window size
* will be big enough to fix the within the requested
* slice size
*/
async getWindowSize(): Promise<number> {
const window = 'index.max_result_window';
const { index } = this.config;
const settings = await this.getSettings(index);
const matcher = indexMatcher(index);
for (const [key, configs] of Object.entries(settings)) {
if (matcher(key)) {
const defaultPath = configs.defaults[window];
const configPath = configs.settings[window];
// config goes first as it overrides an defaults
if (configPath) return toIntegerOrThrow(configPath);
if (defaultPath) return toIntegerOrThrow(defaultPath);
}
}
return this.config.size;
}

Here we can see that varying the size (windowSize) dramatically affects performance, despite the record count remaining the same (less than 50 records returned for this query):

curl -Ss "cluster/index*/_search?size=2000&track_total_hits=true" -XPOST -d '{"query":{"bool":{"must":[{"wildcard":{"_key":"aaad*"}}]}}}' -H 'content-type: application/json' | jq .took
38
curl -Ss "cluster/index*/_search?size=20000&track_total_hits=true" -XPOST -d '{"query":{"bool":{"must":[{"wildcard":{"_key":"aaad*"}}]}}}' -H 'content-type: application/json' | jq .took
45
curl -Ss "cluster/index*/_search?size=200000&track_total_hits=true" -XPOST -d '{"query":{"bool":{"must":[{"wildcard":{"_key":"aaad*"}}]}}}' -H 'content-type: application/json' | jq .took
484
curl -Ss "cluster/index*/_search?size=2000000&track_total_hits=true" -XPOST -d '{"query":{"bool":{"must":[{"wildcard":{"_key":"aaad*"}}]}}}' -H 'content-type: application/json' | jq .took
5134

Presumably we could manually set max_result_window on the indices and the Teraslice job would no longer trigger circuit breaker or crash es nodes, but ideally this would just be set on the job or more intelligently adapt to slice sizes somehow; if a slice only had 1000 records, perhaps a default windowSize of 1000*2 would be appropriate.

Maybe this issue was alluding to this problem: #13

(Thanks to @briend for this description)

Convert CI from Travis to Github Actions

We need to convert the ES asset CI to Github Actions, the tests currently iterate over opensearch versions and aren't as easy as the other assets (e.g. it's not just yarn test:all). So we will only re-use the shared build/publish workflow and implement a custom test workflow for the time being.

This will probably address the need for a Node 18 build as well:

#999

refactor slicers/readers

We need to make these more generic and be able to specify what is the primary slicer and what is the subslicing algorithm. We need to make things more flexible

spaces_reader appears to be reading duplicate records unless using 1 worker or interval set to 1s

Found that the spaces_reader returns too many records unless using 1 worker and 1 slicer, also reducing the interval to 1s independent of the number of workers or slicers returns close to the correct number of records but is still a bit too high.

Used a control group of data of 6.95M records in all the tests.

Tests were ran with elasticsearch-asset version 2.6.2, node-12 on dataeng3, teraslice version 0.76.1

workers slicers interval docs returned (M)
20 10 auto 8.38
20 1 auto 7.81
1 1 auto 6.95
20 10 1s 6.97
20 10 1m 8.19
20 10 1hr 8.48

Ran a job with 20 workers and 10 slicers, interval auto that deduped the records and the count came to 6.95M, so it looks like it's picking up duplicate records.

doc fixes

  • index on operations need to mention that index is optional if full api is specified, if not then its required. Make sure api schema say index is required and in docs

In the instantiation code, it checks for the presence of index, though the docs need to be more clear

id_reader needs to work with elasticsearch v7

The id_reader does not read all the data from an elasticsearch v7 index, tested against different indices on different clusters and the resulting index was short by several multiples.

For this local test the original index had 1,955,764 docs and the index from the id_reader had 288,685 docs.

test job settings:

 {
            "_op": "id_reader",
            "connection": "dev_es_v7",
            "size": 10000,
            "index": "test_index",
            "field": "_key"
 },

Bulk Sender `chunkRequests` Bug

The count variable in the chunkRequests function in the BulkSender Class is not being reset after it equals the config.size which results in every record after the count exceeds the config.size to be it's own bulk request.

This bug was introduced in ES-Assets v2.7.8 and present in every version there after.

Inline script syntax has changed in newer ES versions

On ES 5 and above this warning is logged by elasticsearch when using a script with the elasticsearch_index_selector.

Deprecated field [inline] used, expected [source] instead

This means the field used to provide the script in the bulk request has changed from inline to source. Simply changing this will make the scripting functionality incompatible with ES 2 not sure if that's an issue.

olivere/elastic#627

Refactor Sender APIs

  • Need to refactor the senders so the API logic lives in the new exportable elasticsearch-asset-apis directory
  • Need to update docs on the new API's and on the the headers/retry configs for spaces

id_reader: auto choose better defaults

The id_reader requires too much thought and testing in particular scenarios to work effectively which makes it very hard to use. It seems we should be able to do more introspection of the index and select more reasonable default for things like starting_key_depth and size.

Slice error when writing to rolling opensearch cluster

We were rolling an internal 1.3.* Opensearch cluster today and noticed that we started getting slice errors during the roll. Specifically we were getting this error:

class NoLivingConnectionsError extends OpenSearchClientError {
  constructor(message, meta) {
    super(message);
    Error.captureStackTrace(this, NoLivingConnectionsError);
    this.name = 'NoLivingConnectionsError';
    this.message =
      message ||
      'Given the configuration, the ConnectionPool was not able to find a usable Connection for this request.';
    this.meta = meta;
  }
}

https://github.com/opensearch-project/opensearch-js/blob/871a6669c9153d8161b3bbbce8747b86f9e6f758/lib/errors.js#L66

Restrict the concurrency with which bulk requests are sent

The elasticsearch_bulk_multisender uses concurrency on Promise.map to limit the number of concurrent bulk requests that could be sent at one time. (Note: elasticsearch_bulk_multisender is not yet publicly released, it's in the internal common_processor asset bundle.

The elasticsearch_bulk asset should support this option as well.

api reader broken in 1.3.1

It looks like the simple_api_reader is broken in the latest release.

Error: Cannot find module '../elasticsearch_reader/elasticsearch_date_range/reader’

simple_api_reader is broken

Trying to use the simple_api_reader throws the following error.

TSError: Slice failed processing, caused by TSError: client.search is not a function

Test and Yarn errors

  • yarn upgrade-interactive --latest is broken, it fails when ran, it only updates a part of them before failing
  • the istanbul reporter for all tests is broken, which I believed happened after splitting out the apis to /packages

elasticsearch reader with persist lifecycle creates too many slices when no data is incoming to the index being read

NOTE - using es_asset version 1.7.0

Using the elasticsearch_reader with a persistent lifecycle and reading from an index that gets its data in chunks the slicer seems to create too many slices for the workers or gets stuck creating empty slices when the incoming data pauses.

In this case the index being read from gets data around every 10min, the slicer queue quickly hit 10,000 slices. I upped the workers to 30 then 50 then 100 and the queue never got lower than 4800 slices. Job ran for 45min or so and processed over 200k empty slices and never processed any data.

Previously I had run a similar job that read from an index with a constant stream of data and had no issues processing data using similar jobs settings as below, the only changes were the index specific fields and the interval was 5s.

Originally started the job with a 5s interval but changed it to 60s once I saw how fast the slice queues were building up.

elasticsearch_reader settings (sensitive info replaced with CAPS):

"name": "JOB_NAME",
    "lifecycle": "persistent",
    "workers": 3,
    "assets": [
       ASSET_NAMES
    ],
    "targets": [
        {
            "key": "TARGET",
            "value": "VALUE"
        }
    ],
    "memory": 4294967296,
...
{
            "_op": "elasticsearch_reader",
            "connection": "CLUSTER_CONNECTION",
            "index": "INDEX_NAME",
            "type": "ES_TYPE",
            "query": "_exists_:FIELD_NAME",
            "date_field_name": "DATE_FIELD",
            "start": "2019-11-22T15:15:10.395Z",
            "interval": "60s",
            "delay": "1m",
            "time_resolution": "ms",
            "fields": [
                FIELD_NAMES
            ],
            "size": 100000
        },
...

`retry_on_conflict` does not work with ES 5

Using the v1.6.1 asset with update_retry_on_conflict set in the opConfig results in the following error for jobs writing to ES 5:

[illegal_argument_exception] Action/metadata line [1] contains an unknown parameter [retry_on_conflict]

Consolidate redundant boolean options on elasticsearch_index_selector into multi-value option

There are options on the the elasticsearch_index_selector operator that I think are mutually exclusive, these include:

delete
upsert
create
update
Could these be replaced with a single operation option that allows the user to select one of the values?

https://github.com/terascope/teraslice/blob/master/docs/reference.md#elasticsearch_index_selector

@godber
Member
godber commented on May 3, 2017
This doesn't really matter much but might resolve some ambiguities in the code.

@kstaken
Member
kstaken commented on May 4, 2017
Yes this would probably help.

@kstaken kstaken added the priority:medium label on Aug 23, 2017
@kstaken
Member
kstaken commented on Aug 23, 2017
This is a big breaking change but gets into the over complication of this particular processor. The processor likely needs to be refactored and broken up to reduce this complexity. The existing processor can be deprecated but kept around until all jobs have been converted.

Add support for `fields` to simple_api_reader

The elasticsearch_reader supports the fields option but this doesn't appear to be implemented in the simple_api_reader. The Teraserver search api does support fields so we should add support for this to the reader.

elasticsearch_sender_api schema blocks use of the routed_sender with elasticsearch_sender_api

When using the routed sender with the elasticsearch_sender_api I get this error:

Could not find elasticsearch endpoint configuration for connection default

Cause looks to be this schema line here

Job set up:

"apis": [
        {
            "_name": "elasticsearch_sender_api",
            "index": "INDEX_NAME",
            "type": "_doc",
            "upsert": true,
            "script": "script stuff....",
            "script_params": {
                   params here....
            },
            "size": 10000
        }
    ],
...
"operations":
...
{
            "_op": "date_router",
            "field": "DATE_FIELD",
            "resolution": "monthly",
            "field_delimiter": ".",
            "include_date_units": false
        },
        {
            "_op": "routed_sender",
            "api_name": "elasticsearch_sender_api",
            "routing": {
            "**": "CONNECTION_NAME"
            }
        }

Schema is ensuing a connection in the config, but the connection for this job is in the doc metadata. Schema could check that the routed_sender is present if no connection config?

api reader does combine dates with provided queries correctly

If the end user provides a query with OR clauses then the API reader will not correctly combine the date using AND.

User provides:

"query": "field:value1 OR field:value2"

The reader will add a date range to that that looks like this and results in incorrect data being returned.

date:["..." TO "..."] AND field:value1 OR field:value2

In order to evaluate correctly it needs to be:

date:["..." TO "..."] AND (field:value1 OR field:value2)

elasticsearch date reader slicer can't slice data with gaps in the time field

this could be related to issue #11

It seems like the issue appears when there is a significant time period with no data then data again.

details:

  • job:
    • assets: elasticsearch: 1.6.1
    • 10 workers
    • 1 slicer
    • operations: elasticsearch_reader -> noop

when reading data from an index 2 slices failed with the error Elasticsearch Error: [query_phase_execution_exception] Result window is too large,

The data itself has second resolution with milliseconds in the timestamp, but the milliseconds are always 000. The slicer is millisecond resolution.

2 slices failed during the initial job:

  • slice1: {"start":"2019-05-22T00:00:34+00:00","end":"2019-05-27T00:03:11+00:00","count":2293241}
  • slice2: {"start":"2019-05-14T00:02:50+00:00","end":"2019-05-17T00:05:39+00:00","count":3873168}

When I broke up the slices into smaller groups the jobs ran with out issues.

I moved the start date in the job based on the slice error start until the job didn't error out anymore.

Slice 1 succeeded with start: 2019-05-22T00:00:34 and end: 2019-05-26T22:34:15, then start: 2019-05-26T22:34:15, end: 2019-05-27T00:03:11

Slice 2 succeeded with start: 2019-05-14T00:02:50 , end: 2019-05-16T23:12:16 then start:2019-05-16T23:12:16 and end:2019-05-17T00:05:39

Slice 1 index searches showed that an index search between 2019-05-22T00:00:34 and end: 2019-05-26T22:34:15 returns 0 results.
Index search between 2019-05-26T22:34:15. and end <2019-05-27T00:03:11.000 returns 2293241 docs the same record count for slice 1.

Slice 2 index searches showed that an index search with the date field:>2019-05-14T00:02:50.000+AND+date_field:<2019-05-16T23:12:16.000 returns a count of 0 docs.

Searching date_field:>2019-05-16T23:12:16+AND+date_field:<2019-05-17T00:05:39.000 returns 3873168. The same record count for slice 2.

Removing the time periods of 0 results resulted in the test jobs finishing with no issues.

api doc updates

  • need better examples of using the apis, they configuration and how its used in a processor
  • remove full_response from docs

preserve_id on ES readers

This is to work around the issue with elasticsearch readers and full_response returning an object that forces every downstream processor to have to deal with it. full_response should probably be marked deprecated.

preserve_id should simply copy the document id into an _id field on each record.

Long term we'll want to deal with this more formally but for now we need to stop the uglyness caused by full_response.

id slicer will not work with elasticsearch 6

{jsonble}
They are depreciating how _uid operates in this version, and it starting to use a _id search instead which only allows for simple queries like term searches. This will also affect our elasticsearch_reader if we use the capability to further split a time segment further down.

docs:
https://www.elastic.co/guide/en/elasticsearch/reference/6.2/mapping-uid-field.html
https://www.elastic.co/guide/en/elasticsearch/reference/6.2/mapping-id-field.html

for that last one pay attention to the note posted towards the top of the page and the paragraph just past it.

{kstaken}
This appears to be the root of the problem: elastic/elasticsearch#18154

And where they decided to not worry about what this breaks: elastic/elasticsearch#24247

Just so that we maintain an option for working in this regard we may need to look at sliced scroll queries and evaluate just how much risk there is from the state held server side on active indices. https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html

Also if this were applied as a subslicing technique for further division of an indivisible date range query, the impact of the state held server side would be relatively small and not held for the entire duration of a reader operation. It would only have to be held for the duration of the subslicing.

In cases where an index is static using the scroll API should be perfectly fine so it may still give us a reasonable set of options for when range slicing doesn't work.

spaces_reader times out with 10 or more slicers

when using the spaces_reader in a teraslice job when I set the slicers to 10 the job failed with this error:

TSError: slicer for ex c882e2aa-437b-4385-85a5-7511b5a02903 had an error, shutting down execution, caused by TSError2: HTTP request timed out connecting to API endpoint.\n

changing the number of slicers to 1 or 5 worked fine, only saw the error with 10 or more slicers.

job config:

{
    "name": "test#spaces_reader",
    "lifecycle": "once",
    "workers": 20,
    "slicers": 10,
    "assets": [
        "elasticsearch:2.7.2",
        "kafka:3.2.2"
    ],
    "memory": 3221225472,
    "cpu": 2,
    "operations": [
        {
            "_op": "spaces_reader",
            "endpoint": "REDACTED",
            "index": "REDACTED",
            "query": "REDACTED"
            "date_field_name": "REDACTED",
            "token": "REDACTED",
            "fields": [
                "REDACTED",
                "REDACTED",
                "REDACTED",
                "REDACTED"
            ],
            "size": 100000
        },
        {
            "_op": "kafka_sender",
            "connection": "REDACTED",
            "topic": "REDACTED",
            "size": 100000
        }
    ]
}

changing the job to use the elasticsearch_reader instead of the spaces_reader worked fine with 10 slicers.

We discussed the possibility of the client timeout setting being shorter than the elasticsearch cluster's timeout setting or maybe the slicers were all being handled by one node instead of a node per slicer.

elasticsearch-assets index_selector should work with elasticsearch v7

removal of types:
https://www.elastic.co/guide/en/elasticsearch/reference/6.8/removal-of-types.html

start to discuss changes needed to work with elasticsearch version 7

elasticsearch_reader_api configuration overrides id_reader/ elasticsearch_reader configs

In this job set up:

"name": "test-job",
"lifecycle": "once",
"workers": 1,
"assets": [
        "elasticsearch:2.1.0"
    ],
"apis":[
  {
    "_name": "elasticsearch_reader_api",
     "connection": "homeslice",
    "index": "test-index",
    "field": "_key",
    "query": "key:someKey",
    "size": 10000
  }
],
    "operations": [
        {
            "_op": "id_reader",
            "connection": "otherConnection",
            "index": "second-test-index",
            "field": "_key",
            "size": 10000
        },
        {
            "_op": "custom-api-reader-op",
            "api_name": "elasticsearch_reader_api"        
      },
    ]

The settings for the elasticsearch_reader_api in the apis property override the id_reader configs. It seems like the elasticsearch_reader_api should have to be called in the _op config for the api configs to be applied.

`id_reader` Slicer has problems with records containing `-` in the `id_field`

I am testing with the v1.8.0 asset bundle and ES 6.8.1. For testing, I added an id field to 2M records using characters from the reader's base64url character list and set the id_field with the elasticsearch_index_selector. Then (after spot checking a handle of records), I just used the id_reader to copy that dataset to another index in the same cluster. When using setting id_field in the elasticsearch_index_selector on the copy job, I ended up with 2M records and ~330k deleted docs in the new index, and without setting that, I ended up with about 550k extra records in the new index.

Comparing the records between indices and looking at the slices, I noticed two distinct problems related to having at least one - present in the id field:

  1. Slice queries will match characters immediately following a - in the id field. In this test case, I set the slice size to 10k, so it ended up setting each slice's query depth to two characters from the base64url list. So, a record that included something like "id": "A7-ib-r0JR2Yc7vbRCxcz" would end up being included in the three slices with the queries a7*, ib*, and r0*. Also to note here is that the queries weren't case-sensitive (all slice queries had lower case letters), which might be another issue if that's not the intended behavior.

  2. Looking at each query in every slice, none of them contained a -. As mentioned above, the slices only contained lower case letters, so the only characters present in the queries were a-z, 0-9, and _. The interesting thing here is that I think the id_reader likely did end up picking up all of the records from my test dataset due to the first issue, but I'd have to check all of the record ids to validate that hunch. The only records that wouldn't end up being included here would be records that do not have at least two consecutive non-- characters in the id field.

id_reader with elasticsearch_reader_api requires "field" params in 2 places

elasticsearch_reader_api requires the field parameter and the _op requires the field parameter

example:

"apis": [
        {
            "_name": "elasticsearch_reader_api",
            "connection": "homeslice",
            "index": "local-index",
            "type": "_doc",
            "field": "_key",
            "size": 1000
        }
    ],
    "operations": [
        {
            "_op": "id_reader"
            "field": "_key
        },

reason is this check here: https://github.com/terascope/elasticsearch-assets/blob/master/asset/src/id_reader/slicer.ts#L23

If I remove the check from the slicer and remove field from the _op parameters the jobs runs fine. Seems like the check should be on the api config if it's not already.

job errors get lost or ignored in v2.4.0, v2.2.1

Found two examples that seemed similar where job errors are being ignored or lost somewhere.

error example with incorrect slice size setting:

with version 1.9.1:

config:

"assets": [
        "elasticsearch:1.9.1"
    ],
    "operations": [
        {
            "_op": "elasticsearch_reader",
            "connection": "CONNECTION",
            "index": "INDEX",
            "type": "TYPE",
            "date_field_name": "date",
            "size": 20000
        },
  • ending job status: failed
  • errors: TSError: Elasticsearch Error: [query_phase_execution_exception] Result window is too large, from + size must be less than or equal to: [10000] but was [11305]

which makes sense because the max_result_window is set at 10,000 and slice size is 20,000.

same settings with version 2.4.0:

config:

 "assets": [
        "elasticsearch:2.4.0",
    ],
    "operations": [
        {
            "_op": "elasticsearch_reader",
            "connection": "CONNECTION",
            "index": "INDEX",
            "type": "TYPE",
            "date_field_name": "date",
            "size": 20000
        },
  • ending job status: completed
  • errors: none

But it only reads 10,000 docs from the index then marks the job as completed. So there is still an issue with the slice size being bigger than the max_results_window, but it's getting lost somewhere.

error example with incorrect lifecycle setting:

with version v1.9.1

config:

 "name": "JOBNAME",
    "lifecycle": "persistent",
    "workers": 1,
    "assets": [
        "elasticsearch:1.9.1"
    ],

returns:

 Invalid interval parameter, must be manually set while job is in persistent mode

with v2.4.0

config:

"name": "JOBNAME",
    "lifecycle": "persistent",
    "workers": 1,
    "assets": [
        "elasticsearch:2.4.0",
    ],

no error returned and when job is started it stays in initializing status. Seems similar to the above example where the error is getting lost.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.