uptake / uptasticsearch Goto Github PK

View Code? Open in Web Editor NEW

48.0 7.0 37.0 771 KB

An Elasticsearch client tailored to data science workflows.

License: BSD 3-Clause "New" or "Revised" License

R 79.88% Shell 4.24% Makefile 1.09% Python 14.38% Batchfile 0.40%

elasticsearch r data-science etl nosql document-database python data-engineering

uptasticsearch's Introduction

uptasticsearch

Introduction

uptasticsearch tackles the issue of getting data out of Elasticsearch and into a tabular format in R and Python. It should work for all versions of Elasticsearch from 1.0.0 onwards, but is not regularly tested against all of them. If you run into a problem, please open an issue.

How it Works
Installation
- R
- Python
Usage Examples
- Get a Batch of Documents
- Aggregation Results

How it Works

The core functionality of this package is the es_search() function. This returns a data.table containing the parsed result of any given query. Note that this includes aggs queries.

Installation

R

Releases of this package can be installed from CRAN:

install.packages(
  'uptasticsearch'
  , repos = "http://cran.rstudio.com"
)

or from conda-forge

conda install -c conda-forge r-uptasticsearch

To use the development version of the package, which has the newest changes, you can install directly from GitHub

remotes::install_github(
  "uptake/uptasticsearch"
  , subdir = "r-pkg"
)

Python

This package is not currently available on PyPi. To build the development version from source, clone this repo, then :

cd py-pkg
pip install .

Usage Examples

The examples presented here pertain to a fictional Elasticsearch index holding some information on a movie theater business.

Example 1: Get a Batch of Documents

The most common use case for this package will be the case where you have an Elasticsearch query and want to get a data frame representation of many resulting documents.

In the example below, we use uptasticsearch to look for all survey results in which customers said their satisfaction was "low" or "very low" and mentioned food in their comments.

library(uptasticsearch)

# Build your query in an R string
qbody <- '{
  "query": {
    "filtered": {
      "filter": {
        "bool": {
          "must": [
            {
              "exists": {
                "field": "customer_comments"
              }
            },
            {
              "terms": {
                "overall_satisfaction": ["very low", "low"]
              }
            }
          ]
        }
      }
    },
    "query": {
      "match_phrase": {
        "customer_comments": "food"
      }
    }
  }
}'

# Execute the query, parse into a data.table
commentDT <- es_search(
    es_host = 'http://mydb.mycompany.com:9200'
    , es_index = "survey_results"
    , query_body = qbody
    , scroll = "1m"
    , n_cores = 4
)

Example 2: Aggregation Results

Elasticsearch ships with a rich set of aggregations for creating summarized views of your data. uptasticsearch has built-in support for these aggregations.

In the example below, we use uptasticsearch to create daily timeseries of summary statistics like total revenue and average payment amount.

library(uptasticsearch)

# Build your query in an R string
qbody <- '{
  "query": {
    "filtered": {
      "filter": {
        "bool": {
          "must": [
            {
              "exists": {
                "field": "pmt_amount"
              }
            }
          ]
        }
      }
    }
  },
  "aggs": {
    "timestamp": {
      "date_histogram": {
        "field": "timestamp",
        "interval": "day"
      },
      "aggs": {
        "revenue": {
          "extended_stats": {
            "field": "pmt_amount"
          }
        }
      }
    }
  },
  "size": 0
}'

# Execute the query, parse result into a data.table
revenueDT <- es_search(
    es_host = 'http://mydb.mycompany.com:9200'
    , es_index = "transactions"
    , size = 1000
    , query_body = qbody
    , n_cores = 1
)

In the example above, we used the date_histogram and extended_stats aggregations. es_search() has built-in support for many other aggregations and combinations of aggregations, with more on the way. Please see the table below for the current status of the package. Note that names of the form "agg1 - agg2" refer to the ability to handled aggregations nested inside other aggregations.

Agg type	R support?	Python support?
"cardinality"	YES	NO
"date_histogram"	YES	NO
date_histogram - cardinality	YES	NO
date_histogram - extended_stats	YES	NO
date_histogram - histogram	YES	NO
date_histogram - percentiles	YES	NO
date_histogram - significant_terms	YES	NO
date_histogram - stats	YES	NO
date_histogram - terms	YES	NO
"extended_stats"	YES	NO
"histogram"	YES	NO
"percentiles"	YES	NO
"significant terms"	YES	NO
"stats"	YES	NO
"terms"	YES	NO
terms - cardinality	YES	NO
terms - date_histogram	YES	NO
terms - date_histogram - cardinality	YES	NO
terms - date_histogram - extended_stats	YES	NO
terms - date_histogram - histogram	YES	NO
terms - date_histogram - percentiles	YES	NO
terms - date_histogram - significant_terms	YES	NO
terms - date_histogram - stats	YES	NO
terms - date_histogram - terms	YES	NO
terms - extended_stats	YES	NO
terms - histogram	YES	NO
terms - percentiles	YES	NO
terms - significant_terms	YES	NO
terms - stats	YES	NO
terms - terms	YES	NO

uptasticsearch's People

Contributors

Stargazers

Watchers

uptasticsearch's Issues

unimplemented agg: terms - date_histogram

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
update the corresponding line in the agg support table in README.md
if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

unimplemented agg: "cardinality"

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
update the corresponding line in the agg support table in README.md
if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

Windows support in Travis

Travis recently added Windows support. Since many R users are also Windows users and Windows compatibility is necessary for packages to make it to CRAN, it would be valuable to test on Windows at CI time.

For anyone addressing this issue...you do not need to replicate ALL of our current Linux builds with Windows. One R build and one Python build (both with ES 6.2.x) would be sufficient.

unimplemented agg: "terms"

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
update the corresponding line in the agg support table in README.md
if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

es_search breaks with duplicate fields

Error in data.table::setnames(batchDT, old = names(batchDT), new = gsub("_source\\.",  : 
  Some duplicates exist in 'old': _source.details.issueName

https://github.com/UptakeOpenSource/uptasticsearch/blob/b3cffef4401ce85722277839d7267a5bd7206de0/R/elasticsearch_parsers.R#L501

If an elastic search result populates duplicate fields, this es_search will break here.

Integration tests are very minimal right now

uptasticsearch integration tests need a lot of love. We need to test all of the functions in the package on live data in Elasticsearch

unimplemented agg: date_histogram - terms

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
update the corresponding line in the agg support table in README.md
if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

unimplemented agg: "extended_stats"

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
update the corresponding line in the agg support table in README.md
if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

Long parsing time

Taking minutes for data scrolling (OK) and tens and twenties of minutes for parsing (very strange) for 700-800K of documents.

I tried to keep only plain (not nested) fields. 4-5 fields for the document.

Here was at least one observation when for the same request I had such long situation but after updating the library the same request was processed with in minutes (OK).

I can’t realize which data to put here for represent the issue. Any thoughts?

I am using the latest version of uptasticsearch now but the strange behavior is repeating.

Is it known behavior?
How should I avoid it?

Unify CI for Python and R

The Python code now duplicates some of the integration testing and isn't set up with Travis. Per @jameslamb Here's an example of testing R and Python packages in different sub-builds within one travis build using their nifty matrix thing: https://github.com/dmlc/xgboost/blob/master/.travis.yml#L15

The redundant files are now under py-pkg/dummy_data, and the logic that needs to be moved to travis is in py-pkg/Makefile and py-pkg/docker-compose.yml

Should be using httr::RETRY instead of httr::POST

https://github.com/UptakeOpenSource/uptasticsearch/blob/cfe11a50f9715311909372db102e85eb080ebd65/r-pkg/R/elasticsearch_parsers.R#L1107

Right now if a single request fails, so does es_search

unimplemented agg: "date_histogram"

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
update the corresponding line in the agg support table in README.md
if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

CI stuff should make hostname configurable

Right now in both Python and R packages, host names to hit are hard-coded into the tests. We should consider makingES_HOST configurable as an environment variable and have the tests pick that up.

unimplemented agg: terms - cardinality

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
update the corresponding line in the agg support table in README.md
if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

we should alias the logger

to improve readability.

Like we did here: uptake/pkgnet#10

unimplemented agg: "percentiles"

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
update the corresponding line in the agg support table in README.md
if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

Python package

After much discussion internally at Uptake, we have decided to add the Python version of uptasticsearch to this project!!!

Initial plan is to host the Python package here and provide documentation on how to install it from source. Whether or not we eventually push to PyPi has not been decided.

In anticipating of this, I've created R and Python labels for issues in this project.

This issue can be closed once py-pkg is created.

Travis might be broken

This could be related to the changes in #85 . I overrode some things that were previously being handled automagically by Travis. I wonder if that includes manually installing devtools? Still weird that #85 built though...

unimplemented agg: date_histogram - histogram

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
update the corresponding line in the agg support table in README.md
if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

unimplemented agg: "significant terms"

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
update the corresponding line in the agg support table in README.md
if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

Documentation for get_fields() mentions the wrong function

https://uptakeopensource.github.io/uptasticsearch/reference/get_fields.html

These docs mention retrieve_mapping instead of the actual function name get_fields(). I believe we change the name mid-development but missed this in the docs

Move R into r-pkg directory

Now that the python project is released under the py-pkg directory, we should move the R into r-pkg for consistency

unimplemented agg: date_histogram - percentiles

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
update the corresponding line in the agg support table in README.md
if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

Documentation should use inheritParams for shared arguments like 'es_host'

Roxygen's @inheritParams tools allows you to import parameter documentation from one function into another. Some arguments (like es_host and es_index) are re-used throughout uptasticsearch. Right now documentation for those parameters is duplicated across functions, but we could guarantee consistency across functions by centralizing documentation for those parameters in a null object decorated with @inheritParams

unimplemented agg: date_histogram - cardinality

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
update the corresponding line in the agg support table in README.md
if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

no support for reverse_nested aggregations

Per discussion in #58 ... es_search does not currently support reverse_nested aggregations. Well, technically it might support them but we have no tests around that.

To be honest I have no idea how those aggregations work, but would love for someone knowledgable to take a run at this issue. Reference on them:

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-reverse-nested-aggregation.html

Requirements for closing this issue:

Add code to our integration tests that create a structure in our test ES cluster that even allows this type of query (might require figuring out which versions of Elasticsearch actually support reverse_nested stuff)
Create tests that confirm expected results for queries with reverse_nested stuff in them

performance benchmarks

The changes introduced in #51 (thanks again @wdearden !) did not impact any user code or change the algorithmic correctness of uptasticsearch. They did, however, substantially improve the speed and efficiency of the package.

Verifying that this change actually did what it said it would was tricky. We had to manually try the PR branch on our own machines and against other instances of ES and use system.time() manually to check the speed benchmarks.

Would love if someone would take a shot at adding performance benchmarks to our tests for CI! I would like to test the following:

peak memory usage
total execution time

This will be pretty difficult, I think, because of this library's reliance on connecting to a separate service over a network connection. I envision these tests being limited to the processing of data once it is returned from the server.

unimplemented agg: date_histogram - stats

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
update the corresponding line in the agg support table in README.md
if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

es_search return ERROR when 0 result is returned from query

error out here:
https://github.com/UptakeOpenSource/uptasticsearch/blob/master/r-pkg/R/elasticsearch_parsers.R#L516

and propose a check
https://github.com/UptakeOpenSource/uptasticsearch/blob/master/r-pkg/R/elasticsearch_parsers.R#L796

unimplemented agg: "histogram"

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
update the corresponding line in the agg support table in README.md
if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

Weird error when Parallel::detectCores() returns NA

Error message for reference: Error in if (!is.integerish(x)) return(FALSE) : missing value where TRUE/FALSE needed.

Python aggs support docs missing in README

Now that python's released we need to Update the Aggs support table in README.md.

Also, es_search will throw NotImplementedError for aggs right now, so this feature needs to be integrated.

unimplemented agg: date_histogram - extended_stats

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
update the corresponding line in the agg support table in README.md
if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

es_search() : aggregation query crashes if empty bins

I'm having issues getting aggregate searches working. I use the exact aggregate query within the elastic site search and get results fine, but when I run it via the es_search() function, I get an error about missing data:

Error in log_fatal(msg) :
The column given to unpack_nested_data had no data in it.

I understand that there are empty buckets in the aggregation that are affecting the "unpack_nested_data" function. However, this isn't a problem when the aggregation results are written to a file and the read back in and parsed with "chomp_aggs".

This project does not yet have integration tests.

This is a client for a database, but uptasticsearch currently only has unit tests. This project should have integration tests which use uptasticsearch functions on an actual Elasticsearch index.

handling of versioning

@ngparas comments on #73

since there are also minor changes in scrolling between the versions i'd love if we could pull all the version-specific stuff up higher and configure w/ passing functions instead of putting internal switches like this

IMHO the best way to handle this would be to use an internal R6 object that just holds all of the version specific stuff and have that object get passed around to methods that need it (all internal). But idk...open to suggestions

pkgdown site is broken

As of #87 , our pkgdown docs are no longer in a docs/ folder at the repo root. That means Github-pages can't pick them up!

Is it time for 1.0.0?

Should the next release we do (whenever that happens) be v1.0???

v0.3.0 just went out and is now on CRAN.
by the time we do our next release, this library will be more than 1 year old
the library has a very limited scope and I see no added functionality in the future except auth

Would love to hear your thoughts @wdearden @jayqi @austin3dickey @ngparas @skirmer and any others finding their way to this issue

Write FAQ Vignette

In progress!

Support for Auth

Today, uptasticsearch only works when you have the ability to directly query the cluster. If your cluster has some authentication / authorization set up on it (e.g. Shield), this library will not work.

I think we need to add support for auth-enabled 5.x and 6.x clusters. As of 5.x, you can use X-pack to enable security features for Elasticsearch.

I do NOT think we should put any time or effort into supporting Shield on earlier versions of ES, but open to discussion if some users say they have a need for it.

unimplemented agg: terms - date_histogram - cardinality

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
update the corresponding line in the agg support table in README.md
if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

Python package should have Sphinx infra set up

The Python package uses Sphinx docstrings and we declare various sphinx packages in setup.py, but none of the actual infra (like a conf.py) to run the docs is set up.

Closing this issue involves walking through sphinx-quickstart (here)

appveyor testing is not set up

Looking for someone to figure out how to configure appveyor testing for this repo. I've added the project to the UptakeOpenSource account with appveyor, just need someone to figure out how to create the .appveyor.yml that will run our tests

Remove "old = names(DT)" from calls to data.table::setnames

A few times in our codebase, we have

data.table::setnames(someDT, old = names(someDT), new = new_names)

This is actually less safe than the following call with unnamed arguments:

data.table::setnames(someDT, new_names)

because the former will break when someDT has duplicate column names. It's not this package's job to break if there are duplicate names in a user's data.

unimplemented agg: date_histogram - significant_terms

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
update the corresponding line in the agg support table in README.md
if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

es_search should reject NULL passed to index

I accidentally did this tonight, and got unexpected behavior. With a NULL for index, your URL will look like http://mycluster.whatever:9200//_search, which is totally valid for Elasticsearch and means "search all indexes".

I propose that the case where you are explicitly passing a NULL is almost CERTAINLY a mistake and we should err on the side of caution and break when that happens. Searching over all indexes is a valid use case (e.g. for cases where you have data with the same mapping stored in different indexes for different time periods), but we should only support it by explicitly passing _all.

Thoughts @ngparas @austin3dickey @mfrasco ?

unimplemented agg: "stats"

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
update the corresponding line in the agg support table in README.md
if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

add ES 6 tests on Travis

ES 6 GA is out in the world now:

I haven't done due diligence on ES 6 yet but would love if someone would take some time to get our tests running against it

R package unit test coverage is not 100%

Currently, unit tests cover only 59% of the lines of code in this project. Ideally, we would like every line in the project to be covered by tests.

`unpack_nested_data` is slower than `tidyr::unnest`

From what I can see, unpack_nest_data is an order of magnitude slower than tidyr::unnest.

library(tidyverse)
library(microbenchmark)
library(uptasticsearch)
library(data.table)

n <- 1000

nested_df <-
  tibble(
    x = 1:n,
    y = rep(list(c("a", "b", "c")), n)
  )

nested_dt <- as.data.table(nested_df)

microbenchmark(
  unnest(nested_dt),
  unnest(nested_df),
  unpack_nested_data(nested_dt, col_to_unpack = "y"),
  times = 100
)

Since unpack_nested_data has the same functionality as tidyr::unnest but is limited to data.table and unnesting one column, it should be faster than tidyr::unnest in those cases.

Most of the computational cost is in the line lapply(listDT, data.table::as.data.table). I wrote a version of the function which uses the basic idea of tidyr::unnest without the nonstandard evaulation and without the dplyr verbs (but still with purrr since that's already an imported package).

unpack_nested_data <- function(chomped_df, col_to_unpack)  {
  
    if (!("data.table" %in% class(chomped_df))) {
        msg <- "For unpack_nested_data, chomped_df must be a data.table"
        futile.logger::flog.fatal(msg)
        stop(msg)
    }
    if (".id" %in% names(chomped_df)) {
        msg <- "For unpack_nested_data, chomped_df cannot have a column named '.id'"
        futile.logger::flog.fatal(msg)
        stop(msg)
    }
    if (!("character" %in% class(col_to_unpack)) || length(col_to_unpack) != 
        1) {
        msg <- "For unpack_nested_data, col_to_unpack must be a character of length 1"
        futile.logger::flog.fatal(msg)
        stop(msg)
    }
    if (!(col_to_unpack %in% names(chomped_df))) {
        msg <- "For unpack_nested_data, col_to_unpack must be one of the column names"
        futile.logger::flog.fatal(msg)
        stop(msg)
    }

    outDT <- data.table::copy(chomped_df)
    listDT <- outDT[[col_to_unpack]]
    
    is_df <- purrr::map_lgl(listDT, is.data.frame)
    is_atomic <- purrr::map_lgl(listDT, purrr::is_atomic)
    if (all(is_df)) {
        newDT <- data.table::rbindlist(listDT, fill = TRUE)
    } else if (all(is_atomic)) {
        newDT <- as.data.table(unlist(listDT))
    } else {
        msg <- "For unpack_nested_data, col_to_unpack must be all atomic vectors or all data frames"
        futile.logger::flog.fatal(msg)
        stop(msg)
    }
    
    if (nrow(newDT) == 0) {
        msg <- "The column given to unpack_nested_data had no data in it."
        futile.logger::flog.fatal(msg)
        stop(msg)
    }
    
    group_vars <- setdiff(names(chomped_df), c(names(newDT), col_to_unpack))
    n <- purrr::map_int(listDT, NROW)
    rest <- chomped_df[rep(1:nrow(chomped_df), n), ..group_vars, drop = FALSE]
    outDT <- data.table::data.table(newDT, rest)
    if ("V1" %in% names(outDT)) {
        data.table::setnames(outDT, "V1", col_to_unpack)
    }
    return(outDT)
}

This is about 2.5x faster than tidyr::unnest in the example above. I can submit a pull request if this makes sense. It works but it doesn't match all of the edge cases in the tests yet.

uptake / uptasticsearch Goto Github PK

uptasticsearch's Introduction

uptasticsearch

Introduction

Table of contents

How it Works

Installation

R

Python

Usage Examples

Example 1: Get a Batch of Documents

Example 2: Aggregation Results

uptasticsearch's People

Contributors

Stargazers

Watchers

Forkers

uptasticsearch's Issues

Recommend Projects

Recommend Topics

Recommend Org