Giter Club home page Giter Club logo

uptasticsearch's Introduction

uptasticsearch

GitHub Actions Build Status codecov CRAN_Status_Badge CRAN_Download_Badge

Introduction

uptasticsearch tackles the issue of getting data out of Elasticsearch and into a tabular format in R and Python. It should work for all versions of Elasticsearch from 1.0.0 onwards, but is not regularly tested against all of them. If you run into a problem, please open an issue.

Table of contents

How it Works

The core functionality of this package is the es_search() function. This returns a data.table containing the parsed result of any given query. Note that this includes aggs queries.

Installation

R

Lifecycle Maturing

Releases of this package can be installed from CRAN:

install.packages(
  'uptasticsearch'
  , repos = "http://cran.rstudio.com"
)

or from conda-forge

conda install -c conda-forge r-uptasticsearch

To use the development version of the package, which has the newest changes, you can install directly from GitHub

remotes::install_github(
  "uptake/uptasticsearch"
  , subdir = "r-pkg"
)

Python

Lifecycle Dormant

This package is not currently available on PyPi. To build the development version from source, clone this repo, then :

cd py-pkg
pip install .

Usage Examples

The examples presented here pertain to a fictional Elasticsearch index holding some information on a movie theater business.

Example 1: Get a Batch of Documents

The most common use case for this package will be the case where you have an Elasticsearch query and want to get a data frame representation of many resulting documents.

In the example below, we use uptasticsearch to look for all survey results in which customers said their satisfaction was "low" or "very low" and mentioned food in their comments.

library(uptasticsearch)

# Build your query in an R string
qbody <- '{
  "query": {
    "filtered": {
      "filter": {
        "bool": {
          "must": [
            {
              "exists": {
                "field": "customer_comments"
              }
            },
            {
              "terms": {
                "overall_satisfaction": ["very low", "low"]
              }
            }
          ]
        }
      }
    },
    "query": {
      "match_phrase": {
        "customer_comments": "food"
      }
    }
  }
}'

# Execute the query, parse into a data.table
commentDT <- es_search(
    es_host = 'http://mydb.mycompany.com:9200'
    , es_index = "survey_results"
    , query_body = qbody
    , scroll = "1m"
    , n_cores = 4
)

Example 2: Aggregation Results

Elasticsearch ships with a rich set of aggregations for creating summarized views of your data. uptasticsearch has built-in support for these aggregations.

In the example below, we use uptasticsearch to create daily timeseries of summary statistics like total revenue and average payment amount.

library(uptasticsearch)

# Build your query in an R string
qbody <- '{
  "query": {
    "filtered": {
      "filter": {
        "bool": {
          "must": [
            {
              "exists": {
                "field": "pmt_amount"
              }
            }
          ]
        }
      }
    }
  },
  "aggs": {
    "timestamp": {
      "date_histogram": {
        "field": "timestamp",
        "interval": "day"
      },
      "aggs": {
        "revenue": {
          "extended_stats": {
            "field": "pmt_amount"
          }
        }
      }
    }
  },
  "size": 0
}'

# Execute the query, parse result into a data.table
revenueDT <- es_search(
    es_host = 'http://mydb.mycompany.com:9200'
    , es_index = "transactions"
    , size = 1000
    , query_body = qbody
    , n_cores = 1
)

In the example above, we used the date_histogram and extended_stats aggregations. es_search() has built-in support for many other aggregations and combinations of aggregations, with more on the way. Please see the table below for the current status of the package. Note that names of the form "agg1 - agg2" refer to the ability to handled aggregations nested inside other aggregations.

Agg type R support? Python support?
"cardinality" YES NO
"date_histogram" YES NO
date_histogram - cardinality YES NO
date_histogram - extended_stats YES NO
date_histogram - histogram YES NO
date_histogram - percentiles YES NO
date_histogram - significant_terms YES NO
date_histogram - stats YES NO
date_histogram - terms YES NO
"extended_stats" YES NO
"histogram" YES NO
"percentiles" YES NO
"significant terms" YES NO
"stats" YES NO
"terms" YES NO
terms - cardinality YES NO
terms - date_histogram YES NO
terms - date_histogram - cardinality YES NO
terms - date_histogram - extended_stats YES NO
terms - date_histogram - histogram YES NO
terms - date_histogram - percentiles YES NO
terms - date_histogram - significant_terms YES NO
terms - date_histogram - stats YES NO
terms - date_histogram - terms YES NO
terms - extended_stats YES NO
terms - histogram YES NO
terms - percentiles YES NO
terms - significant_terms YES NO
terms - stats YES NO
terms - terms YES NO

uptasticsearch's People

Contributors

austin3dickey avatar bernardbeckerman avatar chrsblck avatar csyhuang avatar dmaynard51 avatar drkarthi avatar erichall87 avatar erichalluptake avatar falconred avatar gwengww avatar irfansener avatar jameslamb avatar jmcelve2 avatar jzeph avatar kszela24 avatar mattbsg avatar mfrasco avatar mohneet avatar ngparas avatar skirmer avatar staftermath avatar terrytangyuan avatar varadpoddar avatar yashgoswami-infy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

uptasticsearch's Issues

unimplemented agg: terms - date_histogram

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

  1. implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
  2. update the corresponding line in the agg support table in README.md
  3. if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

unimplemented agg: "cardinality"

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

  1. implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
  2. update the corresponding line in the agg support table in README.md
  3. if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

Windows support in Travis

Travis recently added Windows support. Since many R users are also Windows users and Windows compatibility is necessary for packages to make it to CRAN, it would be valuable to test on Windows at CI time.

For anyone addressing this issue...you do not need to replicate ALL of our current Linux builds with Windows. One R build and one Python build (both with ES 6.2.x) would be sufficient.

unimplemented agg: "terms"

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

  1. implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
  2. update the corresponding line in the agg support table in README.md
  3. if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

unimplemented agg: date_histogram - terms

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

  1. implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
  2. update the corresponding line in the agg support table in README.md
  3. if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

unimplemented agg: "extended_stats"

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

  1. implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
  2. update the corresponding line in the agg support table in README.md
  3. if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

Long parsing time

Taking minutes for data scrolling (OK) and tens and twenties of minutes for parsing (very strange) for 700-800K of documents.

I tried to keep only plain (not nested) fields. 4-5 fields for the document.

Here was at least one observation when for the same request I had such long situation but after updating the library the same request was processed with in minutes (OK).

I can’t realize which data to put here for represent the issue. Any thoughts?

I am using the latest version of uptasticsearch now but the strange behavior is repeating.

Is it known behavior?
How should I avoid it?

Unify CI for Python and R

The Python code now duplicates some of the integration testing and isn't set up with Travis. Per @jameslamb Here's an example of testing R and Python packages in different sub-builds within one travis build using their nifty matrix thing: https://github.com/dmlc/xgboost/blob/master/.travis.yml#L15

The redundant files are now under py-pkg/dummy_data, and the logic that needs to be moved to travis is in py-pkg/Makefile and py-pkg/docker-compose.yml

unimplemented agg: "date_histogram"

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

  1. implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
  2. update the corresponding line in the agg support table in README.md
  3. if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

CI stuff should make hostname configurable

Right now in both Python and R packages, host names to hit are hard-coded into the tests. We should consider makingES_HOST configurable as an environment variable and have the tests pick that up.

unimplemented agg: terms - cardinality

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

  1. implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
  2. update the corresponding line in the agg support table in README.md
  3. if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

unimplemented agg: "percentiles"

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

  1. implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
  2. update the corresponding line in the agg support table in README.md
  3. if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

Python package

After much discussion internally at Uptake, we have decided to add the Python version of uptasticsearch to this project!!!

Initial plan is to host the Python package here and provide documentation on how to install it from source. Whether or not we eventually push to PyPi has not been decided.

In anticipating of this, I've created R and Python labels for issues in this project.

This issue can be closed once py-pkg is created.

Travis might be broken

image

This could be related to the changes in #85 . I overrode some things that were previously being handled automagically by Travis. I wonder if that includes manually installing devtools? Still weird that #85 built though...

unimplemented agg: date_histogram - histogram

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

  1. implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
  2. update the corresponding line in the agg support table in README.md
  3. if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

unimplemented agg: "significant terms"

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

  1. implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
  2. update the corresponding line in the agg support table in README.md
  3. if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

Move R into r-pkg directory

Now that the python project is released under the py-pkg directory, we should move the R into r-pkg for consistency

unimplemented agg: date_histogram - percentiles

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

  1. implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
  2. update the corresponding line in the agg support table in README.md
  3. if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

Documentation should use inheritParams for shared arguments like 'es_host'

Roxygen's @inheritParams tools allows you to import parameter documentation from one function into another. Some arguments (like es_host and es_index) are re-used throughout uptasticsearch. Right now documentation for those parameters is duplicated across functions, but we could guarantee consistency across functions by centralizing documentation for those parameters in a null object decorated with @inheritParams

unimplemented agg: date_histogram - cardinality

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

  1. implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
  2. update the corresponding line in the agg support table in README.md
  3. if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

no support for reverse_nested aggregations

Per discussion in #58 ... es_search does not currently support reverse_nested aggregations. Well, technically it might support them but we have no tests around that.

To be honest I have no idea how those aggregations work, but would love for someone knowledgable to take a run at this issue. Reference on them:

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-reverse-nested-aggregation.html

Requirements for closing this issue:

  1. Add code to our integration tests that create a structure in our test ES cluster that even allows this type of query (might require figuring out which versions of Elasticsearch actually support reverse_nested stuff)
  2. Create tests that confirm expected results for queries with reverse_nested stuff in them

performance benchmarks

The changes introduced in #51 (thanks again @wdearden !) did not impact any user code or change the algorithmic correctness of uptasticsearch. They did, however, substantially improve the speed and efficiency of the package.

Verifying that this change actually did what it said it would was tricky. We had to manually try the PR branch on our own machines and against other instances of ES and use system.time() manually to check the speed benchmarks.

Would love if someone would take a shot at adding performance benchmarks to our tests for CI! I would like to test the following:

  • peak memory usage
  • total execution time

This will be pretty difficult, I think, because of this library's reliance on connecting to a separate service over a network connection. I envision these tests being limited to the processing of data once it is returned from the server.

unimplemented agg: date_histogram - stats

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

  1. implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
  2. update the corresponding line in the agg support table in README.md
  3. if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

unimplemented agg: "histogram"

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

  1. implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
  2. update the corresponding line in the agg support table in README.md
  3. if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

Python aggs support docs missing in README

Now that python's released we need to Update the Aggs support table in README.md.

Also, es_search will throw NotImplementedError for aggs right now, so this feature needs to be integrated.

unimplemented agg: date_histogram - extended_stats

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

  1. implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
  2. update the corresponding line in the agg support table in README.md
  3. if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

es_search() : aggregation query crashes if empty bins

I'm having issues getting aggregate searches working. I use the exact aggregate query within the elastic site search and get results fine, but when I run it via the es_search() function, I get an error about missing data:

Error in log_fatal(msg) :
The column given to unpack_nested_data had no data in it.

I understand that there are empty buckets in the aggregation that are affecting the "unpack_nested_data" function. However, this isn't a problem when the aggregation results are written to a file and the read back in and parsed with "chomp_aggs".

This project does not yet have integration tests.

This is a client for a database, but uptasticsearch currently only has unit tests. This project should have integration tests which use uptasticsearch functions on an actual Elasticsearch index.

handling of versioning

@ngparas comments on #73

since there are also minor changes in scrolling between the versions i'd love if we could pull all the version-specific stuff up higher and configure w/ passing functions instead of putting internal switches like this

IMHO the best way to handle this would be to use an internal R6 object that just holds all of the version specific stuff and have that object get passed around to methods that need it (all internal). But idk...open to suggestions

pkgdown site is broken

As of #87 , our pkgdown docs are no longer in a docs/ folder at the repo root. That means Github-pages can't pick them up!

Is it time for 1.0.0?

Should the next release we do (whenever that happens) be v1.0???

  • v0.3.0 just went out and is now on CRAN.
  • by the time we do our next release, this library will be more than 1 year old
  • the library has a very limited scope and I see no added functionality in the future except auth

Would love to hear your thoughts @wdearden @jayqi @austin3dickey @ngparas @skirmer and any others finding their way to this issue

Support for Auth

Today, uptasticsearch only works when you have the ability to directly query the cluster. If your cluster has some authentication / authorization set up on it (e.g. Shield), this library will not work.

I think we need to add support for auth-enabled 5.x and 6.x clusters. As of 5.x, you can use X-pack to enable security features for Elasticsearch.

I do NOT think we should put any time or effort into supporting Shield on earlier versions of ES, but open to discussion if some users say they have a need for it.

unimplemented agg: terms - date_histogram - cardinality

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

  1. implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
  2. update the corresponding line in the agg support table in README.md
  3. if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

Python package should have Sphinx infra set up

The Python package uses Sphinx docstrings and we declare various sphinx packages in setup.py, but none of the actual infra (like a conf.py) to run the docs is set up.

Closing this issue involves walking through sphinx-quickstart (here)

appveyor testing is not set up

Looking for someone to figure out how to configure appveyor testing for this repo. I've added the project to the UptakeOpenSource account with appveyor, just need someone to figure out how to create the .appveyor.yml that will run our tests

Remove "old = names(DT)" from calls to data.table::setnames

A few times in our codebase, we have

data.table::setnames(someDT, old = names(someDT), new = new_names)

This is actually less safe than the following call with unnamed arguments:

data.table::setnames(someDT, new_names)

because the former will break when someDT has duplicate column names. It's not this package's job to break if there are duplicate names in a user's data.

unimplemented agg: date_histogram - significant_terms

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

  1. implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
  2. update the corresponding line in the agg support table in README.md
  3. if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

es_search should reject NULL passed to index

I accidentally did this tonight, and got unexpected behavior. With a NULL for index, your URL will look like http://mycluster.whatever:9200//_search, which is totally valid for Elasticsearch and means "search all indexes".

I propose that the case where you are explicitly passing a NULL is almost CERTAINLY a mistake and we should err on the side of caution and break when that happens. Searching over all indexes is a valid use case (e.g. for cases where you have data with the same mapping stored in different indexes for different time periods), but we should only support it by explicitly passing _all.

Thoughts @ngparas @austin3dickey @mfrasco ?

unimplemented agg: "stats"

Currently, this type of aggregation is not supported by the Python package. "not supported" means that es_search() cannot parse a result from this type of query into a pandas DataFrame.

It's possible that this is handled easily and correctly by pandas.DataFrame.from_json().

To close this issue, a PR would need to:

  1. implement unit tests similar to the corresponding R tests in r-pkg/tests/testthat/test-chomp_aggs.R. See the test results in test_data/. There is one file there corresponding to each aggregation type (or combination of agg types) mentioned in README.md
  2. update the corresponding line in the agg support table in README.md
  3. if necessary, add code to to fetch_all.py. Parsing functions should be implemented as internal functions (with leading underscores) to ensure that es_search() stays magical. Reference the R implementation to see how this could be done.

`unpack_nested_data` is slower than `tidyr::unnest`

From what I can see, unpack_nest_data is an order of magnitude slower than tidyr::unnest.

library(tidyverse)
library(microbenchmark)
library(uptasticsearch)
library(data.table)

n <- 1000

nested_df <-
  tibble(
    x = 1:n,
    y = rep(list(c("a", "b", "c")), n)
  )

nested_dt <- as.data.table(nested_df)

microbenchmark(
  unnest(nested_dt),
  unnest(nested_df),
  unpack_nested_data(nested_dt, col_to_unpack = "y"),
  times = 100
)

Since unpack_nested_data has the same functionality as tidyr::unnest but is limited to data.table and unnesting one column, it should be faster than tidyr::unnest in those cases.

Most of the computational cost is in the line lapply(listDT, data.table::as.data.table). I wrote a version of the function which uses the basic idea of tidyr::unnest without the nonstandard evaulation and without the dplyr verbs (but still with purrr since that's already an imported package).

unpack_nested_data <- function(chomped_df, col_to_unpack)  {
  
    if (!("data.table" %in% class(chomped_df))) {
        msg <- "For unpack_nested_data, chomped_df must be a data.table"
        futile.logger::flog.fatal(msg)
        stop(msg)
    }
    if (".id" %in% names(chomped_df)) {
        msg <- "For unpack_nested_data, chomped_df cannot have a column named '.id'"
        futile.logger::flog.fatal(msg)
        stop(msg)
    }
    if (!("character" %in% class(col_to_unpack)) || length(col_to_unpack) != 
        1) {
        msg <- "For unpack_nested_data, col_to_unpack must be a character of length 1"
        futile.logger::flog.fatal(msg)
        stop(msg)
    }
    if (!(col_to_unpack %in% names(chomped_df))) {
        msg <- "For unpack_nested_data, col_to_unpack must be one of the column names"
        futile.logger::flog.fatal(msg)
        stop(msg)
    }

    outDT <- data.table::copy(chomped_df)
    listDT <- outDT[[col_to_unpack]]
    
    is_df <- purrr::map_lgl(listDT, is.data.frame)
    is_atomic <- purrr::map_lgl(listDT, purrr::is_atomic)
    if (all(is_df)) {
        newDT <- data.table::rbindlist(listDT, fill = TRUE)
    } else if (all(is_atomic)) {
        newDT <- as.data.table(unlist(listDT))
    } else {
        msg <- "For unpack_nested_data, col_to_unpack must be all atomic vectors or all data frames"
        futile.logger::flog.fatal(msg)
        stop(msg)
    }
    
    if (nrow(newDT) == 0) {
        msg <- "The column given to unpack_nested_data had no data in it."
        futile.logger::flog.fatal(msg)
        stop(msg)
    }
    
    group_vars <- setdiff(names(chomped_df), c(names(newDT), col_to_unpack))
    n <- purrr::map_int(listDT, NROW)
    rest <- chomped_df[rep(1:nrow(chomped_df), n), ..group_vars, drop = FALSE]
    outDT <- data.table::data.table(newDT, rest)
    if ("V1" %in% names(outDT)) {
        data.table::setnames(outDT, "V1", col_to_unpack)
    }
    return(outDT)
}

This is about 2.5x faster than tidyr::unnest in the example above. I can submit a pull request if this makes sense. It works but it doesn't match all of the edge cases in the tests yet.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.