dyfanjones / noctua Goto Github PK

View Code? Open in Web Editor NEW

45.0 3.0 5.0 1.95 MB

Connect R to Athena using paws SDK (DBI Interface)

Home Page: https://dyfanjones.github.io/noctua/

License: Other

R 99.82% Dockerfile 0.18%

r athena aws database

noctua's Issues

Unable to install Noctua on EC2 instance

Hello Dyfan,

I am running RStudio on an Amazon Linux EC2 instance, and I am getting this error message when attempting to install the noctua package.

I am currently running this analysis through this link (http://localhost:8787/).

install.packages("noctua")

Installing package into ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4’
(as ‘lib’ is unspecified)
also installing the dependencies ‘curl’, ‘openssl’, ‘httr’, ‘xml2’, ‘paws.common’, ‘paws.compute’, ‘paws.storage’, ‘paws.database’, ‘paws.networking’, ‘paws.management’, ‘paws.machine.learning’, ‘paws.analytics’, ‘paws.security.identity’, ‘paws.application.integration’, ‘paws.cost.management’, ‘paws.customer.engagement’, ‘paws’

trying URL 'https://cran.rstudio.com/src/contrib/curl_4.3.tar.gz'
Content type 'application/x-gzip' length 673779 bytes (657 KB)
==================================================
downloaded 657 KB

trying URL 'https://cran.rstudio.com/src/contrib/openssl_1.4.2.tar.gz'
Content type 'application/x-gzip' length 1204168 bytes (1.1 MB)
==================================================
downloaded 1.1 MB

trying URL 'https://cran.rstudio.com/src/contrib/httr_1.4.2.tar.gz'
Content type 'application/x-gzip' length 159950 bytes (156 KB)
==================================================
downloaded 156 KB

trying URL 'https://cran.rstudio.com/src/contrib/xml2_1.3.2.tar.gz'
Content type 'application/x-gzip' length 271876 bytes (265 KB)
==================================================
downloaded 265 KB

trying URL 'https://cran.rstudio.com/src/contrib/paws.common_0.3.3.tar.gz'
Content type 'application/x-gzip' length 63406 bytes (61 KB)
==================================================
downloaded 61 KB

trying URL 'https://cran.rstudio.com/src/contrib/paws.compute_0.1.9.tar.gz'
Content type 'application/x-gzip' length 864214 bytes (843 KB)
==================================================
downloaded 843 KB

trying URL 'https://cran.rstudio.com/src/contrib/paws.storage_0.1.9.tar.gz'
Content type 'application/x-gzip' length 385287 bytes (376 KB)
==================================================
downloaded 376 KB

trying URL 'https://cran.rstudio.com/src/contrib/paws.database_0.1.9.tar.gz'
Content type 'application/x-gzip' length 628403 bytes (613 KB)
==================================================
downloaded 613 KB

trying URL 'https://cran.rstudio.com/src/contrib/paws.networking_0.1.9.tar.gz'
Content type 'application/x-gzip' length 475508 bytes (464 KB)
==================================================
downloaded 464 KB

trying URL 'https://cran.rstudio.com/src/contrib/paws.management_0.1.9.tar.gz'
Content type 'application/x-gzip' length 842090 bytes (822 KB)
==================================================
downloaded 822 KB

trying URL 'https://cran.rstudio.com/src/contrib/paws.machine.learning_0.1.9.tar.gz'
Content type 'application/x-gzip' length 424388 bytes (414 KB)
==================================================
downloaded 414 KB

trying URL 'https://cran.rstudio.com/src/contrib/paws.analytics_0.1.9.tar.gz'
Content type 'application/x-gzip' length 465442 bytes (454 KB)
==================================================
downloaded 454 KB

trying URL 'https://cran.rstudio.com/src/contrib/paws.security.identity_0.1.9.tar.gz'
Content type 'application/x-gzip' length 887814 bytes (867 KB)
==================================================
downloaded 867 KB

trying URL 'https://cran.rstudio.com/src/contrib/paws.application.integration_0.1.9.tar.gz'
Content type 'application/x-gzip' length 178463 bytes (174 KB)
==================================================
downloaded 174 KB

trying URL 'https://cran.rstudio.com/src/contrib/paws.cost.management_0.1.9.tar.gz'
Content type 'application/x-gzip' length 66152 bytes (64 KB)
==================================================
downloaded 64 KB

trying URL 'https://cran.rstudio.com/src/contrib/paws.customer.engagement_0.1.9.tar.gz'
Content type 'application/x-gzip' length 202482 bytes (197 KB)
==================================================
downloaded 197 KB

trying URL 'https://cran.rstudio.com/src/contrib/paws_0.1.9.tar.gz'
Content type 'application/x-gzip' length 360692 bytes (352 KB)
==================================================
downloaded 352 KB

trying URL 'https://cran.rstudio.com/src/contrib/noctua_1.8.0.tar.gz'
Content type 'application/x-gzip' length 381011 bytes (372 KB)
==================================================
downloaded 372 KB

* installing *source* package ‘curl’ ...
** package ‘curl’ successfully unpacked and MD5 sums checked
Package libcurl was not found in the pkg-config search path.
Perhaps you should add the directory containing `libcurl.pc'
to the PKG_CONFIG_PATH environment variable
No package 'libcurl' found
Package libcurl was not found in the pkg-config search path.
Perhaps you should add the directory containing `libcurl.pc'
to the PKG_CONFIG_PATH environment variable
No package 'libcurl' found
Using PKG_CFLAGS=
Using PKG_LIBS=-lcurl
------------------------- ANTICONF ERROR ---------------------------
Configuration failed because libcurl was not found. Try installing:
 * deb: libcurl4-openssl-dev (Debian, Ubuntu, etc)
 * rpm: libcurl-devel (Fedora, CentOS, RHEL)
 * csw: libcurl_dev (Solaris)
If libcurl is already installed, check that 'pkg-config' is in your
PATH and PKG_CONFIG_PATH contains a libcurl.pc file. If pkg-config
is unavailable you can set INCLUDE_DIR and LIB_DIR manually via:
R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
--------------------------------------------------------------------
ERROR: configuration failed for package ‘curl’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/curl’
Warning in install.packages :
  installation of package ‘curl’ had non-zero exit status
* installing *source* package ‘openssl’ ...
** package ‘openssl’ successfully unpacked and MD5 sums checked
Using PKG_CFLAGS=
--------------------------- [ANTICONF] --------------------------------
Configuration failed because openssl was not found. Try installing:
 * deb: libssl-dev (Debian, Ubuntu, etc)
 * rpm: openssl-devel (Fedora, CentOS, RHEL)
 * csw: libssl_dev (Solaris)
 * brew: [email protected] (Mac OSX)
If openssl is already installed, check that 'pkg-config' is in your
PATH and PKG_CONFIG_PATH contains a openssl.pc file. If pkg-config
is unavailable you can set INCLUDE_DIR and LIB_DIR manually via:
R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
-------------------------- [ERROR MESSAGE] ---------------------------
tools/version.c:1:10: fatal error: openssl/opensslv.h: No such file or directory
 #include <openssl/opensslv.h>
          ^~~~~~~~~~~~~~~~~~~~
compilation terminated.
--------------------------------------------------------------------
ERROR: configuration failed for package ‘openssl’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/openssl’
Warning in install.packages :
  installation of package ‘openssl’ had non-zero exit status
* installing *source* package ‘xml2’ ...
** package ‘xml2’ successfully unpacked and MD5 sums checked
Package libxml-2.0 was not found in the pkg-config search path.
Perhaps you should add the directory containing `libxml-2.0.pc'
to the PKG_CONFIG_PATH environment variable
No package 'libxml-2.0' found
Package libxml-2.0 was not found in the pkg-config search path.
Perhaps you should add the directory containing `libxml-2.0.pc'
to the PKG_CONFIG_PATH environment variable
No package 'libxml-2.0' found
Using PKG_CFLAGS=
Using PKG_LIBS=-lxml2
------------------------- ANTICONF ERROR ---------------------------
Configuration failed because libxml-2.0 was not found. Try installing:
 * deb: libxml2-dev (Debian, Ubuntu, etc)
 * rpm: libxml2-devel (Fedora, CentOS, RHEL)
 * csw: libxml2_dev (Solaris)
If libxml-2.0 is already installed, check that 'pkg-config' is in your
PATH and PKG_CONFIG_PATH contains a libxml-2.0.pc file. If pkg-config
is unavailable you can set INCLUDE_DIR and LIB_DIR manually via:
R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
--------------------------------------------------------------------
ERROR: configuration failed for package ‘xml2’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/xml2’
Warning in install.packages :
  installation of package ‘xml2’ had non-zero exit status
ERROR: dependencies ‘curl’, ‘openssl’ are not available for package ‘httr’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/httr’
Warning in install.packages :
  installation of package ‘httr’ had non-zero exit status
ERROR: dependencies ‘httr’, ‘xml2’ are not available for package ‘paws.common’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/paws.common’
Warning in install.packages :
  installation of package ‘paws.common’ had non-zero exit status
ERROR: dependency ‘paws.common’ is not available for package ‘paws.compute’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/paws.compute’
Warning in install.packages :
  installation of package ‘paws.compute’ had non-zero exit status
ERROR: dependency ‘paws.common’ is not available for package ‘paws.storage’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/paws.storage’
Warning in install.packages :
  installation of package ‘paws.storage’ had non-zero exit status
ERROR: dependency ‘paws.common’ is not available for package ‘paws.database’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/paws.database’
Warning in install.packages :
  installation of package ‘paws.database’ had non-zero exit status
ERROR: dependency ‘paws.common’ is not available for package ‘paws.networking’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/paws.networking’
Warning in install.packages :
  installation of package ‘paws.networking’ had non-zero exit status
ERROR: dependency ‘paws.common’ is not available for package ‘paws.management’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/paws.management’
Warning in install.packages :
  installation of package ‘paws.management’ had non-zero exit status
ERROR: dependency ‘paws.common’ is not available for package ‘paws.machine.learning’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/paws.machine.learning’
Warning in install.packages :
  installation of package ‘paws.machine.learning’ had non-zero exit status
ERROR: dependency ‘paws.common’ is not available for package ‘paws.analytics’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/paws.analytics’
Warning in install.packages :
  installation of package ‘paws.analytics’ had non-zero exit status
ERROR: dependency ‘paws.common’ is not available for package ‘paws.security.identity’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/paws.security.identity’
Warning in install.packages :
  installation of package ‘paws.security.identity’ had non-zero exit status
ERROR: dependency ‘paws.common’ is not available for package ‘paws.application.integration’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/paws.application.integration’
Warning in install.packages :
  installation of package ‘paws.application.integration’ had non-zero exit status
ERROR: dependency ‘paws.common’ is not available for package ‘paws.cost.management’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/paws.cost.management’
Warning in install.packages :
  installation of package ‘paws.cost.management’ had non-zero exit status
ERROR: dependency ‘paws.common’ is not available for package ‘paws.customer.engagement’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/paws.customer.engagement’
Warning in install.packages :
  installation of package ‘paws.customer.engagement’ had non-zero exit status
ERROR: dependencies ‘paws.compute’, ‘paws.storage’, ‘paws.database’, ‘paws.networking’, ‘paws.management’, ‘paws.machine.learning’, ‘paws.analytics’, ‘paws.security.identity’, ‘paws.application.integration’, ‘paws.cost.management’, ‘paws.customer.engagement’ are not available for package ‘paws’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/paws’
Warning in install.packages :
  installation of package ‘paws’ had non-zero exit status
ERROR: dependency ‘paws’ is not available for package ‘noctua’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/noctua’
Warning in install.packages :
  installation of package ‘noctua’ had non-zero exit status

My apologies for the long error message; I wanted to provide as much context as possible.

I am fairly new to the tech industry and am wondering if this is a noctua specific problem?
If so, any advice is greatly appreciated.

Fetching Data from Athena Speed test

library(noctua)

con <- dbConnect(noctua::athena())

res <- dbExecute(con, "select * from sampledb.elb_logs")
result <- noctua:::poll(res)
result_info <- noctua:::split_s3_uri(result$QueryExecution$ResultConfiguration$OutputLocation)

obj <- res@connection@ptr$S3$get_object(Bucket = result_info$bucket, Key = result_info$key)

# method_1
data.table::fread(readBin(obj$Body "character"))

# method_2
writeBin(obj$Body, con = "test.csv")
data.table::fread("test.csv")

Data return size: 19 X 1,356,206
Data memory size: 335.3 Mb

Cran resubmission

The latest noctua & RAthena releases caused an error in "CRAN Package Check Results for Package noctua"

The error comes from the suggested package arrow failing to install:
Error: write_parquet requires the arrow package, please install it first and try again

As arrow is a suggested package dependency. All unit tests using arrow should first check if arrow is correctly installed before proceeding cran Writing R Extensions. This was an oversight with the latest unit tests.

noctua and RAthena will be re - released before the 2020-11-25

dbWriteTable should support a retry functionality

It appears that some users are coming into an issue of throttling errors. A user has found a possible solution https://community.rstudio.com/t/use-exponential-back-off-with-dbi/56870. To make it easier for users this solution should be inbuilt into the package.

noctua 1.7.0 cran submission delay

Similar to RAthena, noctua next release will be slightly delayed due to changes in \donttest.

Please see DyfanJones/RAthena#108 for full details

codecov dropped from 83% to 3%

For some reason codecov has dropped from 83% to 3%. The code changes shouldn't of caused this issue. Plus it seems codecov isn't picking up new unit tests in #108

R function Documentation

Connecting from noctua in a local R-Studio session throws 'no region provided error'

Hey Dyfan,

I apologize for coming over here with yet another issue.

Issue Description

dbConnect() returns the following error:
Error in get_region(cfgs$credentials$profile) : No region provided

Reproducible Example

library(noctua)
library(DBI)

con <- dbConnect(noctua::athena(),
aws_access_key_id='MyAccessKey',
aws_secret_access_key='MyAccessKey',
s3_staging_dir='s3://'my-staging-dir-us-west-2/',
region_name='us-west-2')

Rhub check issue: pre-error build

Currently pre-error with Rhub check.

https://builder.r-hub.io/status/noctua_1.10.0.tar.gz-f2324215e0b840b1ae002c4fead75766

A glue catalog table with a trailing '/' in the s3 path breaks dbRemoveTable

Hi Dyfan, bumped into another bug.

If a glue table definition has a trailing '/' in the s3 path, dbRemoveTable doesn't work. (note, such a table definition is allowed in glue)

I think the reason is on this line and the pasting of the extra '/' at the end of the s3_path$key:

noctua/R/Connection.R

Line 608 in e3ce929

 objects <- conn@ptr$S3$list_objects_v2(Bucket=s3_path$bucket, Prefix=paste0(s3_path$key, "/"), ContinuationToken = token) 

For example, if we have a path like "s3://mybucket/mypath/mysubpath/", this will produce a key "s3://mybucket/mypath/mysubpath//" and things will break down.

Also, if a table definition points directly a single file, as in, "s3://mybucket/mypath/mysubpath/myfile.csv", dbRemoveTable will not work either

As a fix, I would propose omitting the extra '/'.

All the best and happy holidays,
Ossi

Date time methods from dbplyr

Hi @DyfanJones, I'm sure you're aware of this, but anyway, the funtions refrenced in DyfanJones/RAthena#121 should be implemented in noctua too.

dbGetQuery unavailable for noctua::athena

Hello Dyfan,

I raised this question on stackoverflow.com as @datascientist2117; thank you for your response.

Here is the code I currently have.

library(tidyverse)
library(dplyr)
library(dbplyr)
library(odbc)
library(DBI)
library(RAthena)
library(rJava)

odbcListDrivers()
sort((unique(odbcListDrivers()[[1]])))

con <- DBI::dbConnect(
  odbc::odbc(),
  Driver             = "[Simba Athena ODBC Driver]",
  S3OutputLocation   = "[s3://bucket-folder/]",
  AwsRegion          = "[region]",
  AuthenticationType = "IAM Credentials",
  Schema             = "[schema]",
  UID                = rstudioapi::askForPassword("AWS Access Key"),
  PWD                = rstudioapi::askForPassword("AWS Secret Key"))

All code above is exactly the same except for the omitted values in S3OutPutLocation, AwsRegion, and Schema.

I am currently using Simba Athena ODBC Driver Version 1.1.2

I hope this information is sufficient; please let me know if there is anything else I can provide!

The 'statement' argument to dbGetQuery and dbSendQuery methods isn't exposed in older versions of R

Issue Description

The 'statement' argument to dbGetQuery and dbSendQuery methods isn't exposed in older versions of R, resulting in an error that it cannot find the 'statement' object.

Reproducible Example

First, install version 2.3.0 of the noctua package under R version 3.4.4. Then take a look at

library(noctua)
showMethods("dbGetQuery", includeDefs=TRUE)
showMethods("dbSendQuery", includeDefs=TRUE)

It will say:

Function: dbGetQuery (package DBI)
conn="AthenaConnection", statement="character"
function (conn, ...)
{
    .local <- function (conn, statement = NULL, statistics = FALSE,
        unload = FALSE, ...)
    {
        con_error_msg(conn, msg = "Connection already closed.")
        stopifnot(is.logical(statistics), is.logical(unload))
        rs <- dbSendQuery(conn, statement = statement, unload = unload)
        if (statistics)
            print(dbStatistics(rs))
        out <- dbFetch(res = rs, n = -1, ...)
        dbClearResult(rs)
        return(out)
    }
    .local(conn, statement, ...)
}

and

Function: dbSendQuery (package DBI)
conn="AthenaConnection", statement="character"
function (conn, ...)
{
    .local <- function (conn, statement = NULL, unload = FALSE,
        ...)
    {
        con_error_msg(conn, msg = "Connection already closed.")
        stopifnot(is.logical(unload))
        res <- AthenaResult(conn = conn, statement = statement,
            s3_staging_dir = conn@info$s3_staging, unload = unload)
        return(res)
    }
    .local(conn, statement, ...)
}

Whereas, under R version 3.5.0 (and later), the methods do expose the statement argument, and there is no error when using them. So naturally I am wondering if this is due to R-core correcting a bug going from version 3.4.4 to 3.5.0 of R, or could it be considered a bug in noctua itself? If the former, should noctua require R version >= 3.5.0? In any case, curious to hear your thoughts or suggestions.

Session Info

devtools::session_info()
#> output

Session info -------------------------------------------------------------------
 setting  value                       
 version  R version 3.4.4 (2018-03-15)
 system   x86_64, linux-gnu           
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/New_York            
 date     2021-11-04                  

Packages -----------------------------------------------------------------------
 package    * version date       source        
 data.table   1.12.6  2019-10-18 CRAN (R 3.4.4)
 DBI          1.1.0   2019-12-15 CRAN (R 3.4.4)
 devtools     1.12.0  2016-12-05 CRAN (R 3.4.0)
 digest       0.6.27  2020-10-24 CRAN (R 3.4.4)
 memoise      1.1.0   2017-04-21 CRAN (R 3.4.0)
 noctua     * 2.3.0   2021-10-26 CRAN (R 3.4.4)
 paws         0.1.12  2021-09-03 CRAN (R 3.4.1)
 rstudioapi   0.11    2020-02-07 CRAN (R 3.4.4)
 withr        2.1.2   2018-03-15 CRAN (R 3.4.4)

grepl support for use with dbplyr

There seems to be no implementation of grepl in the sql_translator for use with dbplyr.

For example:

> dplyr::tbl(db_connection, "concept") %>% filter(grepl('search_str', 'concept_name')) %>% show_query()
<SQL>
SELECT *
FROM "concept"
WHERE (grepl('search_str', "concept_name"))

Instead I expected the output to be something like this:

<SQL>
SELECT *
FROM "concept"
WHERE concept_name like '%search_str%'

Other backends for dbplyr include support for grepl, e.g. here is the postgres implementation.

Possible Better formatting for the return partitions of an AWS Athena table

Please look at DyfanJones/RAthena#129 for extra details.

Timeouts when running with paws.database 0.1.10

Issue Description

Under noctua 1.10.0, going to paws.database 0.1.10 seems to cause curl timeouts when using the dplyr interface to queries.

(Semi-)Reproducible Example

Generating a clean reprex is tricky, but I have a local query managed under renv that reliably replicates the problem. Here is a redacted query (hitting an Apache log store in parquet format) that demonstrates the problem:

Error under paws.database 1.10.0

 > con <- dbConnect(noctua::athena(),
                  profile_name = "REDACTED",
                  region = "us-east-2",
                  s3_staging_dir = 's3://REDACTED',
                  work_group = "REDACTED")
> query <- str_glue("
   SELECT date_parse(timestamp, '%d/%b/%Y:%H:%i:%s +0000') AS timestamp,
          verb, request, response, CAST(bytes as integer) AS bytes, referrer, agent
   FROM REDACTED")
> dat <- tbl(con, sql(query)) %>% collect()
Info: (Data scanned: 14.09 MB)
Error in curl::curl_fetch_memory(url, handle = handle): Timeout was reached: [REDACTED.s3.us-east-2.amazonaws.com] Operation timed out after 10000 milliseconds with 119006796 out of 316782524 bytes received
Request failed. Retrying in 0.7 seconds...
Error in curl::curl_fetch_memory(url, handle = handle): Timeout was reached: [REDACTED.s3.us-east-2.amazonaws.com] Operation timed out after 10000 milliseconds with 112822860 out of 316782524 bytes received
Request failed. Retrying in 2.2 seconds...

If left to run, the query goes through exponential back-off and eventually fails.

Running the same query under paws.database 0.1.9 works without issue. noctua 1.9.1 also hits this problem, so this seems to be something in the interface with paws (or maybe even a problem with paws itself).

Really appreciate the package. If there's a better way to help debug this, please let me know!

Added snappy compression to parquet uploaded files

Development of this will be tracked in RAthena ticket: #26

Support for paws.common credential mechanisms

Thanks for the very nifty package!

paws.common recently added support for credential_process in .aws/config files, allowing role assumption, nested profiles, and external credentials all to work seamlessly together. It looks like noctua constructs the config object itself rather than relying on paws, though I'm not clear on the full object model of paws. Is it practical for noctua to hand off all authentication work to paws, allowing a user to take advantage of this greatly improved credential process?

Feature idea/request: Optionally disable querying glue for the whole data catalog in rstudio when dbConnect:ing

One speculative idea: would it make sense to allow a user to optionally skip reading in the glue catalog when running dbConnect in rstudio?

Of course it is often helpful to have, for example, the data catalog available for browsing in the connections pane in Rstudio.
However, when the catalog becomes very large, it can take pretty long to establish the connection (a minute or so in my case, I think some lake formation shares make things worse), and often one doesn't need the catalog for anything. In interactive work this kind of a delay can be a bit annoying, but obviously nothing too dangerous.

So, what do you think, would this be doable, and would it make sense to implement something like this? I'm unfortunately pretty clueless with the interaction with rstudio, so not sure what this would require.

should other file types be split by default: benchmark

When uploading data frames to AWS Athena, noctua only split compressed delimited files due to DyfanJones/RAthena#36 . Would other file.types benefit from similar file splitting?

Release Date v-1.7.0

The next release is planned for the 14/05/2020. This is to prevent the package being release too frequently to the cran.

If the new features are required before the planned release please let me know and the release date can be changed.

Float/real type cannot be read

Hi Dyfan,

it appears that noctua doesn't know how to interpret a query result with the type 'real'.

For example:

res <- DBI::dbSendStatement(cona, "create table temp.realexample as (select cast(1 as real) as realcol)")
# Wait until table is created
dplyr::tbl(cona, dbplyr::in_schema("temp", "realexample"))

produces the error: "Error: Unknown shortcut: real"

Should AthenaToRDataType-function have mapping for 'real' included (instead of 'float')?

noctua/R/DataTypes.R

Line 23 in 3b9d5d5

AthenaToRDataType.athena_data.table <-

noctua/R/DataTypes.R

Line 47 in 3b9d5d5

AthenaToRDataType.athena_vroom <-

uuid library versioning

First, noctua is excellent and very impressive!

Just a slight improvement, the DESCRIPTION file should be modified slightly to suggest or require that the uuid library is of version 0.1-4. Any earlier version of uuid does not allow the use of the n argument when using uuid::UUIDgenerate() and may cause noctua::dbWriteTable to fail.

Best regards,

Michael

Compression GZIP when uploading files to Athena/S3

Support GZIP compression:

https://docs.aws.amazon.com/athena/latest/ug/compression-formats.html

Use the GZIP compression in Athena for querying Amazon Kinesis Data Firehose logs. Athena and Amazon Kinesis Data Firehose each support different versions of SNAPPY, so GZIP is the only compatible format.

dbConnect and noctua::athena() only pulls in default profile

Issue Description

r noctua::athena() does connect to database BUT only to the database with the default profile in the aws .config/.credentials files.

When I run this

con <- dbConnect(noctua::athena(),
                 profile_name = "default",
                 s3_staging_dir = "s3://aws-athena-query-results-for-default-profile-us-east-1")

con <- dbConnect(noctua::athena(),
                 s3_staging_dir = "s3://aws-athena-query-results-for-default-profile-us-east-1")

It does connect to the correct AWS database under the default profile in the aws .config/.credentials file and I'm able to run queries against it.

Output from ```dbGetInfo(con)```

> dbGetInfo(con) $profile_name [1] "default"

$s3_staging
[1] "s3://aws-athena-query-results-for-default-profile-us-east-1"

$dbms.name
[1] "default"

$work_group
[1] "primary"

$poll_interval
NULL

$encryption_option
NULL

$kms_key
NULL

$expiration
NULL

$region_name
[1] "us-east-1"

$paws
[1] "0.1.7"

$noctua
[1] "1.6.0"

However, when I run this,

con_2 <- dbConnect(noctua::athena(),
                   profile_name = "other-profile-that-is-not-default",
                   s3_staging_dir = "s3://aws-athena-query-results-for-other-profile-us-east-1/", 
                   region = "us-east-1")

Output from ```dbGetInfo(con_2)```

$profile_name
[1] "other-profile-that-is-not-default"

$s3_staging
[1] "s3://aws-athena-query-results-for-other-profile-us-east-1"

$dbms.name
[1] "default"

$work_group
[1] "primary"

$poll_interval
NULL

$encryption_option
NULL

$kms_key
NULL

$expiration
NULL

$region_name
[1] "us-east-1"

$paws
[1] "0.1.7"

$noctua
[1] "1.6.0"

then, it still connects but again to the default database. How can I successfully connect to a non-default profile? It looks like dbGetInfo(con_2) has the correct information but for whatever reason it does not connect to the correct database and I see the exact same tables as for con.

This is what the .config file looks like

[default]
region = us-east-1
output = text

[profile other-profile-that-is-not-default]
region = us-east-1
saml.session_duration = 43200
saml.idp_arn = arn:aws:iam::XXXX:saml-provider/g-suite
saml.role_arn = arn:aws:iam::XXXXXXX:role/other-profile_that-is-not-default

and this is what the .credentials file looks like

[default]
aws_access_key_id = XXXX
aws_secret_access_key = XXXXXXX

[other-profile-that-is-not-default]
aws_access_key_id = XXXXX
aws_secret_access_key = XXXX
aws_session_token = XXXXX
aws_security_token = XXXXXX

Any help would be greatly appreciated especially since I have no idea how to set up the .config/.credentials file differently.

Support datatypes: array, map, struct

Currently noctua doesn't support AWS Athena data types [array, map, struct]

Upload and register flat files into Athena enhancement

The development of this will be covered by RAthena ticket: #25

dbplyr's tbl() function queries Athena for the fields. Querying Glue would be more efficient.

Issue Description

When 'connecting' to a table with dplyr's tbl()-function, a query is sent to Athena with a 'WHERE 0 == 1' clause for getting the column names. This query is generated by db_query_fields.DBIconnection.

The thing is, querying Athena can be slow at times and a much faster response could be gotten from Glue.

This, however works only if a direct connection to a table is made, like tbl(con, in_schema("schema", "table)). If one has a subquery in tbl() (e.g., tbl(con, sql("select * from table where b=c)), Glue cannot help here.

To handle this, one could define a method, such as:

db_query_fields.AthenaConnection <- function(con, sql, ...) {

  # Test if we have a subquery or a direct table definition
  is_direct <- grepl('^"?[a-å]+"?\\.?"?[a-å]+"?$', trimws(tolower(sql)))
  
  if(is_direct) { # If a direct definiton, get the fields from Glue
  
    if (!dbIsValid(con)) {stop("Connection already closed.", call. = FALSE)}
    
    if (grepl("\\.", sql)) {
      dbms.name <- gsub("\\..*", "" , sql)
      Table <- gsub(".*\\.", "" , sql)
    } else {
      dbms.name <- conn@info$dbms.name
      Table <- sql}
    
    tryCatch(
      output <- conn@ptr$glue$get_table(DatabaseName = dbms.name,
                                        Name = Table)$Table$StorageDescriptor$Columns)
    sapply(output, function(y) y$Name)
  } else { # If a subquery, query Athena for the fields
    sql <- sql_select(con, sql("*"), sql_subquery(con, sql), where = sql("0 = 1"))
    qry <- dbSendQuery(con, sql)
    on.exit(dbClearResult(qry))
    
    res <- dbFetch(qry, 0)
    names(res)
  }
}

So basically test if we have a direct table definition or a subquery, and query Glue or Athena accordingly.

The weakest part of this would be the first regex for trying to see if we have a direct table def. The good thing is, that if the regex match returns FALSE, we revert to dplyr's default behaviour.

What do you think?

p.s. Have been trying noctua as opposed to RAthena for a few days and really seems to work as a drop in replacement. Really like the native 'all R' aspect of it!

Ubuntu Linux 16.04 LTS, R-release, GCC Rhub Error

uild ID:	noctua_1.2.0.tar.gz-a51a35c2ada44db7b831b7f50f5f0697
Platform:	Ubuntu Linux 16.04 LTS, R-release, GCC
Submitted:	12 minutes 12.5 seconds ago
Build time:	12 minutes 11 seconds

ERROR: dependency ‘openssl’ is not available for package ‘httr’

Swap to Github Actions instead of travis

Currently travis ci has broken for noctua and RAthena. Instead of fixing pipeline, will swap to github actions instead.

Release noctua 2.3.0

Prepare for release:

Check current CRAN check results
Polish NEWS
urlchecker::url_check()
devtools::check_rhub(env_vars=c(R_COMPILE_AND_INSTALL_PACKAGES = "always", LIBARROW_BINARY="true"))
devtools::check_win_devel()
rhub::check_for_cran(env_vars=c(R_COMPILE_AND_INSTALL_PACKAGES = "always", LIBARROW_BINARY="true"))
Update cran-comments.md

Submit to CRAN:

usethis::use_version('minor')
devtools::submit_cran()
Approve email

Wait for CRAN...

Accepted 🎉
usethis::use_github_release()
Finish blog post

dbWriteTable doesn't seem to be fully repairing the table

A user on Rstudio community is experiencing issues were dbWriteTable isn't repairing the table once data has been updated to AWS Athena link.

This could be down to dbExecute not fully reporting errors from AWS Athena and not notifying user the repair has failed.

Might need to explore using ALTER TABLE instead of MSCK REPAIR for general speed if user has multiple partition.

Note: dbExecute has been fixed in latest PR #80 and will return AWS Athena errors. Will need to check what version user is using and then debug

noctua: data not shown in s3 if overwrite=TRUE, but shown if overwrite=FALSE and append=TRUE

Hi Dyfan,

I am uploading data into s3 buckets and using for that purpose your package “noctua” -- which is great. But, there seems to be one thing I have come across which I have not yet managed after endless testing using this code snippet:

### Connect to database

con <- dbConnect(noctua::athena()

             , aws_access_key_id = aws_access_key_id

             , aws_secret_access_key = aws_secret_access_key

             , aws_session_token = aws_session_token

             , s3_staging_dir = 's3://bucket/path/to/folder/'

             , region = 'us-east-1'

             , schema = 'schema'

             , work_group = 'abc_ef_gh_jklmnoqr'

             )

 

### Upload to Athena

dbWriteTable(conn = con

           , name = schema.aaa_bbb_cccccc'

           , value = AbcDt

           , overwrite = ifelse(r==1, TRUE, FALSE)

           , append = ifelse(r==1, FALSE, TRUE)

           , file.type = 'json'

           , partition=c("ZEITSTEMPEL" = format(Sys.Date(), "%Y%m%d"))

           , s3.location = 's3://bucket/path/to/folder/'

           )

I am using this code in a loop. By doing so, I have noticed that for the first loop (r==1), the data can be queried in Athena but does not seem to be available or visible in s3. For the following loops (r>1), i.e. if the data is appended, the data is shown in s3, too.

Is this anything you can reproduce?

Many thanks in advance (and for the great work you have put into the package)
Christian

waiting for rhub and community to resolve issue

Currently waiting for #310. This is getting resolved by the rhub team and the community. Once it has been fixed will run through rhub testing before releasing new features for noctua and RAthena on the cran

List of Milestones

v-1.6.0

New Features:

local caching of AWS Athena queries to reduce AWS Athena costs RAthena: # 81

Bug:

Region_name parameter not correctly being parsed in dbConnect #69

v-1.5.1

Bug:

sql-translate-env R functions paste/paste0 defaulted to concat_ws
data.table due to fwrite supporting compression in (>=1.12.4) dependency needs to be altered to reflect this
sql-translate-env correctly translate data variables
dbWriteTable utilise existing s3 location when appending
db_compute to take into account schema and name from a string
Polling issue resulting in ThrottlingException: Rate exceeded #58
sql-translate-env R functions paste/paste0 defaulted to concat_ws

v-1.5.0

New Features:

Added a method to integrate into RStudio connection tab
Return amount of data scanned by aws athena
Ability to change back end file parser from data.table to vroom

Documentation

Update roxygen2 from 6.1.1 to 7.0.2
Update pkgdown::site documentation with new roxygen2

v-1.4.0

New Features:

Support environment variable AWS_ATHENA_WORK_GROUP
Convert default delimited file from csv to tsv. This is to enable array and json to be pushed to Athena from R
Added Append checker. This checks what file type is currently being used, utilities file type when pushing new data to existing AWS Athena Table.
dbRemoveTable to be able to delete Athena table s3 files

Bug:

sql_translate_env Integer miss classified
data.frames with row.names returning NA when uploading to Athena
Issue with special characters not being correctly passed to Athena
Create time stamp field incorrectly passed from Athena to R when returned alone

Changes:

remove setting column names to lower case

Documentation:

R documentation for new features
Update pkgdown site documentation

Support other DDL Json formats

Rstudio issue was raised around how json ddl's are created for AWS Athena:

The confusion in how the DDL is created and what row format is accepted. This will mean an update to documentation is required.
Can other json ddl formats be support with json line?

Weird query behavior: Successful query with HTTP 403 AccessDenied error

Issue Description

I don't know if you remember, but not to long ago we discovered an issue in paws.common together and in an issue that had to do with pulling in profiles into the profile parameter into dbConnect(). Everything works fine even with AWS_PROFILE as env variables. I've been working with the updated packages and a new error prevents me from running a script on the server.

When I run a query, the data is pulled and written into an r object (see test_results below). However, I also get this error message:

Info: (Data scanned: 625.16 GB)
additional arguments ignored in warning()
Warning: AccessDenied (HTTP 403). Access Denied

Which is weird since the data are actually pulled and the connection to the database is successfully established. Any idea why this message is popping up?

As always, any insights would be great!

Reproducible Example

Example:

library(RJDBC)
library(glue)
library(tidyverse)
library(magrittr)

library(rlang)
library(noctua)
library(DBI)
library(paws.common)
library(paws)

con_3 <- dbConnect(noctua::athena(),
                 profile_name = "profile",
                 s3_staging_dir = "s3://aws-athena-query-results-XXXXXXXXXX-us-east-1/", 
                 region = "us-east-1")

test_query <- glue_sql("select event_date, count(x) as x_count from logs.x group by 1 limit 10;", .con = con_3)

# make sure all results are returned from athena

setMethod("dbGetQuery", signature(conn="JDBCConnection", statement="character"),  def=function(conn, statement, ...) {
r <- dbSendQuery(con, test_query)
on.exit(.jcall(r@stat, "V", "close"))
if (conn@jc %instanceof% "com.amazonaws.athena.jdbc.AthenaConnection") fetch(r, -1, 999) # Athena can only pull 999 rows at a time
else fetch(r, -1)
})

test_results <- dbGetQuery(conn = con_3, statement = test_query)

Session Info

> devtools::session_info()
─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.5.3 (2019-03-11)
 os       macOS  10.15.3              
 system   x86_64, darwin15.6.0        
 ui       RStudio                     
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       America/New_York            
 date     2020-06-12                  

─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
 package        * version  date       lib source        
 assertthat       0.2.1    2019-03-21 [2] CRAN (R 3.5.2)
 backports        1.1.7    2020-05-13 [1] CRAN (R 3.5.3)
 bit              1.1-15.1 2020-01-14 [2] CRAN (R 3.5.2)
 bit64            0.9-7    2017-05-08 [2] CRAN (R 3.5.0)
 blob             1.2.1    2020-01-20 [2] CRAN (R 3.5.2)
 broom            0.5.6    2020-04-20 [1] CRAN (R 3.5.3)
 callr            3.4.3    2020-03-28 [1] CRAN (R 3.5.3)
 cellranger       1.1.0    2016-07-27 [2] CRAN (R 3.5.0)
 cli              2.0.2    2020-02-28 [1] CRAN (R 3.5.2)
 colorspace       1.4-1    2019-03-18 [1] CRAN (R 3.5.2)
 crayon           1.3.4    2017-09-16 [2] CRAN (R 3.5.0)
 curl             4.3      2019-12-02 [1] CRAN (R 3.5.2)
 data.table       1.12.8   2019-12-09 [1] CRAN (R 3.5.2)
 DBI            * 1.1.0    2019-12-15 [2] CRAN (R 3.5.2)
 dbplyr           1.4.4    2020-05-27 [1] CRAN (R 3.5.3)
 desc             1.2.0    2018-05-01 [2] CRAN (R 3.5.0)
 devtools         2.3.0    2020-04-10 [1] CRAN (R 3.5.3)
 digest           0.6.25   2020-02-23 [1] CRAN (R 3.5.2)
 dplyr          * 0.8.5    2020-03-07 [1] CRAN (R 3.5.2)
 ellipsis         0.3.1    2020-05-15 [1] CRAN (R 3.5.3)
 fansi            0.4.1    2020-01-08 [2] CRAN (R 3.5.2)
 forcats        * 0.5.0    2020-03-01 [1] CRAN (R 3.5.2)
 fs               1.4.1    2020-04-04 [1] CRAN (R 3.5.3)
 generics         0.0.2    2018-11-29 [1] CRAN (R 3.5.0)
 ggplot2        * 3.3.0    2020-03-05 [1] CRAN (R 3.5.2)
 glue           * 1.4.1    2020-05-13 [1] CRAN (R 3.5.3)
 gt             * 0.2.1    2020-05-23 [1] CRAN (R 3.5.3)
 gtable           0.3.0    2019-03-25 [2] CRAN (R 3.5.3)
 haven            2.3.0    2020-05-24 [1] CRAN (R 3.5.3)
 hms              0.5.3    2020-01-08 [1] CRAN (R 3.5.2)
 httr             1.4.1    2019-08-05 [1] CRAN (R 3.5.2)
 ini              0.3.1    2018-05-20 [2] CRAN (R 3.5.0)
 jsonlite         1.6.1    2020-02-02 [1] CRAN (R 3.5.2)
 knitr            1.28     2020-02-06 [1] CRAN (R 3.5.2)
 lattice          0.20-38  2018-11-04 [2] CRAN (R 3.5.3)
 lifecycle        0.2.0    2020-03-06 [1] CRAN (R 3.5.2)
 lubridate        1.7.8    2020-04-06 [1] CRAN (R 3.5.3)
 magrittr       * 1.5      2014-11-22 [2] CRAN (R 3.5.0)
 memoise          1.1.0    2017-04-21 [2] CRAN (R 3.5.0)
 modelr           0.1.8    2020-05-19 [1] CRAN (R 3.5.3)
 munsell          0.5.0    2018-06-12 [2] CRAN (R 3.5.0)
 nlme             3.1-143  2019-12-10 [2] CRAN (R 3.5.2)
 noctua         * 1.7.0    2020-05-14 [1] CRAN (R 3.5.3)
 packrat          0.5.0    2018-11-14 [2] CRAN (R 3.5.0)
 paws           * 0.1.8    2020-04-28 [1] CRAN (R 3.5.3)
 paws.analytics   0.1.8    2020-04-26 [1] CRAN (R 3.5.3)
 paws.common    * 0.3.1    2020-04-26 [1] CRAN (R 3.5.3)
 paws.storage     0.1.8    2020-04-27 [1] CRAN (R 3.5.3)
 pillar           1.4.3    2019-12-20 [1] CRAN (R 3.5.2)
 pkgbuild         1.0.6    2019-10-09 [2] CRAN (R 3.5.2)
 pkgconfig        2.0.3    2019-09-22 [2] CRAN (R 3.5.2)
 pkgload          1.0.2    2018-10-29 [2] CRAN (R 3.5.0)
 prettyunits      1.1.1    2020-01-24 [2] CRAN (R 3.5.2)
 processx         3.4.2    2020-02-09 [1] CRAN (R 3.5.2)
 ps               1.3.2    2020-02-13 [1] CRAN (R 3.5.2)
 purrr          * 0.3.4    2020-04-17 [1] CRAN (R 3.5.3)
 R6               2.4.1    2019-11-12 [1] CRAN (R 3.5.2)
 Rcpp             1.0.4.6  2020-04-09 [1] CRAN (R 3.5.3)
 readr          * 1.3.1    2018-12-21 [1] CRAN (R 3.5.0)
 readxl           1.3.1    2019-03-13 [1] CRAN (R 3.5.2)
 remotes          2.1.1    2020-02-15 [1] CRAN (R 3.5.2)
 reprex           0.3.0    2019-05-16 [2] CRAN (R 3.5.2)
 rJava          * 0.9-11   2019-03-29 [2] CRAN (R 3.5.2)
 RJDBC          * 0.2-7.1  2018-04-16 [2] CRAN (R 3.5.0)
 rlang          * 0.4.6    2020-05-02 [1] CRAN (R 3.5.3)
 RPostgreSQL      0.6-2    2017-06-24 [2] CRAN (R 3.5.0)
 rprojroot        1.3-2    2018-01-03 [2] CRAN (R 3.5.0)
 rstudioapi       0.11     2020-02-07 [1] CRAN (R 3.5.2)
 rvest            0.3.5    2019-11-08 [1] CRAN (R 3.5.2)
 scales           1.1.0    2019-11-18 [2] CRAN (R 3.5.2)
 sessioninfo      1.1.1    2018-11-05 [2] CRAN (R 3.5.0)
 stringi          1.4.6    2020-02-17 [1] CRAN (R 3.5.2)
 stringr        * 1.4.0    2019-02-10 [1] CRAN (R 3.5.2)
 testthat         2.3.2    2020-03-02 [1] CRAN (R 3.5.2)
 tibble         * 3.0.1    2020-04-20 [1] CRAN (R 3.5.3)
 tidyr          * 1.0.2    2020-01-24 [1] CRAN (R 3.5.2)
 tidyselect       1.0.0    2020-01-27 [2] CRAN (R 3.5.3)
 tidyverse      * 1.3.0    2019-11-21 [1] CRAN (R 3.5.2)
 usethis          1.6.1    2020-04-29 [2] CRAN (R 3.5.3)
 vctrs            0.3.0    2020-05-11 [1] CRAN (R 3.5.3)
 withr            2.2.0    2020-04-20 [1] CRAN (R 3.5.3)
 xfun             0.12     2020-01-13 [1] CRAN (R 3.5.2)
 xml2             1.3.2    2020-04-23 [1] CRAN (R 3.5.3)

Create Unit Test

RAthena unit tests will be a good starting point

list to Json string improvements

Currently noctua just collapses list using paste. This is incorrect as it Athena won't be able to interpret it, for example.

paste(list(list("var3"= 1:3, "var4" = list("var5"= letters))), collapse = "|"))

'list(var3 = 1:3, var4 = list(var5 = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z")))'

Almost random: InvalidRequestException (HTTP 400). Idempotent parameters do not match / Could not find results

Hello Dyfan,
Lately I had problems with RAthena so I turned to noctua instead and I noticed some peculiar, almost random, issues.

Issue Description

When running the code below then I usually get Error: InvalidRequestException (HTTP 400). Idempotent parameters do not match or sometimes Error: InvalidRequestException (HTTP 400). Could not find results HOWEVER about 1 of 5 times I retrieve the desired result. It almost seems random. Issue occurs in both R Console and in Rstudio. As a side note: If I for example use library(aws.s3); bucketlist() then I always retrieves the desired result, so I know that I'm assuming the role correctly.

Reproducible Example

Example:

library(DBI)
library(noctua)

noctua::assume_role(
role_arn = "arn:aws:iam::666666666:role/ARole",
duration_seconds = 3600L,
region_name = "eu-west-1",
set_env = TRUE)

con <- dbConnect(noctua::athena(), s3_staging_dir = "s3://aws-athena-query-results-eu-west-1-328396111111")

dbGetQuery(con, "SELECT row_names, mpg, cyl, disp FROM adhoc.mtcars limit 10")

Screenshot:

Traceback:

dbGetQuery(con, "SELECT row_names, mpg, cyl, disp FROM adhoc.mtcars limit 10")
Error: InvalidRequestException (HTTP 400). Idempotent parameters do not match

> traceback()
8: stop(resp)
7: retry_api_call(response <- conn@ptr$Athena$start_query_execution(QueryString = statement, 
      QueryExecutionContext = list(Database = conn@info$dbms.name), 
      ResultConfiguration = ResultConfiguration(conn), WorkGroup = conn@info$work_group))
6: AthenaResult(conn = conn, statement = statement, s3_staging_dir = s3_staging_dir)
5: dbSendQuery(conn, statement = statement)
4: dbSendQuery(conn, statement = statement)
3: .local(conn, statement, ...)
2: dbGetQuery(con, "SELECT * FROM adhoc.mtcars limit 10")
1: dbGetQuery(con, "SELECT * FROM adhoc.mtcars limit 10")


dbGetQuery(con, "SELECT row_names, mpg, cyl, disp FROM adhoc.mtcars limit 10")
Error: InvalidRequestException (HTTP 400). Could not find results

> traceback()
7: stop(resp)
6: retry_api_call(result_class <- res@connection@ptr$Athena$get_query_results(QueryExecutionId = res@info$QueryExecutionId, 
      MaxResults = as.integer(1))$ResultSet$ResultSetMetadata$ColumnInfo)
5: dbFetch(res = rs, n = -1, ...)
4: dbFetch(res = rs, n = -1, ...)
3: .local(conn, statement, ...)
2: dbGetQuery(con, "SELECT * FROM adhoc.mtcars limit 10")
1: dbGetQuery(con, "SELECT * FROM adhoc.mtcars limit 10")

R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] noctua_1.8.0 DBI_1.0.0   

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2                   paws.analytics_0.1.9         digest_0.6.20                zeallot_0.1.0                crayon_1.3.4                
 [6] paws.common_0.3.3            R6_2.4.0                     jsonlite_1.6                 backports_1.1.4              httr_1.4.0                  
[11] pillar_1.4.2                 rlang_0.4.0                  curl_4.0                     data.table_1.13.0            ini_0.3.1                   
[16] paws_0.1.9                   xml2_1.2.1                   vctrs_0.2.0                  paws.security.identity_0.1.9 readr_1.3.1                 
[21] paws.storage_0.1.9           hms_0.5.0                    compiler_3.6.1               pkgconfig_2.0.2              tibble_2.1.3

Throttling Exception when trying to pull 62GBs of data from AWS Athena

Issue Description

Hi,

I have been trying to connect to Athena from Sagemaker. I have been successful, but in the case of slightly bigger data, when I run the following code:

install.packages('noctua')
install.packages('DBI')

con <- DBI::dbConnect(noctua::athena(),
s3_staging_dir = "s3://aws-athena...",
region_name = 'us-west-2')

res <- DBI::dbExecute(con, 'SELECT *
FROM db.table
LIMIT 10000000')
df <- DBI::dbFetch(res)

I get the following error on Sagemaker:
Error: ThrottlingException: Rate exceeded Traceback:

At first I though that I needed more RAM, so I edited my instance to where it had 64gb, seeing as the previous one had 16. But the error still persists.

I am farely new to AWS and its intricacies. While I was looking for a way to overcome this issue I stumbled upon the following article:

https://aws.amazon.com/blogs/messaging-and-targeting/how-to-handle-a-throttling-maximum-sending-rate-exceeded-error/

Looking at the article, I guess my question would be, is there a way to use some sort of rate limiting function? If it is supported by paws?

Ps. I hope I wrote this in the correct format and place.

Any help is greatly appreciated,

Ricardo

dbplyr 2.0.0 breaks in_schema in Noctua

Hi Dyfan,

a situation came up when upgrading to dbplyr 2.0.0:

In the new version, in_schema() returns an object of class "dbplyr_schema" (previously "ident"), and the database and table names are automatically quoted.

As an example:

dbplyr < 2.0.0:

> in_schema("database", "table")
<IDENT> database.table

> class(in_schema("database", "table"))
[1] "ident_q"   "ident"     "character"

dbplyr 2.0.0:

> in_schema("database", "table")
<SCHEMA> `database`.`table`

> class(in_schema("database", "table"))
[1] "dbplyr_schema"

Consequently, in_schema does not work with noctua + dbplyr 2.0.0.

One workaround would be to use tbl(con, sql("database.table"), but this is not too elegant and a big negative is that athena is used for getting the table structure instead of paws::glue(), which makes things much slower.

I'm not sure what would be the best fix for this. Perhaps an AthenaConnection specific in_schema-method? Or perhaps just check for "dbplyr_schema" in addition to "ident" when deciding whether to query glue or athena (

noctua/R/dplyr_integration.R

Line 216 in 1e34ca1

is_ident <- inherits(sql, "ident")

), and removing the quote chars (some characters are removed already in any case at

noctua/R/dplyr_integration.R

Line 221 in 1e34ca1

dbms.name <- gsub("\\..*", "" , sql)

)

What are your thoughts on this?

Rhub Check Error

Currently RAthena and noctua are having issues with rhub checks. Please see DyfanJones/RAthena#111 for more information

Passing credentails to aws objects

The paws sdk has developed a method to pass credentials to aws objects. This method is preferred as it will enable multiple connections to different AWS accounts.

dbplyr summarise translation to SQL

Hello again! I've found what I think is another issue around translating to SQL.

When I try to use the summarise function to get the median value for a column, I get the following error.

> query <- dplyr::tbl(db_connection, "table_name") %>% summarise(median_value=median(column_name))
> query %>% collect()
Error: InvalidRequestException (HTTP 400). line 1:42: mismatched input '('. Expecting: 'BY'

Here is the SQL being executed:

> query %>% show_query()
<SQL>
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "column_name") AS "median_value"
FROM "table_name"

So I believe the issue is that PERCENTILE_CONT is not a valid Presto function. Perhaps we could use one of the approx_percentile functions here?

Improve dbWriteTable documentation for s3 file path creation

Several users have reached out with some confusion around how the s3 file path is created in the backend. To minimise this confuse the documentation should be more open and let users know exactly what is going on with the s3 backend.

Support AWS Athena UNLOAD

Aws Athena support UNLOAD which allows Athena to write out different file type i.e. parquet.

This will allow noctua/RAthena to utilise AWS Athena Unload queries and read the parquet format similar to how AWS Data Wrangler currently does (NOTE: AWS Data Wrangler wraps queries with CTAS (https://docs.aws.amazon.com/athena/latest/ug/ctas.html)

AWS Data Wrangler example with Pros/Cons to their current CTAS method
https://aws-data-wrangler.readthedocs.io/en/stable/tutorials/006%20-%20Amazon%20Athena.html

Paginated glue responses not covered

Hi Dyfan, one more for the week:

when noctua queries glue for a list of databases with get_databases(), the list is truncated in case the result gets paginated. Same with get_tables().

Not sure what is the underlying reason for this, but in one of our environments the first call to glue returns only the first 23 databases. For one reason or another setting the MaxResults parameter doesn't help here.

One fix would be to implement a separate paginated reader function and replace the native paws calls with that, right? Perhaps some more elegant implementation could be come up with.

get_all_databases <- function(glue) {
  cresults <- glue$get_databases()
  cdbs <- sapply(cresults$DatabaseList,function(x) x$Name)
  
  citers <- 0
  while(length(cresults$NextToken) > 0 & citers < 1000) {
    
    cresults <- glue$get_databases(NextToken = cresults$NextToken)
    cdbs <- c(cdbs, sapply(cresults$DatabaseList,function(x) x$Name))
    citers <- citers + 1
  }
  
  cdbs
}

(I tend to get nervous with these while loops so I added the max iteration count there...)

Rhub window server error

Currently Windows Server 2008 R2 SP1, R-devel, 32/64 bit is failing:

* checking package dependencies ... ERROR
Package required but not available: 'paws'

Packages suggested but not available:
  'arrow', 'data.table', 'dplyr', 'dbplyr', 'testthat'

The suggested packages are required for a complete check.
variable _R_CHECK_FORCE_SUGGESTS_ to a false value.

Checking can be attempted without them by setting the environment
See section 'The DESCRIPTION file' in the 'Writing R Extensions'
manual.

Issue raised to rhub: #305

dyfanjones / noctua Goto Github PK

noctua's Issues

Issue Description

Reproducible Example

Issue Description

Reproducible Example

Issue Description

(Semi-)Reproducible Example

Issue Description

Issue Description

v-1.6.0

New Features:

Bug:

v-1.5.1

Bug:

v-1.5.0

New Features:

Documentation

v-1.4.0

New Features:

Bug:

Changes:

Documentation:

Issue Description

Reproducible Example

Issue Description

Reproducible Example

Issue Description

Recommend Projects

Recommend Topics

Recommend Org