dyfanjones / noctua Goto Github PK
View Code? Open in Web Editor NEWConnect R to Athena using paws SDK (DBI Interface)
Home Page: https://dyfanjones.github.io/noctua/
License: Other
Connect R to Athena using paws SDK (DBI Interface)
Home Page: https://dyfanjones.github.io/noctua/
License: Other
Hello Dyfan,
I am running RStudio on an Amazon Linux EC2 instance, and I am getting this error message when attempting to install the noctua
package.
I am currently running this analysis through this link (http://localhost:8787/).
install.packages("noctua")
Installing package into ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4’
(as ‘lib’ is unspecified)
also installing the dependencies ‘curl’, ‘openssl’, ‘httr’, ‘xml2’, ‘paws.common’, ‘paws.compute’, ‘paws.storage’, ‘paws.database’, ‘paws.networking’, ‘paws.management’, ‘paws.machine.learning’, ‘paws.analytics’, ‘paws.security.identity’, ‘paws.application.integration’, ‘paws.cost.management’, ‘paws.customer.engagement’, ‘paws’
trying URL 'https://cran.rstudio.com/src/contrib/curl_4.3.tar.gz'
Content type 'application/x-gzip' length 673779 bytes (657 KB)
==================================================
downloaded 657 KB
trying URL 'https://cran.rstudio.com/src/contrib/openssl_1.4.2.tar.gz'
Content type 'application/x-gzip' length 1204168 bytes (1.1 MB)
==================================================
downloaded 1.1 MB
trying URL 'https://cran.rstudio.com/src/contrib/httr_1.4.2.tar.gz'
Content type 'application/x-gzip' length 159950 bytes (156 KB)
==================================================
downloaded 156 KB
trying URL 'https://cran.rstudio.com/src/contrib/xml2_1.3.2.tar.gz'
Content type 'application/x-gzip' length 271876 bytes (265 KB)
==================================================
downloaded 265 KB
trying URL 'https://cran.rstudio.com/src/contrib/paws.common_0.3.3.tar.gz'
Content type 'application/x-gzip' length 63406 bytes (61 KB)
==================================================
downloaded 61 KB
trying URL 'https://cran.rstudio.com/src/contrib/paws.compute_0.1.9.tar.gz'
Content type 'application/x-gzip' length 864214 bytes (843 KB)
==================================================
downloaded 843 KB
trying URL 'https://cran.rstudio.com/src/contrib/paws.storage_0.1.9.tar.gz'
Content type 'application/x-gzip' length 385287 bytes (376 KB)
==================================================
downloaded 376 KB
trying URL 'https://cran.rstudio.com/src/contrib/paws.database_0.1.9.tar.gz'
Content type 'application/x-gzip' length 628403 bytes (613 KB)
==================================================
downloaded 613 KB
trying URL 'https://cran.rstudio.com/src/contrib/paws.networking_0.1.9.tar.gz'
Content type 'application/x-gzip' length 475508 bytes (464 KB)
==================================================
downloaded 464 KB
trying URL 'https://cran.rstudio.com/src/contrib/paws.management_0.1.9.tar.gz'
Content type 'application/x-gzip' length 842090 bytes (822 KB)
==================================================
downloaded 822 KB
trying URL 'https://cran.rstudio.com/src/contrib/paws.machine.learning_0.1.9.tar.gz'
Content type 'application/x-gzip' length 424388 bytes (414 KB)
==================================================
downloaded 414 KB
trying URL 'https://cran.rstudio.com/src/contrib/paws.analytics_0.1.9.tar.gz'
Content type 'application/x-gzip' length 465442 bytes (454 KB)
==================================================
downloaded 454 KB
trying URL 'https://cran.rstudio.com/src/contrib/paws.security.identity_0.1.9.tar.gz'
Content type 'application/x-gzip' length 887814 bytes (867 KB)
==================================================
downloaded 867 KB
trying URL 'https://cran.rstudio.com/src/contrib/paws.application.integration_0.1.9.tar.gz'
Content type 'application/x-gzip' length 178463 bytes (174 KB)
==================================================
downloaded 174 KB
trying URL 'https://cran.rstudio.com/src/contrib/paws.cost.management_0.1.9.tar.gz'
Content type 'application/x-gzip' length 66152 bytes (64 KB)
==================================================
downloaded 64 KB
trying URL 'https://cran.rstudio.com/src/contrib/paws.customer.engagement_0.1.9.tar.gz'
Content type 'application/x-gzip' length 202482 bytes (197 KB)
==================================================
downloaded 197 KB
trying URL 'https://cran.rstudio.com/src/contrib/paws_0.1.9.tar.gz'
Content type 'application/x-gzip' length 360692 bytes (352 KB)
==================================================
downloaded 352 KB
trying URL 'https://cran.rstudio.com/src/contrib/noctua_1.8.0.tar.gz'
Content type 'application/x-gzip' length 381011 bytes (372 KB)
==================================================
downloaded 372 KB
* installing *source* package ‘curl’ ...
** package ‘curl’ successfully unpacked and MD5 sums checked
Package libcurl was not found in the pkg-config search path.
Perhaps you should add the directory containing `libcurl.pc'
to the PKG_CONFIG_PATH environment variable
No package 'libcurl' found
Package libcurl was not found in the pkg-config search path.
Perhaps you should add the directory containing `libcurl.pc'
to the PKG_CONFIG_PATH environment variable
No package 'libcurl' found
Using PKG_CFLAGS=
Using PKG_LIBS=-lcurl
------------------------- ANTICONF ERROR ---------------------------
Configuration failed because libcurl was not found. Try installing:
* deb: libcurl4-openssl-dev (Debian, Ubuntu, etc)
* rpm: libcurl-devel (Fedora, CentOS, RHEL)
* csw: libcurl_dev (Solaris)
If libcurl is already installed, check that 'pkg-config' is in your
PATH and PKG_CONFIG_PATH contains a libcurl.pc file. If pkg-config
is unavailable you can set INCLUDE_DIR and LIB_DIR manually via:
R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
--------------------------------------------------------------------
ERROR: configuration failed for package ‘curl’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/curl’
Warning in install.packages :
installation of package ‘curl’ had non-zero exit status
* installing *source* package ‘openssl’ ...
** package ‘openssl’ successfully unpacked and MD5 sums checked
Using PKG_CFLAGS=
--------------------------- [ANTICONF] --------------------------------
Configuration failed because openssl was not found. Try installing:
* deb: libssl-dev (Debian, Ubuntu, etc)
* rpm: openssl-devel (Fedora, CentOS, RHEL)
* csw: libssl_dev (Solaris)
* brew: [email protected] (Mac OSX)
If openssl is already installed, check that 'pkg-config' is in your
PATH and PKG_CONFIG_PATH contains a openssl.pc file. If pkg-config
is unavailable you can set INCLUDE_DIR and LIB_DIR manually via:
R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
-------------------------- [ERROR MESSAGE] ---------------------------
tools/version.c:1:10: fatal error: openssl/opensslv.h: No such file or directory
#include <openssl/opensslv.h>
^~~~~~~~~~~~~~~~~~~~
compilation terminated.
--------------------------------------------------------------------
ERROR: configuration failed for package ‘openssl’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/openssl’
Warning in install.packages :
installation of package ‘openssl’ had non-zero exit status
* installing *source* package ‘xml2’ ...
** package ‘xml2’ successfully unpacked and MD5 sums checked
Package libxml-2.0 was not found in the pkg-config search path.
Perhaps you should add the directory containing `libxml-2.0.pc'
to the PKG_CONFIG_PATH environment variable
No package 'libxml-2.0' found
Package libxml-2.0 was not found in the pkg-config search path.
Perhaps you should add the directory containing `libxml-2.0.pc'
to the PKG_CONFIG_PATH environment variable
No package 'libxml-2.0' found
Using PKG_CFLAGS=
Using PKG_LIBS=-lxml2
------------------------- ANTICONF ERROR ---------------------------
Configuration failed because libxml-2.0 was not found. Try installing:
* deb: libxml2-dev (Debian, Ubuntu, etc)
* rpm: libxml2-devel (Fedora, CentOS, RHEL)
* csw: libxml2_dev (Solaris)
If libxml-2.0 is already installed, check that 'pkg-config' is in your
PATH and PKG_CONFIG_PATH contains a libxml-2.0.pc file. If pkg-config
is unavailable you can set INCLUDE_DIR and LIB_DIR manually via:
R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
--------------------------------------------------------------------
ERROR: configuration failed for package ‘xml2’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/xml2’
Warning in install.packages :
installation of package ‘xml2’ had non-zero exit status
ERROR: dependencies ‘curl’, ‘openssl’ are not available for package ‘httr’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/httr’
Warning in install.packages :
installation of package ‘httr’ had non-zero exit status
ERROR: dependencies ‘httr’, ‘xml2’ are not available for package ‘paws.common’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/paws.common’
Warning in install.packages :
installation of package ‘paws.common’ had non-zero exit status
ERROR: dependency ‘paws.common’ is not available for package ‘paws.compute’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/paws.compute’
Warning in install.packages :
installation of package ‘paws.compute’ had non-zero exit status
ERROR: dependency ‘paws.common’ is not available for package ‘paws.storage’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/paws.storage’
Warning in install.packages :
installation of package ‘paws.storage’ had non-zero exit status
ERROR: dependency ‘paws.common’ is not available for package ‘paws.database’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/paws.database’
Warning in install.packages :
installation of package ‘paws.database’ had non-zero exit status
ERROR: dependency ‘paws.common’ is not available for package ‘paws.networking’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/paws.networking’
Warning in install.packages :
installation of package ‘paws.networking’ had non-zero exit status
ERROR: dependency ‘paws.common’ is not available for package ‘paws.management’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/paws.management’
Warning in install.packages :
installation of package ‘paws.management’ had non-zero exit status
ERROR: dependency ‘paws.common’ is not available for package ‘paws.machine.learning’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/paws.machine.learning’
Warning in install.packages :
installation of package ‘paws.machine.learning’ had non-zero exit status
ERROR: dependency ‘paws.common’ is not available for package ‘paws.analytics’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/paws.analytics’
Warning in install.packages :
installation of package ‘paws.analytics’ had non-zero exit status
ERROR: dependency ‘paws.common’ is not available for package ‘paws.security.identity’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/paws.security.identity’
Warning in install.packages :
installation of package ‘paws.security.identity’ had non-zero exit status
ERROR: dependency ‘paws.common’ is not available for package ‘paws.application.integration’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/paws.application.integration’
Warning in install.packages :
installation of package ‘paws.application.integration’ had non-zero exit status
ERROR: dependency ‘paws.common’ is not available for package ‘paws.cost.management’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/paws.cost.management’
Warning in install.packages :
installation of package ‘paws.cost.management’ had non-zero exit status
ERROR: dependency ‘paws.common’ is not available for package ‘paws.customer.engagement’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/paws.customer.engagement’
Warning in install.packages :
installation of package ‘paws.customer.engagement’ had non-zero exit status
ERROR: dependencies ‘paws.compute’, ‘paws.storage’, ‘paws.database’, ‘paws.networking’, ‘paws.management’, ‘paws.machine.learning’, ‘paws.analytics’, ‘paws.security.identity’, ‘paws.application.integration’, ‘paws.cost.management’, ‘paws.customer.engagement’ are not available for package ‘paws’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/paws’
Warning in install.packages :
installation of package ‘paws’ had non-zero exit status
ERROR: dependency ‘paws’ is not available for package ‘noctua’
* removing ‘/home/ec2-user/R/x86_64-redhat-linux-gnu-library/3.4/noctua’
Warning in install.packages :
installation of package ‘noctua’ had non-zero exit status
My apologies for the long error message; I wanted to provide as much context as possible.
I am fairly new to the tech industry and am wondering if this is a noctua
specific problem?
If so, any advice is greatly appreciated.
library(noctua)
con <- dbConnect(noctua::athena())
res <- dbExecute(con, "select * from sampledb.elb_logs")
result <- noctua:::poll(res)
result_info <- noctua:::split_s3_uri(result$QueryExecution$ResultConfiguration$OutputLocation)
obj <- res@connection@ptr$S3$get_object(Bucket = result_info$bucket, Key = result_info$key)
# method_1
data.table::fread(readBin(obj$Body "character"))
# method_2
writeBin(obj$Body, con = "test.csv")
data.table::fread("test.csv")
Data return size: 19 X 1,356,206
Data memory size: 335.3 Mb
The latest noctua
& RAthena
releases caused an error in "CRAN Package Check Results for Package noctua"
The error comes from the suggested package arrow
failing to install:
Error: write_parquet requires the arrow package, please install it first and try again
As arrow
is a suggested package dependency. All unit tests using arrow
should first check if arrow
is correctly installed before proceeding cran Writing R Extensions. This was an oversight with the latest unit tests.
noctua
and RAthena
will be re - released before the 2020-11-25
It appears that some users are coming into an issue of throttling errors. A user has found a possible solution https://community.rstudio.com/t/use-exponential-back-off-with-dbi/56870. To make it easier for users this solution should be inbuilt into the package.
Similar to RAthena, noctua next release will be slightly delayed due to changes in \donttest
.
Please see DyfanJones/RAthena#108 for full details
For some reason codecov has dropped from 83% to 3%. The code changes shouldn't of caused this issue. Plus it seems codecov isn't picking up new unit tests in #108
Hey Dyfan,
I apologize for coming over here with yet another issue.
dbConnect() returns the following error:
Error in get_region(cfgs$credentials$profile) : No region provided
library(noctua)
library(DBI)
con <- dbConnect(noctua::athena(),
aws_access_key_id='MyAccessKey',
aws_secret_access_key='MyAccessKey',
s3_staging_dir='s3://'my-staging-dir-us-west-2/',
region_name='us-west-2')
Currently pre-error with Rhub check.
https://builder.r-hub.io/status/noctua_1.10.0.tar.gz-f2324215e0b840b1ae002c4fead75766
Hi Dyfan, bumped into another bug.
If a glue table definition has a trailing '/' in the s3 path, dbRemoveTable doesn't work. (note, such a table definition is allowed in glue)
I think the reason is on this line and the pasting of the extra '/' at the end of the s3_path$key:
Line 608 in e3ce929
For example, if we have a path like "s3://mybucket/mypath/mysubpath/", this will produce a key "s3://mybucket/mypath/mysubpath//" and things will break down.
Also, if a table definition points directly a single file, as in, "s3://mybucket/mypath/mysubpath/myfile.csv", dbRemoveTable will not work either
As a fix, I would propose omitting the extra '/'.
All the best and happy holidays,
Ossi
Hi @DyfanJones, I'm sure you're aware of this, but anyway, the funtions refrenced in DyfanJones/RAthena#121 should be implemented in noctua too.
Hello Dyfan,
I raised this question on stackoverflow.com as @datascientist2117; thank you for your response.
Here is the code I currently have.
library(tidyverse)
library(dplyr)
library(dbplyr)
library(odbc)
library(DBI)
library(RAthena)
library(rJava)
odbcListDrivers()
sort((unique(odbcListDrivers()[[1]])))
con <- DBI::dbConnect(
odbc::odbc(),
Driver = "[Simba Athena ODBC Driver]",
S3OutputLocation = "[s3://bucket-folder/]",
AwsRegion = "[region]",
AuthenticationType = "IAM Credentials",
Schema = "[schema]",
UID = rstudioapi::askForPassword("AWS Access Key"),
PWD = rstudioapi::askForPassword("AWS Secret Key"))
All code above is exactly the same except for the omitted values in S3OutPutLocation, AwsRegion, and Schema.
I am currently using Simba Athena ODBC Driver Version 1.1.2
I hope this information is sufficient; please let me know if there is anything else I can provide!
The 'statement' argument to dbGetQuery and dbSendQuery methods isn't exposed in older versions of R, resulting in an error that it cannot find the 'statement' object.
First, install version 2.3.0 of the noctua package under R version 3.4.4. Then take a look at
library(noctua)
showMethods("dbGetQuery", includeDefs=TRUE)
showMethods("dbSendQuery", includeDefs=TRUE)
It will say:
Function: dbGetQuery (package DBI)
conn="AthenaConnection", statement="character"
function (conn, ...)
{
.local <- function (conn, statement = NULL, statistics = FALSE,
unload = FALSE, ...)
{
con_error_msg(conn, msg = "Connection already closed.")
stopifnot(is.logical(statistics), is.logical(unload))
rs <- dbSendQuery(conn, statement = statement, unload = unload)
if (statistics)
print(dbStatistics(rs))
out <- dbFetch(res = rs, n = -1, ...)
dbClearResult(rs)
return(out)
}
.local(conn, statement, ...)
}
and
Function: dbSendQuery (package DBI)
conn="AthenaConnection", statement="character"
function (conn, ...)
{
.local <- function (conn, statement = NULL, unload = FALSE,
...)
{
con_error_msg(conn, msg = "Connection already closed.")
stopifnot(is.logical(unload))
res <- AthenaResult(conn = conn, statement = statement,
s3_staging_dir = conn@info$s3_staging, unload = unload)
return(res)
}
.local(conn, statement, ...)
}
Whereas, under R version 3.5.0 (and later), the methods do expose the statement argument, and there is no error when using them. So naturally I am wondering if this is due to R-core correcting a bug going from version 3.4.4 to 3.5.0 of R, or could it be considered a bug in noctua itself? If the former, should noctua require R version >= 3.5.0? In any case, curious to hear your thoughts or suggestions.
devtools::session_info()
#> output
Session info -------------------------------------------------------------------
setting value
version R version 3.4.4 (2018-03-15)
system x86_64, linux-gnu
ui X11
language (EN)
collate en_US.UTF-8
tz America/New_York
date 2021-11-04
Packages -----------------------------------------------------------------------
package * version date source
data.table 1.12.6 2019-10-18 CRAN (R 3.4.4)
DBI 1.1.0 2019-12-15 CRAN (R 3.4.4)
devtools 1.12.0 2016-12-05 CRAN (R 3.4.0)
digest 0.6.27 2020-10-24 CRAN (R 3.4.4)
memoise 1.1.0 2017-04-21 CRAN (R 3.4.0)
noctua * 2.3.0 2021-10-26 CRAN (R 3.4.4)
paws 0.1.12 2021-09-03 CRAN (R 3.4.1)
rstudioapi 0.11 2020-02-07 CRAN (R 3.4.4)
withr 2.1.2 2018-03-15 CRAN (R 3.4.4)
There seems to be no implementation of grepl
in the sql_translator
for use with dbplyr.
For example:
> dplyr::tbl(db_connection, "concept") %>% filter(grepl('search_str', 'concept_name')) %>% show_query()
<SQL>
SELECT *
FROM "concept"
WHERE (grepl('search_str', "concept_name"))
Instead I expected the output to be something like this:
<SQL>
SELECT *
FROM "concept"
WHERE concept_name like '%search_str%'
Other backends for dbplyr include support for grepl
, e.g. here is the postgres implementation.
Please look at DyfanJones/RAthena#129 for extra details.
Under noctua
1.10.0, going to paws.database
0.1.10 seems to cause curl timeouts when using the dplyr interface to queries.
Generating a clean reprex is tricky, but I have a local query managed under renv
that reliably replicates the problem. Here is a redacted query (hitting an Apache log store in parquet format) that demonstrates the problem:
> con <- dbConnect(noctua::athena(),
profile_name = "REDACTED",
region = "us-east-2",
s3_staging_dir = 's3://REDACTED',
work_group = "REDACTED")
> query <- str_glue("
SELECT date_parse(timestamp, '%d/%b/%Y:%H:%i:%s +0000') AS timestamp,
verb, request, response, CAST(bytes as integer) AS bytes, referrer, agent
FROM REDACTED")
> dat <- tbl(con, sql(query)) %>% collect()
Info: (Data scanned: 14.09 MB)
Error in curl::curl_fetch_memory(url, handle = handle): Timeout was reached: [REDACTED.s3.us-east-2.amazonaws.com] Operation timed out after 10000 milliseconds with 119006796 out of 316782524 bytes received
Request failed. Retrying in 0.7 seconds...
Error in curl::curl_fetch_memory(url, handle = handle): Timeout was reached: [REDACTED.s3.us-east-2.amazonaws.com] Operation timed out after 10000 milliseconds with 112822860 out of 316782524 bytes received
Request failed. Retrying in 2.2 seconds...
If left to run, the query goes through exponential back-off and eventually fails.
Running the same query under paws.database
0.1.9 works without issue. noctua
1.9.1 also hits this problem, so this seems to be something in the interface with paws
(or maybe even a problem with paws
itself).
Really appreciate the package. If there's a better way to help debug this, please let me know!
Development of this will be tracked in RAthena
ticket: #26
Thanks for the very nifty package!
paws.common
recently added support for credential_process
in .aws/config files, allowing role assumption, nested profiles, and external credentials all to work seamlessly together. It looks like noctua
constructs the config object itself rather than relying on paws, though I'm not clear on the full object model of paws. Is it practical for noctua to hand off all authentication work to paws, allowing a user to take advantage of this greatly improved credential process?
One speculative idea: would it make sense to allow a user to optionally skip reading in the glue catalog when running dbConnect in rstudio?
Of course it is often helpful to have, for example, the data catalog available for browsing in the connections pane in Rstudio.
However, when the catalog becomes very large, it can take pretty long to establish the connection (a minute or so in my case, I think some lake formation shares make things worse), and often one doesn't need the catalog for anything. In interactive work this kind of a delay can be a bit annoying, but obviously nothing too dangerous.
So, what do you think, would this be doable, and would it make sense to implement something like this? I'm unfortunately pretty clueless with the interaction with rstudio, so not sure what this would require.
When uploading data frames to AWS Athena, noctua only split compressed delimited files due to DyfanJones/RAthena#36 . Would other file.types benefit from similar file splitting?
The next release is planned for the 14/05/2020. This is to prevent the package being release too frequently to the cran.
If the new features are required before the planned release please let me know and the release date can be changed.
Hi Dyfan,
it appears that noctua doesn't know how to interpret a query result with the type 'real'.
For example:
res <- DBI::dbSendStatement(cona, "create table temp.realexample as (select cast(1 as real) as realcol)")
# Wait until table is created
dplyr::tbl(cona, dbplyr::in_schema("temp", "realexample"))
produces the error: "Error: Unknown shortcut: real"
Should AthenaToRDataType-function have mapping for 'real' included (instead of 'float')?
Line 23 in 3b9d5d5
Line 47 in 3b9d5d5
First, noctua is excellent and very impressive!
Just a slight improvement, the DESCRIPTION file should be modified slightly to suggest or require that the uuid library is of version 0.1-4. Any earlier version of uuid does not allow the use of the n argument when using uuid::UUIDgenerate() and may cause noctua::dbWriteTable to fail.
Best regards,
Michael
Support GZIP compression:
https://docs.aws.amazon.com/athena/latest/ug/compression-formats.html
Use the GZIP compression in Athena for querying Amazon Kinesis Data Firehose logs. Athena and Amazon Kinesis Data Firehose each support different versions of SNAPPY, so GZIP is the only compatible format.
r noctua::athena()
does connect to database BUT only to the database with the default profile in the aws .config/.credentials files.
When I run this
con <- dbConnect(noctua::athena(),
profile_name = "default",
s3_staging_dir = "s3://aws-athena-query-results-for-default-profile-us-east-1")
or
con <- dbConnect(noctua::athena(),
s3_staging_dir = "s3://aws-athena-query-results-for-default-profile-us-east-1")
It does connect to the correct AWS database under the default
profile in the aws .config/.credentials file and I'm able to run queries against it.
$s3_staging
[1] "s3://aws-athena-query-results-for-default-profile-us-east-1"
$dbms.name
[1] "default"
$work_group
[1] "primary"
$poll_interval
NULL
$encryption_option
NULL
$kms_key
NULL
$expiration
NULL
$region_name
[1] "us-east-1"
$paws
[1] "0.1.7"
$noctua
[1] "1.6.0"
However, when I run this,
con_2 <- dbConnect(noctua::athena(),
profile_name = "other-profile-that-is-not-default",
s3_staging_dir = "s3://aws-athena-query-results-for-other-profile-us-east-1/",
region = "us-east-1")
$profile_name
[1] "other-profile-that-is-not-default"
$s3_staging
[1] "s3://aws-athena-query-results-for-other-profile-us-east-1"
$dbms.name
[1] "default"
$work_group
[1] "primary"
$poll_interval
NULL
$encryption_option
NULL
$kms_key
NULL
$expiration
NULL
$region_name
[1] "us-east-1"
$paws
[1] "0.1.7"
$noctua
[1] "1.6.0"
then, it still connects but again to the default
database. How can I successfully connect to a non-default profile? It looks like dbGetInfo(con_2)
has the correct information but for whatever reason it does not connect to the correct database and I see the exact same tables as for con
.
This is what the .config
file looks like
[default]
region = us-east-1
output = text
[profile other-profile-that-is-not-default]
region = us-east-1
saml.session_duration = 43200
saml.idp_arn = arn:aws:iam::XXXX:saml-provider/g-suite
saml.role_arn = arn:aws:iam::XXXXXXX:role/other-profile_that-is-not-default
and this is what the .credentials
file looks like
[default]
aws_access_key_id = XXXX
aws_secret_access_key = XXXXXXX
[other-profile-that-is-not-default]
aws_access_key_id = XXXXX
aws_secret_access_key = XXXX
aws_session_token = XXXXX
aws_security_token = XXXXXX
Any help would be greatly appreciated especially since I have no idea how to set up the .config/.credentials file differently.
Currently noctua
doesn't support AWS Athena data types [array, map, struct]
The development of this will be covered by RAthena
ticket: #25
When 'connecting' to a table with dplyr's tbl()-function, a query is sent to Athena with a 'WHERE 0 == 1' clause for getting the column names. This query is generated by db_query_fields.DBIconnection.
The thing is, querying Athena can be slow at times and a much faster response could be gotten from Glue.
This, however works only if a direct connection to a table is made, like tbl(con, in_schema("schema", "table))
. If one has a subquery in tbl() (e.g., tbl(con, sql("select * from table where b=c))
, Glue cannot help here.
To handle this, one could define a method, such as:
db_query_fields.AthenaConnection <- function(con, sql, ...) {
# Test if we have a subquery or a direct table definition
is_direct <- grepl('^"?[a-å]+"?\\.?"?[a-å]+"?$', trimws(tolower(sql)))
if(is_direct) { # If a direct definiton, get the fields from Glue
if (!dbIsValid(con)) {stop("Connection already closed.", call. = FALSE)}
if (grepl("\\.", sql)) {
dbms.name <- gsub("\\..*", "" , sql)
Table <- gsub(".*\\.", "" , sql)
} else {
dbms.name <- conn@info$dbms.name
Table <- sql}
tryCatch(
output <- conn@ptr$glue$get_table(DatabaseName = dbms.name,
Name = Table)$Table$StorageDescriptor$Columns)
sapply(output, function(y) y$Name)
} else { # If a subquery, query Athena for the fields
sql <- sql_select(con, sql("*"), sql_subquery(con, sql), where = sql("0 = 1"))
qry <- dbSendQuery(con, sql)
on.exit(dbClearResult(qry))
res <- dbFetch(qry, 0)
names(res)
}
}
So basically test if we have a direct table definition or a subquery, and query Glue or Athena accordingly.
The weakest part of this would be the first regex for trying to see if we have a direct table def. The good thing is, that if the regex match returns FALSE, we revert to dplyr's default behaviour.
What do you think?
p.s. Have been trying noctua as opposed to RAthena for a few days and really seems to work as a drop in replacement. Really like the native 'all R' aspect of it!
uild ID: | noctua_1.2.0.tar.gz-a51a35c2ada44db7b831b7f50f5f0697 |
---|---|
Platform: | Ubuntu Linux 16.04 LTS, R-release, GCC |
Submitted: | 12 minutes 12.5 seconds ago |
Build time: | 12 minutes 11 seconds |
ERROR: dependency ‘openssl’ is not available for package ‘httr’
Currently travis ci has broken for noctua and RAthena. Instead of fixing pipeline, will swap to github actions instead.
Prepare for release:
Submit to CRAN:
Wait for CRAN...
A user on Rstudio community is experiencing issues were dbWriteTable
isn't repairing the table once data has been updated to AWS Athena link.
This could be down to dbExecute
not fully reporting errors from AWS Athena
and not notifying user the repair has failed.
Might need to explore using ALTER TABLE
instead of MSCK REPAIR
for general speed if user has multiple partition.
Note: dbExecute
has been fixed in latest PR #80 and will return AWS Athena
errors. Will need to check what version user is using and then debug
Hi Dyfan,
I am uploading data into s3 buckets and using for that purpose your package “noctua” -- which is great. But, there seems to be one thing I have come across which I have not yet managed after endless testing using this code snippet:
### Connect to database
con <- dbConnect(noctua::athena()
, aws_access_key_id = aws_access_key_id
, aws_secret_access_key = aws_secret_access_key
, aws_session_token = aws_session_token
, s3_staging_dir = 's3://bucket/path/to/folder/'
, region = 'us-east-1'
, schema = 'schema'
, work_group = 'abc_ef_gh_jklmnoqr'
)
### Upload to Athena
dbWriteTable(conn = con
, name = schema.aaa_bbb_cccccc'
, value = AbcDt
, overwrite = ifelse(r==1, TRUE, FALSE)
, append = ifelse(r==1, FALSE, TRUE)
, file.type = 'json'
, partition=c("ZEITSTEMPEL" = format(Sys.Date(), "%Y%m%d"))
, s3.location = 's3://bucket/path/to/folder/'
)
I am using this code in a loop. By doing so, I have noticed that for the first loop (r==1), the data can be queried in Athena but does not seem to be available or visible in s3. For the following loops (r>1), i.e. if the data is appended, the data is shown in s3, too.
Is this anything you can reproduce?
Many thanks in advance (and for the great work you have put into the package)
Christian
Currently waiting for #310. This is getting resolved by the rhub team and the community. Once it has been fixed will run through rhub testing before releasing new features for noctua
and RAthena
on the cran
dbConnect
#69paste/paste0
defaulted to concat_ws
data.table
due to fwrite
supporting compression in (>=1.12.4) dependency needs to be altered to reflect thisdbWriteTable
utilise existing s3 location when appendingdb_compute
to take into account schema and name from a stringThrottlingException: Rate exceeded
#58paste/paste0
defaulted to concat_ws
roxygen2
from 6.1.1 to 7.0.2pkgdown::site
documentation with new roxygen2
AWS_ATHENA_WORK_GROUP
csv
to tsv
. This is to enable array and json to be pushed to Athena from RdbRemoveTable
to be able to delete Athena table s3 filessql_translate_env
Integer miss classifiedNA
when uploading to AthenaRstudio issue was raised around how json ddl's are created for AWS Athena:
The confusion in how the DDL is created and what row format is accepted. This will mean an update to documentation is required.
Can other json ddl formats be support with json line?
I don't know if you remember, but not to long ago we discovered an issue in paws.common together and in an issue that had to do with pulling in profiles into the profile
parameter into dbConnect()
. Everything works fine even with AWS_PROFILE as env variables. I've been working with the updated packages and a new error prevents me from running a script on the server.
When I run a query, the data is pulled and written into an r
object (see test_results
below). However, I also get this error message:
Info: (Data scanned: 625.16 GB)
additional arguments ignored in warning()
Warning: AccessDenied (HTTP 403). Access Denied
Which is weird since the data are actually pulled and the connection to the database is successfully established. Any idea why this message is popping up?
As always, any insights would be great!
Example:
library(RJDBC)
library(glue)
library(tidyverse)
library(magrittr)
library(rlang)
library(noctua)
library(DBI)
library(paws.common)
library(paws)
con_3 <- dbConnect(noctua::athena(),
profile_name = "profile",
s3_staging_dir = "s3://aws-athena-query-results-XXXXXXXXXX-us-east-1/",
region = "us-east-1")
test_query <- glue_sql("select event_date, count(x) as x_count from logs.x group by 1 limit 10;", .con = con_3)
# make sure all results are returned from athena
setMethod("dbGetQuery", signature(conn="JDBCConnection", statement="character"), def=function(conn, statement, ...) {
r <- dbSendQuery(con, test_query)
on.exit(.jcall(r@stat, "V", "close"))
if (conn@jc %instanceof% "com.amazonaws.athena.jdbc.AthenaConnection") fetch(r, -1, 999) # Athena can only pull 999 rows at a time
else fetch(r, -1)
})
test_results <- dbGetQuery(conn = con_3, statement = test_query)
> devtools::session_info()
─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 3.5.3 (2019-03-11)
os macOS 10.15.3
system x86_64, darwin15.6.0
ui RStudio
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/New_York
date 2020-06-12
─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
package * version date lib source
assertthat 0.2.1 2019-03-21 [2] CRAN (R 3.5.2)
backports 1.1.7 2020-05-13 [1] CRAN (R 3.5.3)
bit 1.1-15.1 2020-01-14 [2] CRAN (R 3.5.2)
bit64 0.9-7 2017-05-08 [2] CRAN (R 3.5.0)
blob 1.2.1 2020-01-20 [2] CRAN (R 3.5.2)
broom 0.5.6 2020-04-20 [1] CRAN (R 3.5.3)
callr 3.4.3 2020-03-28 [1] CRAN (R 3.5.3)
cellranger 1.1.0 2016-07-27 [2] CRAN (R 3.5.0)
cli 2.0.2 2020-02-28 [1] CRAN (R 3.5.2)
colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.5.2)
crayon 1.3.4 2017-09-16 [2] CRAN (R 3.5.0)
curl 4.3 2019-12-02 [1] CRAN (R 3.5.2)
data.table 1.12.8 2019-12-09 [1] CRAN (R 3.5.2)
DBI * 1.1.0 2019-12-15 [2] CRAN (R 3.5.2)
dbplyr 1.4.4 2020-05-27 [1] CRAN (R 3.5.3)
desc 1.2.0 2018-05-01 [2] CRAN (R 3.5.0)
devtools 2.3.0 2020-04-10 [1] CRAN (R 3.5.3)
digest 0.6.25 2020-02-23 [1] CRAN (R 3.5.2)
dplyr * 0.8.5 2020-03-07 [1] CRAN (R 3.5.2)
ellipsis 0.3.1 2020-05-15 [1] CRAN (R 3.5.3)
fansi 0.4.1 2020-01-08 [2] CRAN (R 3.5.2)
forcats * 0.5.0 2020-03-01 [1] CRAN (R 3.5.2)
fs 1.4.1 2020-04-04 [1] CRAN (R 3.5.3)
generics 0.0.2 2018-11-29 [1] CRAN (R 3.5.0)
ggplot2 * 3.3.0 2020-03-05 [1] CRAN (R 3.5.2)
glue * 1.4.1 2020-05-13 [1] CRAN (R 3.5.3)
gt * 0.2.1 2020-05-23 [1] CRAN (R 3.5.3)
gtable 0.3.0 2019-03-25 [2] CRAN (R 3.5.3)
haven 2.3.0 2020-05-24 [1] CRAN (R 3.5.3)
hms 0.5.3 2020-01-08 [1] CRAN (R 3.5.2)
httr 1.4.1 2019-08-05 [1] CRAN (R 3.5.2)
ini 0.3.1 2018-05-20 [2] CRAN (R 3.5.0)
jsonlite 1.6.1 2020-02-02 [1] CRAN (R 3.5.2)
knitr 1.28 2020-02-06 [1] CRAN (R 3.5.2)
lattice 0.20-38 2018-11-04 [2] CRAN (R 3.5.3)
lifecycle 0.2.0 2020-03-06 [1] CRAN (R 3.5.2)
lubridate 1.7.8 2020-04-06 [1] CRAN (R 3.5.3)
magrittr * 1.5 2014-11-22 [2] CRAN (R 3.5.0)
memoise 1.1.0 2017-04-21 [2] CRAN (R 3.5.0)
modelr 0.1.8 2020-05-19 [1] CRAN (R 3.5.3)
munsell 0.5.0 2018-06-12 [2] CRAN (R 3.5.0)
nlme 3.1-143 2019-12-10 [2] CRAN (R 3.5.2)
noctua * 1.7.0 2020-05-14 [1] CRAN (R 3.5.3)
packrat 0.5.0 2018-11-14 [2] CRAN (R 3.5.0)
paws * 0.1.8 2020-04-28 [1] CRAN (R 3.5.3)
paws.analytics 0.1.8 2020-04-26 [1] CRAN (R 3.5.3)
paws.common * 0.3.1 2020-04-26 [1] CRAN (R 3.5.3)
paws.storage 0.1.8 2020-04-27 [1] CRAN (R 3.5.3)
pillar 1.4.3 2019-12-20 [1] CRAN (R 3.5.2)
pkgbuild 1.0.6 2019-10-09 [2] CRAN (R 3.5.2)
pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 3.5.2)
pkgload 1.0.2 2018-10-29 [2] CRAN (R 3.5.0)
prettyunits 1.1.1 2020-01-24 [2] CRAN (R 3.5.2)
processx 3.4.2 2020-02-09 [1] CRAN (R 3.5.2)
ps 1.3.2 2020-02-13 [1] CRAN (R 3.5.2)
purrr * 0.3.4 2020-04-17 [1] CRAN (R 3.5.3)
R6 2.4.1 2019-11-12 [1] CRAN (R 3.5.2)
Rcpp 1.0.4.6 2020-04-09 [1] CRAN (R 3.5.3)
readr * 1.3.1 2018-12-21 [1] CRAN (R 3.5.0)
readxl 1.3.1 2019-03-13 [1] CRAN (R 3.5.2)
remotes 2.1.1 2020-02-15 [1] CRAN (R 3.5.2)
reprex 0.3.0 2019-05-16 [2] CRAN (R 3.5.2)
rJava * 0.9-11 2019-03-29 [2] CRAN (R 3.5.2)
RJDBC * 0.2-7.1 2018-04-16 [2] CRAN (R 3.5.0)
rlang * 0.4.6 2020-05-02 [1] CRAN (R 3.5.3)
RPostgreSQL 0.6-2 2017-06-24 [2] CRAN (R 3.5.0)
rprojroot 1.3-2 2018-01-03 [2] CRAN (R 3.5.0)
rstudioapi 0.11 2020-02-07 [1] CRAN (R 3.5.2)
rvest 0.3.5 2019-11-08 [1] CRAN (R 3.5.2)
scales 1.1.0 2019-11-18 [2] CRAN (R 3.5.2)
sessioninfo 1.1.1 2018-11-05 [2] CRAN (R 3.5.0)
stringi 1.4.6 2020-02-17 [1] CRAN (R 3.5.2)
stringr * 1.4.0 2019-02-10 [1] CRAN (R 3.5.2)
testthat 2.3.2 2020-03-02 [1] CRAN (R 3.5.2)
tibble * 3.0.1 2020-04-20 [1] CRAN (R 3.5.3)
tidyr * 1.0.2 2020-01-24 [1] CRAN (R 3.5.2)
tidyselect 1.0.0 2020-01-27 [2] CRAN (R 3.5.3)
tidyverse * 1.3.0 2019-11-21 [1] CRAN (R 3.5.2)
usethis 1.6.1 2020-04-29 [2] CRAN (R 3.5.3)
vctrs 0.3.0 2020-05-11 [1] CRAN (R 3.5.3)
withr 2.2.0 2020-04-20 [1] CRAN (R 3.5.3)
xfun 0.12 2020-01-13 [1] CRAN (R 3.5.2)
xml2 1.3.2 2020-04-23 [1] CRAN (R 3.5.3)
RAthena unit tests will be a good starting point
Currently noctua just collapses list using paste
. This is incorrect as it Athena won't be able to interpret it, for example.
paste(list(list("var3"= 1:3, "var4" = list("var5"= letters))), collapse = "|"))
'list(var3 = 1:3, var4 = list(var5 = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z")))'
Hello Dyfan,
Lately I had problems with RAthena
so I turned to noctua
instead and I noticed some peculiar, almost random, issues.
When running the code below then I usually get Error: InvalidRequestException (HTTP 400). Idempotent parameters do not match
or sometimes Error: InvalidRequestException (HTTP 400). Could not find results
HOWEVER about 1 of 5 times I retrieve the desired result. It almost seems random. Issue occurs in both R Console and in Rstudio. As a side note: If I for example use library(aws.s3); bucketlist()
then I always retrieves the desired result, so I know that I'm assuming the role correctly.
Example:
library(DBI)
library(noctua)
noctua::assume_role(
role_arn = "arn:aws:iam::666666666:role/ARole",
duration_seconds = 3600L,
region_name = "eu-west-1",
set_env = TRUE)
con <- dbConnect(noctua::athena(), s3_staging_dir = "s3://aws-athena-query-results-eu-west-1-328396111111")
dbGetQuery(con, "SELECT row_names, mpg, cyl, disp FROM adhoc.mtcars limit 10")
Screenshot:
Traceback:
dbGetQuery(con, "SELECT row_names, mpg, cyl, disp FROM adhoc.mtcars limit 10")
Error: InvalidRequestException (HTTP 400). Idempotent parameters do not match
> traceback()
8: stop(resp)
7: retry_api_call(response <- conn@ptr$Athena$start_query_execution(QueryString = statement,
QueryExecutionContext = list(Database = conn@info$dbms.name),
ResultConfiguration = ResultConfiguration(conn), WorkGroup = conn@info$work_group))
6: AthenaResult(conn = conn, statement = statement, s3_staging_dir = s3_staging_dir)
5: dbSendQuery(conn, statement = statement)
4: dbSendQuery(conn, statement = statement)
3: .local(conn, statement, ...)
2: dbGetQuery(con, "SELECT * FROM adhoc.mtcars limit 10")
1: dbGetQuery(con, "SELECT * FROM adhoc.mtcars limit 10")
dbGetQuery(con, "SELECT row_names, mpg, cyl, disp FROM adhoc.mtcars limit 10")
Error: InvalidRequestException (HTTP 400). Could not find results
> traceback()
7: stop(resp)
6: retry_api_call(result_class <- res@connection@ptr$Athena$get_query_results(QueryExecutionId = res@info$QueryExecutionId,
MaxResults = as.integer(1))$ResultSet$ResultSetMetadata$ColumnInfo)
5: dbFetch(res = rs, n = -1, ...)
4: dbFetch(res = rs, n = -1, ...)
3: .local(conn, statement, ...)
2: dbGetQuery(con, "SELECT * FROM adhoc.mtcars limit 10")
1: dbGetQuery(con, "SELECT * FROM adhoc.mtcars limit 10")
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] noctua_1.8.0 DBI_1.0.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.2 paws.analytics_0.1.9 digest_0.6.20 zeallot_0.1.0 crayon_1.3.4
[6] paws.common_0.3.3 R6_2.4.0 jsonlite_1.6 backports_1.1.4 httr_1.4.0
[11] pillar_1.4.2 rlang_0.4.0 curl_4.0 data.table_1.13.0 ini_0.3.1
[16] paws_0.1.9 xml2_1.2.1 vctrs_0.2.0 paws.security.identity_0.1.9 readr_1.3.1
[21] paws.storage_0.1.9 hms_0.5.0 compiler_3.6.1 pkgconfig_2.0.2 tibble_2.1.3
Hi,
I have been trying to connect to Athena from Sagemaker. I have been successful, but in the case of slightly bigger data, when I run the following code:
install.packages('noctua')
install.packages('DBI')
con <- DBI::dbConnect(noctua::athena(),
s3_staging_dir = "s3://aws-athena...",
region_name = 'us-west-2')
res <- DBI::dbExecute(con, 'SELECT *
FROM db.table
LIMIT 10000000')
df <- DBI::dbFetch(res)
I get the following error on Sagemaker:
Error: ThrottlingException: Rate exceeded Traceback:
At first I though that I needed more RAM, so I edited my instance to where it had 64gb, seeing as the previous one had 16. But the error still persists.
I am farely new to AWS and its intricacies. While I was looking for a way to overcome this issue I stumbled upon the following article:
Looking at the article, I guess my question would be, is there a way to use some sort of rate limiting function? If it is supported by paws?
Ps. I hope I wrote this in the correct format and place.
Any help is greatly appreciated,
Ricardo
Hi Dyfan,
a situation came up when upgrading to dbplyr 2.0.0:
In the new version, in_schema() returns an object of class "dbplyr_schema" (previously "ident"), and the database and table names are automatically quoted.
As an example:
dbplyr < 2.0.0:
> in_schema("database", "table")
<IDENT> database.table
> class(in_schema("database", "table"))
[1] "ident_q" "ident" "character"
dbplyr 2.0.0:
> in_schema("database", "table")
<SCHEMA> `database`.`table`
> class(in_schema("database", "table"))
[1] "dbplyr_schema"
Consequently, in_schema does not work with noctua + dbplyr 2.0.0.
One workaround would be to use tbl(con, sql("database.table"), but this is not too elegant and a big negative is that athena is used for getting the table structure instead of paws::glue(), which makes things much slower.
I'm not sure what would be the best fix for this. Perhaps an AthenaConnection specific in_schema-method? Or perhaps just check for "dbplyr_schema" in addition to "ident" when deciding whether to query glue or athena (
Line 216 in 1e34ca1
Line 221 in 1e34ca1
What are your thoughts on this?
Currently RAthena
and noctua
are having issues with rhub checks. Please see DyfanJones/RAthena#111 for more information
The paws sdk has developed a method to pass credentials to aws objects. This method is preferred as it will enable multiple connections to different AWS accounts.
Hello again! I've found what I think is another issue around translating to SQL.
When I try to use the summarise
function to get the median value for a column, I get the following error.
> query <- dplyr::tbl(db_connection, "table_name") %>% summarise(median_value=median(column_name))
> query %>% collect()
Error: InvalidRequestException (HTTP 400). line 1:42: mismatched input '('. Expecting: 'BY'
Here is the SQL being executed:
> query %>% show_query()
<SQL>
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "column_name") AS "median_value"
FROM "table_name"
So I believe the issue is that PERCENTILE_CONT
is not a valid Presto function. Perhaps we could use one of the approx_percentile
functions here?
Several users have reached out with some confusion around how the s3 file path is created in the backend. To minimise this confuse the documentation should be more open and let users know exactly what is going on with the s3 backend.
Aws Athena support UNLOAD which allows Athena to write out different file type i.e. parquet.
This will allow noctua/RAthena to utilise AWS Athena Unload queries and read the parquet format similar to how AWS Data Wrangler
currently does (NOTE: AWS Data Wrangler wraps queries with CTAS (https://docs.aws.amazon.com/athena/latest/ug/ctas.html)
AWS Data Wrangler example with Pros/Cons to their current CTAS method
https://aws-data-wrangler.readthedocs.io/en/stable/tutorials/006%20-%20Amazon%20Athena.html
Hi Dyfan, one more for the week:
when noctua queries glue for a list of databases with get_databases(), the list is truncated in case the result gets paginated. Same with get_tables().
Not sure what is the underlying reason for this, but in one of our environments the first call to glue returns only the first 23 databases. For one reason or another setting the MaxResults parameter doesn't help here.
One fix would be to implement a separate paginated reader function and replace the native paws calls with that, right? Perhaps some more elegant implementation could be come up with.
get_all_databases <- function(glue) {
cresults <- glue$get_databases()
cdbs <- sapply(cresults$DatabaseList,function(x) x$Name)
citers <- 0
while(length(cresults$NextToken) > 0 & citers < 1000) {
cresults <- glue$get_databases(NextToken = cresults$NextToken)
cdbs <- c(cdbs, sapply(cresults$DatabaseList,function(x) x$Name))
citers <- citers + 1
}
cdbs
}
(I tend to get nervous with these while loops so I added the max iteration count there...)
Currently Windows Server 2008 R2 SP1, R-devel, 32/64 bit is failing:
* checking package dependencies ... ERROR
Package required but not available: 'paws'
Packages suggested but not available:
'arrow', 'data.table', 'dplyr', 'dbplyr', 'testthat'
The suggested packages are required for a complete check.
variable _R_CHECK_FORCE_SUGGESTS_ to a false value.
Checking can be attempted without them by setting the environment
See section 'The DESCRIPTION file' in the 'Writing R Extensions'
manual.
Issue raised to rhub: #305
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.