Comments (19)
@OssiLehtinen This looks really good, I will review the code to double check, but I am keen to get it in. I am happy you like noctua
as well, I am planning to keep both RAthena
and noctua
up to date and at the same level. So hopefully it will feel seamless between using one or the other.
from noctua.
Just to confirm, the inherits
solution works well in my tests too.
from noctua.
Haha, I somehow managed to miss your latest message about the inherits solution earlier and noticed the solution only after looking at your pull request (and after posting about my regex-improvement etc.). Well, at least I got some practice on regex...
from noctua.
i am guessing you are using the if statement due to the nature of what is returned in sapply
from the partition call. To over come this vapply
can be used instead with expected values to be returned:
vapply(output$PartitionKeys,function(y) y$Name, FUN.VALUE = character(1))
This will make the following:
db_query_fields.AthenaConnection <- function(con, sql, ...) {
# check if sql is dbplyr ident
is_ident <- inherits(sql, "ident")
if(is_ident) { # If ident, get the fields from Glue
if (grepl("\\.", sql)) {
dbms.name <- gsub("\\..*", "" , sql)
Table <- gsub(".*\\.", "" , sql)
} else {
dbms.name <- con@info$dbms.name
Table <- sql}
tryCatch(
output <- con@ptr$glue$get_table(DatabaseName = dbms.name,
Name = Table)$Table)
col_names = vapply(output$StorageDescriptor$Columns, function(y) y$Name, FUN.VALUE = character(1))
partitions = vapply(output$PartitionKeys,function(y) y$Name, FUN.VALUE = character(1))
c(col_names, partitions)
} else { # If a subquery, query Athena for the fields
# return dplyr methods
sql_select <- pkg_method("sql_select", "dplyr")
sql_subquery <- pkg_method("sql_subquery", "dplyr")
dplyr_sql <- pkg_method("sql", "dplyr")
sql <- sql_select(con, dplyr_sql("*"), sql_subquery(con, sql), where = dplyr_sql("0 = 1"))
qry <- dbSendQuery(con, sql)
on.exit(dbClearResult(qry))
res <- dbFetch(qry, 0)
names(res)
}
}
Yes dbListFields
should have this as currently it will be missing partition colnames. Which is a bug
from noctua.
Speed test:
library(DBI)
library(dplyr)
con <- dbConnect(noctua::athena())
system.time(tbl(con, "iris")) -> t1 # new method
system.time(tbl(con, sql("select * from iris"))) -> t2 # to replicate old method
# new method
user system elapsed
0.082 0.012 0.288
# old method
user system elapsed
0.993 0.138 3.660
The speed increase is really good and makes it alot more user interactive.
from noctua.
Due to the speed increase I think the documentation will have to be updated to advise users to use the new method as much as possible if they can
from noctua.
@OssiLehtinen coming across cran check:
checking R files for non-ASCII characters ... WARNING
Found the following file with non-ASCII characters:
dplyr_integration.R
Portable packages must use only ASCII characters in their R code,
except perhaps in comments.
Use \uxxxx escapes for other characters.
I believe it relates to:
grepl('^"?[a-å]+"?\\.?"?[a-å]+"?$', trimws(tolower(sql)))
Do you have a possible solution for this?
from noctua.
@OssiLehtinen don't worry I have modified the code not to look if it is a sub query but to check if the class is "ident" or not:
db_query_fields.AthenaConnection <- function(con, sql, ...) {
# check if sql is dbplyr ident
is_ident <- inherits(sql, "ident")
if(is_ident) { # If ident, get the fields from Glue
if (grepl("\\.", sql)) {
dbms.name <- gsub("\\..*", "" , sql)
Table <- gsub(".*\\.", "" , sql)
} else {
dbms.name <- con@info$dbms.name
Table <- sql}
tryCatch(
output <- con@ptr$glue$get_table(DatabaseName = dbms.name,
Name = Table)$Table$StorageDescriptor$Columns)
sapply(output, function(y) y$Name)
} else { # If a subquery, query Athena for the fields
# return dplyr methods
sql_select <- pkg_method("sql_select", "dplyr")
sql_subquery <- pkg_method("sql_subquery", "dplyr")
dplyr_sql <- pkg_method("sql", "dplyr")
sql <- sql_select(con, dplyr_sql("*"), sql_subquery(con, sql), where = dplyr_sql("0 = 1"))
qry <- dbSendQuery(con, sql)
on.exit(dbClearResult(qry))
res <- dbFetch(qry, 0)
names(res)
}
}
This will give the same results :) so all good
from noctua.
The speed up is great!
And yes, the 'å's are surely the culprit for the cran error. I included those in the range from purely Scandinavic reasoning, but now that I think of it, that regex is not optimal in any case:
One could limit the letter range to a-z, but the bigger problem are numbers, underscores etc. I'm on my mobile at the moment so can't test it right now, but [a-z1-9_] could do the trick? Regex is really not my forte, so have to test it still.
Sorry for the rushed version, will look into it more once at the keyboard tomorrow. Fortunately a failure to catch a valid table name results in using the dplyr default method, but of course it would be best to use the faster one when evever possible.
from noctua.
This statement seems to do the trick in my tests (the \p{L} should match any unicode character):
is_direct <- grepl('^"?[\\p{L}0-9_]+"?\\.?"?[\\p{L}0-9_]+"?$', trimws(tolower(sql)), perl = T)
https://docs.aws.amazon.com/athena/latest/ug/tables-databases-columns-names.html
from noctua.
Ah, you seem to have found a more elegant solution with inherits, right? That's great, as the regex solution is pretty kludgey.
from noctua.
One more glitch came up!
With the current version, partition names will be missed from the list of names returned.
The following snippet will also fetch the partitionnames if they exist:
if(is_ident) { # If a direct definiton, get the fields from Glue
message("direct")
if (!dbIsValid(con)) {stop("Connection already closed.", call. = FALSE)}
if (grepl("\\.", sql)) {
dbms.name <- gsub("\\..*", "" , sql)
Table <- gsub(".*\\.", "" , sql)
} else {
dbms.name <- conn@info$dbms.name
Table <- sql}
tryCatch(
table_definition <- con@ptr$glue$get_table(DatabaseName = dbms.name,
Name = Table)$Table)
columns <- sapply(table_definition$StorageDescriptor$Columns, function(y) y$Name)
partitions <- NULL
if(length(table_definition$PartitionKeys) > 0) partitions <- sapply(table_definition$PartitionKeys, function(y) y$Name)
c(columns, partitions)
}
from noctua.
Btw, should something similar be done in dbListFields also?
from noctua.
The partition is for the db_query_fields
?
from noctua.
ah I see what you mean, good spot
from noctua.
This issue persisted in the RStudio connection tab view. PR #65 fixes this issue.
from noctua.
PR #65 passed unit tests. If this issue persists please re-open or open another one
from noctua.
Where does pkg_method
come from?
from noctua.
Where does
pkg_method
come from?
pkg_method
function is created:
Lines 168 to 175 in 1a4b000
from noctua.
Related Issues (20)
- Method to set unload at a package level HOT 1
- Release noctua 2.4.0 on to cran
- Prevent Noctua from printing Data Scanned -information HOT 7
- Release noctua 2.5.0 HOT 3
- Release noctua 2.6.0 HOT 1
- Sub-query fails with dplyr interface indicating "Only one sql statement is allowed" HOT 6
- cran-2.6.1 release
- Can I set various parameters in `.aws/config` file and have `DBI::dbConnect()` read those directly from that file? HOT 5
- Add catalog support HOT 26
- Column Bucketing
- Allow for Partition columns to change data types
- Can't write/append an empty data frame
- Connecting using long-term-creds returns Error 400 HOT 5
- dbFetch(..., n=small number) is quite slow when run on a large result set HOT 4
- `dbExistsTable()` doesn't work anymore HOT 3
- fix: for dbplyr 2.3.3.9000 +
- dbExistsTable() returns an incorrect result when the table name is defined by Id() or SQL() HOT 1
- InvalidRequestException with dbGetQuery HOT 4
- [Question]: Requesting guidance and best practices - Athena shinyApp with noctua HOT 2
- Unload option returns `null` results when `s3_staging_dir` is a bucket only HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from noctua.