Comments (26)
Hey man, thanks for the fast turnaround on this. I'll give it a test spin and let you know if anything doesn't work as expected.
from noctua.
Thanks for that @fabhans, it looks like calling Athena directly isn't the fastest method to get all of the tables. After checking the API documentation, it looks like it restricts it to 50 results per call so not great :( I will switch to the information_schema, this should speed things up on connection
from noctua.
I think it should document so that users are more aware
from noctua.
Thanks for raising this ticket :) I agree this is an important feature that needs to be added. Unfortunately I wont be able to get around to this in the next couple of weeks. I am expecting my son to be born any minute now :)
Once everything has settle down myside I will get onto deving this out :) In the meantime if you wish to develop this feature, I welcome all PRs :)
from noctua.
Congratulations on becoming a father! I'll see if I can figure this out and put in a PR if yes.
from noctua.
Hi @fabhans, sorry about the wait, this too longer than expected. Been a little busy :)
This has proved to be a little more tricky than expected as noctua needed to be de-coupled from AWS Glue to so that it could work with different catalogs.
remotes::install_github("dyfanjones/noctua", ref = "catalog")
To change default catalog (AwsDataCatalog) please use the new parameter catalog_name
in dbConnect
:
library(DBI)
con <- dbConnect(noctua::athena(), catalog_name = "MyCatalog")
Please have a go, any feedback would be greatly appreciated :)
from noctua.
Will leave this open until cran release.
from noctua.
I'm getting this error when trying to connect with catalog_name specified:
An error occurred while the 'noctua' package was updating the RStudio Connections pane: Error in names(tbl_type) <- x$Name: 'names' attribute [1] must be the same length as the vector [0] If necessary, these warnings can be squelched by setting
options(rstudio.connectionObserver.errorsSuppressed = TRUE).
from noctua.
Hi @fabhans I am unable to reproduce your error. Can you share your environment you are using please
from noctua.
Actually I think I know what is going on, please try:
remotes::install_github("dyfanjones/noctua", ref = "table_type")
from noctua.
Connecting works with that, but it takes very long (>20mins). Btw this is the same time the error message took to appear.
from noctua.
@fabhans how many table do you have in your data catalogs?
from noctua.
i wonder for large data catalogs getting the tables from the sdk isn't the most optimal method
from noctua.
10k+ tables. If that's the issue I could forego connecting to a specific catalog and use dbplyr::in_catalog()
to target a specific table in a specific catalog instead.
from noctua.
You can turn off the connection tab using the following
library(DBI)
con <- dbConnect(noctua::athena(), catalog_name = "mycatalog", rstudio_conn_tab = FALSE)
However I would like to see if I can improve the performance of getting multiple tables for the connection.
from noctua.
That fixed the issue.
from noctua.
@fabhans can you do me a favour and run the following. I would like to understand what happens at scale. I only have a small handful of tables so I don't know what happens at scale.
library(DBI)
con <- dbConnect(noctua::athena(), rstudio_conn_tab = F)
bm <- bench::mark(
method1 = noctua::dbGetTables(con),
method2 = dbGetQuery(
con,
"select
table_catalog as Catalog,
table_schema as Schema,
table_name as TableName,
table_type as TableType
from information_schema.tables
where table_schema != 'information_schema'
order by table_catalog, table_schema"
),
check = F,
iterations = 1
)
#> INFO: (Data scanned: 2.37 KB)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
bm
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 method1 1.94s 1.94s 0.516 6.68MB 0.516
#> 2 method2 3.36s 3.36s 0.297 2.94MB 0.297
Created on 2023-04-26 with reprex v2.0.2
from noctua.
Here you go. Got a few warnings:
Warning messages:
1: In rbindlist(unlist(lapply(schema[[n]], function(s) list_tables(athena, :
Column 4 ['TableType'] of item 57293 is length 0. This (and 4 others like it) has been filled with NA (NULL for list columns) to make each item uniform.
2: In rbindlist(unlist(lapply(schema[[n]], function(s) list_tables(athena, :
Column 4 ['TableType'] of item 57296 is length 0. This (and 4 others like it) has been filled with NA (NULL for list columns) to make each item uniform.
3: Some expressions had a GC in every iteration; so filtering is disabled.
> bm
# A tibble: 2 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 method1 15.01m 15.01m 0.00111 1.45GB 0.225 1 203 15.01m <NULL> <Rprofmem> <bench_tm [1]> <tibble>
2 method2 3.41s 3.41s 0.294 2.6MB 0 1 0 3.41s <NULL> <Rprofmem> <bench_tm [1]> <tibble>
from noctua.
I have been doing to some tweaking to the rstudio connection tab helper functions. @fabhans can you test it out please. I am hoping it is useable now :)
remotes::install_github("dyfanjones/noctua", ref = "catalog_performance")
library(DBI)
con <- dbConnect(noctua::athena())
Let me know how it works.
from noctua.
Hi @DyfanJones. It's faster than previously now, but still pretty slow. Same benchmark as above:
INFO: (Data scanned: 3.74 MB)
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.
> bm
# A tibble: 2 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 method1 12.3m 12.3m 0.00136 121MB 0.0217 1 16 12.3m <NULL> <Rprofmem> <bench_tm [1]> <tibble>
2 method2 12.2m 12.2m 0.00137 120MB 0.0205 1 15 12.2m <NULL> <Rprofmem> <bench_tm [1]> <tibble>
>
from noctua.
Thanks for running the test for me, it is interesting method 2 has slowed down from 3.4s to 12.1m, as there has been no change to it
Out of interest how fast does Athena execute the raw SQL from the Athena console?
select
table_catalog as Catalog,
table_schema as Schema,
table_name as TableName,
table_type as TableType
from information_schema.tables
where table_schema != 'information_schema'
order by table_catalog, table_schema
I really appreciate all the testing you are doing.
from noctua.
Run time: 22 min 51.742 sec
Happy to test as much as necessary!
from noctua.
Ah ok so if raw Athena can't run any faster I guess this is the fastest we can get it
from noctua.
I'm not sure where the 3.4 secs yesterday came from. Might have had something to do with the warnings I've been seeing.
I guess if the data catalogs are extremely large like in my case, one will just have to live without the connection pane (or wait for it). I'm personally completely fine not using it.
from noctua.
Keeping open until cran release
from noctua.
Related Issues (20)
- list to Json string improvements HOT 7
- Support AWS Athena UNLOAD HOT 7
- Swap to Github Actions instead of travis
- Release noctua 2.3.0
- The 'statement' argument to dbGetQuery and dbSendQuery methods isn't exposed in older versions of R HOT 7
- Caching issue HOT 2
- Method to set unload at a package level HOT 1
- Release noctua 2.4.0 on to cran
- Prevent Noctua from printing Data Scanned -information HOT 7
- Release noctua 2.5.0 HOT 3
- Release noctua 2.6.0 HOT 1
- Sub-query fails with dplyr interface indicating "Only one sql statement is allowed" HOT 6
- cran-2.6.1 release
- Can I set various parameters in `.aws/config` file and have `DBI::dbConnect()` read those directly from that file? HOT 5
- Column Bucketing
- Allow for Partition columns to change data types
- Can't write/append an empty data frame
- Connecting using long-term-creds returns Error 400 HOT 5
- dbFetch(..., n=small number) is quite slow when run on a large result set HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from noctua.