It would be great to be able to specify the catalog (data source) when connecting. Thi

Thanks for that <a class="user-mention notranslate" data-hovercard-type="user" data-ho

I think it should document so that users are more aware <g-emoji class="g-emoji" alias

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Actually I think I know what is going on, please try: <div class="highlight highli

Add catalog support about noctua HOT 26 OPEN

fabhans commented on September 28, 2024

Add catalog support

from noctua.

Comments (26)

fabhans commented on September 28, 2024 1

Hey man, thanks for the fast turnaround on this. I'll give it a test spin and let you know if anything doesn't work as expected.

from noctua.

DyfanJones commented on September 28, 2024 1

Thanks for that @fabhans, it looks like calling Athena directly isn't the fastest method to get all of the tables. After checking the API documentation, it looks like it restricts it to 50 results per call so not great :( I will switch to the information_schema, this should speed things up on connection 😄

from noctua.

DyfanJones commented on September 28, 2024 1

I think it should document so that users are more aware 😃

from noctua.

DyfanJones commented on September 28, 2024

Thanks for raising this ticket :) I agree this is an important feature that needs to be added. Unfortunately I wont be able to get around to this in the next couple of weeks. I am expecting my son to be born any minute now :)

Once everything has settle down myside I will get onto deving this out :) In the meantime if you wish to develop this feature, I welcome all PRs :)

from noctua.

fabhans commented on September 28, 2024

Congratulations on becoming a father! I'll see if I can figure this out and put in a PR if yes.

from noctua.

DyfanJones commented on September 28, 2024

Hi @fabhans, sorry about the wait, this too longer than expected. Been a little busy :)

This has proved to be a little more tricky than expected as noctua needed to be de-coupled from AWS Glue to so that it could work with different catalogs.

remotes::install_github("dyfanjones/noctua", ref = "catalog")

To change default catalog (AwsDataCatalog) please use the new parameter catalog_name in dbConnect:

library(DBI)

con <- dbConnect(noctua::athena(), catalog_name = "MyCatalog")

Please have a go, any feedback would be greatly appreciated :)

from noctua.

DyfanJones commented on September 28, 2024

Will leave this open until cran release.

from noctua.

fabhans commented on September 28, 2024

I'm getting this error when trying to connect with catalog_name specified:

An error occurred while the 'noctua' package was updating the RStudio Connections pane: Error in names(tbl_type) <- x$Name: 'names' attribute [1] must be the same length as the vector [0] If necessary, these warnings can be squelched by setting options(rstudio.connectionObserver.errorsSuppressed = TRUE).

from noctua.

DyfanJones commented on September 28, 2024

Hi @fabhans I am unable to reproduce your error. Can you share your environment you are using please

from noctua.

DyfanJones commented on September 28, 2024

Actually I think I know what is going on, please try:

remotes::install_github("dyfanjones/noctua", ref = "table_type")

from noctua.

fabhans commented on September 28, 2024

Connecting works with that, but it takes very long (>20mins). Btw this is the same time the error message took to appear.

from noctua.

DyfanJones commented on September 28, 2024

@fabhans how many table do you have in your data catalogs? 🤔

from noctua.

DyfanJones commented on September 28, 2024

i wonder for large data catalogs getting the tables from the sdk isn't the most optimal method 🤔

from noctua.

fabhans commented on September 28, 2024

10k+ tables. If that's the issue I could forego connecting to a specific catalog and use dbplyr::in_catalog() to target a specific table in a specific catalog instead.

from noctua.

DyfanJones commented on September 28, 2024

You can turn off the connection tab using the following

library(DBI)
con <- dbConnect(noctua::athena(), catalog_name = "mycatalog", rstudio_conn_tab = FALSE)

However I would like to see if I can improve the performance of getting multiple tables for the connection.

from noctua.

fabhans commented on September 28, 2024

That fixed the issue.

from noctua.

DyfanJones commented on September 28, 2024

@fabhans can you do me a favour and run the following. I would like to understand what happens at scale. I only have a small handful of tables so I don't know what happens at scale.

library(DBI)

con <- dbConnect(noctua::athena(), rstudio_conn_tab = F)

bm <- bench::mark(
  method1 = noctua::dbGetTables(con),
  method2 = dbGetQuery(
    con,
    "select 
      table_catalog as Catalog,
      table_schema as Schema,
      table_name as TableName,
      table_type as TableType
    from information_schema.tables
    where table_schema != 'information_schema'
    order by table_catalog, table_schema"
  ),
  check = F,
  iterations = 1
)

#> INFO: (Data scanned: 2.37 KB)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
bm
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 method1       1.94s    1.94s     0.516    6.68MB    0.516
#> 2 method2       3.36s    3.36s     0.297    2.94MB    0.297

^{Created on 2023-04-26 with reprex v2.0.2}

from noctua.

fabhans commented on September 28, 2024

Here you go. Got a few warnings:

Warning messages:
1: In rbindlist(unlist(lapply(schema[[n]], function(s) list_tables(athena,  :
  Column 4 ['TableType'] of item 57293 is length 0. This (and 4 others like it) has been filled with NA (NULL for list columns) to make each item uniform.
2: In rbindlist(unlist(lapply(schema[[n]], function(s) list_tables(athena,  :
  Column 4 ['TableType'] of item 57296 is length 0. This (and 4 others like it) has been filled with NA (NULL for list columns) to make each item uniform.
3: Some expressions had a GC in every iteration; so filtering is disabled. 
> bm
# A tibble: 2 × 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time           gc      
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>         <list>  
1 method1      15.01m   15.01m   0.00111    1.45GB    0.225     1   203     15.01m <NULL> <Rprofmem> <bench_tm [1]> <tibble>
2 method2       3.41s    3.41s   0.294       2.6MB    0         1     0      3.41s <NULL> <Rprofmem> <bench_tm [1]> <tibble>

from noctua.

DyfanJones commented on September 28, 2024

I have been doing to some tweaking to the rstudio connection tab helper functions. @fabhans can you test it out please. I am hoping it is useable now :)

remotes::install_github("dyfanjones/noctua", ref = "catalog_performance")

library(DBI)

con <- dbConnect(noctua::athena())

Let me know how it works.

from noctua.

fabhans commented on September 28, 2024

Hi @DyfanJones. It's faster than previously now, but still pretty slow. Same benchmark as above:

INFO: (Data scanned: 3.74 MB)
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled. 
> bm
# A tibble: 2 × 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time           gc      
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>         <list>  
1 method1       12.3m    12.3m   0.00136     121MB   0.0217     1    16      12.3m <NULL> <Rprofmem> <bench_tm [1]> <tibble>
2 method2       12.2m    12.2m   0.00137     120MB   0.0205     1    15      12.2m <NULL> <Rprofmem> <bench_tm [1]> <tibble>
>

from noctua.

DyfanJones commented on September 28, 2024

Thanks for running the test for me, it is interesting method 2 has slowed down from 3.4s to 12.1m, as there has been no change to it 🤔

Out of interest how fast does Athena execute the raw SQL from the Athena console?

select 
      table_catalog as Catalog,
      table_schema as Schema,
      table_name as TableName,
      table_type as TableType
from information_schema.tables
where table_schema != 'information_schema'
order by table_catalog, table_schema

I really appreciate all the testing you are doing.

from noctua.

fabhans commented on September 28, 2024

Run time: 22 min 51.742 sec

Happy to test as much as necessary!

from noctua.

DyfanJones commented on September 28, 2024

Ah ok so if raw Athena can't run any faster I guess this is the fastest we can get it 🤔

from noctua.

fabhans commented on September 28, 2024

I'm not sure where the 3.4 secs yesterday came from. Might have had something to do with the warnings I've been seeing.

I guess if the data catalogs are extremely large like in my case, one will just have to live without the connection pane (or wait for it). I'm personally completely fine not using it.

from noctua.

DyfanJones commented on September 28, 2024

Keeping open until cran release

from noctua.

Add catalog support about noctua HOT 26 OPEN

Comments (26)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent