Giter Club home page Giter Club logo

Comments (26)

fabhans avatar fabhans commented on September 28, 2024 1

Hey man, thanks for the fast turnaround on this. I'll give it a test spin and let you know if anything doesn't work as expected.

from noctua.

DyfanJones avatar DyfanJones commented on September 28, 2024 1

Thanks for that @fabhans, it looks like calling Athena directly isn't the fastest method to get all of the tables. After checking the API documentation, it looks like it restricts it to 50 results per call so not great :( I will switch to the information_schema, this should speed things up on connection 😄

from noctua.

DyfanJones avatar DyfanJones commented on September 28, 2024 1

I think it should document so that users are more aware 😃

from noctua.

DyfanJones avatar DyfanJones commented on September 28, 2024

Thanks for raising this ticket :) I agree this is an important feature that needs to be added. Unfortunately I wont be able to get around to this in the next couple of weeks. I am expecting my son to be born any minute now :)

Once everything has settle down myside I will get onto deving this out :) In the meantime if you wish to develop this feature, I welcome all PRs :)

from noctua.

fabhans avatar fabhans commented on September 28, 2024

Congratulations on becoming a father! I'll see if I can figure this out and put in a PR if yes.

from noctua.

DyfanJones avatar DyfanJones commented on September 28, 2024

Hi @fabhans, sorry about the wait, this too longer than expected. Been a little busy :)

This has proved to be a little more tricky than expected as noctua needed to be de-coupled from AWS Glue to so that it could work with different catalogs.

remotes::install_github("dyfanjones/noctua", ref = "catalog")

To change default catalog (AwsDataCatalog) please use the new parameter catalog_name in dbConnect:

library(DBI)

con <- dbConnect(noctua::athena(), catalog_name = "MyCatalog")

Please have a go, any feedback would be greatly appreciated :)

from noctua.

DyfanJones avatar DyfanJones commented on September 28, 2024

Will leave this open until cran release.

from noctua.

fabhans avatar fabhans commented on September 28, 2024

I'm getting this error when trying to connect with catalog_name specified:

An error occurred while the 'noctua' package was updating the RStudio Connections pane: Error in names(tbl_type) <- x$Name: 'names' attribute [1] must be the same length as the vector [0] If necessary, these warnings can be squelched by setting options(rstudio.connectionObserver.errorsSuppressed = TRUE).

from noctua.

DyfanJones avatar DyfanJones commented on September 28, 2024

Hi @fabhans I am unable to reproduce your error. Can you share your environment you are using please

from noctua.

DyfanJones avatar DyfanJones commented on September 28, 2024

Actually I think I know what is going on, please try:

remotes::install_github("dyfanjones/noctua", ref = "table_type")

from noctua.

fabhans avatar fabhans commented on September 28, 2024

Connecting works with that, but it takes very long (>20mins). Btw this is the same time the error message took to appear.

from noctua.

DyfanJones avatar DyfanJones commented on September 28, 2024

@fabhans how many table do you have in your data catalogs? 🤔

from noctua.

DyfanJones avatar DyfanJones commented on September 28, 2024

i wonder for large data catalogs getting the tables from the sdk isn't the most optimal method 🤔

from noctua.

fabhans avatar fabhans commented on September 28, 2024

10k+ tables. If that's the issue I could forego connecting to a specific catalog and use dbplyr::in_catalog() to target a specific table in a specific catalog instead.

from noctua.

DyfanJones avatar DyfanJones commented on September 28, 2024

You can turn off the connection tab using the following

library(DBI)
con <- dbConnect(noctua::athena(), catalog_name = "mycatalog", rstudio_conn_tab = FALSE)

However I would like to see if I can improve the performance of getting multiple tables for the connection.

from noctua.

fabhans avatar fabhans commented on September 28, 2024

That fixed the issue.

from noctua.

DyfanJones avatar DyfanJones commented on September 28, 2024

@fabhans can you do me a favour and run the following. I would like to understand what happens at scale. I only have a small handful of tables so I don't know what happens at scale.

library(DBI)

con <- dbConnect(noctua::athena(), rstudio_conn_tab = F)

bm <- bench::mark(
  method1 = noctua::dbGetTables(con),
  method2 = dbGetQuery(
    con,
    "select 
      table_catalog as Catalog,
      table_schema as Schema,
      table_name as TableName,
      table_type as TableType
    from information_schema.tables
    where table_schema != 'information_schema'
    order by table_catalog, table_schema"
  ),
  check = F,
  iterations = 1
)

#> INFO: (Data scanned: 2.37 KB)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
bm
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 method1       1.94s    1.94s     0.516    6.68MB    0.516
#> 2 method2       3.36s    3.36s     0.297    2.94MB    0.297

Created on 2023-04-26 with reprex v2.0.2

from noctua.

fabhans avatar fabhans commented on September 28, 2024

Here you go. Got a few warnings:

Warning messages:
1: In rbindlist(unlist(lapply(schema[[n]], function(s) list_tables(athena,  :
  Column 4 ['TableType'] of item 57293 is length 0. This (and 4 others like it) has been filled with NA (NULL for list columns) to make each item uniform.
2: In rbindlist(unlist(lapply(schema[[n]], function(s) list_tables(athena,  :
  Column 4 ['TableType'] of item 57296 is length 0. This (and 4 others like it) has been filled with NA (NULL for list columns) to make each item uniform.
3: Some expressions had a GC in every iteration; so filtering is disabled. 
> bm
# A tibble: 2 × 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time           gc      
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>         <list>  
1 method1      15.01m   15.01m   0.00111    1.45GB    0.225     1   203     15.01m <NULL> <Rprofmem> <bench_tm [1]> <tibble>
2 method2       3.41s    3.41s   0.294       2.6MB    0         1     0      3.41s <NULL> <Rprofmem> <bench_tm [1]> <tibble>

from noctua.

DyfanJones avatar DyfanJones commented on September 28, 2024

I have been doing to some tweaking to the rstudio connection tab helper functions. @fabhans can you test it out please. I am hoping it is useable now :)

remotes::install_github("dyfanjones/noctua", ref = "catalog_performance")
library(DBI)

con <- dbConnect(noctua::athena())

Let me know how it works.

from noctua.

fabhans avatar fabhans commented on September 28, 2024

Hi @DyfanJones. It's faster than previously now, but still pretty slow. Same benchmark as above:

INFO: (Data scanned: 3.74 MB)
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled. 
> bm
# A tibble: 2 × 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time           gc      
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>         <list>  
1 method1       12.3m    12.3m   0.00136     121MB   0.0217     1    16      12.3m <NULL> <Rprofmem> <bench_tm [1]> <tibble>
2 method2       12.2m    12.2m   0.00137     120MB   0.0205     1    15      12.2m <NULL> <Rprofmem> <bench_tm [1]> <tibble>
> 

from noctua.

DyfanJones avatar DyfanJones commented on September 28, 2024

Thanks for running the test for me, it is interesting method 2 has slowed down from 3.4s to 12.1m, as there has been no change to it 🤔

Out of interest how fast does Athena execute the raw SQL from the Athena console?

select 
      table_catalog as Catalog,
      table_schema as Schema,
      table_name as TableName,
      table_type as TableType
from information_schema.tables
where table_schema != 'information_schema'
order by table_catalog, table_schema

I really appreciate all the testing you are doing.

from noctua.

fabhans avatar fabhans commented on September 28, 2024

Run time: 22 min 51.742 sec

Happy to test as much as necessary!

from noctua.

DyfanJones avatar DyfanJones commented on September 28, 2024

Ah ok so if raw Athena can't run any faster I guess this is the fastest we can get it 🤔

from noctua.

fabhans avatar fabhans commented on September 28, 2024

I'm not sure where the 3.4 secs yesterday came from. Might have had something to do with the warnings I've been seeing.

I guess if the data catalogs are extremely large like in my case, one will just have to live without the connection pane (or wait for it). I'm personally completely fine not using it.

from noctua.

DyfanJones avatar DyfanJones commented on September 28, 2024

Keeping open until cran release

from noctua.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.