Hi Dyfan, one more for the week: when noctua queries glue for a list

PR <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id=

Sorry <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

does this happen in dbGetTables and <code class="notr

Paginated glue responses not covered about noctua HOT 24 CLOSED

dyfanjones commented on June 24, 2024

Paginated glue responses not covered

from noctua.

Comments (24)

DyfanJones commented on June 24, 2024 1

PR #139 fixes this issue 😄

from noctua.

DyfanJones commented on June 24, 2024

Sorry @OssiLehtinen I am not a 100% sure what is the issue. Can you get me an example of what is happening? Is the issue not all databases and tables aren't being returned. Or table names are getting truncated?

from noctua.

OssiLehtinen commented on June 24, 2024

Ah sorry for being unclear. The issue is that all databases and tables are note returned. For example, the connection panel in Rstudio shows only the first 23 databases, even though many more exist in the catalog.

from noctua.

DyfanJones commented on June 24, 2024

Ah I see :) i will have a look into it. Strange it is only displaying that top 23 data bases, if i remember correctly the glue call should return a 1000 🤔.

But you are right it does need the Next token as it won't return everything.

Just thinking out loud (don't know if it is true so don't quote me 😆) is there a restriction on the rstudio connection tab? As in the number of databases and tables it will show

from noctua.

OssiLehtinen commented on June 24, 2024

Yeah it is pretty strange, and I don't quite understand why we're getting just the first 23 databases (even if we explicitly set MaxResults = 1000L in the paws function call). 23 is a pretty strange number of results to return.

It shouldn't be an issue with rstudio as the issue is there also with a direct call to glue$get_databases() and in an R session in a terminal.

BTW, boto3 has the same issue, so it doesn't seem to be a specific bug with paws either. Well, fo of course it is in principle a bug with both as the MaxResults parameter isn't really respected. aws-cli returns all the databases, however, but maybe there is some built in pagination handling.

p.s. I guess someone might have 1000+ databases (not saying that anyone should :D) and then pagination would be needed even if things were working correctly...

from noctua.

DyfanJones commented on June 24, 2024

Interesting, i guess i need to make 24 temporary databases in my athena just to see this happening :D hahaha

from noctua.

OssiLehtinen commented on June 24, 2024

Haha, maybe you would see it, maybe not. The strange thing is that this happens only in one of our environments/accounts. One specific feature there is that we have quite a lot of content shared via lake formation to that account, but the 'breaking point' of the database list doesn't seem to correlate with anything.

from noctua.

DyfanJones commented on June 24, 2024

does this happen in dbGetTables and dbListTables. I think i will need to update these methods accordingly

from noctua.

OssiLehtinen commented on June 24, 2024

Yes, the same issue there. Must be everywhere where glue$get_databases() is used.

from noctua.

DyfanJones commented on June 24, 2024

Interesting, and does this only happen when you are working in that environment?

from noctua.

OssiLehtinen commented on June 24, 2024

Yes, only in there.

from noctua.

DyfanJones commented on June 24, 2024

hmmm interesting. Are the permissions to aws glue the same in that environment compared to your others?

from noctua.

DyfanJones commented on June 24, 2024

Or is thats glue catalogue sync up correctly with your others?

from noctua.

OssiLehtinen commented on June 24, 2024

Well there certainly could be something weird going on especially with the permissions granted through lake formation, but haven't spotted anything relevant so far. And when we do another get_databases() query with NextToken we get results for the rest of the tables, so no fundamental problem with permissions in my opinion.

I will let you know if we make some discoveries about what might cause this.

from noctua.

DyfanJones commented on June 24, 2024

Well next token needs to be added to these functions. Hopefully this will fix it going forward :) very interesting problem as i haven't come across it before :)

from noctua.

OssiLehtinen commented on June 24, 2024

Perhaps a helper like this could be handy?:

glue <- paws::glue()

get_paginated_from_paws <- function(paws_fun, token_field = "NextToken", content_node = NULL, ...) {
  
  cargs <- as.list(match.call())
  cargs[[1]] <- NULL
  cargs[["paws_fun"]] <- NULL
  cresults <- do.call("paws_fun", cargs)
  
  if(is.null(content_node)) content_node <- names(cresults)[1]
  
  ccontent <- cresults[[content_node]]
                 
  citers <- 0
  while(length(cresults[[token_field]]) > 0 & citers < 100) {
    
    ccargs <- cargs
    ccargs[[token_field]] <- cresults[[token_field]]
    
    cresults <- do.call("paws_fun", ccargs)
    ccontent <- c(ccontent, cresults[[content_node]])
    citers <- citers + 1
  }
  
  ccontent
}


get_paginated_from_paws(glue$get_databases)
sapply(get_paginated_from_paws(glue$get_databases), function(x) x$Name)

get_paginated_from_paws(glue$get_table, DatabaseName = "mydatabase", Name = "mytable")

The idea would be that you can give it different functions such as glue$get_databases or glue$get_table as an argument and the function will handle the looping. Some assumptions are made about the structure of the query results. Shouldn't make much overhead if NextToken is empty after the first call.

from noctua.

DyfanJones commented on June 24, 2024

@OssiLehtinen thanks :) I will have a look to see how this works in practise :D

from noctua.

DyfanJones commented on June 24, 2024

@OssiLehtinen are you available to test if branch aws-table-list fixes the issue you are coming across in your environment?

# Installation method for branch: aws-table-list
remotes::install_github("dyfanjones/noctua", ref="aws-table-list")

This works in my environment but I would like to make sure it is fixed in yours :)

from noctua.

DyfanJones commented on June 24, 2024

Thanks for this :) I have created 2 helper functions for the glue query as I wanted to format the returning response from AWS. This is to help with reducing the number of loops required to parse the named lists, and help with performance :)

Perhaps a helper like this could be handy?:

glue <- paws::glue()

get_paginated_from_paws <- function(paws_fun, token_field = "NextToken", content_node = NULL, ...) {
  
  cargs <- as.list(match.call())
  cargs[[1]] <- NULL
  cargs[["paws_fun"]] <- NULL
  cresults <- do.call("paws_fun", cargs)
  
  if(is.null(content_node)) content_node <- names(cresults)[1]
  
  ccontent <- cresults[[content_node]]
                 
  citers <- 0
  while(length(cresults[[token_field]]) > 0 & citers < 100) {
    
    ccargs <- cargs
    ccargs[[token_field]] <- cresults[[token_field]]
    
    cresults <- do.call("paws_fun", ccargs)
    ccontent <- c(ccontent, cresults[[content_node]])
    citers <- citers + 1
  }
  
  ccontent
}


get_paginated_from_paws(glue$get_databases)
sapply(get_paginated_from_paws(glue$get_databases), function(x) x$Name)

get_paginated_from_paws(glue$get_table, DatabaseName = "mydatabase", Name = "mytable")

from noctua.

OssiLehtinen commented on June 24, 2024

Everything seems to work correctly now. The connections tab shows all the databases and also dbListTables and dbGetTables show the full list now. So, great work again, thanks!

p.s. the rstudio_conn_tab = F option really comes handy, as reading the whole glue catalog is even slower now (obviously it takes more time to actually read in the complete catalog :) )

from noctua.

DyfanJones commented on June 24, 2024

You must have a fairly large glue catalogue for it to be fairly slow :D . From the R side of things it should be nice and fast when parsing the lists 😄 (Well as fast as I know how currently 😋)

🤔 How is dbGetTables or dbListTables? Do they run slow-ish?

from noctua.

OssiLehtinen commented on June 24, 2024

Yeah, it's some 1200 tables or so at the moment. But I think using Lake Formation is the final nail in the coffin (quite a few of those tables are shared from other accounts on this specific account, where things are especially slow).

I totally agree the slowness shouldn't be an R side issue. I think I will raise an issue with AWS about it.

from noctua.

OssiLehtinen commented on June 24, 2024

Oh, dbListTables and dbGetTables are equally slow. Also 'select * from information_schema.tables'.

from noctua.

DyfanJones commented on June 24, 2024

@OssiLehtinen Thanks for giving an update on this 😄

from noctua.

Paginated glue responses not covered about noctua HOT 24 CLOSED

Comments (24)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent