Comments (24)
PR #139 fixes this issue 😄
from noctua.
Sorry @OssiLehtinen I am not a 100% sure what is the issue. Can you get me an example of what is happening? Is the issue not all databases and tables aren't being returned. Or table names are getting truncated?
from noctua.
Ah sorry for being unclear. The issue is that all databases and tables are note returned. For example, the connection panel in Rstudio shows only the first 23 databases, even though many more exist in the catalog.
from noctua.
Ah I see :) i will have a look into it. Strange it is only displaying that top 23 data bases, if i remember correctly the glue call should return a 1000 🤔.
But you are right it does need the Next token as it won't return everything.
Just thinking out loud (don't know if it is true so don't quote me 😆) is there a restriction on the rstudio connection tab? As in the number of databases and tables it will show
from noctua.
Yeah it is pretty strange, and I don't quite understand why we're getting just the first 23 databases (even if we explicitly set MaxResults = 1000L in the paws function call). 23 is a pretty strange number of results to return.
It shouldn't be an issue with rstudio as the issue is there also with a direct call to glue$get_databases() and in an R session in a terminal.
BTW, boto3 has the same issue, so it doesn't seem to be a specific bug with paws either. Well, fo of course it is in principle a bug with both as the MaxResults parameter isn't really respected. aws-cli returns all the databases, however, but maybe there is some built in pagination handling.
p.s. I guess someone might have 1000+ databases (not saying that anyone should :D) and then pagination would be needed even if things were working correctly...
from noctua.
Interesting, i guess i need to make 24 temporary databases in my athena just to see this happening :D hahaha
from noctua.
Haha, maybe you would see it, maybe not. The strange thing is that this happens only in one of our environments/accounts. One specific feature there is that we have quite a lot of content shared via lake formation to that account, but the 'breaking point' of the database list doesn't seem to correlate with anything.
from noctua.
does this happen in dbGetTables
and dbListTables
. I think i will need to update these methods accordingly
from noctua.
Yes, the same issue there. Must be everywhere where glue$get_databases() is used.
from noctua.
Interesting, and does this only happen when you are working in that environment?
from noctua.
Yes, only in there.
from noctua.
hmmm interesting. Are the permissions to aws glue the same in that environment compared to your others?
from noctua.
Or is thats glue catalogue sync up correctly with your others?
from noctua.
Well there certainly could be something weird going on especially with the permissions granted through lake formation, but haven't spotted anything relevant so far. And when we do another get_databases() query with NextToken we get results for the rest of the tables, so no fundamental problem with permissions in my opinion.
I will let you know if we make some discoveries about what might cause this.
from noctua.
Well next token needs to be added to these functions. Hopefully this will fix it going forward :) very interesting problem as i haven't come across it before :)
from noctua.
Perhaps a helper like this could be handy?:
glue <- paws::glue()
get_paginated_from_paws <- function(paws_fun, token_field = "NextToken", content_node = NULL, ...) {
cargs <- as.list(match.call())
cargs[[1]] <- NULL
cargs[["paws_fun"]] <- NULL
cresults <- do.call("paws_fun", cargs)
if(is.null(content_node)) content_node <- names(cresults)[1]
ccontent <- cresults[[content_node]]
citers <- 0
while(length(cresults[[token_field]]) > 0 & citers < 100) {
ccargs <- cargs
ccargs[[token_field]] <- cresults[[token_field]]
cresults <- do.call("paws_fun", ccargs)
ccontent <- c(ccontent, cresults[[content_node]])
citers <- citers + 1
}
ccontent
}
get_paginated_from_paws(glue$get_databases)
sapply(get_paginated_from_paws(glue$get_databases), function(x) x$Name)
get_paginated_from_paws(glue$get_table, DatabaseName = "mydatabase", Name = "mytable")
The idea would be that you can give it different functions such as glue$get_databases or glue$get_table as an argument and the function will handle the looping. Some assumptions are made about the structure of the query results. Shouldn't make much overhead if NextToken is empty after the first call.
from noctua.
@OssiLehtinen thanks :) I will have a look to see how this works in practise :D
from noctua.
@OssiLehtinen are you available to test if branch aws-table-list
fixes the issue you are coming across in your environment?
# Installation method for branch: aws-table-list
remotes::install_github("dyfanjones/noctua", ref="aws-table-list")
This works in my environment but I would like to make sure it is fixed in yours :)
from noctua.
Thanks for this :) I have created 2 helper functions for the glue query as I wanted to format the returning response from AWS. This is to help with reducing the number of loops required to parse the named lists, and help with performance :)
Perhaps a helper like this could be handy?:
glue <- paws::glue() get_paginated_from_paws <- function(paws_fun, token_field = "NextToken", content_node = NULL, ...) { cargs <- as.list(match.call()) cargs[[1]] <- NULL cargs[["paws_fun"]] <- NULL cresults <- do.call("paws_fun", cargs) if(is.null(content_node)) content_node <- names(cresults)[1] ccontent <- cresults[[content_node]] citers <- 0 while(length(cresults[[token_field]]) > 0 & citers < 100) { ccargs <- cargs ccargs[[token_field]] <- cresults[[token_field]] cresults <- do.call("paws_fun", ccargs) ccontent <- c(ccontent, cresults[[content_node]]) citers <- citers + 1 } ccontent } get_paginated_from_paws(glue$get_databases) sapply(get_paginated_from_paws(glue$get_databases), function(x) x$Name) get_paginated_from_paws(glue$get_table, DatabaseName = "mydatabase", Name = "mytable")The idea would be that you can give it different functions such as glue$get_databases or glue$get_table as an argument and the function will handle the looping. Some assumptions are made about the structure of the query results. Shouldn't make much overhead if NextToken is empty after the first call.
from noctua.
Everything seems to work correctly now. The connections tab shows all the databases and also dbListTables and dbGetTables show the full list now. So, great work again, thanks!
p.s. the rstudio_conn_tab = F option really comes handy, as reading the whole glue catalog is even slower now (obviously it takes more time to actually read in the complete catalog :) )
from noctua.
You must have a fairly large glue catalogue for it to be fairly slow :D . From the R side of things it should be nice and fast when parsing the lists 😄 (Well as fast as I know how currently 😋)
🤔 How is dbGetTables
or dbListTables
? Do they run slow-ish?
from noctua.
Yeah, it's some 1200 tables or so at the moment. But I think using Lake Formation is the final nail in the coffin (quite a few of those tables are shared from other accounts on this specific account, where things are especially slow).
I totally agree the slowness shouldn't be an R side issue. I think I will raise an issue with AWS about it.
from noctua.
Oh, dbListTables and dbGetTables are equally slow. Also 'select * from information_schema.tables'.
from noctua.
@OssiLehtinen Thanks for giving an update on this 😄
from noctua.
Related Issues (20)
- Caching issue HOT 2
- Method to set unload at a package level HOT 1
- Release noctua 2.4.0 on to cran
- Prevent Noctua from printing Data Scanned -information HOT 7
- Release noctua 2.5.0 HOT 3
- Release noctua 2.6.0 HOT 1
- Sub-query fails with dplyr interface indicating "Only one sql statement is allowed" HOT 6
- cran-2.6.1 release
- Can I set various parameters in `.aws/config` file and have `DBI::dbConnect()` read those directly from that file? HOT 5
- Add catalog support HOT 26
- Column Bucketing
- Allow for Partition columns to change data types
- Can't write/append an empty data frame
- Connecting using long-term-creds returns Error 400 HOT 5
- dbFetch(..., n=small number) is quite slow when run on a large result set HOT 4
- `dbExistsTable()` doesn't work anymore HOT 3
- fix: for dbplyr 2.3.3.9000 +
- dbExistsTable() returns an incorrect result when the table name is defined by Id() or SQL() HOT 1
- InvalidRequestException with dbGetQuery HOT 4
- [Question]: Requesting guidance and best practices - Athena shinyApp with noctua HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from noctua.