Comments (24)
Hi @mraess ,
This behaviour is correct. profile_name
doesn't affect the schema/database you are connecting to. The profile_name
basically passes your credentials safely to the SDK so you can make the connection. So in affect you will be connecting as different users with each profile_name depending on the credentials.
What I think you want to do is change the default schema
when you make the connection. To do this you need to do the following:
library(DBI)
library(noctua)
# changing profile name but same schema
con1 <- dbConnect(athena())
con2 <- dbConnect(athena(),
profile_name = "demo_profile")
# changing schema/database
con3<- dbConnect(athena(),
schema_name = "temp")
con4 <- dbConnect(athena(),
profile_name = "demo_profile",
schema_name = "temp")
dbGetInfo(con1)
# $profile_name
# [1] "default"
#
# $s3_staging
# [1] "s3://mybucket/"
#
# $dbms.name
# [1] "default"
#
# $work_group
# [1] "primary"
#
# $poll_interval
# NULL
#
# $encryption_option
# NULL
#
# $kms_key
# NULL
#
# $expiration
# NULL
#
# $region_name
# [1] "eu-west-1"
#
# $paws
# [1] "0.1.7"
#
# $noctua
# [1] "1.6.0.9000"
dbGetInfo(con2)
# $profile_name
# [1] "demo_profile"
#
# $s3_staging
# [1] "s3://mybucket/"
#
# $dbms.name
# [1] "default"
#
# $work_group
# [1] "primary"
#
# $poll_interval
# NULL
#
# $encryption_option
# NULL
#
# $kms_key
# NULL
#
# $expiration
# NULL
#
# $region_name
# [1] "eu-west-1"
#
# $paws
# [1] "0.1.7"
#
# $noctua
# [1] "1.6.0.9000"
dbGetInfo(con3)
# $profile_name
# [1] "default"
#
# $s3_staging
# [1] "s3://mybucket/"
#
# $dbms.name
# [1] "temp"
#
# $work_group
# [1] "primary"
#
# $poll_interval
# NULL
#
# $encryption_option
# NULL
#
# $kms_key
# NULL
#
# $expiration
# NULL
#
# $region_name
# [1] "eu-west-1"
#
# $paws
# [1] "0.1.7"
#
# $noctua
# [1] "1.6.0.9000"
dbGetInfo(con4)
# $profile_name
# [1] "demo_profile"
#
# $s3_staging
# [1] "s3://mybucket/"
#
# $dbms.name
# [1] "temp"
#
# $work_group
# [1] "primary"
#
# $poll_interval
# NULL
#
# $encryption_option
# NULL
#
# $kms_key
# NULL
#
# $expiration
# NULL
#
# $region_name
# [1] "eu-west-1"
#
# $paws
# [1] "0.1.7"
#
# $noctua
# [1] "1.6.0.9000"
I hope this makes sense :)
from noctua.
When you make a connection and you have multiple schemas in your Athena instance you can still query them for example:
library(DBI)
library(noctua)
# changing profile name but same schema
con1 <- dbConnect(athena())
dbGetQuery(con1, "select * from temp.iris_parquet")
# Info: (Data scanned: 666 Bytes)
# petal_length petal_width time_stamp
# 1: 1.4 0.2 20200421
# 2: 1.4 0.2 20200421
# 3: 1.3 0.2 20200421
# 4: 1.5 0.2 20200421
# 5: 1.4 0.2 20200421
# ---
# 146: 5.2 2.3 20200421
# 147: 5.0 1.9 20200421
# 148: 5.2 2.0 20200421
# 149: 5.4 2.3 20200421
# 150: 5.1 1.8 20200421
The benefit of setting the schema_name
in the connection is when you use dplyr
for example:
library(dplyr)
con2 <- dbConnect(athena(),
schema_name = "temp")
tbl(con2, "iris_parquet")
# Info: (Data scanned: 666 Bytes)
# Source: table<iris_parquet> [?? x 3]
# Database: Athena 0.1.7 [default@eu-west-1/temp]
# petal_length petal_width time_stamp
# <dbl> <dbl> <chr>
# 1 1.4 0.2 20200421
# 2 1.4 0.2 20200421
# 3 1.3 0.2 20200421
# 4 1.5 0.2 20200421
# 5 1.4 0.2 20200421
# 6 1.7 0.4 20200421
# 7 1.4 0.3 20200421
# 8 1.5 0.2 20200421
# 9 1.4 0.2 20200421
# 10 1.5 0.1 20200421
# … with more rows
However this still can be achieved with dbplyr::in_schema()
tbl(con1, dbplyr::in_schema("temp","iris_parquet"))
# Info: (Data scanned: 666 Bytes)
# Source: table<temp.iris_parquet> [?? x 3]
# Database: Athena 0.1.7 [default@eu-west-1/default]
# petal_length petal_width time_stamp
# <dbl> <dbl> <chr>
# 1 1.4 0.2 20200421
# 2 1.4 0.2 20200421
# 3 1.3 0.2 20200421
# 4 1.5 0.2 20200421
# 5 1.4 0.2 20200421
# 6 1.7 0.4 20200421
# 7 1.4 0.3 20200421
# 8 1.5 0.2 20200421
# 9 1.4 0.2 20200421
# 10 1.5 0.1 20200421
# … with more rows
I hope this has helped with your initial query. I will leave this open just incase I miss understood you initial question.
from noctua.
Hey, Thanks for the quick answer - I'll certainly give this a shot. One question though, wouldn't the different s3 staging directories determine which database it connects to? And how does it know what the standard database is, which I'm connecting to right now?
I guess what I'm asking is IF I have all my tables in AWS athena, but some are under dev-a
, which the default credentials are tied to and a second schema/database, prod-a
to which the non-default profile is tied to, would my problem be solved with the schema setting you mentioned? I'm asking b/c of the above since I was under the impression that the staging directory is tied to the specific database, i.e. dev-a
or prod-a
.
Let me know if this makes sense :)
from noctua.
s3_staging_dir
is the s3 bucket that AWS Athena outputs it's results into. What noctua
does is, it gets those results and returns them to R. By default noctua
cleans up after each query to keep your S3 buckets tidy. However you can turn that off by activating the caching feature: AWS Athena Query Caching
By default schema_name = "default"
this is just coded. So by changing the schema_name
to the correct database it should work. In the backend "noctua" passes your connection database to the Athena api call functions. This is what I am guessing is the problem you are coming into.
I am not sure how to get the schema_name
from your profile using the AWS SDK's. If you know how then I am happy to receive a pull request or even if you share the code in this thread. I can add this to the package :)
from noctua.
Haha, sounds good - I just might have to hard-code the schema_name
. To clarify, this should work even if dev-a
and prod-a
are different databases with their own schemas.tables
, correct? E.g. ingest.table-a
in dev-a
would that have ingest
as the schema name?
from noctua.
It should be able to query across schema's so for example:
library(DBI)
library(noctua)
con <- dbConnect(athena(),
schema_name = "default")
# querying default schema
dbGetQuery(con, "select * from iris")
# Info: (Data scanned: 860 Bytes)
# sepal_length sepal_width petal_length petal_width species
# 1: 5.1 3.5 1.4 0.2 setosa
# 2: 4.9 3.0 1.4 0.2 setosa
# 3: 4.7 3.2 1.3 0.2 setosa
# 4: 4.6 3.1 1.5 0.2 setosa
# 5: 5.0 3.6 1.4 0.2 setosa
# ---
# 146: 6.7 3.0 5.2 2.3 virginica
# 147: 6.3 2.5 5.0 1.9 virginica
# 148: 6.5 3.0 5.2 2.0 virginica
# 149: 6.2 3.4 5.4 2.3 virginica
# 150: 5.9 3.0 5.1 1.8 virginica
# querying temp schema
dbGetQuery(con, "select * from temp.iris_parquet")
# Info: (Data scanned: 666 Bytes)
# petal_length petal_width time_stamp
# 1: 1.4 0.2 20200421
# 2: 1.4 0.2 20200421
# 3: 1.3 0.2 20200421
# 4: 1.5 0.2 20200421
# 5: 1.4 0.2 20200421
# ---
# 146: 5.2 2.3 20200421
# 147: 5.0 1.9 20200421
# 148: 5.2 2.0 20200421
# 149: 5.4 2.3 20200421
# 150: 5.1 1.8 20200421
from noctua.
Thanks for the quick replies!! I'll check this out for sure. Thanks also for keeping the issue open for now!
from noctua.
Hey, so I implemented your suggestions - unfortunately to no avail. I have been able to narrow the problem down some more though.
A little more context: both dev-a
and prod-a
are in two completely different AWS accounts.
When I replace the key/secret for dev-a
in the default
profile in .credentials
with a key/secret pair for prod-a
, it does connect to prod-a
. Even though in the actual r
code everything is left the same, e.g.
con_2 <- dbConnect(noctua::athena(),
profile_name = "other-profile-that-is-not-default",
s3_staging_dir = "s3://aws-athena-query-results-for-other-profile-us-east-1/",
region = "us-east-1")
pointing to a non-default profile.
This leads me to believe that, for some reason, only the default
profile is being picked up b/c what changed wasn't the r
code but the underlying credentials in the default
profile, while the function was pointing to a non-default
profile.
Another thing I tried was to put SAML credentials into the default
profile in the .credentials
file to see if that would work.
[default]
aws_access_key_id = XXXXX
aws_secret_access_key = XXXX
aws_session_token = XXXXX
aws_security_token = XXXXXX
In the .config
file I put
[default]
region = us-east-1
saml.session_duration = 43200
saml.idp_arn = arn:aws:iam::XXXX:saml-provider/g-suite
saml.role_arn = arn:aws:iam::XXXXXXX:role/other-profile_that-is-not-default
This however results in an error:
Error: UnrecognizedClientException (HTTP 400). The security token included in the request is invalid.
Anyways, these might be two different problems altogether but it seems like the function is only picking up the default
profile from the .config
file.
Again, any insights and help are greatly appreciated! I'm really hoping to figure this out :)
from noctua.
Will have a little look tomorrow :)
from noctua.
Awesome, thanks!!
from noctua.
I am just a little confused with what is going on, sorry. Do you mind running through with me again so I have it straight in my head :)
Please correct me if I have miss understood
- The
default
profile connects to thedev
AWS Account where you want to connect todev-a
database. - The
other-profile-that-is-not-default
profile connects toprod
AWS account where you want to connect toprod-a
database.
Is this correct?
Your initial comment was able to make the connection to both AWS Accounts but not to the right databases. As you was able to get connection objects from both dbConnect()
and view the metadata with dbGetInfo()
.
With your initial .credentials/.config
file did you try the following?
con_2 <- dbConnect(noctua::athena(),
profile_name = "other-profile-that-is-not-default",
schema_name = "prod-a",
s3_staging_dir = "s3://aws-athena-query-results-for-other-profile-us-east-1/")
from noctua.
Hey, thanks for following up. I'm happy to provide some more details. Yes, I tried the code at the bottom but that did not work.
For context, assume that dev-a
and prod-a
are completely different AWS accounts (which they are) like from two different customers. Under each of these accounts, there are several schemas with one ore more tables.
However, I tried to narrow the use case down even more to make potential debugging easier :)
I wrote a short test script, which I then ran from the terminal and which prints out a list of tables under the connection. This allowed me to narrow down the problems more
library(noctua)
library(DBI)
con_3 <- dbConnect(noctua::athena(),
profile_name = "other-profile-that-is-not-default"
s3_staging_dir = "s3://aws-athena-query-results-for-other-profile-us-east-1/",
region = "us-east-1")
dbListTables(con_3)
Observations:
-
In the
.credentials
file, I removed thedefault
profile. When you run aws client command-line tools withother-profile-that-is-not-default
profile active, it works fine, i.e. in the terminal, it lists all the right tables, etc. This shows that the credentials in the profile are valid and work. However, when running the test script above - it says it can’t find any credentials while theaws cli
finds them fine. -
IF there is no
AWS
environment variable, it does not pick up a named profile even when passed as a parameter to the function. This happens even when static keys/secret are put into the.credentials
file forother-profile-that-is-not-default
Error in f() : No credentials provided
Calls: dbListTables ... sign_with_body -> get_credentials -> call_with_args -> f
Execution halted
- Even when AWS profile passed in via the environment, whether you pass
profile_name
or not, you get the following error
Error: UnrecognizedClientException (HTTP 400). The security token included in the request is invalid.
Execution halted
This seems to indicate that 1) the profile_name
is being ignored for sure and the connection works only with static key/secret under [default]
profile and 2) that STS
credentials, which have a session token and a security token, are not working at all even when put int he default
profile.
In conclusion, hopefully, this test proved that 1) the profile_name
parameter is being ignored, but that the package is able to talk to different profiles/accounts when put in the [default]
profile with static key/secret. And 2) that STS credentials don't work at all whether they are in [default]
or not.
Please let me know if this makes sense. I'm happy to provide further insights/clarification.
Thanks again for your awesome responsiveness here!!! Happy Friday!
from noctua.
Ah ok cool, just had a little look and the profile not being passed correctly looks like it was a known issue with the paws
sdk which has been fixed with the dev version of paws.common
remotes::install_github("paws-r/paws/paws.common")
From a quick test this fixes apart of your issue but I don't know if it will fix the second part.
For the STS what code are you using? I might need to added more explicit support for SAML
access methods.
from noctua.
For the STS use-case, we are using a SAML provider in conjunction with sts assume-role-with-saml in order to establish the temporary STS session. STS generates temporary AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SECURITY_TOKEN values that can be stored in the ~/.aws/credentials and then used with standard SDK tools to call AWS services. Once the session is established, it works like any other STS session, so there is no more SAML involved at that point. This is similar to what you would get if you used normal keys and called aws sts get-session-token
to generate temporary session keys for short-term expiration or a multi-factor upgrade.
from noctua.
@pmarkert Thanks for the for the extra information. I will need to create a branch that gives more explicit support for SAML connection methods similar to assume_role(). I will need to do a little extra reading so I know how to set up my own aws account with saml connection method so i can add some unit tests around this :)
from noctua.
@DyfanJones thanks a lot!!
from noctua.
@mraess i will create a initial pull request is it possible for you to test? I currently don't know how to set up my aws account for saml connections. This would be a massive help :)
from noctua.
Actually, it looks like this second issue is also related to that underlying driver library. paws-r/paws#280 describes exactly the problem. From your code, you are already passing the aws_session_token through any time the access_key is referenced, which would be all you need to do to ensure the proper support, so I don't know that you need to setup any SAML or role integrations.
from noctua.
@pmarkert Thanks for finding that out. @mraess are you able to trying this out with the dev version of paws.common
?
remotes::install_github("paws-r/paws/paws.common")
from noctua.
@DyfanJones yes, I'll get on that later today or early next week and let you know how it goes :)
from noctua.
I am getting the package ready for the next release which comes with retry functionality, dbFetch
can return data in chunks greater than 1000 when n != -1
and simple wrapper for converting backend s3 files. If you come into any more issue I would like to capture them in this release so that it is all live on the cran :)
from noctua.
@DyfanJones Sounds great!
from noctua.
I was able to test with the updated paws library and it looks like it is working to me, both parts reading from a named profile and picking up the STS session. Thanks!
from noctua.
Thanks @pmarkert for testing. I will close this ticket. If this issue crops up again please re-open. Or open up another ticket :)
from noctua.
Related Issues (20)
- Prevent Noctua from printing Data Scanned -information HOT 7
- Release noctua 2.5.0 HOT 3
- Release noctua 2.6.0 HOT 1
- Sub-query fails with dplyr interface indicating "Only one sql statement is allowed" HOT 6
- cran-2.6.1 release
- Can I set various parameters in `.aws/config` file and have `DBI::dbConnect()` read those directly from that file? HOT 5
- Add catalog support HOT 26
- Column Bucketing
- Allow for Partition columns to change data types
- Can't write/append an empty data frame
- Connecting using long-term-creds returns Error 400 HOT 5
- dbFetch(..., n=small number) is quite slow when run on a large result set HOT 4
- `dbExistsTable()` doesn't work anymore HOT 3
- fix: for dbplyr 2.3.3.9000 +
- dbExistsTable() returns an incorrect result when the table name is defined by Id() or SQL() HOT 1
- InvalidRequestException with dbGetQuery HOT 4
- [Question]: Requesting guidance and best practices - Athena shinyApp with noctua HOT 2
- Unload option returns `null` results when `s3_staging_dir` is a bucket only HOT 1
- Speeding up `noctua` installation by cutting down unnecessary pieces HOT 3
- Add support for describe? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from noctua.