Giter Club home page Giter Club logo

Comments (24)

DyfanJones avatar DyfanJones commented on September 25, 2024

Hi @mraess ,

This behaviour is correct. profile_name doesn't affect the schema/database you are connecting to. The profile_name basically passes your credentials safely to the SDK so you can make the connection. So in affect you will be connecting as different users with each profile_name depending on the credentials.

What I think you want to do is change the default schema when you make the connection. To do this you need to do the following:

library(DBI)
library(noctua)


# changing profile name but same schema
con1 <- dbConnect(athena())

con2 <- dbConnect(athena(), 
                  profile_name = "demo_profile")


# changing schema/database
con3<- dbConnect(athena(),
                  schema_name = "temp")

con4 <- dbConnect(athena(), 
                  profile_name = "demo_profile",
                  schema_name = "temp")

dbGetInfo(con1)
# $profile_name
# [1] "default"
# 
# $s3_staging
# [1] "s3://mybucket/"
# 
# $dbms.name
# [1] "default"
# 
# $work_group
# [1] "primary"
# 
# $poll_interval
# NULL
# 
# $encryption_option
# NULL
# 
# $kms_key
# NULL
# 
# $expiration
# NULL
# 
# $region_name
# [1] "eu-west-1"
# 
# $paws
# [1] "0.1.7"
# 
# $noctua
# [1] "1.6.0.9000"

dbGetInfo(con2)
# $profile_name
# [1] "demo_profile"
# 
# $s3_staging
# [1] "s3://mybucket/"
# 
# $dbms.name
# [1] "default"
# 
# $work_group
# [1] "primary"
# 
# $poll_interval
# NULL
# 
# $encryption_option
# NULL
# 
# $kms_key
# NULL
# 
# $expiration
# NULL
# 
# $region_name
# [1] "eu-west-1"
# 
# $paws
# [1] "0.1.7"
# 
# $noctua
# [1] "1.6.0.9000"

dbGetInfo(con3)
# $profile_name
# [1] "default"
# 
# $s3_staging
# [1] "s3://mybucket/"
# 
# $dbms.name
# [1] "temp"
# 
# $work_group
# [1] "primary"
# 
# $poll_interval
# NULL
# 
# $encryption_option
# NULL
# 
# $kms_key
# NULL
# 
# $expiration
# NULL
# 
# $region_name
# [1] "eu-west-1"
# 
# $paws
# [1] "0.1.7"
# 
# $noctua
# [1] "1.6.0.9000"

dbGetInfo(con4)
# $profile_name
# [1] "demo_profile"
# 
# $s3_staging
# [1] "s3://mybucket/"
# 
# $dbms.name
# [1] "temp"
# 
# $work_group
# [1] "primary"
# 
# $poll_interval
# NULL
# 
# $encryption_option
# NULL
# 
# $kms_key
# NULL
# 
# $expiration
# NULL
# 
# $region_name
# [1] "eu-west-1"
# 
# $paws
# [1] "0.1.7"
# 
# $noctua
# [1] "1.6.0.9000"

I hope this makes sense :)

from noctua.

DyfanJones avatar DyfanJones commented on September 25, 2024

When you make a connection and you have multiple schemas in your Athena instance you can still query them for example:

library(DBI)
library(noctua)

# changing profile name but same schema
con1 <- dbConnect(athena())

dbGetQuery(con1, "select * from temp.iris_parquet")
# Info: (Data scanned: 666 Bytes)
# petal_length petal_width time_stamp
# 1:          1.4         0.2   20200421
# 2:          1.4         0.2   20200421
# 3:          1.3         0.2   20200421
# 4:          1.5         0.2   20200421
# 5:          1.4         0.2   20200421
# ---                                    
#   146:          5.2         2.3   20200421
# 147:          5.0         1.9   20200421
# 148:          5.2         2.0   20200421
# 149:          5.4         2.3   20200421
# 150:          5.1         1.8   20200421

The benefit of setting the schema_name in the connection is when you use dplyr for example:

library(dplyr)
con2 <- dbConnect(athena(),
                 schema_name = "temp")

tbl(con2, "iris_parquet")
# Info: (Data scanned: 666 Bytes)
# Source:   table<iris_parquet> [?? x 3]
# Database: Athena 0.1.7 [default@eu-west-1/temp]
# petal_length petal_width time_stamp
# <dbl>       <dbl> <chr>     
#   1          1.4         0.2 20200421  
# 2          1.4         0.2 20200421  
# 3          1.3         0.2 20200421  
# 4          1.5         0.2 20200421  
# 5          1.4         0.2 20200421  
# 6          1.7         0.4 20200421  
# 7          1.4         0.3 20200421  
# 8          1.5         0.2 20200421  
# 9          1.4         0.2 20200421  
# 10          1.5         0.1 20200421  
# … with more rows

However this still can be achieved with dbplyr::in_schema()

tbl(con1, dbplyr::in_schema("temp","iris_parquet"))
# Info: (Data scanned: 666 Bytes)
# Source:   table<temp.iris_parquet> [?? x 3]
# Database: Athena 0.1.7 [default@eu-west-1/default]
# petal_length petal_width time_stamp
# <dbl>       <dbl> <chr>     
#   1          1.4         0.2 20200421  
# 2          1.4         0.2 20200421  
# 3          1.3         0.2 20200421  
# 4          1.5         0.2 20200421  
# 5          1.4         0.2 20200421  
# 6          1.7         0.4 20200421  
# 7          1.4         0.3 20200421  
# 8          1.5         0.2 20200421  
# 9          1.4         0.2 20200421  
# 10          1.5         0.1 20200421  
# … with more rows

I hope this has helped with your initial query. I will leave this open just incase I miss understood you initial question.

from noctua.

mraess avatar mraess commented on September 25, 2024

Hey, Thanks for the quick answer - I'll certainly give this a shot. One question though, wouldn't the different s3 staging directories determine which database it connects to? And how does it know what the standard database is, which I'm connecting to right now?

I guess what I'm asking is IF I have all my tables in AWS athena, but some are under dev-a, which the default credentials are tied to and a second schema/database, prod-a to which the non-default profile is tied to, would my problem be solved with the schema setting you mentioned? I'm asking b/c of the above since I was under the impression that the staging directory is tied to the specific database, i.e. dev-a or prod-a.

Let me know if this makes sense :)

from noctua.

DyfanJones avatar DyfanJones commented on September 25, 2024

s3_staging_dir is the s3 bucket that AWS Athena outputs it's results into. What noctua does is, it gets those results and returns them to R. By default noctua cleans up after each query to keep your S3 buckets tidy. However you can turn that off by activating the caching feature: AWS Athena Query Caching

By default schema_name = "default" this is just coded. So by changing the schema_name to the correct database it should work. In the backend "noctua" passes your connection database to the Athena api call functions. This is what I am guessing is the problem you are coming into.

I am not sure how to get the schema_name from your profile using the AWS SDK's. If you know how then I am happy to receive a pull request or even if you share the code in this thread. I can add this to the package :)

from noctua.

mraess avatar mraess commented on September 25, 2024

Haha, sounds good - I just might have to hard-code the schema_name. To clarify, this should work even if dev-a and prod-a are different databases with their own schemas.tables, correct? E.g. ingest.table-a in dev-a would that have ingest as the schema name?

from noctua.

DyfanJones avatar DyfanJones commented on September 25, 2024

It should be able to query across schema's so for example:

library(DBI)
library(noctua)

con <- dbConnect(athena(),
                  schema_name = "default")

# querying default schema
dbGetQuery(con, "select * from iris")
# Info: (Data scanned: 860 Bytes)
# sepal_length sepal_width petal_length petal_width   species
# 1:          5.1         3.5          1.4         0.2    setosa
# 2:          4.9         3.0          1.4         0.2    setosa
# 3:          4.7         3.2          1.3         0.2    setosa
# 4:          4.6         3.1          1.5         0.2    setosa
# 5:          5.0         3.6          1.4         0.2    setosa
# ---                                                            
# 146:          6.7         3.0          5.2         2.3 virginica
# 147:          6.3         2.5          5.0         1.9 virginica
# 148:          6.5         3.0          5.2         2.0 virginica
# 149:          6.2         3.4          5.4         2.3 virginica
# 150:          5.9         3.0          5.1         1.8 virginica

# querying temp schema
dbGetQuery(con, "select * from temp.iris_parquet")
# Info: (Data scanned: 666 Bytes)
# petal_length petal_width time_stamp
# 1:          1.4         0.2   20200421
# 2:          1.4         0.2   20200421
# 3:          1.3         0.2   20200421
# 4:          1.5         0.2   20200421
# 5:          1.4         0.2   20200421
# ---                                    
# 146:          5.2         2.3   20200421
# 147:          5.0         1.9   20200421
# 148:          5.2         2.0   20200421
# 149:          5.4         2.3   20200421
# 150:          5.1         1.8   20200421

from noctua.

mraess avatar mraess commented on September 25, 2024

Thanks for the quick replies!! I'll check this out for sure. Thanks also for keeping the issue open for now!

from noctua.

mraess avatar mraess commented on September 25, 2024

Hey, so I implemented your suggestions - unfortunately to no avail. I have been able to narrow the problem down some more though.

A little more context: both dev-a and prod-a are in two completely different AWS accounts.

When I replace the key/secret for dev-a in the default profile in .credentials with a key/secret pair for prod-a, it does connect to prod-a. Even though in the actual r code everything is left the same, e.g.

con_2 <- dbConnect(noctua::athena(),
                   profile_name = "other-profile-that-is-not-default",
                   s3_staging_dir = "s3://aws-athena-query-results-for-other-profile-us-east-1/", 
                   region = "us-east-1")

pointing to a non-default profile.

This leads me to believe that, for some reason, only the default profile is being picked up b/c what changed wasn't the r code but the underlying credentials in the default profile, while the function was pointing to a non-default profile.

Another thing I tried was to put SAML credentials into the default profile in the .credentials file to see if that would work.

[default]
aws_access_key_id = XXXXX
aws_secret_access_key = XXXX
aws_session_token = XXXXX
aws_security_token = XXXXXX

In the .config file I put

[default]
region = us-east-1
saml.session_duration = 43200
saml.idp_arn = arn:aws:iam::XXXX:saml-provider/g-suite
saml.role_arn = arn:aws:iam::XXXXXXX:role/other-profile_that-is-not-default

This however results in an error:

Error: UnrecognizedClientException (HTTP 400). The security token included in the request is invalid.

Anyways, these might be two different problems altogether but it seems like the function is only picking up the default profile from the .config file.

Again, any insights and help are greatly appreciated! I'm really hoping to figure this out :)

from noctua.

DyfanJones avatar DyfanJones commented on September 25, 2024

Will have a little look tomorrow :)

from noctua.

mraess avatar mraess commented on September 25, 2024

Awesome, thanks!!

from noctua.

DyfanJones avatar DyfanJones commented on September 25, 2024

I am just a little confused with what is going on, sorry. Do you mind running through with me again so I have it straight in my head :)

Please correct me if I have miss understood

  • The default profile connects to the dev AWS Account where you want to connect to dev-a database.
  • The other-profile-that-is-not-default profile connects to prod AWS account where you want to connect to prod-a database.
    Is this correct?

Your initial comment was able to make the connection to both AWS Accounts but not to the right databases. As you was able to get connection objects from both dbConnect() and view the metadata with dbGetInfo().

With your initial .credentials/.config file did you try the following?

con_2 <- dbConnect(noctua::athena(),
                   profile_name = "other-profile-that-is-not-default",
                   schema_name = "prod-a",
                   s3_staging_dir = "s3://aws-athena-query-results-for-other-profile-us-east-1/")

from noctua.

mraess avatar mraess commented on September 25, 2024

Hey, thanks for following up. I'm happy to provide some more details. Yes, I tried the code at the bottom but that did not work.

For context, assume that dev-a and prod-a are completely different AWS accounts (which they are) like from two different customers. Under each of these accounts, there are several schemas with one ore more tables.

However, I tried to narrow the use case down even more to make potential debugging easier :)

I wrote a short test script, which I then ran from the terminal and which prints out a list of tables under the connection. This allowed me to narrow down the problems more

library(noctua)
library(DBI)
con_3 <- dbConnect(noctua::athena(),
                   profile_name = "other-profile-that-is-not-default"
                   s3_staging_dir = "s3://aws-athena-query-results-for-other-profile-us-east-1/", 
                   region = "us-east-1")

dbListTables(con_3)

Observations:

  1. In the .credentials file, I removed the default profile. When you run aws client command-line tools with other-profile-that-is-not-default profile active, it works fine, i.e. in the terminal, it lists all the right tables, etc. This shows that the credentials in the profile are valid and work. However, when running the test script above - it says it can’t find any credentials while the aws cli finds them fine.

  2. IF there is no AWS environment variable, it does not pick up a named profile even when passed as a parameter to the function. This happens even when static keys/secret are put into the .credentials file for other-profile-that-is-not-default

Error in f() : No credentials provided
Calls: dbListTables ... sign_with_body -> get_credentials -> call_with_args -> f
Execution halted
  1. Even when AWS profile passed in via the environment, whether you pass profile_name or not, you get the following error
Error: UnrecognizedClientException (HTTP 400). The security token included in the request is invalid.
Execution halted

This seems to indicate that 1) the profile_name is being ignored for sure and the connection works only with static key/secret under [default] profile and 2) that STS credentials, which have a session token and a security token, are not working at all even when put int he default profile.

In conclusion, hopefully, this test proved that 1) the profile_name parameter is being ignored, but that the package is able to talk to different profiles/accounts when put in the [default] profile with static key/secret. And 2) that STS credentials don't work at all whether they are in [default] or not.

Please let me know if this makes sense. I'm happy to provide further insights/clarification.

Thanks again for your awesome responsiveness here!!! Happy Friday!

from noctua.

DyfanJones avatar DyfanJones commented on September 25, 2024

Ah ok cool, just had a little look and the profile not being passed correctly looks like it was a known issue with the paws sdk which has been fixed with the dev version of paws.common

paws-r/paws#283

remotes::install_github("paws-r/paws/paws.common")

From a quick test this fixes apart of your issue but I don't know if it will fix the second part.

For the STS what code are you using? I might need to added more explicit support for SAML access methods.

from noctua.

pmarkert avatar pmarkert commented on September 25, 2024

For the STS use-case, we are using a SAML provider in conjunction with sts assume-role-with-saml in order to establish the temporary STS session. STS generates temporary AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SECURITY_TOKEN values that can be stored in the ~/.aws/credentials and then used with standard SDK tools to call AWS services. Once the session is established, it works like any other STS session, so there is no more SAML involved at that point. This is similar to what you would get if you used normal keys and called aws sts get-session-token to generate temporary session keys for short-term expiration or a multi-factor upgrade.

from noctua.

DyfanJones avatar DyfanJones commented on September 25, 2024

@pmarkert Thanks for the for the extra information. I will need to create a branch that gives more explicit support for SAML connection methods similar to assume_role(). I will need to do a little extra reading so I know how to set up my own aws account with saml connection method so i can add some unit tests around this :)

from noctua.

mraess avatar mraess commented on September 25, 2024

@DyfanJones thanks a lot!!

from noctua.

DyfanJones avatar DyfanJones commented on September 25, 2024

@mraess i will create a initial pull request is it possible for you to test? I currently don't know how to set up my aws account for saml connections. This would be a massive help :)

from noctua.

pmarkert avatar pmarkert commented on September 25, 2024

Actually, it looks like this second issue is also related to that underlying driver library. paws-r/paws#280 describes exactly the problem. From your code, you are already passing the aws_session_token through any time the access_key is referenced, which would be all you need to do to ensure the proper support, so I don't know that you need to setup any SAML or role integrations.

from noctua.

DyfanJones avatar DyfanJones commented on September 25, 2024

@pmarkert Thanks for finding that out. @mraess are you able to trying this out with the dev version of paws.common?

remotes::install_github("paws-r/paws/paws.common")

from noctua.

mraess avatar mraess commented on September 25, 2024

@DyfanJones yes, I'll get on that later today or early next week and let you know how it goes :)

from noctua.

DyfanJones avatar DyfanJones commented on September 25, 2024

I am getting the package ready for the next release which comes with retry functionality, dbFetch can return data in chunks greater than 1000 when n != -1 and simple wrapper for converting backend s3 files. If you come into any more issue I would like to capture them in this release so that it is all live on the cran :)

from noctua.

mraess avatar mraess commented on September 25, 2024

@DyfanJones Sounds great!

from noctua.

pmarkert avatar pmarkert commented on September 25, 2024

I was able to test with the updated paws library and it looks like it is working to me, both parts reading from a named profile and picking up the STS session. Thanks!

from noctua.

DyfanJones avatar DyfanJones commented on September 25, 2024

Thanks @pmarkert for testing. I will close this ticket. If this issue crops up again please re-open. Or open up another ticket :)

from noctua.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.