Giter Club home page Giter Club logo

moj-analytical-services.dbtools's Introduction

dbtools

This is a simple package that lets you query databases using Amazon Athena and get the s3 path to the athena output (as a csv). This is significantly faster than using database drivers provided by Amazon, so might be a good option when pulling in large data.

Note: this package works alongside user IAM policies on the Analytical-Platform and requires you to be added to be given a standard database access. If in our github organisation you will be able to access the repo to request standard database access here.

Setup

This package uses the Python package pydbtools under the hood. Make sure your R-Studio deployment is up to date and has Python 3.6 or higher installed.

Before using dbtools you'll need to install pydbtools v2.0.1 or higher. Do this from the R terminal:

# in terminal
pip install -U pydbtools==2.0.2

Then install dbtools itself. The best way to do this is via conda:

# in terminal
conda install -c moj-analytical-services r-dbtools 

Package requirements are:

  • s3tools (preinstalled)

  • reticulate

  • python (preinstalled - version 3.6 or higher)

  • boto3 (preinstalled)

  • readr (preinstalled)

  • data.table (version 1.11.8 or above)

Examples

The easiest way to read in the data:

# returns SQL query with matching data types as a tibble
df = dbtools::read_sql("SELECT * from crest_v1.flatfile limit 10000")

# Read df as a data.table
dt = dbtools::read_sql("SELECT * from crest_v1.flatfile limit 10000", return_df_as = "data.table")

If you want to read in your data using a specific method

### Read SQL query using your own read csv method
response <- dbtools::get_athena_query_response("SELECT * from crest_v1.flatfile limit 10000")

# print out path to athena query output (as a csv)
print(response$s3_path)

# print out meta data
print(response$meta)

# Read in data using whatever csv reader you want (in this example using data.table::fread but reading everything as a string)
s3_path_stripped = gsub("s3://", "", response$s3_path)
df <- s3tools::read_using(FUN = data.table::fread, s3_path=s3_path_stripped)

Meta data conformance

When using the read_sql function you can specify the type of dataframe to return:

  • tibble (default)
  • data.table
  • dataframe

note: to find out more on this function see the function documentation i.e. ?dbtools::read_sql

Each is a type of dataframe in R and have different quirks when converting from Athena datatypes to R datatypes.

  • tibble: This is the default dataframe choice as it was the only dataframe that converts dates and datetimes (aka timestamps) on read rather than requiring a second parse of the data to convert date and timestamps to their correct types from strings. This is a good option if your data is not that large and you like those tidyverse things. One downside is that long integers are actually stored as doubles (this is because tibbles currently don't support 64 bit integers - see issue).

  • data.table: This dataframe class is really good for larger datasets (as it's more memory efficient and just generally better). long integers are read in as int64. Dates and datetimes are read in as strings. Feel free to cast the columns afterwards, data.table::fread doesn't convert them on read - see documentation.

  • dataframe: Added support for this because it's the base dataframe type in R. However, Athena exports CSVs with every value in double quotes because of this the scan function that is called internally by read.csv throws an error unless you specify columns as a character (see issue). Therefore the returning dataframe has every column type as a character. Feel free to cast the columns afterwards.

Meta data conversion

Below is a table that explains what the conversion is from our data types to the supported dataframe in R (using the read_sql function):

data type tibble type (R atomic type) data.table type (R atomic type) dataframe type (R atomic type)
character readr::col_character() (character) character character
int readr::col_integer() (integer) integer character
long readr::col_double() (double) bit64::integer64() (double) character
date readr::col_date() (double) character character
datetime readr::col_datetime() (double) character character
boolean readr::col_logical() (logical) logical character
float readr::col_double() (double) double character
double readr::col_double() (double) double character

Note: If the R atomic type is not listed in the table above then it is the same as the type specified

Meta data

The output from dbtools::get_athena_query_response(...) is a list one of its keys is meta. The meta key is a list where each element in this list is the name (name) and data type (type) for each column in your athena query output. For example for this table output:

col1 col2
1 2018-01-01
2 2018-01-02
...

Would have a meta like:

response$meta[[1]]$name # col1
response$meta[[1]]$type # int

response$meta[[1]]$name # col2
response$meta[[1]]$type # date

The meta types follow those listed as the generic meta data types used in etl_manager. If you want the actual athena meta data instead you can get them instead of the generic meta data types by setting the return_athena_types input parameter to TRUE e.g.

response <- dbtools::get_athena_query_response("SELECT * from crest_v1.flatfile limit 10000", return_athena_types=TRUE)

print(response$meta)

Notes:

  • Amazon Athena using a flavour of SQL called presto docs can be found here
  • To query a date column in Athena you need to specify that your value is a date e.g. SELECT * FROM db.table WHERE date_col > date '2018-12-31'
  • To query a datetime or timestamp column in Athena you need to specify that your value is a timestamp e.g. SELECT * FROM db.table WHERE datetime_col > timestamp '2018-12-31 23:59:59'
  • Note dates and datetimes formatting used above. See more specifics around date and datetimes here
  • To specify a string in the sql query always use '' not "". Using ""'s means that you are referencing a database, table or col, etc.
  • When data is pulled back into rStudio the column types are either R characters (for any col that was a dates, datetimes, characters) or doubles (for everything else).

Under The Hood

When you run a query in SQL against our databases you are using Athena. When Athena produces the output of an SQL query it is always written to a location in S3 as a csv. dbtools defines the S3 location based on your AWS role. It will write the output CSV into a folder only you have read/write access to, and then read it in using s3tools. Once the data has been read into a dataframe dbtools will delete the CSV from your folder.

Note: dbtools requires you to have the StandardDatabaseAccess group policy attached. If you want to use dbtools please ask the data engineering team (on slack ideally via the #analytical_platform channel).

Changelog:

2.0.3 - 2020-04-29

  • Fixes prompts to install miniconda - now automatically uses main Analytical Platform Conda Python, based on sys path

2.0.2 - 2019-06-14

  • Fixed issue where credentials would not refresh
  • Is now dependant on pydbtools package
  • SQL queries like SHOW COLUMNS FROM db.table now work for read_sql and return a df.

2.0.1 - 2019-04-23

  • Updated the version in the DESCRIPTION file to the correct version

v2.0.0 - 2019-02-08

  • Removed input parameters bucket and output_folder from read_sql and get_athena_query_response functions. New section to README named 'Under The Hood' explains why.
  • Note package now requires the group policy StandardDatabaseAccess to be attached to the role that needs to use this package.

v1.0.0 - 2019-01-14

  • Added function read_sql which reads an SQL query directly into an R dataframe. See R documentation (i.e. ?read_sql)
  • Input parameter out_path in function get_athena_query_response has been replaced by two input parameters bucket and output_folder. E.g. If your out_path="s3://my-bucket/__temp__" then the new input params are bucket=my-bucket and output_folder=__temp__. Note that output_folder defaults to value __athena_temp__ it is recommended that you leave this unchanged.

v0.0.2 - 2018-10-12

  • timeout is now an input parameter to get_athena_query_response if not set there is no timeout for the athena query.
  • get_athena_query_response will now print out the athena_client response if the athena query fails.

v0.0.1 - First Release

moj-analytical-services.dbtools's People

Contributors

isichei avatar thomas-hirsch avatar robinl avatar r4vi avatar mralecjohnson avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.