athena_query_to_dataframe_speedtests

We're testing against 10 million records from the flights database. The dataset in csv format is 1.29Gb on disk.

Performance so far:

In Python

Athena create table as using parquet, snappy compression: 19 seconds Athena create table as using parquet, gzip compression: 21 seconds Athena standard query outputting to csv: 117 seconds Read from disk in csv format on macbook pro: 40 seconds Read from disk in parquet format on macbook pro: 8 seconds Read from sqlite databse on disk: 160 seconds

This repo contains various code snippets that we've used to time the performance of data reads into pandas.

Specifically, we're interested in optimising reading sql query results into pandas dataframes.

In this repo we test various things:

Reading from Athena using ODBC/JDBC drivers
Reading from Athena by picking up the results from the .csv byproducts
Performing a Create Table As and reding the results from parquet
Comparison to reading from csv and parquet on disk
Comparison to reading from a sqlite database on disk
Comparison to reading from a RDS/postgres instance

So….i’ve been using Athena’s create table as to get better performance of reading a sql query into pandas memory. Once the apache arrow projects delivers parquet reads in R, we should be able to get the same benefits by using create table as we get to choose our output format, so we can do query -> parquet dump in s3. We can also choose the number of partitions (files). we can then use pyarrow to read a folder of parquet files, increasing read performance First results are reasonably good. We can perform a query that returns a ~1.1Gb file (well, it’s 1.1Gb in csv format, the actual output is four smaller parquet files), and load that into a pandas dataframe. The whole thing takes 19 seconds. (edited) The parquet output has the advantage that we don’t need to worry about type conversion.

robinlinacre [3:58 PM] Reading the same dataset from disk (SSD) as a .csv file on my Macbook pro takes 40 seconds. As a .parquet takes 8 seconds. (edited)

robinlinacre [4:18 PM] Timings of the same query in current version of dbtools: 117 seconds in Python and 230 seconds for R. (edited) I think this means using this new method we’re getting much faster query performance than we would from a relational database. (edited)

uk-gov-mirror / moj-analytical-services.athena_query_to_dataframe_speedtests Goto Github PK

moj-analytical-services.athena_query_to_dataframe_speedtests's Introduction

athena_query_to_dataframe_speedtests

Performance so far:

In Python

moj-analytical-services.athena_query_to_dataframe_speedtests's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent