Giter Club home page Giter Club logo

moj-analytical-services.athena_query_to_dataframe_speedtests's Introduction

athena_query_to_dataframe_speedtests

We're testing against 10 million records from the flights database. The dataset in csv format is 1.29Gb on disk.

Performance so far:

In Python

Athena create table as using parquet, snappy compression: 19 seconds Athena create table as using parquet, gzip compression: 21 seconds Athena standard query outputting to csv: 117 seconds Read from disk in csv format on macbook pro: 40 seconds Read from disk in parquet format on macbook pro: 8 seconds Read from sqlite databse on disk: 160 seconds

This repo contains various code snippets that we've used to time the performance of data reads into pandas.

Specifically, we're interested in optimising reading sql query results into pandas dataframes.

In this repo we test various things:

  • Reading from Athena using ODBC/JDBC drivers
  • Reading from Athena by picking up the results from the .csv byproducts
  • Performing a Create Table As and reding the results from parquet
  • Comparison to reading from csv and parquet on disk
  • Comparison to reading from a sqlite database on disk
  • Comparison to reading from a RDS/postgres instance

So….i’ve been using Athena’s create table as to get better performance of reading a sql query into pandas memory. Once the apache arrow projects delivers parquet reads in R, we should be able to get the same benefits by using create table as we get to choose our output format, so we can do query -> parquet dump in s3. We can also choose the number of partitions (files). we can then use pyarrow to read a folder of parquet files, increasing read performance First results are reasonably good. We can perform a query that returns a ~1.1Gb file (well, it’s 1.1Gb in csv format, the actual output is four smaller parquet files), and load that into a pandas dataframe. The whole thing takes 19 seconds. (edited) The parquet output has the advantage that we don’t need to worry about type conversion.

robinlinacre [3:58 PM] Reading the same dataset from disk (SSD) as a .csv file on my Macbook pro takes 40 seconds. As a .parquet takes 8 seconds. (edited)

robinlinacre [4:18 PM] Timings of the same query in current version of dbtools: 117 seconds in Python and 230 seconds for R. (edited) I think this means using this new method we’re getting much faster query performance than we would from a relational database. (edited)

moj-analytical-services.athena_query_to_dataframe_speedtests's People

Contributors

robinl avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.