Giter Club home page Giter Club logo

Comments (7)

DyfanJones avatar DyfanJones commented on June 17, 2024

Set up awswrangler example for noctua benchmarks

import awswrangler as wr

import getpass
bucket = getpass.getpass()
path = f"s3://{bucket}/data/"


if "awswrangler_test" not in wr.catalog.databases().values:
    wr.catalog.create_database("awswrangler_test")

cols = ["id", "dt", "element", "value", "m_flag", "q_flag", "s_flag", "obs_time"]

df = wr.s3.read_csv(
    path="s3://noaa-ghcn-pds/csv/189",
    names=cols,
    parse_dates=["dt", "obs_time"])  # Read 10 files from the 1890 decade (~1GB)

wr.s3.to_parquet(
    df=df,
    path=path,
    dataset=True,
    mode="overwrite",
    database="awswrangler_test",
    table="noaa"
);

wr.catalog.table(database="awswrangler_test", table="noaa")

Initial benchmark testing:

remotes::install_github("dyfanjones/noctua", ref="parquet_unload")

library(DBI)

con <- dbConnect(noctua::athena())

# AWS Athena result is outputed as CSV and then read into R.
system.time({
  df = dbGetQuery(con, "SELECT * FROM awswrangler_test.noaa")
})
# Info: (Data scanned: 80.88 MB)
# user  system elapsed 
# 81.819  38.409 667.637 

# AWS Athena result is outputed as Parquet and then read into R.
system.time({
  df = dbGetQuery(con, "SELECT * FROM awswrangler_test.noaa", unload = T)
})
# Info: (Data scanned: 80.88 MB)
# user  system elapsed 
# 17.899   2.992  57.022 

from noctua.

DyfanJones avatar DyfanJones commented on June 17, 2024

When caching is enable extra time performance is gained:

noctua::noctua_options(cache_size = 1)

# AWS Athena result is outputed as Parquet and then read into R.
system.time({
  df = dbGetQuery(con, "SELECT * FROM awswrangler_test.noaa", unload = T)
})
# Info: (Data scanned: 80.88 MB)
# user  system elapsed 
# 17.899   2.992  57.022 

system.time({
  df = dbGetQuery(con, "SELECT * FROM awswrangler_test.noaa", unload = T)
})

# Info: (Data scanned: 80.88 MB)
# user  system elapsed 
# 14.094   3.117  34.663 

from noctua.

DyfanJones avatar DyfanJones commented on June 17, 2024

The unload method does come with some down side. Possibly create a vignette and document pros and cons of both methods, so that users are fully aware of what to expect.

from noctua.

DyfanJones avatar DyfanJones commented on June 17, 2024

Benchmark running on sagemaker ml.t3.xlarge instance:

library(DBI)

con <- dbConnect(noctua::athena())

# Query ran using CSV output
system.time({
  df = dbGetQuery(con, "SELECT * FROM awswrangler_test.noaa")
})
# Info: (Data scanned: 80.88 MB)
#    user  system elapsed
#  57.004   8.430 160.567 

noctua::noctua_options(cache_size = 1)

# Query ran using UNLOAD Parquet output
system.time({
  df = dbGetQuery(con, "SELECT * FROM awswrangler_test.noaa", unload = T)
})
# Info: (Data scanned: 80.88 MB)
#    user  system elapsed 
#  21.622   2.350  39.232 

# Query ran using cache
system.time({
  df = dbGetQuery(con, "SELECT * FROM awswrangler_test.noaa", unload = T)
})
# Info: (Data scanned: 80.88 MB)
#    user  system elapsed 
#  17.514   2.430  14.249 

# Query ran using cache with unload parameter not specified
system.time({
  df = dbGetQuery(con, "SELECT * FROM awswrangler_test.noaa")
})
# Info: (Data scanned: 80.88 MB)
#    user  system elapsed 
#  15.319   2.445  11.924 

Results:

csv:
seconds:160.567

unload:
seconds: 39.232
4X faster

cache unload:
seconds:11.924
13.4X faster

noctua: 2.2.0.9000
R: 4.1.1

from noctua.

DyfanJones avatar DyfanJones commented on June 17, 2024

Comparison to awswrangler:

Note: Benchmarks ran on sagemaker ml.t3.xlarge instance:

import awswrangler as wr

# Query with AWS Athena csv output
%%time
wr.athena.read_sql_query("SELECT * FROM noaa", database="awswrangler_test", ctas_approach=False)
# CPU times: user 1min 30s, sys: 5.78 s, total: 1min 36s
# Wall time: 3min 9s
# seconds: 189

# Query with AWS Athena Parquet output
%%time
wr.athena.read_sql_query("SELECT * FROM noaa", database="awswrangler_test")
# CPU times: user 1min 3s, sys: 6.53 s, total: 1min 10s
# Wall time: 1min 32s
# seonds: 92

# Query with AWS Athena Parquet output
%%time
wr.athena.read_sql_query("SELECT * FROM noaa", database="awswrangler_test", categories=["id", "dt", "element", "value", "m_flag", "q_flag", "s_flag", "obs_time"])
# CPU times: user 18.1 s, sys: 2.21 s, total: 20.3 s
# Wall time: 35.9 s
# seconds: 35.9

Results:

csv:
seconds: 189

ctas:
seonds: 92
2.05X faster

ctas with categories:
seconds: 35.9
5.26X faster

awswrangler version: 2.11.0
python: 3.6.13

from noctua.

DyfanJones avatar DyfanJones commented on June 17, 2024

Closing issues as AWS Athena Unload has been added the RAthena and noctua.

from noctua.

OssiLehtinen avatar OssiLehtinen commented on June 17, 2024

Great stuff!

from noctua.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.