Comments (4)
These initial benchmarks were carried using dev branch "append-static-s3-dir" however previous version of noctua should be sufficient as this is benchmarking AWS Athena.
# method for install noctua branch: append-static-s3-dir
remotes::install_github("dyfanjones/noctua", ref = "append-static-s3-dir")
Benchmark code
library(DBI)
library(data.table)
X <- 1e8
value <- data.table(x = 1:X,
y = sample(letters, X, replace = T),
z = sample(c(TRUE, FALSE), X, replace = T))
con <- dbConnect(noctua::athena())
dbWriteTable(con, "test_split1", value, file.type = "parquet", compress = T, overwrite = T)
dbWriteTable(con, "test_split2", value, file.type = "parquet", compress = T, overwrite = T, max.batch = .5*X)
dbWriteTable(con, "test_split3", value, file.type = "parquet", compress = T, overwrite = T, max.batch = .1*X)
dbWriteTable(con, "test_split4", value, file.type = "parquet", compress = T, overwrite = T, max.batch = .05*X)
library(microbenchmark)
res <- microbenchmark(
files_1 = dbGetQuery(con, "select * from test_split1 limit 10"),
files_2 = dbGetQuery(con, "select * from test_split2 limit 10"),
files_10 = dbGetQuery(con, "select * from test_split3 limit 10"),
files_20 = dbGetQuery(con, "select * from test_split4 limit 10"),
times = 10
)
library(ggplot2)
autoplot(res) +
labs(title = "AWS Athena benchmark with compressed parquet file",
subtitle = "compression type: snappy") +
theme_bw() +
theme(text = element_text(size=15))
From this initial benchmark it looks like splitting parquet does have its advantages.
from noctua.
Next steps for benchmarking:
- More complex sql queries
- AWS Athena tables with different column data types
- All the different file types
noctua
supports:- csv/tsv (uncompressed)
- json
- parquet (uncompressed)
from noctua.
Compressed Parquet group by:
Note: Data was taken from: #112 (comment)
res <- microbenchmark(
files_1 = dbGetQuery(con, "select y, sum(x) as tot_x, avg(x) as avg_x from test_split1 group by y"),
files_2 = dbGetQuery(con, "select y, sum(x) as tot_x, avg(x) as avg_x from test_split2 group by y"),
files_10 = dbGetQuery(con, "select y, sum(x) as tot_x, avg(x) as avg_x from test_split3 group by y"),
files_20 = dbGetQuery(con, "select y, sum(x) as tot_x, avg(x) as avg_x from test_split4 group by y"),
times = 10
)
library(ggplot2)
autoplot(res) +
labs(title = "AWS Athena benchmark with compressed parquet file",
subtitle = "compression type: snappy") +
theme_bw() +
theme(text = element_text(size=15))
from noctua.
Compressed Parquet group by # 2:
Note: Data was taken from: #112 (comment)
res <- microbenchmark(
files_1 = dbGetQuery(con, "select y, min(x) - max(x) as range_x from test_split1 group by y"),
files_2 = dbGetQuery(con, "select y, min(x) - max(x) as range_x from test_split2 group by y"),
files_10 = dbGetQuery(con, "select y, min(x) - max(x) as range_x from test_split3 group by y"),
files_20 = dbGetQuery(con, "select y, min(x) - max(x) as range_x from test_split4 group by y"),
times = 10
)
library(ggplot2)
autoplot(res) +
labs(title = "AWS Athena benchmark with compressed parquet file",
subtitle = "compression type: snappy") +
theme_bw() +
theme(text = element_text(size=15))
from noctua.
Related Issues (20)
- Method to set unload at a package level HOT 1
- Release noctua 2.4.0 on to cran
- Prevent Noctua from printing Data Scanned -information HOT 7
- Release noctua 2.5.0 HOT 3
- Release noctua 2.6.0 HOT 1
- Sub-query fails with dplyr interface indicating "Only one sql statement is allowed" HOT 6
- cran-2.6.1 release
- Can I set various parameters in `.aws/config` file and have `DBI::dbConnect()` read those directly from that file? HOT 5
- Add catalog support HOT 26
- Column Bucketing
- Allow for Partition columns to change data types
- Can't write/append an empty data frame
- Connecting using long-term-creds returns Error 400 HOT 5
- dbFetch(..., n=small number) is quite slow when run on a large result set HOT 4
- `dbExistsTable()` doesn't work anymore HOT 3
- fix: for dbplyr 2.3.3.9000 +
- dbExistsTable() returns an incorrect result when the table name is defined by Id() or SQL() HOT 1
- InvalidRequestException with dbGetQuery HOT 4
- [Question]: Requesting guidance and best practices - Athena shinyApp with noctua HOT 2
- Unload option returns `null` results when `s3_staging_dir` is a bucket only HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from noctua.