h2oai / db-benchmark Goto Github PK

View Code? Open in Web Editor NEW

320.0 132.0 83.0 34.25 MB

reproducible benchmark of database-like ops

Home Page: https://h2oai.github.io/db-benchmark

License: Mozilla Public License 2.0

R 49.15% Shell 7.74% Python 39.19% Julia 3.85% HTML 0.06%

db-benchmark's Introduction

db-benchmark's People

Contributors

Stargazers

Watchers

db-benchmark's Issues

upgrade dplyr script for recent change: `summarise_each()` is deprecated.

Add Scala

I'm not very experienced on these things but I do think there is some overhead for using the Python/R APIs to Spark & co-workers report using native Scala when trying to optimize performance. Same I think goes for the SQL API which I believe has an interpretation layer.

add groupby scenario to group on factor column(s)

Add Julia

Comment from ZJ (@dzj_evalparse) here with a pointer to a Julia repo which had already reproduced these 5 tests in Julia, I assume from seeing the 2014 benchmark before. Which is great.
https://twitter.com/dzj_evalparse/status/1039271981286187008

It looks like it has everything needed to add Julia :
https://discourse.julialang.org/t/group-by-performance-benchmarks-and-recommendations/9313

Add product names to y-axis

As per comment here : #34 (comment)

The page/image width constraint needs to be overcome first. But other than that, there's plenty of width to use.

add total time taken by each task to report

now report has only groupby so postpone that when two tasks will be there. currently the time was put by hand.

Include timestamp within each image file

The same report-generation timestamp that's at the bottom of the page, but embedded within all the images at the bottom right corner. Keep the report-generation timestamp at the bottom of the page too. Sometimes there are browser refresh issues and old images haven't updated yet, needing a hard refresh or browser cache clear. Including the timestamp within the image wouldn't solve that problem but it would enable that problem to be spotted by viewer of page, so at least they know to hard-refresh by comparing to the report generation timestamp at the bottom of the page.

Report the history of benchmark timings

It would be great to automatically produce a chart where X axis is the date, Y axis the benchmarking timing, and the values plotted are limited to a particular package + question. Multiple lines may be present to show single package/several questions; or multiple packages/single question.

This chart would allow us to view how the timings changed over time, and in particular, identify any possible performance regressions.

We could use a JS library like Flot to make the chart directly in the browser.

Force consistent sort order for spark?

I guess related to #12.

It's possible part of Spark's advantage is because it doesn't constrain itself to return groups in the same order as they're fed in. Would require a separate query for by = functionality in other languages as well.

The query to force this would be substantially more complicated and might slow down Spark.

Also related: Rdatatable/data.table#1880

Also, it's probably a bug that spark gets away with using group by and not group by order by at the moment...

Add data ingest timings

Need to clear OS cache before each 1st run; e.g. using sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'. I'd say that the 2nd run should be allowed to benefit from the OS cache and we'd expect the 2nd run to be faster. One option is that variance can be obtained using a time series of daily observations of 1st and 2nd runs. Another option is to repeat the drop cache and do the 1st and 2nd run a second time (4 timings total). In this case of data ingest, a 3rd run may be benefit (6 timings total), since OS cache affects the 1st run. Although, 6 runs may push the limits of what can be achieved within a single 24hr period since db-bench total time is at 5hrs currently before the addition of dask.

Aside: since db-bench reruns on the same machine each day, there's a chance that yesterday's run affects today's run in various ways. So perhaps a daily machine reboot should be considered to remove any possibility of that. In other words: even if it's proved that a machine reboot doesn't make a jot of difference, still do it anyway, for the avoidance of doubt.

Add Dask

Question on Twitter here :
https://twitter.com/guillermo_ponce/status/1039569303379406848

Adding Dask would be great. If someone could provide the equivalent commands as GitHub issue/pr, it dramatically helps.

make_chk needs to handle datatable structure

make_chk helper function that is used to write grand totals to csv entries currently does not extract those values properly from datatable

Add python's modin

It looks like modin is already capable to perform groupby test thus could be added.
ref: https://groups.google.com/forum/#!topic/modin-dev/itnDWIuRm4o

add support for full path to source files

Add benchark with missings

It would also be interesting to see how fast are all these tools when the data contains missings.

update main html report

update for disabled multinode tools
plug in benchplot results
plug in memtest results

add `modin` (pandas on ray) as new solution

Installing modin force python to uninstall recent pandas and install pandas 0.22.0.
@st-pasha should we use separate virtual env for pandas and modin? maybe also another one for pydatatable?

Installing collected packages: pyyaml, funcsigs, flatbuffers, click, redis, ray, pandas, modin
  Found existing installation: pandas 0.23.0
    Uninstalling pandas-0.23.0:
      Successfully uninstalled pandas-0.23.0
Successfully installed click-6.7 flatbuffers-2015.12.22.1 funcsigs-1.0.2 modin-0.1.1 pandas-0.22.0 pyyaml-3.13 ray-0.5.0 redis-2.10.6

Use binary load for Dask and Pandas grouping tests

Currently the Dask and Pandas grouping tests are shown as a fail at the 50GB size, where other products work. But this is only because reading the csv file fails, which isn't to do with grouping per se. These grouping tests could instead use pickle or feather to load the dataset. It's not like the time to load the data is included in the test anyway. It would also be faster to run the grouping tests since the time to read from csv would not need to happen first. Reading data from csv is due to be added to db-bench as a separate set of tests where the fail point would be fairly represented there separately.

Similarly, pydatatable could read the test data from its memory map before grouping, and data.table could read from fst. So long as the result of reading from these binary formats was just the same as if read from csv (so no pre-computed data like indexes or similar allowed (**)) then it would be faster for db-bench to run as well as getting a timing for Dask and Pandas which probably do in fact work at this size on this machine.

(**) separate tests to be added in future where pre-computed indexes and similar are allowed.

With this done, #45 could be enabled again.

add `out_cols` to benchmark log

turn off dask and pandas on 1e9 rows

We are losing a lot of time just waiting for memory errors when attempting to run 1e9 rows groupings for pandas and dask. They should be disabled and from time to time revisited if reading csv was improved for memory usage.

Need more flexible task scheduler

The benchmark suite grows both in depth and in width. In the near future we will not be able to afford the "run everything" strategy anymore. Limiting the number of tasks that are run every day will be a necessity. Approaches such as #50 are a good starting point, but that may not be enough.

I propose to implement a new system, where at the beginning of the benchmarking session each task will be assigned a "worth" score. The tasks can then be sorted by that score, and only the top ones run (until the benchmarking server runs out of time).

The function that evaluates the worthiness of each task will be dynamic, taking into account many factors:

How long ago the task was run before;
Estimated run time of the task, based on the previous runs;
Relative importance of the task (e.g. join/groupby may be more useful than filtering/updating);
Relative importance of the solution being measured (e.g. data.table / datatable should be tested more frequently than others);
Age of the latest version of the solution, esp. compared to the time the task was run before (#50);
Manual overrides, allowing the user to force execution of a particular task;
etc.

advanced questions for `join` tests

Presently join tests are made on 2 integer columns tables, equal size, inner join on single column. It is because it was difficult to achieve good random numbers for 1e10 datasets used before. Now we won't go beyond 1e9 so we can easily use another set of data.
Based on the questions we want to answer in this tests we will pick/generate expected datasets.
Initial list of queries we might want to test listed below. We need to chose those which we want to have included in first iteration, rest will be left for future extensions. My picks are as follows.

Types of queries:

Types of fields:

join on integer column
join on factor column(s) #21

Sizes of datasets:

big to big (1e9-1e9)
big to medium (1e9-1e6)
big to small (1e9-1e3)

Using different datasets will heavily complicate presenting benchmarks results (as this is another dimension to present on report). We can think how to overcome that.
Also we need to wisely choose subset of queries/fields/sizes as my current selection 4*2*3 gives 24 different questions, this multiply by 3 (1e7, 1e8, 1e9) and we have 72 tests. While current groupby tests has only 5*3 = 15 tests.
@mattdowle

Add total time to summary table, and more.

The new table is very nice! Minor tweaks :

add total time row to the bottom of the summary table, in the same spirit as the total time already at the top of the barplot.
move text "according to the following pattern G1_[in-rows][k-cardinality-factor][NA-pct]_[is-sorted].csv." down to just above the summary table (i.e. where the reader needs it). Or, even better, split the first coded column into separate columns: [rows, K, nas, issorted] so the pattern isn't needed to be understood by the reader of the page. Also use 2 not 2e0 and 10 rather than 1e1, to make it easier and neather. To avoid adding a note explaining what K is, change "K" to "group size (rows)" in the column heading.
add a blank line between the barplot image and the table.
what does the 1 mean in "G1" ? It isn't question 1 because all 5 questions appear in the question column but all rows in the table are marked G1.
is the reason pandas missing because 1e9 data won't load? Is there any hope in getting it to load from binary file? I imagine a common thought from readers of this summary table will be -- what about pandas. Another approach might be to reduce the size from 1e9 to 5e8 so pandas can appear there. Since the size is a constant in that summary table (1e9), it could be a different constant (5e8). That table is concerned with varying things other than size, so it doesn't need to be largest size. Reducing to 5e8 would also help with overal runtime of db-bench. (Still keep the barplots at 1e9, assuming the data load problem can be overcome.) At the very least, why pandas isn't in the summary table should be explained as a note somewhere on the page.

add join scenario to join on factor column(s)

add `read` task for csv import

questions:

read whole csv
read only top 100 rows

benchmark floating point rounding error

There might be improvements coming to current functionality which might eventually impact rounding error. It will be good to spot this and implement rounding error correction.

add more complex groupby questions

Currently among 5 questions 4 are grouping by single column and one by 2 columns. We need more complex grouping queries to group by most of the columns, especially:

by all dimensions, so all id* columns, and aggregate v1,v2,v3 (process of producing data mart)
K=10 and K=2
~~by all columns (detecting duplicated rows)~~ less real-life use cases - especially grouping by double type
sorted input

prerequisite: h2oai/datatable#1082

add pydatatable

sort already works
groupby should soon work

current state in pydatatable branch.
@st-pasha

add `subset` task

As we plot results by run (1, 2) we can nicely see impact of index there.
We can consider adding two tasks: subset and unindexed subset.
Latter one will ensure index are not being created.

Order results within each test, perhaps.

Each software is already colored, and the syntax is next to each bar, so if they are ordered by time within each test it might be easier to more quickly compare and it won't be difficult to identify which software is which. Question is, what to order by: first run time, or average of both, or total of both runs. As more packages are added, ordering within test might become more needed. As always, providing a toggle to let the viewer choose the sort method that is appropriate for them would be ideal, but more work to achieve of course.

Add cudf (RAPIDS)

https://github.com/rapidsai/cudf

add groupby 500GB (1e10)

We should also test out-of-memory aggregations, as some of the tools can handle that.

Consider adding mapd

https://github.com/omnisci/mapd-core

use data.table update procedure

We can drop updating data.table defined in helpers R script. The same functionality is already in data.table 1.10.5. Also the field name in DESCRIPTION file has been changed since then from Commit to Revision.

Add Gandiva/Dremio (arrow)

https://www.datanami.com/2018/11/28/dremio-donates-fast-analytics-compiler-to-apache-foundation

Here's an article with a benchmark.
https://www.dremio.com/gandiva-performance-improvements-production-query/

https://github.com/dremio/dremio-oss

spark should use .persist() method on each query

To match other tools behavior, otherwise it should be marked as cache=FALSE as Impala and Presto were marked before.
also to not suffer from writing to HDD it should persist only to memory, by default it uses RAM and HDD.
https://spark.apache.org/docs/2.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.persist

upgrade scripts to use python 3.5

groupby task should test NAs in grouping columns (distinct result for NA group), and NAs in value columns too

add `update on join` task

common process in data analytics is to lookup attributes from dimension hierarchies. Update on join is able to handle that process in-place, but not all tools supports it most likely having penalty of data copy on every join.

add spark

This time on single node and pyspark, not scala. Initially just groupby.

re-order tasks on report

groupby
join
sort
fread

run only solutions that have been upgraded since last run

as discussed in #12

Chop outlier bars

Where one or two solutions are a lot slower than the others, lower the axis range so the faster solutions can be compared in more detail. Then split the overshooting bars with a white break and put the timing in text there. An alternative would be log scale but please don't do log scale. I find simple timings significantly easier to interpret very quickly.

Spark first time longer

Comment from Michael on Twitter here :

https://twitter.com/michael_chirico/status/1039356873760112641

Noticing that the first run of the first Spark benchmark is slow... I assume it's including the start-up time of the cluster?

It seems proportional to the data size, though. What's happening there and is there a way to isolate it and report separately perhaps.

sort by double precision column

There is sort only by integer column at the moment.

report needs to display more dimensions

kind of "browser" UI selector, currently report has selector for data size only.
Over time we will need to select:

task (groupby, join, rollfun, ...)
data (so we can present different K in groupby benchmark)
number of rows
multiple data/datasizes for 2+ tables tasks (join)
batch (to view historical benchmarks)

move benchmark tables from report to new details page

Include timings at the end of each bar.

#27 (comment)

It would be nice to place the timing in seconds at the end of each bar, too. Say "7.3s; 7.1s" in small text at the end of each bar for the 1st and 2nd timing. Just to read off the timings extremely easily and conveniently.

make benchplot to work for arbitrary task

currently written for groupby task but would be already useful for sort to add sorting by float.

measure memory usage

One of tools is currently failing on 1e9 due to lack of memory, 126GB is not enough to aggregate 50GB csv. This requires use bigger machine to run 1e9 grouping benchmark. Still memory usage should be investigated and probably added to report.

h2oai / db-benchmark Goto Github PK

db-benchmark's Introduction

Tasks

Solutions

Reproduce

Batch benchmark run

Single solution benchmark

Running script interactively

Extra care needed

Example environment

Acknowledgment

db-benchmark's People

Contributors

Stargazers

Watchers

Forkers

db-benchmark's Issues

Recommend Projects

Recommend Topics

Recommend Org