Giter Club home page Giter Club logo

db-benchmark's Introduction

Repository for reproducible benchmarking of database-like operations in single-node environment.
Benchmark report is available at h2oai.github.io/db-benchmark.
We focused mainly on portability and reproducibility. Benchmark is routinely re-run to present up-to-date timings. Most of solutions used are automatically upgraded to their stable or development versions.
This benchmark is meant to compare scalability both in data volume and data complexity.
Contribution and feedback are very welcome!

Tasks

  • groupby
  • join
  • groupby2014

Solutions

More solutions has been proposed. Status of those can be tracked in issues tracker of our project repository by using new solution label.

Reproduce

Batch benchmark run

  • edit path.env and set julia and java paths
  • if solution uses python create new virtualenv as $solution/py-$solution, example for pandas use virtualenv pandas/py-pandas --python=/usr/bin/python3.6
  • install every solution, follow $solution/setup-$solution.sh scripts
  • edit run.conf to define solutions and tasks to benchmark
  • generate data, for groupby use Rscript _data/groupby-datagen.R 1e7 1e2 0 0 to create G1_1e7_1e2_0_0.csv, re-save to binary format where needed (see below), create data directory and keep all data files there
  • edit _control/data.csv to define data sizes to benchmark using active flag
  • ensure SWAP is disabled and ClickHouse server is not yet running
  • start benchmark with ./run.sh

Single solution benchmark

  • install solution software
    • for python we recommend to use virtualenv for better isolation
    • for R ensure that library is installed in a solution subdirectory, so that library("dplyr", lib.loc="./dplyr/r-dplyr") or library("data.table", lib.loc="./datatable/r-datatable") works
    • note that some solutions may require another to be installed to speed-up csv data load, for example, dplyr requires data.table and similarly pandas requires (py)datatable
  • generate data using _data/*-datagen.R scripts, for example, Rscript _data/groupby-datagen.R 1e7 1e2 0 0 creates G1_1e7_1e2_0_0.csv, put data files in data directory
  • run benchmark for a single solution using ./_launcher/solution.R --solution=data.table --task=groupby --nrow=1e7
  • run other data cases by passing extra parameters --k=1e2 --na=0 --sort=0
  • use --quiet=true to suppress script's output and print timings only, using --print=question,run,time_sec specify columns to be printed to console, to print all use --print=*
  • use --out=time.csv to write timings to a file rather than console

Running script interactively

  • install software in expected location, details above
  • ensure data name to be used in env var below is present in ./data dir
  • source python virtual environment if needed
  • call SRC_DATANAME=G1_1e7_1e2_0_0 R, if desired replace R with python or julia
  • proceed pasting code from benchmark script

Extra care needed

  • cudf uses conda instead of virtualenv

Example environment

Acknowledgment

Timings for some solutions might be missing for particular data sizes or questions. Some functions are not yet implemented in all solutions so we were unable to answer all questions in all solutions. Some solutions might also run out of memory when running benchmark script which results the process to be killed by OS. Lastly we also added timeout for single benchmark script to run, once timeout value is reached script is terminated. Please check exceptions label in our repository for a list of issues/defects in solutions, that makes us unable to provide all timings. There is also no documentation label that lists issues that are blocked by missing documentation in solutions we are benchmarking.

db-benchmark's People

Contributors

bkamins avatar hannes avatar jangorecki avatar mattdowle avatar michaelchirico avatar nalimilan avatar pallharaldsson avatar ravwojdyla avatar ritchie46 avatar trivialfis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

db-benchmark's Issues

Add Scala

I'm not very experienced on these things but I do think there is some overhead for using the Python/R APIs to Spark & co-workers report using native Scala when trying to optimize performance. Same I think goes for the SQL API which I believe has an interpretation layer.

Include timestamp within each image file

The same report-generation timestamp that's at the bottom of the page, but embedded within all the images at the bottom right corner. Keep the report-generation timestamp at the bottom of the page too. Sometimes there are browser refresh issues and old images haven't updated yet, needing a hard refresh or browser cache clear. Including the timestamp within the image wouldn't solve that problem but it would enable that problem to be spotted by viewer of page, so at least they know to hard-refresh by comparing to the report generation timestamp at the bottom of the page.

Report the history of benchmark timings

It would be great to automatically produce a chart where X axis is the date, Y axis the benchmarking timing, and the values plotted are limited to a particular package + question. Multiple lines may be present to show single package/several questions; or multiple packages/single question.

This chart would allow us to view how the timings changed over time, and in particular, identify any possible performance regressions.

We could use a JS library like Flot to make the chart directly in the browser.

Force consistent sort order for spark?

I guess related to #12.

It's possible part of Spark's advantage is because it doesn't constrain itself to return groups in the same order as they're fed in. Would require a separate query for by = functionality in other languages as well.

The query to force this would be substantially more complicated and might slow down Spark.

Also related: Rdatatable/data.table#1880

Also, it's probably a bug that spark gets away with using group by and not group by order by at the moment...

Add data ingest timings

Need to clear OS cache before each 1st run; e.g. using sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'. I'd say that the 2nd run should be allowed to benefit from the OS cache and we'd expect the 2nd run to be faster. One option is that variance can be obtained using a time series of daily observations of 1st and 2nd runs. Another option is to repeat the drop cache and do the 1st and 2nd run a second time (4 timings total). In this case of data ingest, a 3rd run may be benefit (6 timings total), since OS cache affects the 1st run. Although, 6 runs may push the limits of what can be achieved within a single 24hr period since db-bench total time is at 5hrs currently before the addition of dask.

Aside: since db-bench reruns on the same machine each day, there's a chance that yesterday's run affects today's run in various ways. So perhaps a daily machine reboot should be considered to remove any possibility of that. In other words: even if it's proved that a machine reboot doesn't make a jot of difference, still do it anyway, for the avoidance of doubt.

add `modin` (pandas on ray) as new solution

Installing modin force python to uninstall recent pandas and install pandas 0.22.0.
@st-pasha should we use separate virtual env for pandas and modin? maybe also another one for pydatatable?

Installing collected packages: pyyaml, funcsigs, flatbuffers, click, redis, ray, pandas, modin
  Found existing installation: pandas 0.23.0
    Uninstalling pandas-0.23.0:
      Successfully uninstalled pandas-0.23.0
Successfully installed click-6.7 flatbuffers-2015.12.22.1 funcsigs-1.0.2 modin-0.1.1 pandas-0.22.0 pyyaml-3.13 ray-0.5.0 redis-2.10.6

Use binary load for Dask and Pandas grouping tests

Currently the Dask and Pandas grouping tests are shown as a fail at the 50GB size, where other products work. But this is only because reading the csv file fails, which isn't to do with grouping per se. These grouping tests could instead use pickle or feather to load the dataset. It's not like the time to load the data is included in the test anyway. It would also be faster to run the grouping tests since the time to read from csv would not need to happen first. Reading data from csv is due to be added to db-bench as a separate set of tests where the fail point would be fairly represented there separately.

Similarly, pydatatable could read the test data from its memory map before grouping, and data.table could read from fst. So long as the result of reading from these binary formats was just the same as if read from csv (so no pre-computed data like indexes or similar allowed (**)) then it would be faster for db-bench to run as well as getting a timing for Dask and Pandas which probably do in fact work at this size on this machine.

(**) separate tests to be added in future where pre-computed indexes and similar are allowed.

With this done, #45 could be enabled again.

turn off dask and pandas on 1e9 rows

We are losing a lot of time just waiting for memory errors when attempting to run 1e9 rows groupings for pandas and dask. They should be disabled and from time to time revisited if reading csv was improved for memory usage.

Need more flexible task scheduler

The benchmark suite grows both in depth and in width. In the near future we will not be able to afford the "run everything" strategy anymore. Limiting the number of tasks that are run every day will be a necessity. Approaches such as #50 are a good starting point, but that may not be enough.

I propose to implement a new system, where at the beginning of the benchmarking session each task will be assigned a "worth" score. The tasks can then be sorted by that score, and only the top ones run (until the benchmarking server runs out of time).

The function that evaluates the worthiness of each task will be dynamic, taking into account many factors:

  • How long ago the task was run before;
  • Estimated run time of the task, based on the previous runs;
  • Relative importance of the task (e.g. join/groupby may be more useful than filtering/updating);
  • Relative importance of the solution being measured (e.g. data.table / datatable should be tested more frequently than others);
  • Age of the latest version of the solution, esp. compared to the time the task was run before (#50);
  • Manual overrides, allowing the user to force execution of a particular task;
  • etc.

advanced questions for `join` tests

Presently join tests are made on 2 integer columns tables, equal size, inner join on single column. It is because it was difficult to achieve good random numbers for 1e10 datasets used before. Now we won't go beyond 1e9 so we can easily use another set of data.
Based on the questions we want to answer in this tests we will pick/generate expected datasets.
Initial list of queries we might want to test listed below. We need to chose those which we want to have included in first iteration, rest will be left for future extensions. My picks are as follows.

Types of queries:

  • left/right outer
  • inner
  • full outer
  • join on multiple columns
  • non-equi join
  • lookup (update on join #24)
  • lookup from multiple tables
  • row explosion join (some multiple matches, partial cross join)
  • cross join (full cartesian product)
  • temporal join

Types of fields:

  • join on integer column
  • join on factor column(s) #21

Sizes of datasets:

  • big to big (1e9-1e9)
  • big to medium (1e9-1e6)
  • big to small (1e9-1e3)

Using different datasets will heavily complicate presenting benchmarks results (as this is another dimension to present on report). We can think how to overcome that.
Also we need to wisely choose subset of queries/fields/sizes as my current selection 4*2*3 gives 24 different questions, this multiply by 3 (1e7, 1e8, 1e9) and we have 72 tests. While current groupby tests has only 5*3 = 15 tests.
@mattdowle

Add total time to summary table, and more.

The new table is very nice! Minor tweaks :

  • add total time row to the bottom of the summary table, in the same spirit as the total time already at the top of the barplot.

  • move text "according to the following pattern G1_[in-rows][k-cardinality-factor][NA-pct]_[is-sorted].csv." down to just above the summary table (i.e. where the reader needs it). Or, even better, split the first coded column into separate columns: [rows, K, nas, issorted] so the pattern isn't needed to be understood by the reader of the page. Also use 2 not 2e0 and 10 rather than 1e1, to make it easier and neather. To avoid adding a note explaining what K is, change "K" to "group size (rows)" in the column heading.

  • add a blank line between the barplot image and the table.

  • what does the 1 mean in "G1" ? It isn't question 1 because all 5 questions appear in the question column but all rows in the table are marked G1.

  • is the reason pandas missing because 1e9 data won't load? Is there any hope in getting it to load from binary file? I imagine a common thought from readers of this summary table will be -- what about pandas. Another approach might be to reduce the size from 1e9 to 5e8 so pandas can appear there. Since the size is a constant in that summary table (1e9), it could be a different constant (5e8). That table is concerned with varying things other than size, so it doesn't need to be largest size. Reducing to 5e8 would also help with overal runtime of db-bench. (Still keep the barplots at 1e9, assuming the data load problem can be overcome.) At the very least, why pandas isn't in the summary table should be explained as a note somewhere on the page.

benchmark floating point rounding error

There might be improvements coming to current functionality which might eventually impact rounding error. It will be good to spot this and implement rounding error correction.

add more complex groupby questions

Currently among 5 questions 4 are grouping by single column and one by 2 columns. We need more complex grouping queries to group by most of the columns, especially:

  • by all dimensions, so all id* columns, and aggregate v1,v2,v3 (process of producing data mart)
  • K=10 and K=2
  • by all columns (detecting duplicated rows) less real-life use cases - especially grouping by double type
  • sorted input

prerequisite: h2oai/datatable#1082

add `subset` task

As we plot results by run (1, 2) we can nicely see impact of index there.
We can consider adding two tasks: subset and unindexed subset.
Latter one will ensure index are not being created.

Order results within each test, perhaps.

Each software is already colored, and the syntax is next to each bar, so if they are ordered by time within each test it might be easier to more quickly compare and it won't be difficult to identify which software is which. Question is, what to order by: first run time, or average of both, or total of both runs. As more packages are added, ordering within test might become more needed. As always, providing a toggle to let the viewer choose the sort method that is appropriate for them would be ideal, but more work to achieve of course.

use data.table update procedure

We can drop updating data.table defined in helpers R script. The same functionality is already in data.table 1.10.5. Also the field name in DESCRIPTION file has been changed since then from Commit to Revision.

add `update on join` task

common process in data analytics is to lookup attributes from dimension hierarchies. Update on join is able to handle that process in-place, but not all tools supports it most likely having penalty of data copy on every join.

add spark

This time on single node and pyspark, not scala. Initially just groupby.

Chop outlier bars

Where one or two solutions are a lot slower than the others, lower the axis range so the faster solutions can be compared in more detail. Then split the overshooting bars with a white break and put the timing in text there. An alternative would be log scale but please don't do log scale. I find simple timings significantly easier to interpret very quickly.

report needs to display more dimensions

kind of "browser" UI selector, currently report has selector for data size only.
Over time we will need to select:

  • task (groupby, join, rollfun, ...)
  • data (so we can present different K in groupby benchmark)
  • number of rows
  • multiple data/datasizes for 2+ tables tasks (join)
  • batch (to view historical benchmarks)

Include timings at the end of each bar.

#27 (comment)

It would be nice to place the timing in seconds at the end of each bar, too. Say "7.3s; 7.1s" in small text at the end of each bar for the 1st and 2nd timing. Just to read off the timings extremely easily and conveniently.

measure memory usage

One of tools is currently failing on 1e9 due to lack of memory, 126GB is not enough to aggregate 50GB csv. This requires use bigger machine to run 1e9 grouping benchmark. Still memory usage should be investigated and probably added to report.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.