Giter Club home page Giter Club logo

db-benchmark's Introduction

Forked from h2oai/db-benchmark

Repository for reproducible benchmarking of database-like operations in single-node environment.
Benchmark report is available at duckdblabs.github.io/db-benchmark.
We focused mainly on portability and reproducibility. Benchmark is routinely re-run to present up-to-date timings. Most of solutions used are automatically upgraded to their stable or development versions.
This benchmark is meant to compare scalability both in data volume and data complexity.
Contribution and feedback are very welcome!

Tasks

  • groupby
  • join
  • groupby2014

Solutions

If you would like your solution to be included, feel free to file a PR with the necessary setup-solution/ver-solution/groupby-solution/join-solution scripts. If the team at DuckDB Labs approves the PR it will be merged. In the interest of transparency and fairness, only results from open-source data-science tools will be merged.

Reproduce

Batch benchmark run

  • if solution uses python create new virtualenv as $solution/py-$solution, example for pandas use virtualenv pandas/py-pandas --python=/usr/bin/python3.10
  • install every solution, follow $solution/setup-$solution.sh scripts by hand, they are not automatic scripts.
  • edit run.conf to define solutions and tasks to benchmark
  • generate data, for groupby use Rscript _data/groupby-datagen.R 1e7 1e2 0 0 to create G1_1e7_1e2_0_0.csv, re-save to binary format where needed (see below), create data directory and keep all data files there
  • edit _control/data.csv to define data sizes to benchmark using active flag
  • ensure SWAP is disabled and ClickHouse server is not yet running
  • start benchmark with ./run.sh

Single solution benchmark

  • install solution software
    • for python we recommend to use virtualenv for better isolation
    • for R ensure that library is installed in a solution subdirectory, so that library("dplyr", lib.loc="./dplyr/r-dplyr") or library("data.table", lib.loc="./datatable/r-datatable") works
    • note that some solutions may require another to be installed to speed-up csv data load, for example, dplyr requires data.table and similarly pandas requires (py)datatable
  • generate data using _data/*-datagen.R scripts, for example, Rscript _data/groupby-datagen.R 1e7 1e2 0 0 creates G1_1e7_1e2_0_0.csv, put data files in data directory
  • run benchmark for a single solution using ./_launcher/solution.R --solution=data.table --task=groupby --nrow=1e7
  • run other data cases by passing extra parameters --k=1e2 --na=0 --sort=0
  • use --quiet=true to suppress script's output and print timings only, using --print=question,run,time_sec specify columns to be printed to console, to print all use --print=*
  • use --out=time.csv to write timings to a file rather than console

Running script interactively

  • install software in expected location, details above
  • ensure data name to be used in env var below is present in ./data dir
  • source python virtual environment if needed
  • call SRC_DATANAME=G1_1e7_1e2_0_0 R, if desired replace R with python or julia
  • proceed pasting code from benchmark script

Updating the benchmark.

The benchmark will now be updated upon request. A request can be made by creating a PR with a combination of the following.

The PR must include

  • updates to the time.csv and log.csv files of a run on a c6id.metal machine. If you are re-enabling a query for a solution, you can just include new times and logs for the query, however, the version must match currently reported version.

The PR must include one of the following

  • changes to a solution VERSION file.
  • changes to a solution groupby or join script. This can mean:
    1. Loading the data differently
    2. Changing settings for a solution.
    3. Re-enabling a query for a solution

To facilitate creating an instance identical to the one with the current results, the script _utils/format_and_mount.sh was created. The script does the following

  1. Formats and mounts an nvme drive so that solutions have access to instance storage
  2. Creates a new directory db-benchmark-metal on the nvme drive. This directory is a clone of the repository. Having a clone of the benchmark on the nvme drive enables the solutions to load the data faster (assuming you follow the steps to copy the data onto the nvme mount).

Once the db-benchmark-metal directory is created, you will need to

  1. Create or generate all the datasets. The benchmark will not be updated if only a subset of datasets are tested.
  • If you call ./_utils/format_and_mount.sh -c the datasets will be created for you. Creating every dataset will take at least >1hr
  1. Install the solutions you wish to have updated. The {{solution}}/setup-{{solution}}.sh should have everything you need
  2. Update the solution(s) groupby or join scripts with any desired changes
  3. Benchmark on your solution against all datasets.
  4. Generate the report to see how the results compare to other solutions. The report should be automatically generated. You can find it in public.
  5. Create your PR! Include the updates to the time.csv and logs.csv files

The PR will then be reviewed by the DuckDB Labs team where we will run the benchmark again ourselves to validate the new results. If there aren't any questions, we will merge the PR and publish a new report!

Example environment

Acknowledgment

Timings for solutions from before the fork have been deleted. You can still view them on the original h2oai/db-benchmark fork. Including these timings in report generation resulted in errors, and since all libraries have been updated and benchmarked using new hardware, the decision was made to start a new results file. Timings for some solutions might be missing for particular data sizes or questions. Some functions are not yet implemented in all solutions so we were unable to answer all questions in all solutions. Some solutions might also run out of memory when running benchmark script which results the process to be killed by OS. There is also a timeout for single benchmark script to run, once the timeout value is reached script is terminated. Please check exceptions label in the original h2oai repository for a list of issues/defects in solutions, that makes us unable to provide all timings. There is also no documentation label that lists issues that are blocked by missing documentation in solutions we are benchmarking.

Notice

In the interest of transparency and fairness, only results from open-source data-science tools will be included in the benchmark.

db-benchmark's People

Contributors

bkamins avatar dandandan avatar fjetter avatar gvelasq avatar hannes avatar jangorecki avatar mattdowle avatar michaelchirico avatar milesgranger avatar moelf avatar mroeschke avatar nalimilan avatar pallharaldsson avatar qoega avatar ravwojdyla avatar ritchie46 avatar sebkrantz avatar sl-solution avatar szarnyasg avatar themantra108 avatar tmonster avatar trivialfis avatar vincentarelbundock avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

db-benchmark's Issues

Recommended Benchmarking Arrangement

I would like to thank the DuckDB team for keeping this benchmark going! It can be expensive to run benchmark tests for many software. As a single developer working on my current dataframe project, I encounter problems if I spend too much time learning how to configure other software to do benchmark.

However, I suggest that if each software developer writes their own script, the results can be more fair as different script settings will result in different performance. The software developers must be able to configure their script to get optimized performance.

It is unbelievable that none of the software of the benchmark can complete the JoinTable test.

I use a 67 GB csv fact table of 1 billion rows x 14 columns to join a master of 99696 rows, full join for 3-composit key, returning a 90GB csv file. Peaks can complete the Jointable using only 32GB memory. You can see https://github.com/hkpeaks/peaks-consolidation/tree/main/Polars-PeaksBenchmarking for more information.

You can use the Peaks pre-release to see if it can complete Jointabe on your much better machine. https://github.com/hkpeaks/peaks-consolidation/releases . I want to know the results as I do not have a Google Cloud account. I am exploring which cloud service can accept PayPal for prepayment. Using credit card for online payment is very risky for me.

I suggest that your coming benchmark can be tested on the following categories by selecting all records except filter test:

  • Select all records for Sorting test for csv and parquet (cover 3+ sorting columns)
  • Select all records for Groupby test for csv and parquet (cover 3+ GroupBy columns, 3+ Aggregate Column)
  • Select all records for Distinct test for csv and parquet (cover 3+ Distinct column)
  • Select all records for JoinTable test for csv and parquet (cover 3+ Composit Key)
  • Different compare operators filter/unmatch filter test for csv and parquet (cover at least filtered 50%+ volume from input data)

This way it can be more easily understood by business users.

Suggestion: Include reshape benchmarks

Stoked to see this back up and running!

(As an aside, the relentless performance gains of DuckDB are truly impressive.)

Two suggestions:

  1. Please consider the collapse R package (link). In my own set of benchmarks, collapse is typically at or near the top of various groupby operations for datasets in the order of .5-5 GB. (I haven't tested larger than that and should also say it doesn't support join operations yet.) I can add a PR if interested. Closed via #33.
  2. There was talk over at the old repo of adding a set of reshape benchmarks. Personally, I think this would be great to have. See: h2oai#175

Thanks again for all effort in resurrecting this.

Clean-up `/tmp` dirctory

I notice that on-disk solutions may create large temporary files during their runs, however, they may not clean up afterward (e.g. polars creates .ipc files). This may cause the undefined exception error for other solutions, when they run within the same session.

`=~` matching in run.sh causes "duckdb-latest" to match "duckdb"

Commit 17d41db that introduced solution "duckdb-latest" is causing that the following line in run.sh

if [[ "$RUN_SOLUTIONS" =~ "duckdb" ]]; then ./duckdb/ver-duckdb.sh; fi;

is matching to duckdb even when duckdb is not being used, but only duckdb-latest. As a result benchmark may fail with the following message:

Error in read.dcf(system.file(package = "duckdb", lib.loc = "./duckdb/r-duckdb",  : 
  cannot open the connection
In addition: Warning message:
In read.dcf(system.file(package = "duckdb", lib.loc = "./duckdb/r-duckdb",  :
  cannot open compressed file '', probable reason 'No such file or directory'
Execution halted

To resolve it we should use more precise operator than =~, till then let's keep this issue open in case someone will be getting such error and trying to search for it.

Add one Digit to Benchmark Summary

Minor issue, but I'd like to suggest to add one digit to the summary table for better display of results, especially for 0.5Gb and 5Gb benchmarks.

duckdb-latest fails

When trying to setup duckdb-latest I get an error:

~/db-benchmark$ bash duckdb-latest/setup-duckdb-latest.sh
Downloading GitHub repo duckdb/duckdb@HEAD
Error: Failed to install 'duckdb' from GitHub:
  Does not appear to be an R package (no DESCRIPTION)
Execution halted

The docs page recommends doing the following:

install.packages('duckdb', repos=c('https://duckdb.r-universe.dev', 'https://cloud.r-project.org'))

I'm probably doing something wrong as I'm not very familiar with R

datafusion does not correctly make chunk results

tagging @Dandandan since they added the solution.

Thile in the process of running the benchmarks again and the output of data fusion has caused some issues during report generation.

Specifically, when a write_log is called and a chunk is written using chk=make_chk([chk]) the chunk output is incorrectly formatted.

Running the 500mb group by benchmark on G1_1e7_1e2_0_0 logs the following result for one of the queries

{{codename}},1696851068,1696851075.3744261,groupby,G1_1e7_1e2_0_0,10000000,sum v1 mean v3 by id3,100000,3,datafusion,31.0.0,,.groupby,1,0.269,1.877,TRUE,[29998789.          4999719.62234443],5.398,,FALSE

the column with the value [29998789. 4999719.62234443] is the offending column. It should contain semi-colon separated answers.

the report is ready to be published and I would like to have the datafusion results included

Add cuDF

Any particular reason why cuDF is left out especially since it's included in the original repo?

many solutions are missing from join 1e9

It is a bit confusing that join 1e9 has so few solutions. There are no notes below the benchplot explaining that. Ideally is to have all solutions included, and whenever something is not working, then exception defined in benchplot dictionary that refers to gh issues #num of particular project. You will find multiple examples in current benchplot dictionary because that was the way we handled failures and exceptions in different solutions. This way you are also allowing users to eventually fix the problems that are behind particular issues.

Add additional data wrangling methods

Thank you DuckDB team for keeping this benchmark going!!!

I see there are a lot of variations on group bys and joins, however, I think it would be highly beneficial to incorporate additional data wrangling methods. A few that come to mind, but others should add to this list, includes:

  • Unions
  • Subsetting data
  • Sampling data
  • Rolling joins (see data.table)
  • Pivots long and wide
  • Rolling / windowing operations by groups over time, such as lags and moving averages
  • Differencing data by groups based on a time column
  • Updating records in a data frame / table
  • Categorical encoding methods: target encoding, James-Stein encoding
  • Column type conversions

I believe a broader set of operations serves a several purposes. For one, I would like to know if a particular framework can actually do the operation. Secondly, I would like to see benchmarks on their performance. Lastly, I think it would a huge community benefit to see what the actual code ends up looking like to get the greatest performance, which isn't always available through documentation or stackoverflow.

Thanks in advance,
Adrian

suspicious timings for groupby q10

groupby q10 is quite specific query which meant to stress aggregations really well.

Looking at current results I observe that duckdb and clickhouse are suspiciously well dealing with the question (0.14s, 0.47s), comparing to remaining solutions (1.56s).

My guess, and what I would like to clarify in this issue, is that those databases are holding statistics of the tables, and run queries faster using those statistics. Could you provide a comment, if that is likely scenario? For duckdb @Tmonster, and clickhouse @qoega ?

If that is the case, then:

  • there should be a note about that, probably even the first note just below the timings;
  • we should look for corresponding features in other solutions. For example, in data.table test script we could set indices (these are data.table native statistics that could be enabled), or a key. Those optimizations are currently not there.

Benchmark with Parquet

Is there any interest in using Parquet datasets to benchmark, particularly for the 50GB dataset case? Parquet is very common for large-scale data analytics, and as far as I know, most if not all of the libraries tested support Parquet files. Furthermore, some libraries have special support for Parquet datasets, like reading encoded columns and performing analytics on them directly in memory.

I think it still makes sense to do the 0.5GB and 5GB benchmarks in CSV, but I think it would be helpful to have an additional 50GB benchmark in Parquet (or even replace the existing one). That way, the benchmark doesn't take much longer.

Question

Hi,
I wanted to know how often will the benchmark be updated?
Just curious to know how some new solutions are performing.
Thank you

Inclusion of Julia in the benchmarks

Are you open to re-include Julia DataFrames.jl in the benchmarks?

The scripts in https://github.com/duckdblabs/db-benchmark/tree/master/juliadf should just work. However, I can also update them to latest Julia and related packages versions. It would roughly be h2oai#232.

A related big question is precompilation. In the H2O benchmark the compilation time for Julia was included in the benchmark (that is the reason why there are two runs per test). Also there was a requirement that no named functions should be created and called from the code performing the operations. Do you want to keep these two restrictions?

Inconsistent multicore use across languages?

Probably a question for @jangorecki, since this applies to data.table's thread retrieval function. But it will effect other R implementations relying on getDThreads, e.g. the newly proposed r-collapse PR (#33).

Here's an example from my own EC2 server.

First, basic machine characteristics.

[~]$ grep -E -w 'VERSION|NAME|PRETTY_NAME' /etc/os-release    
NAME="Amazon Linux"
VERSION="2"
PRETTY_NAME="Amazon Linux 2"

[~]$ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('  
CPU(s):              96
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           2

The Python call gets the 96 CPU number right.

[~]$ python3 -c 'import multiprocessing as mp; print(mp.cpu_count())'
96

But data.table only recognises half of the available cores (physical only?).

[~]$ Rscript -e "data.table::getDTthreads()" 
[1] 48

I created a benchmark but DuckDB run times are super slow and not sure why

Here's a link to the repo: https://github.com/AdrianAntico/Benchmarks

I use 3 datasets, one with a million rows, one with 10M, and one with 100M. I am currently just running a sum aggregation, with varying number of grouping variables and aggregated numeric variables. One main difference between those datasets and the ones used here is that I make use of a Date type column in all aggregations. It also seems that Polars has a harder time with that data type. I'm showing data.table to be the fastest for all queries except the single Date type aggregation (this is where DuckDB wins).

I copied some of the code from this repo. Hoping someone can take a look because the results were a bit unexpected.

Published duckdb results are not reproducible

Hi.
I created environment you use for benchmarks and tried to reproduce current published results.

curl http://169.254.169.254/latest/meta-data/instance-type
c6id.metal

Local disk with benchmark data is stored on local nvme disk

~/nvme/h2oai-db-benchmark$ df -h .
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme2n1    1.8T  510G  1.3T  29% /home/ubuntu/nvme

lsblk | grep -v loop
NAME         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
nvme4n1      259:0    0  1000G  0 disk
├─nvme4n1p1  259:1    0 999.9G  0 part /
├─nvme4n1p14 259:2    0     4M  0 part
└─nvme4n1p15 259:3    0   106M  0 part /boot/efi
nvme2n1      259:4    0   1.7T  0 disk /home/ubuntu/nvme
nvme3n1      259:5    0   1.7T  0 disk
nvme1n1      259:6    0   1.7T  0 disk /var/lib/clickhouse
nvme0n1      259:7    0   1.7T  0 disk /nvme

Group by G1_1e9_1e2_5_0 fails with OOM for duckdb 0.8.1.3

cat run_duckdb_groupby_G1_1e9_1e2_5_0.err
Error: rapi_execute: Failed to run query
Error: Out of Memory Error: could not allocate block of size 262KB (216.2GB/216.2GB used)
Database is launched in in-memory mode and no temporary directory is specified.
Unused blocks cannot be offloaded to disk.

Launch the database with a persistent storage back-end
Or set PRAGMA temp_directory='/path/to/tmp.tmp'
Timing stopped at: 768 538.4 33.28
Execution halted
Warning messages:
1: Connection is garbage-collected, use dbDisconnect() to avoid this.
2: Database is garbage-collected, use dbDisconnect(con, shutdown=TRUE) or duckdb::duckdb_shutdown(drv) to avoid this.

Add a new backend - Clojure's tablecloth

Hi there! This is a really cool project, thanks for keeping it going. Would the maintainers be open to accepting a solution using Clojure's tablecloth? It's the main dataframe/datatable manipulation library in our toolkit, entirely free and open source. I'm active in the Clojure-for-data-science community and would happily contribute a PR with the Clojure setup and solutions, but wanted to check first whether it's likely to be accepted. If so, are there any specific guidelines or requirements for contributing a new language (Clojure runs on the JVM).

Add set key to data.table operations.

setkey make joins extremely faster in data.tables, the codes over join benchmark are not setting the keys properly and can affect the main results.

It is also important in other kinds of data manipulation such as deduce. for instance:
setkey(DT, key)
unique(DT, by = 'key')

is very much faster than
unique(DT, by 'key')

This can go from 15 minutes to seconds for 100GB+ datasets

Joins work the same way:

setkey(DTA, key)
setkey(DTB, key)

DTA[DTB, on = .(key)]

I hope it can make the benchmar better!!

Incorrect timings in groupby 50 GB charts

Data in this chart (groupby 50 GB) is incorrect. Timings for each question seem way too small (smaller that for 5 GB) and do not add up to the timing at the top of the chart. E.g. DuckDB top timing is 143s, but for each question timings are less than 0.5s.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.