Giter Club home page Giter Club logo

deepdb-public's Introduction

DeepDB: Learn from Data, not from Queries!

DeepDB is a data-driven learned database component achieving state-of-the-art-performance in cardinality estimation and approximate query processing (AQP). This is the implementation described in

Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, Carsten Binnig: "DeepDB: Learn from Data, not from Queries!", VLDB'2020. [PDF]

DeepDB Overview

Setup

Tested with python3.7 and python3.8

git clone https://github.com/DataManagementLab/deepdb-public.git
cd deepdb-public
sudo apt install -y libpq-dev gcc python3-dev
python3 -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt

For python3.8: Sometimes spflow fails, in this case remove spflow from requirements.txt, install them and run

pip3 install spflow --no-deps

Reproduce Experiments

Cardinality Estimation

Download the Job dataset. Generate hdf files from csvs.

python3 maqp.py --generate_hdf
    --dataset imdb-light
    --csv_seperator ,
    --csv_path ../imdb-benchmark
    --hdf_path ../imdb-benchmark/gen_single_light
    --max_rows_per_hdf_file 100000000

Generate sampled hdf files from csvs.

python3 maqp.py --generate_sampled_hdfs
    --dataset imdb-light
    --hdf_path ../imdb-benchmark/gen_single_light
    --max_rows_per_hdf_file 100000000
    --hdf_sample_size 10000

Learn ensemble with the optimized rdc strategy (requires postgres with imdb dataset)

python3 maqp.py --generate_ensemble
    --dataset imdb-light 
    --samples_per_spn 10000000 10000000 1000000 1000000 1000000
    --ensemble_strategy rdc_based
    --hdf_path ../imdb-benchmark/gen_single_light
    --max_rows_per_hdf_file 100000000
    --samples_rdc_ensemble_tests 10000
    --ensemble_path ../imdb-benchmark/spn_ensembles
    --database_name imdb
    --post_sampling_factor 10 10 5 1 1
    --ensemble_budget_factor 5
    --ensemble_max_no_joins 3
    --pairwise_rdc_path ../imdb-benchmark/spn_ensembles/pairwise_rdc.pkl

Alternatively: Learn base ensemble over different tables with naive strategy. (Does not work with different dataset sizes because join sizes are hard coded but does not require postgres)

python3 maqp.py --generate_ensemble
    --dataset imdb-light 
    --samples_per_spn 1000000 1000000 1000000 1000000 1000000
    --ensemble_strategy relationship
    --hdf_path ../imdb-benchmark/gen_single_light
    --ensemble_path ../imdb-benchmark/spn_ensembles
    --max_rows_per_hdf_file 100000000
    --post_sampling_factor 10 10 5 1 1

Evaluate performance for queries.

python3 maqp.py --evaluate_cardinalities
    --rdc_spn_selection
    --max_variants 1
    --pairwise_rdc_path ../imdb-benchmark/spn_ensembles/pairwise_rdc.pkl
    --dataset imdb-light
    --target_path ./baselines/cardinality_estimation/results/deepDB/imdb_light_model_based_budget_5.csv
    --ensemble_location ../imdb-benchmark/spn_ensembles/ensemble_join_3_budget_5_10000000.pkl
    --query_file_location ./benchmarks/job-light/sql/job_light_queries.sql
    --ground_truth_file_location ./benchmarks/job-light/sql/job_light_true_cardinalities.csv

Updates

Conditional incremental learning (i.e., initial learning of all films before 2013, newer films learn incremental)

python3 maqp.py  --generate_ensemble
    --dataset imdb-light
    --samples_per_spn 10000000 10000000 1000000 1000000 1000000
    --ensemble_strategy rdc_based
    --hdf_path ../imdb-benchmark/gen_single_light
    --max_rows_per_hdf_file 100000000
    --samples_rdc_ensemble_tests 10000
    --ensemble_path ../imdb-benchmark/spn_ensembles
    --database_name JOB-light
    --post_sampling_factor 10 10 5 1 1
    --ensemble_budget_factor 0
    --ensemble_max_no_joins 3
    --pairwise_rdc_path ../imdb-benchmark/spn_ensembles/pairwise_rdc.pkl
    --incremental_condition "title.production_year<2013"

Optimized Inference

Generate the C++ code. (Currently only works for cardinality estimation).

python3 maqp.py --code_generation 
    --ensemble_path ../imdb-benchmark/spn_ensembles/ensemble_join_3_budget_5_10000000.pkl

Compile it in a venv with pybind installed. Sometimes installing this yields: ModuleNotFoundError: No module named 'pip.req' One workaround is to downgrade pip pip3 install pip==9.0.3 as described here.

The command below works for ubuntu 18.04. Make sure the generated .so file is in the root directory of the project.

g++ -O3 -Wall -shared -std=c++11 -ftemplate-depth=2048 -ftime-report -fPIC `python3 -m pybind11 --includes` optimized_inference.cpp -o optimized_inference`python3-config --extension-suffix`

If you now want to leverage the module you have to specify it for cardinalities.

python3 maqp.py --evaluate_cardinalities 
    --rdc_spn_selection 
    --max_variants 1 
    --pairwise_rdc_path ../imdb-benchmark/spn_ensembles/pairwise_rdc.pkl 
    --dataset imdb-light 
    --target_path ./baselines/cardinality_estimation/results/deepDB/imdb_light_model_based_budget_5.csv 
    --ensemble_location ../imdb-benchmark/spn_ensembles/ensemble_join_3_budget_5_10000000.pkl 
    --query_file_location ./benchmarks/job-light/sql/job_light_queries.sql 
    --ground_truth_file_location ./benchmarks/job-light/sql/job_light_true_cardinalities.csv 
    --use_generated_code

AQP

SSB pipeline

Generate standard SSB dataset (Scale Factor=500) and use the correct seperator.

for i in `ls *.tbl`; do
    sed 's/|$//' $i > $TMP_DIR/${i/tbl/csv} &
    echo $i;
done

Create lineorder sample

cat lineorder.csv | awk 'BEGIN {srand()} !/^$/ { if (rand() <= .003333) print $0}' > lineorder_sampled.csv

Generate hdf files from csvs.

python3 maqp.py --generate_hdf
    --dataset ssb-500gb
    --csv_seperator \|
    --csv_path ../mqp-data/ssb-benchmark
    --hdf_path ../mqp-data/ssb-benchmark/gen_hdf

Learn the ensemble with a naive strategy.

python3 maqp.py --generate_ensemble 
    --dataset ssb-500gb
    --samples_per_spn 1000000
    --ensemble_strategy single 
    --hdf_path ../mqp-data/ssb-benchmark/gen_hdf 
    --ensemble_path ../mqp-data/ssb-benchmark/spn_ensembles
    --rdc_threshold 0.3
    --post_sampling_factor 10

Optional: Compute ground truth for AQP queries (requires postgres with ssb schema).

python3 maqp.py --aqp_ground_truth
    --query_file_location ./benchmarks/ssb/sql/aqp_queries.sql
    --target_path ./benchmarks/ssb/ground_truth_500GB.pkl
    --database_name ssb

Evaluate the AQP queries.

python3 maqp.py --evaluate_aqp_queries
    --dataset ssb-500gb
    --target_path ./baselines/aqp/results/deepDB/ssb_500gb_model_based.csv
    --ensemble_location ../mqp-data/ssb-benchmark/spn_ensembles/ensemble_single_ssb-500gb_1000000.pkl
    --query_file_location ./benchmarks/ssb/sql/aqp_queries.sql
    --ground_truth_file_location ./benchmarks/ssb/ground_truth_500GB.pkl

Optional: Create the ground truth for confidence interval. (with 10M because we also use 10M samples for the training)

python3 maqp.py --aqp_ground_truth
    --query_file_location ./benchmarks/ssb/sql/confidence_queries.sql
    --target_path ./benchmarks/ssb/confidence_intervals/confidence_interval_10M.pkl
    --database_name ssb

Evaluate the confidence intervals.

python3 maqp.py --evaluate_confidence_intervals
    --dataset ssb-500gb
    --target_path ./baselines/aqp/results/deepDB/ssb500GB_confidence_intervals.csv
    --ensemble_location ../mqp-data/ssb-benchmark/spn_ensembles/ensemble_single_ssb-500gb_1000000.pkl
    --query_file_location ./benchmarks/ssb/sql/aqp_queries.sql
    --ground_truth_file_location ./benchmarks/ssb/confidence_intervals/confidence_interval_10M.pkl
    --confidence_upsampling_factor 300
    --confidence_sample_size 10000000

Flights pipeline

Generate flights dataset with scale factor 1 billion using IDEBench and generate a sample using

cat dataset.csv | awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01) print $0}' > dataset_sampled.csv

Generate hdf files from csvs.

python3 maqp.py --generate_hdf
    --dataset flights1B
    --csv_seperator ,
    --csv_path ../mqp-data/flights-benchmark
    --hdf_path ../mqp-data/flights-benchmark/gen_hdf

Learn the ensemble.

python3 maqp.py --generate_ensemble 
    --dataset flights1B
    --samples_per_spn 10000000 
    --ensemble_strategy single 
    --hdf_path ../mqp-data/flights-benchmark/gen_hdf 
    --ensemble_path ../mqp-data/flights-benchmark/spn_ensembles
    --rdc_threshold 0.3
    --post_sampling_factor 10

Optional: Compute ground truth

python3 maqp.py --aqp_ground_truth
    --dataset flights1B
    --query_file_location ./benchmarks/flights/sql/aqp_queries.sql
    --target_path ./benchmarks/flights/ground_truth_1B.pkl
    --database_name flights   

Evaluate the AQP queries.

python3 maqp.py --evaluate_aqp_queries
    --dataset flights1B
    --target_path ./baselines/aqp/results/deepDB/flights1B_model_based.csv
    --ensemble_location ../mqp-data/flights-benchmark/spn_ensembles/ensemble_single_flights1B_10000000.pkl
    --query_file_location ./benchmarks/flights/sql/aqp_queries.sql
    --ground_truth_file_location ./benchmarks/flights/ground_truth_1B.pkl

Optional: Create the ground truth for confidence interval. (with 10M because we also use 10M samples for the training)

python3 maqp.py --aqp_ground_truth
    --dataset flights1B
    --query_file_location ./benchmarks/flights/sql/confidence_queries.sql
    --target_path ./benchmarks/flights/confidence_intervals/confidence_interval_10M.pkl
    --database_name flights10M_origsample 

Evaluate the confidence intervals.

python3 maqp.py --evaluate_confidence_intervals
    --dataset flights1B
    --target_path ./baselines/aqp/results/deepDB/flights1B_confidence_intervals.csv
    --ensemble_location ../mqp-data/flights-benchmark/spn_ensembles/ensemble_single_flights1B_10000000.pkl
    --query_file_location ./benchmarks/flights/sql/aqp_queries.sql
    --ground_truth_file_location ./benchmarks/flights/confidence_intervals/confidence_interval_10M.pkl
    --confidence_upsampling_factor 100
    --confidence_sample_size 10000000

TPC-DS (Single Table) pipeline

As an additional example on how to work with DeepDB, we provide an example on just a single table of the TPC-DS schema for the queries in ./benchmarks/tpc_ds_single_table/sql/aqp_queries.sql. As a prerequisite, you need a 10 million tuple sample of the store_sales table in the directory ../mqp-data/tpc-ds-benchmark/store_sales_sampled.csv. Afterwards, you can run the following commands. To compute the ground truth, you need a postgres instance with a 1T TPC-DS dataset.

Generate hdf files from csvs

python3 maqp.py --generate_hdf
    --dataset tpc-ds-1t
    --csv_seperator |
    --csv_path ../mqp-data/tpc-ds-benchmark
    --hdf_path ../mqp-data/tpc-ds-benchmark/gen_hdf

Learn the ensemble

python3 maqp.py --generate_ensemble 
    --dataset tpc-ds-1t
    --samples_per_spn 10000000 
    --ensemble_strategy single 
    --hdf_path ../mqp-data/tpc-ds-benchmark/gen_hdf
    --ensemble_path ../mqp-data/tpc-ds-benchmark/spn_ensembles 
    --rdc_threshold 0.3 
    --post_sampling_factor 10

Compute ground truth

python3 maqp.py --aqp_ground_truth
    --dataset tpc-ds-1t
    --query_file_location ./benchmarks/tpc_ds_single_table/sql/aqp_queries.sql
    --target_path ./benchmarks/tpc_ds_single_table/ground_truth_1t.pkl
    --database_name tcpds

Evaluate the AQP queries

python3 maqp.py --evaluate_aqp_queries
    --dataset tpc-ds-1t
    --target_path ./baselines/aqp/results/deepDB/tpcds1t_model_based.csv
    --ensemble_location ../mqp-data/tpc-ds-benchmark/spn_ensembles/ensemble_single_tpc-ds-1t_10000000.pkl
    --query_file_location ./benchmarks/tpc_ds_single_table/sql/aqp_queries.sql
    --ground_truth_file_location ./benchmarks/tpc_ds_single_table/ground_truth_1t.pkl

How to experiment with DeepDB on a new Dataset

  • Specify a new schema in the schemas folder
  • Due to the current implementation, make sure to declare
    • the primary key,
    • the filename of the csv sample file,
    • the correct table size and sample rate,
    • the relationships among tables if you do not just run queries over a single table,
    • any non-key functional dependencies (this is rather an implementation detail),
    • and include all columns in the no-compression list by default (as done for the IMDB benchmark),
  • To further reduce the training time, you can exclude columns you do not need in your experiments (also done in the IMDB benchmark)
  • Generate the HDF/sampled HDF files and learn the RSPN ensemble
  • Use the RSPN ensemble to answer queries
  • For reference, please check the commands to reproduce the results of the paper

Reference

If you find this repository useful in your work, please cite our paper:

@inproceedings{deepdb,
  title={DeepDB: Learn from Data, not from Queries!},
  author = {Benjamin Hilprecht and Andreas Schmidt and Moritz Kulessa and Alejandro Molina and Kristian Kersting and Carsten Binnig},
  journal={Proceedings of the VLDB Endowment},
  volume={13},
  number={7},
  pages={992--1005},
  year={2020},
  publisher={VLDB Endowment}
}

deepdb-public's People

Contributors

bhilprecht avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

deepdb-public's Issues

Problem during reproducing SSB experiment's results

Hi,
Thank you for great work! I am reproducing results for the SSB dataset. Due to limited storage, I generated the SSB (SF=10) with the unified SSB generator (https://github.com/eyalroz/ssb-dbgen). However, after I trained the model and ran the evaluation:

2022-12-22 05:50:28,693 [INFO ]  Evaluating AQP query 4: select d_year, p_brand1, sum(lo_revenue) from lineorder, dwdate, part, supplier where lo_orderdate = d_datekey and lo_partkey = p_partkey and lo_suppkey = s_suppkey and p_brand1 in ('MFGR#2221','MFGR#2222','MFGR#2223','MFGR#2224','MFGR#2225','MFGR#2226','MFGR#2227','MFGR#2228') and s_region = 'ASIA' group by d_year, p_brand1 order by d_year, p_brand1;
Traceback (most recent call last):
  File "/Users/jw/Desktop/deepdb-public/maqp.py", line 234, in <module>
    evaluate_aqp_queries(args.ensemble_location, args.query_file_location, args.target_path, schema,
  File "/Users/jw/Desktop/deepdb-public/evaluation/aqp_evaluation.py", line 127, in evaluate_aqp_queries
    confidence_intervals, aqp_result = spn_ensemble.evaluate_query(query, rdc_spn_selection=rdc_spn_selection,
  File "/Users/jw/Desktop/deepdb-public/ensemble_compilation/spn_ensemble.py", line 746, in evaluate_query
    group_bys_scopes, result_tuples, result_tuples_translated = self._evaluate_group_by_spn_ensembles(query)
  File "/Users/jw/Desktop/deepdb-public/ensemble_compilation/spn_ensemble.py", line 677, in _evaluate_group_by_spn_ensembles
    group_bys_scope, temporary_results, temporary_results_translated = spn.evaluate_group_by_combinations(
  File "/Users/jw/Desktop/deepdb-public/aqp_spn/aqp_spn.py", line 251, in evaluate_group_by_combinations
    range_conditions = self._parse_conditions(range_conditions)
  File "/Users/jw/Desktop/deepdb-public/aqp_spn/aqp_spn.py", line 409, in _parse_conditions
    assert len(matching_cols) == 1 or len(matching_fd_cols) == 1, "Found multiple or no matching columns"
AssertionError: Found multiple or no matching columns

For query 4 (Q4), I have printed and seen:

matching_fd_cols: []
matching_cols: []

The first 3 queries ran smoothly but I stuck at this one because of this error. I wonder how can I deal with it? Thank you for help!

Cardinality Estimation Anomalies for ORDERKEY and PARTKEY in TPC-H Dataset (SF=1)

I've been working with the TPC-H dataset (Scale Factor 1) in DeepDB and noticed an unusual pattern in cardinality estimation (CE). Specifically, when querying numerical columns with limited distinct values such as ORDERKEY and PARTKEY in the LINEITEM (total
records = 6001215) table, the system's predictions come out as multiples of the inverse of the sampling rate or exactly one. (e.g. CE results were 1, 6, 12, 18, ... when samples_per_spn = 1000000 1000000 1000000 1000000 1000000)
This occurs even after listing these columns under the no_compression section of the schema file to avoid compression effects. I'd appreciate any guidance or recommendations to mitigate this issue.

Non tree graphs

Hi there,

Thank you for the code, I am wondering if this can be used in non-tree schemas such as JOB (full IMDB) ?

Thanks,

tpcds-single table cardinality estimation

I wonder if it is possible to do single table (single SPN) cardinality estimation without modifying the code in SPNensemble class. I tried with a single table toy dataset and it run into a lot of errors. I assume you must have tested the code on single table first before making it work on multiple tables. Could you let me know how to estimate on single table?
Thanks a lot.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.