Giter Club home page Giter Club logo

microsoft / dsb Goto Github PK

View Code? Open in Web Editor NEW
34.0 8.0 7.0 10.69 MB

The DSB benchmark is designed for evaluating both workloaddriven and traditional database systems on modern decision support workloads. DSB is adapted from the widely-used industrialstandard TPC-DS benchmark. It enhances the TPC-DS benchmark with complex data distribution and challenging yet semantically meaningful query templates. DSB also introduces configurable and dynamic workloads to assess the adaptability of database systems. Since workload-driven and traditional database systems have different performance dimensions, including the additional resources required for tuning and maintaining the systems, we provide guidelines on evaluation methodology and metrics to report.

License: MIT License

Shell 1.00% PLSQL 0.28% C 73.34% Smarty 20.78% Makefile 2.04% Yacc 0.80% Lex 0.37% Python 0.58% TSQL 0.82%

dsb's Introduction

DSB Benchmark

The DSB benchmark is designed for evaluating both workloaddriven and traditional database systems on modern decision support workloads. DSB is adapted from the widely-used industrialstandard TPC-DS benchmark. It enhances the TPC-DS benchmark with complex data distribution and challenging yet semantically meaningful query templates. DSB also introduces configurable and dynamic workloads to assess the adaptability of database systems. Since workload-driven and traditional database systems have different performance dimensions, including the additional resources required for tuning and maintaining the systems, we provide guidelines on evaluation methodology and metrics to report.

The detail of this benchmark is described here:

Bailu Ding, Surajit Chaudhuri, Johannes Gehrke, and Vivek Narasayya. DSB: A Decision Support Benchmark for Workload-Driven and Traditional Database Systems PVLDB, 14(13): 3376 - 3388, 2021. doi:10.14778/3484224.3484234

(http://www.vldb.org/pvldb/vol14/p3376-ding.pdf)

Disclaimer: The DSB benchmark is derived from TPC-DS and as such is not comparable to published TPC-DS results, as the DSB benchmark does not comply with the TPC-DS benchmark

Compile the code

  • The code can be compiled based on the instructions in ./code/v2.11.0rc2/tools/How_To_Guide-DS-V2.0.0.docx.

Data generation

  • The data can be generated based on the instructions in ./code/v2.11.0rc2/tools/How_To_Guide-DS-V2.0.0.docx.
  • Because the DSB benchmark includes correlation between tables, the tables must be generated following a partial order. * We strongly suggest that the users generate ALL the tables when populating the data files *. Generating / repopulating an individual table file can result in incorrect correlation between tables.
  • Sample script to generate data files: ./scripts/generate_dsb_db_files.py
  • Sample script to load the data files to Microsoft SQL Server: ./scripts/load_data_sqlserver.py
  • Sample script to load the data files to Postgres: ./scripts/load_data_pg.py

Physical configuration

  • We provide a sample physical configuration of 56 B+ tree indexes for the database. The physical configuration is produced by the Database Tuning Advisor (DTA) from Microsoft SQL Server based on a 100GB DSB database. This physical configuration is only for demonstration purpose. We suggest the users to produce their own physical configuration based on the database instance and the query workloads.
  • The SQL file to create the indexes for Microsoft SQL Server: ./scripts/dsb_index_sqlserver.sql
  • THe SQL file to create the indexes for Postgres: ./scripts/dsb_index_pg.sql

Query templates

  • The query templates are adapted from the TPC-DS benchmark with three new queries (100, 101, 102). For the query templates adapted from TPC-DS benchmark, we keep the original query ID of the template.
  • The queries are divided into two groups: agg_queries (i.e., single block queries) and multi-block queries.
  • The DSB benchmark also includes a set of single-block SPJ queries that are derived from the query templates for evaluating techniques with limited capabilities.
  • The query templates for Microsoft SQL Server dialect: ./query_templates_sqlserver
  • The query templates for Postgres dialect: ./query_templates_pg

Query generation

  • The query workloads can be generated with workload configurations.
  • Sample script to generate workloads: ./scripts/generate_workload.py
      • This script MUST be executed from the path of the binary of the query generation tool, e.g., D:\code\v2.11.0rc2\tools. *
  • Sample workload configuration: ./scripts/workload_config.json
  • As part of the query generation, our tool will output a tpcds.idx file, which stores the probability distributions of the values in each domain in a workload.

Workload configuration parameters

  • A workload configuration is a JSON object that consists of a sequence of workload distributions
  • Each workload configuration has the following meta parameters:
    • output_dir: The path to store the output query files
    • binary_dir: The path of the binary of the query generation tool
    • query_template_root_dir: The root directory of the query templates. It will be traversed recursively
    • dialect: "sqlserver" or "postgres"
    • workload: an array of workload distributions
  • Each workload distribution has the following parameters:
    • id: The id of the workload distribution
    • query_template_names: The names of the query template to be included in this workload distribution. An empty list means including all the query templates
    • instance_count: The number of query instances per query template
    • param_dist: "normal" or "default", where "normal" means using Gaussian distribution.
    • param_sigma: a positive numerical value. This is the variance of the Gaussian distribution (if applicable)
    • param_center: a numerical value between [-0.5, 0.5]. This is the shift of the center of the Gaussian distribution (if applicable)
    • rngseed: an integer. This is used for generating the permutation of the values in each domain. The same rngseed can be used to fix the parameter value permutation.

Test

  • The code and scripts for data generation and query generation are tested under Windows Server 2019.
  • The query templates are tested under Microsoft SQL Server 2019 and Postgres V13.

Known issues

  • The code has known compatibility issues with older versions of Linux and GCC. Please run the code with a newer version of Linux and GCC, e.g., Ubuntu 20.04 LTS.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

dsb's People

Contributors

bailud avatar microsoft-github-operations[bot] avatar microsoftopensource avatar themanojkumar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dsb's Issues

What is the format of distribution file for dsdgen

DSB is really creative and useful. May I ask What the format of distribution file for dsdgen is ?

Advanced Options

DELIMITER = -- use as output field separator
DISTRIBUTIONS = -- read distributions from file
FORCE = [Y|N] -- over-write data files without prompting
SUFFIX = -- use as output file suffix
TERMINATE = [Y|N] -- end each record with a field delimiter
VCOUNT = -- set number of validation rows to be produced
VSUFFIX = -- set file suffix for data validation
RNGSEED = -- set RNG seed

i.e. What should "DISTRIBUTIONS = -- read distributions from file " the file contain?

Compilation fails on Ubuntu 22.04 due to linker issues.

Hi, thanks for open-sourcing the code! AFAIK, the code is meant to work with newer versions of Linux ("Please run the code with a newer version of Linux and GCC"), but there are some small issues with Ubuntu 22.04.

Summary

  • A small number of files have multiple definitions of the same global variable. This causes linker errors.
  • Similarly, yacc wants yydebug to be extern.

Environment

$ lsb_release -d
Description:    Ubuntu 22.04 LTS
$ uname -r
5.15.0-39-generic
$ gcc --version
gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0
$ yacc --version
bison (GNU Bison) 3.8.2

Details

The output of make is copied below.

$ make
...
gcc -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -DYYDEBUG  -DLINUX -g -Wall  -o dsdgen s_brand.o s_customer_address.o s_call_center.o s_catalog.o s_catalog_order.o s_catalog_order_lineitem.o s_catalog_page.o s_catalog_promotional_item.o s_catalog_returns.o s_category.o s_class.o s_company.o s_customer.o s_division.o s_inventory.o s_item.o s_manager.o s_manufacturer.o s_market.o s_pline.o s_product.o s_promotion.o s_purchase.o s_reason.o s_store.o s_store_promotional_item.o s_store_returns.o s_subcategory.o s_subclass.o s_warehouse.o s_web_order.o s_web_order_lineitem.o s_web_page.o s_web_promotinal_item.o s_web_returns.o s_web_site.o s_zip_to_gmt.o w_call_center.o w_catalog_page.o w_catalog_returns.o w_catalog_sales.o w_customer_address.o w_customer.o w_customer_demographics.o w_datetbl.o w_household_demographics.o w_income_band.o w_inventory.o w_item.o w_promotion.o w_reason.o w_ship_mode.o w_store.o w_store_returns.o w_store_sales.o w_timetbl.o w_warehouse.o w_web_page.o w_web_returns.o w_web_sales.o w_web_site.o dbgen_version.o joint_distribution.o address.o build_support.o date.o decimal.o dist.o driver.o error_msg.o genrand.o join.o list.o load.o misc.o nulls.o parallel.o permute.o pricing.o print.o r_params.o StringBuffer.o tdef_functions.o tdefs.o text.o scd.o scaling.o release.o sparse.o validate.o -lm
/usr/bin/ld: s_purchase.o:/<snip>/git/dsb/code/tools/s_purchase.c:55: multiple definition of `nItemIndex'; s_catalog_order.o:/<snip>/git/dsb/code/tools/s_catalog_order.c:56: first defined here
/usr/bin/ld: s_web_order.o:/<snip>/git/dsb/code/tools/s_web_order.c:56: multiple definition of `nItemIndex'; s_catalog_order.o:/<snip>/git/dsb/code/tools/s_catalog_order.c:56: first defined here
/usr/bin/ld: s_web_order_lineitem.o:/<snip>/git/dsb/code/tools/s_web_order_lineitem.c:54: multiple definition of `g_s_web_order_lineitem'; s_web_order.o:/<snip>/git/dsb/code/tools/s_web_order.c:54: first defined here
/usr/bin/ld: w_catalog_page.o:/<snip>/git/dsb/code/tools/w_catalog_page.c:52: multiple definition of `g_w_catalog_page'; s_catalog_page.o:/<snip>/git/dsb/code/tools/s_catalog_page.c:51: first defined here
/usr/bin/ld: w_customer.o:/<snip>/git/dsb/code/tools/w_customer.c:59: multiple definition of `pCountyRecord'; w_customer_address.o:/<snip>/git/dsb/code/tools/w_customer_address.c:60: first defined here
/usr/bin/ld: w_customer.o:/<snip>/git/dsb/code/tools/w_customer.c:62: multiple definition of `pStateRecord'; w_customer_address.o:/<snip>/git/dsb/code/tools/w_customer_address.c:61: first defined here
/usr/bin/ld: w_household_demographics.o:/<snip>/git/dsb/code/tools/w_household_demographics.c:62: multiple definition of `pDemoRecord'; w_customer_demographics.o:/<snip>/git/dsb/code/tools/w_customer_demographics.c:54: first defined here
/usr/bin/ld: w_warehouse.o:/<snip>/git/dsb/code/tools/w_warehouse.c:54: multiple definition of `g_w_warehouse'; s_warehouse.o:/<snip>/git/dsb/code/tools/s_warehouse.c:51: first defined here
/usr/bin/ld: w_web_site.o:/<snip>/git/dsb/code/tools/w_web_site.c:60: multiple definition of `g_w_web_site'; s_web_site.o:/<snip>/git/dsb/code/tools/s_web_site.c:51: first defined here
collect2: error: ld returned 1 exit status
make: *** [makefile:233: dsdgen] Error 1

Undefined reference to `print_dist_index'

Hello, when I run 'make' in linux system, it shows errors like that. Could you give me a help on this issue? Sorry for my poor understanding on C language.

ubuntu@VM-4-4-ubuntu:~/dsb/code/tools$ make
gcc -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -DYYDEBUG  -DLINUX -g -Wall  -o distcomp dcgram.o dcomp.o grammar.o error_msg.o StringBuffer.o r_params.o  -lm
/usr/bin/ld: dcomp.o: in function `AddGuassianDist':
/home/ubuntu/dsb/code/tools/dcomp.c:248: undefined reference to `init_rand'
/usr/bin/ld: /home/ubuntu/dsb/code/tools/dcomp.c:263: undefined reference to `genrand_gaussian_dist'
/usr/bin/ld: /home/ubuntu/dsb/code/tools/dcomp.c:271: undefined reference to `print_dist_index'
/usr/bin/ld: dcomp.o: in function `main':
/home/ubuntu/dsb/code/tools/dcomp.c:333: undefined reference to `print_dist_index'
collect2: error: ld returned 1 exit status
make: *** [makefile:237: distcomp] Error 1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.