Giter Club home page Giter Club logo

tpcds-kit's Introduction

tpcds-kit

The official TPC-DS tools can be found at tpc.org.

This version is based on v2.10.0 and has been modified to:

  • Allow compilation under macOS (commit 2ec45c5)
  • Address obvious query template bugs like
  • Rename s_web_returns column wret_web_site_id to wret_web_page_id to match specification. See #22 & #42.

To see all modifications, diff the files in the master branch to the version branch. Eg: master vs v2.10.0.

Setup

Linux

Make sure the required development tools are installed:

Ubuntu:

sudo apt-get install gcc make flex bison byacc git

CentOS/RHEL:

sudo yum install gcc make flex bison byacc git

Then run the following commands to clone the repo and build the tools:

git clone https://github.com/gregrahn/tpcds-kit.git
cd tpcds-kit/tools
make OS=LINUX

macOS

Make sure the required development tools are installed:

xcode-select --install

Then run the following commands to clone the repo and build the tools:

git clone https://github.com/gregrahn/tpcds-kit.git
cd tpcds-kit/tools
make OS=MACOS

Using the TPC-DS tools

Data generation

Data generation is done via dsdgen. See dsdgen -help for all options. If you do not run dsdgen from the tools/ directory then you will need to use the option -DISTRIBUTIONS /.../tpcds-kit/tools/tpcds.idx. The output directory (specified via the -DIR option) must exist prior to running dsdgen.

Query generation

Query generation is done via dsqgen. See dsqgen -help for all options.

The following command can be used to generate all 99 queries in numerical order (-QUALIFY) for the 10TB scale factor (-SCALE) using the Netezza dialect template (-DIALECT) with the output going to /tmp/query_0.sql (-OUTPUT_DIR).

dsqgen \
-DIRECTORY ../query_templates \
-INPUT ../query_templates/templates.lst \
-VERBOSE Y \
-QUALIFY Y \
-SCALE 10000 \
-DIALECT netezza \
-OUTPUT_DIR /tmp

tpcds-kit's People

Contributors

gregrahn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tpcds-kit's Issues

y.tab.h not found

Looks like a recent change broke this:

[ec2@ip-111 tools]$ make
yacc -d -v qgen.y
gcc -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -DYYDEBUG -DLINUX -g -Wall -c -o expr.o expr.c
expr.c:49:19: fatal error: y.tab.h: No such file or directory
#include "y.tab.h"
^
compilation terminated.
make: *** [expr.o] Error 1
[ec2@ip-111 tools]$

(I have the instance around, let me know if you want me to test something out)

tpcds_ri.sql has several bugs

Errors related to two references to columns that do not exist and three duplicate name collisions.

psql:tpcds-kit/tools/tpcds_ri.sql:39: ERROR:  column "cp_promo_id" referenced in foreign key constraint does not exist
psql:tpcds-kit/tools/tpcds_ri.sql:50: ERROR:  constraint "cr_i" for relation "catalog_returns" already exists
psql:tpcds-kit/tools/tpcds_ri.sql:55: ERROR:  column "cr_ship_date_sk" referenced in foreign key constraint does not exist
psql:tpcds-kit/tools/tpcds_ri.sql:119: ERROR:  constraint "wr_ret_cd" for relation "web_returns" already exists
psql:tpcds-kit/tools/tpcds_ri.sql:124: ERROR:  constraint "ws_b_cd" for relation "web_sales" already exists

Error: open of distributions failed: tpcds.idx: no such file or directory

I have the following nix expression to set up the TPCDS toolkit. All is well, the executables (dsdgen dsqgen distcomp mkheader checksum) are copied but then when I run dsdgen I get this error: "Error: open of distributions failed: tpcds.idx: no such file or directory"

{ pkgs, ... }:
    pkgs.stdenv.mkDerivation rec {
      pname = "tpcds-kit";
      version = "unstable-2019-06-14";
      src = pkgs.fetchFromGitHub {
        owner = "gregrahn";
        repo = pname;
        rev = "9d01e73403c32d8e3d89a987d096b98cbfae3c62";
        sha256 = "removed";
      };
      preBuild = "cd tools";
      buildInputs = with pkgs; [ bison byacc (pkgs.writeScriptBin "lex" "exec ${flex}/bin/flex $@") ];
      hardeningDisable = [ "all" ];

      installPhase = ''
        mkdir -p $out/bin
        cp dsdgen dsqgen distcomp mkheader checksum tpcds $out/bin;

      '';

    }

It seems that the tpcds.idx (since its not an executable, maybe?) is not copied into that folder, but the 5 executables are. If cd-ing into it I can't see it, dsdgen can't see it either and returns this error. Can confirm that in my local install of tpcds-kit, the tpcds.idx and tpcds.idx.h files are there and dsdgen can be run just fine. I am trying to set it up in Google cloud instance to work the same way. Tried adding tpcds to the cp command, but it seems to not do anything.

Is there something obvious I am missing? thank you in advance

FILTER flag no longer working

It seems that with the recent changes the FILTER flag for dsdgen no longer ouputs to STDOUT. __FILTER is still listed but data is output to local dir.

pb running dsqgen

Hi,
i need help to run dsqgen
./dsqgen -input /TPC_DS/tpc-ds_gen/query_templates/templates.lst -directory /host/Users/TPC_DS/tpc-ds_gen/query_templates -dialect oracle -scale 1

the output is :
ERROR: Substitution'STATE' is used before being initialized at line 57 in /TPC_DS/tpc-ds_gen/query_templates/query1.tpl
and is almost the same when running with -template query1.tpl or any other query
best,
rim

bugs in query22a

As of v2.4 the results CTE has a group by that appears to be a mistake. There is no aggregate and when removing the group by query22a returns the same result then as query22.

ERROR:  column "inventory.inv_quantity_on_hand" must appear in the GROUP BY clause or be used in an aggregate function
LINE 6:              ,inv_quantity_on_hand qoh
                      ^

Additionally query22a has a reference to the warehouse table and corresponding join -- which is not present in query22.

from inventory
           ,date_dim
           ,item
           ,warehouse  -- <<< Not in query22
       where  inv_date_sk=d_date_sk
              and inv_item_sk=i_item_sk
              and inv_warehouse_sk = w_warehouse_sk  -- <<< Not in query22
              and d_month_seq between [DMS] and [DMS] + 11

1 Column missed in s_web_returns

Dear all,
When I'm preparing the refresh data, I created the source table with tpcds_source.sql. In this sql file, there's 1 table named s_web_returns. The statement is --
create table s_web_returns
(
wret_web_site_id char(16) ,
wret_order_id integer not null,
wret_line_number integer not null,
wret_item_id char(16) not null,
wret_return_customer_id char(16) ,
wret_refund_customer_id char(16) ,
wret_return_date char(10) ,
wret_return_time char(10) ,
wret_return_qty integer ,
wret_return_amt numeric(7,2) ,
wret_return_tax numeric(7,2) ,
wret_return_fee numeric(7,2) ,
wret_return_ship_cost numeric(7,2) ,
wret_refunded_cash numeric(7,2) ,
wret_reversed_charge numeric(7,2) ,
wret_account_credit numeric(7,2) ,
wret_reason_id char(16)
);

But while creating the fact view wrv on it, we can see an error -- Column 'wret_web_page_id' not found, and by check for wrv's statement --

CREATE VIEW wrv AS
SELECT
d_date_sk wr_return_date_sk
,t_time_sk wr_return_time_sk
,i_item_sk wr_item_sk
,c1.c_customer_sk wr_refunded_customer_sk
,c1.c_current_cdemo_sk wr_refunded_cdemo_sk
,c1.c_current_hdemo_sk wr_refunded_hdemo_sk
,c1.c_current_addr_sk wr_refunded_addr_sk
,c2.c_customer_sk wr_returning_customer_sk
,c2.c_current_cdemo_sk wr_returning_cdemo_sk
,c2.c_current_hdemo_sk wr_returning_hdemo_sk
,c2.c_current_addr_sk wr_returing_addr_sk
,wp_web_page_sk wr_web_page_sk
,r_reason_sk wr_reason_sk
,wret_order_id wr_order_number
,wret_return_qty wr_return_quantity
,wret_return_amt wr_return_amt
,wret_return_tax wr_return_tax
,wret_return_amt + wret_return_tax AS wr_return_amt_inc_tax
,wret_return_fee wr_fee
,wret_return_ship_cost wr_return_ship_cost
,wret_refunded_cash wr_refunded_cash
,wret_reversed_charge wr_reversed_charge
,wret_account_credit wr_account_credit
,wret_return_amt+wret_return_tax+wret_return_fee
-wret_refunded_cash-wret_reversed_charge-wret_account_credit wr_net_loss
FROM s_web_returns LEFT OUTER JOIN date_dim ON (cast(wret_return_date as date) = d_date)
LEFT OUTER JOIN time_dim ON ((CAST(SUBSTR(wret_return_time,1,2) AS integer)*3600
+CAST(SUBSTR(wret_return_time,4,2) AS integer)*60+CAST(SUBSTR(wret_return_time,7,2) AS integer))=t_time)
LEFT OUTER JOIN item ON (wret_item_id = i_item_id)
LEFT OUTER JOIN customer c1 ON (wret_return_customer_id = c1.c_customer_id)
LEFT OUTER JOIN customer c2 ON (wret_refund_customer_id = c2.c_customer_id)
LEFT OUTER JOIN reason ON (wret_reason_id = r_reason_id)
LEFT OUTER JOIN web_page ON (wret_web_page_id = WP_WEB_PAGE_id)
WHERE i_rec_end_date IS NULL AND wp_rec_end_date IS NULL;

But there's no wret_web_page_id column defined in s_web_returns. So would you please have a look and also modify the refresh data generator to add this column?

Thanks in advance for your help.

Regards
Eisen

Where is the query type classification for TPC-DS queries?

The queries are categorized into four types and few examples are there.

Reporting Class 41
Ad Hoc Class 59
Iterative Class 4
Data Mining Class 23

I did not find the type of each query anywhere. One hint is that store_sales queries are ad-hoc and catalog_sales are reporting. There is no other data point to classify the queries.

Running the tests for MySQL Server

Hi gregrahn,

Thank you for this repository

I have an error trying to run the tests on a MySQL server on my machine. I suppose it has to do with the fact that the 'with' operator isn't supported in MySQL. Can help me out in trying to run these tests on a MySQL server?

Any help is appreciated.

Thanks,
Bhargav

What's the bkc for view for schema source table?

Dear all,

 In "TPC BENCHMARK ™ DS Standard Specification Version 2.3.0", Page 49, there's Data 

Maintenance Method 1 description --
Method 1: Fact Table Load
for every row v in view V corresponding to fact table F
get row v into local variable lv
for every type 1 business key column bkc in v
get row d from dimension table D corresponding to bkc
where the business keys of v and d are equal
update bkc of lv with surrogate key of d
end for
for every type 2 business key column bkc in v
get row d from dimension table D corresponding to bkc
where the business keys of v and d are equal and rec_end_date is NULL
update bkc of lv with surrogate key of d
end for
insert lv into F
end for

 But what is the bkc for Fact Table or the view? -- As we know only Dimension table has the business

key. And by check all the Business Key column should name with postfix "_id" and for all the DM views
-- I can't find any column named like "..._id" ... And even weird -- there are two kinds of business keys
-- "type 1 business keys" and "type 2 business keys" ... Really confused on this. Would anyone kind help
me out?
Thanks in advance for your help.

Regards
Eisen

Changing range of dates

With the default setting, data is generated from 1998-01-02 to 2003-01-02.
Is it possible to generate data for a different range of dates ?
for instance 2015-01-01 to 2020-01-01..

query91 mismatch between business question and SQL code

The business question for query91 states:

Display total returns of catalog sales by call center and manager in a particular month for male customers of unknown education or female customers with advanced degrees with a specified buy potential and from a particular time zone.

However the SQL contains

and     ( (cd_marital_status       = 'M' and cd_education_status     = 'Unknown')
        or(cd_marital_status       = 'W' and cd_education_status     = 'Advanced Degree'))

For male/female customers the query should use cd_gender and not cd_marital_status like such:

and     ( (cd_gender       = 'M' and cd_education_status     = 'Unknown')
        or(cd_gender       = 'F' and cd_education_status     = 'Advanced Degree'))

query11.tpl missing column alias

The SELECTCONE variable references a column customer_preferred_cust_flag but there is no column alias matching that. Need to add that alias for c_preferred_cust_flag from customer.

How are validation queries generated using substitution parameters?

There is an answer_sets folder but it does not work for normal queries generated using scale factor 1. We need to change substitution parameters (section 11 appendix B of TPCDS.1.3.1.pdf doc) after which results match. I could not find any automated way to do this.
Is there an option in dsqgen to generate validation queries using correct substitution parameters for scale factor 1 so that validation can be done?

ERROR: Malloc Failed at 116 in permute.c

Hi gregrahn,

Thank you for this repository 🥇
I have an error after this:
dsqgen \ -DIRECTORY ../query_templates \ -INPUT ../query_templates/templates.lst \ -VERBOSE Y \ -QUALIFY Y \ -SCALE 10000 \ -DIALECT netezza \ -OUTPUT_DIR /tmp

I have in output:
qgen2 Query Generator (Version 2.6.0) Copyright Transaction Processing Performance Council (TPC) 2001 - 2017 Parsed 99 templates Malloc Failed at 116 in permute.c

Have you ever had this error?
Thank you!

error in query77a

As of v2.4 query77a is missing cr_call_center_sk in the select list and group by making it semantically different

query77a

 cr as
 (select
        sum(cr_return_amount) as returns,
        sum(cr_net_loss) as profit_loss
 from catalog_returns,
      date_dim
 where cr_returned_date_sk = d_date_sk
       and d_date between cast('[SALES_DATE]' as date)
                  and (cast('[SALES_DATE]' as date) +  30 )
 ),

query77

cr as
 (select cr_call_center_sk,
         sum(cr_return_amount) as returns,
         sum(cr_net_loss) as profit_loss
 from catalog_returns,
      date_dim
 where cr_returned_date_sk = d_date_sk
       and d_date between cast('[SALES_DATE]' as date)
                  and (cast('[SALES_DATE]' as date) +  30 days)
 group by cr_call_center_sk
 ), 

query72.tpl contains duplicate predicates

query72.tpl contains two repeated predicates:

where d1.d_week_seq = d2.d_week_seq
  and inv_quantity_on_hand < cs_quantity 
  and d3.d_date > d1.d_date + 5
  and hd_buy_potential = '[BP]'
  and d1.d_year = [YEAR]
  and hd_buy_potential = '[BP]'  -- repeated predicate
  and cd_marital_status = '[MS]'
  and d1.d_year = [YEAR]         -- repeated predicate

customer table: c_last_review_date vs c_last_review_date_sk

The definition of customer in tpcds.sql lists the last column as
c_last_review_date char(10)
however the TPC-DS spec v2.4 list the last column as
c_last_review_date_sk identifier with FK to d_date_sk

The DDL looks incorrect. This column change also impacts query30.tpl

Is there a support for dialect mysql in tpcds query generation tool?

Thanks you for helping so many people with this repository. It would have been impossible to get this to work by using the source from official site.

I have the data ready in the database. During query generation I did not find any option for mysql as a dialect. I have tried ansi and sqlserver but the queries generated there are not valid mysql queries.

Any help in this regard will be highly appreciated. Thank you.

can not generate query statement

./dsqgen -input ../query_templates/templates.lst -scale 8 -dialect oracle -directory ../query_templates

ERROR: Substitution'_END' is used before being initialized at line 64 in ../query_templates/query1.tpl
for each tpl, I must add this :define _END = ""; ???

there are so many problems in generating data&query。 AND I CAN NOT STAND SO MANY PROBLEMS IN THIS PROJECT, DOCS EVEN NOT UPDATED.

query24 missing join predicate

Both iterations (queries) in query24.tpl seem to be missing the PK/FK join condition between between customer and customer_address which correlates to customers who currently live in their birth countries and in the neighborhood of the store (snippet of business question from specification).

as of v2.3.0

from store_sales
    ,store_returns
    ,store
    ,item
    ,customer
    ,customer_address
where ss_ticket_number = sr_ticket_number
  and ss_item_sk = sr_item_sk
  and ss_customer_sk = c_customer_sk
  and ss_item_sk = i_item_sk
  and ss_store_sk = s_store_sk
  and c_current_addr_sk = ca_address_sk -- this join is missing in query24.tpl
  and c_birth_country = upper(ca_country)
  and s_zip = ca_zip
  and s_market_id=[MARKET]

query51a - erroneous line break

Line 90 has an erronous line break causing the column alias to be on two lines. Seems like an obvious mistake.

select v1.item_sk, v1.d_date, v1.web_sales, v1.store_sales, max(v2.web_sales) web_cumulative, max(v2.store_sales) s
tore_cumulative

"Error: open of distributions failed: tpcds.idx: No such file or directory"

Dear All,

I just downloaded the most recent tpcds-kit-master. I first tested with --
/asiq160/rs/tpc-ds/tpcds-kit-master/tools/dsqgen -template query99.tpl -dir /asiq160/rs/tpc-ds/tpcds-kit-master/query_templates -dialect sybase
It works fine and generate a valid sql file. But while I ran --
/asiq160/rs/tpc-ds/tpcds-kit-master/tools/dsqgen -template query98.tpl -dir /asiq160/rs/tpc-ds/tpcds-kit-master/query_templates -dialect sybase
It raised error as --
qgen2 Query Generator (Version 2.3.0)
Copyright Transaction Processing Performance Council (TPC) 2001 - 2016
Error: open of distributions failed: tpcds.idx: No such file or directory

I don't know if it's due to query98.tpl is not in the list... And the batch statement --
/asiq160/rs/tpc-ds/tpcds-kit-master/tools/dsqgen -input /asiq160/rs/tpc-ds/tpcds-kit-master/query_templates/templates.lst -scale 8 -dialect sybase -directory /asiq160/rs/tpc-ds/tpcds-kit-master/query_templates

Also raised the same error --

asiq160@bigDataIQ:/iq_data/dump/query> /asiq160/rs/tpc-ds/tpcds-kit-master/tools/dsqgen -input /asiq160/rs/tpc-ds/tpcds-kit-master/query_templates/templates.lst -scale 8 -dialect sybase -directory /asiq160/rs/tpc-ds/tpcds-kit-master/query_templates
qgen2 Query Generator (Version 2.3.0)
Copyright Transaction Processing Performance Council (TPC) 2001 - 2016
Error: open of distributions failed: tpcds.idx: No such file or directory

So maybe due to the tpcds.idx not contains all the 99 templates? Please kind help. Thanks in advance for your help.

Regards
Eisen

When I make the code, I will make a mistake.

When I make the code, I will make a mistake.

[root@master tools]# make gcc -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -DYYDEBUG -DLINUX -g -Wall -o dsqgen address.o date.o decimal.o dist.o error_msg.o expr.o eval.o genrand.o grammar_support.o keywords.o list.o nulls.o permute.o print.o QgenMain.o query_handler.o r_params.o scaling.o StringBuffer.o substitution.o tdefs.o text.o tokenizer.o w_inventory.o y.tab.o release.o scd.o build_support.o parallel.o -lm grammar_support.o: In function include_file':
/opt/tpcds-kit-master/tools/grammar_support.c:128: undefined reference to yy_create_buffer' /opt/tpcds-kit-master/tools/grammar_support.c:129: undefined reference to yy_switch_to_buffer'
y.tab.o: In function yyparse': /opt/tpcds-kit-master/tools/y.tab.c:551: undefined reference to yylex'
/opt/tpcds-kit-master/tools/y.tab.c:1129: undefined reference to yylex' collect2: ld returned 1 exit status make: *** [dsqgen] Error 1

I need help to solve this issue. Thanks.

error in query10a

v2.5.0 introduced query10a but as-is, does not run. Each branch of the union all contains two where clauses -- the second one needs to be an and.

   exists (select *
          from (
             select ws_bill_customer_sk as customer_sk, d_year,d_moy
             from web_sales, date_dim where ws_sold_date_sk = d_date_sk
             where d_year = [YEAR] and
                   d_moy between [MONTH] and [MONTH]+3
             union all
             select cs_ship_customer_sk as customer_sk, d_year, d_moy
             from catalog_sales, date_dim where cs_sold_date_sk = d_date_sk
             where d_year = [YEAR] and
                   d_moy between [MONTH] and [MONTH]+3
	     ) x
            where c.c_customer_sk = customer_sk)

Missing comma in query77a.tpl

A change introduced in v2.9 modified query77a.tpl to fix the incorrect inline view cr (lines 75-84) is missing a comma after the first column (L76).

How to generate answer sets?

Sorry, could be a stupid question. I would like to do some tests under different scale factor, but I just can't find a way to generate the corresponding query results like those in answer_sets folder.

ORACLE date arithmetic is suspect

There is a bunch of date arithmetic in templates, that uses bad ORACLE syntax, i.e.:

query5.tpl:                  and (cast('[SALES_DATE]' as date) +  14 days)
query5.tpl:                  and (cast('[SALES_DATE]' as date) +  14 days)
query5.tpl:                  and (cast('[SALES_DATE]' as date) +  14 days)

Queries fail with:

query 7 in stream 0 using template query32.tpl
        (cast('1999-02-22' as date) + 90 days)
                                         *
ERROR at line 10:
ORA-00907: missing right parenthesis

The correct syntax is: (date) +90 or (date) + interval '90' day.

Does it make sense to adjust templates with something like: + 90 &DAYSLITERAL ?

> grep -HP '\+\s+\d+\s+days' query*.tpl
query12.tpl:				and (cast('[SDATE]' as date) + 30 days)
query16.tpl:           (cast('[YEAR]-[MONTH]-01' as date) + 60 days)
query20.tpl: 				and (cast('[SDATE]' as date) + 30 days)
query21.tpl:                    and (cast ('[SALES_DATE]' as date) + 30 days)
query32.tpl:        (cast('[CSDATE]' as date) + 90 days)
query32.tpl:                             (cast('[CSDATE]' as date) + 90 days)
query37.tpl: and d_date between cast('[INVDATE]' as date) and (cast('[INVDATE]' as date) +  60 days)
query40.tpl:                and (cast ('[SALES_DATE]' as date) + 30 days)
query5.tpl:                  and (cast('[SALES_DATE]' as date) +  14 days)
query5.tpl:                  and (cast('[SALES_DATE]' as date) +  14 days)
query5.tpl:                  and (cast('[SALES_DATE]' as date) +  14 days)
query77.tpl:                  and (cast('[SALES_DATE]' as date) +  30 days)
query77.tpl:                  and (cast('[SALES_DATE]' as date) +  30 days)
query77.tpl:                  and (cast('[SALES_DATE]' as date) +  30 days)
query77.tpl:                  and (cast('[SALES_DATE]' as date) +  30 days)
query77.tpl:                  and (cast('[SALES_DATE]' as date) +  30 days)
query77.tpl:                  and (cast('[SALES_DATE]' as date) +  30 days)
query80.tpl:                  and (cast('[SALES_DATE]' as date) +  30 days)
query80.tpl:                  and (cast('[SALES_DATE]' as date) +  30 days)
query80.tpl:                  and (cast('[SALES_DATE]' as date) +  30 days)
query82.tpl: and d_date between cast('[INVDATE]' as date) and (cast('[INVDATE]' as date) +  60 days)
query92.tpl:        (cast('[WSDATE]' as date) + 90 days)
query92.tpl:                             (cast('[WSDATE]' as date) + 90 days)
query94.tpl:           (cast('[YEAR]-[MONTH]-01' as date) + 60 days)
query95.tpl:           (cast('[YEAR]-[MONTH]-01' as date) + 60 days)
query98.tpl:				and (cast('[SDATE]' as date) + 30 days)

Dialect definitions missing _BEGIN and _END defines

At some point the dialect definitions lost the following two lines and dsqgen will error if not present.

define _BEGIN = "-- start query " + [_QUERY] + " in stream " + [_STREAM] + " using template " + [_TEMPLATE];
define _END = "-- end query " + [_QUERY] + " in stream " + [_STREAM] + " using template " + [_TEMPLATE];

How to choose directory properly?

Hello I am on windows trying to generate data with dsdgen.exe
Command I've tried: dsdgen SCALE 1 DIR c:\tmp
But it doesn't create in the specified directory, instead just does it in the local tools directory.
My question is what did I do wrong? I checked the help section

bug in q59

The below section has week1/week2 but the tues_sales is

tue_sales1/tue_sales1

and should likely be

tue_sales1/tue_sales2
[_LIMITA] select [_LIMITB] s_store_name1,s_store_id1,d_week_seq1
       ,sun_sales1/sun_sales2,mon_sales1/mon_sales2
       ,tue_sales1/tue_sales1,wed_sales1/wed_sales2,thu_sales1/thu_sales2
       ,fri_sales1/fri_sales2,sat_sales1/sat_sales2

Cannot generate TPC-DS data at scale factors < 1

When I attempt to generate TPC-DS data at scale factor < 1, I get the following:

[arunm2@vm-arunm2 tools]$ ./dsdgen -DIR ~/tpcds -SCALE 0.01
dsdgen Population Generator (Version 2.10.0)
Copyright Transaction Processing Performance Council (TPC) 2001 - 2018
Segmentation fault (core dumped)

Small scale factors are extremely useful to test functionality.

Functional dependency between i_brand_id and brand does not hold

In the item table, for all current records (where i_rec_end_date is NULL) each unique value for column i_brand_id does not point to only only 1 brand name. In other words, the brand name cannot be determined from the brand id itself. Same is the case with class_id and class columns. Even if it is the way it was meant to be, isn't it semantically incorrect?

Another issue is that the hierarchy item->brand->class-> category does not exist in the generated data.

Failed to open output file

Hi gregrahn,

I am getting this error. Any idea how can I fix it?

dsdgen Population Generator (Version 2.10.0)
dsdgen Population Generator (Version 2.10.0)
Copyright Transaction Processing Performance Council (TPC) 2001 - 2018
Copyright Transaction Processing Performance Council (TPC) 2001 - 2018
dsdgen Population Generator (Version 2.10.0)
Copyright Transaction Processing Performance Council (TPC) 2001 - 2018
Warning: Selected scale factor is NOT valid for result publication
Warning: Selected scale factor is NOT valid for result publication
dsdgen Population Generator (Version 2.10.0)
Copyright Transaction Processing Performance Council (TPC) 2001 - 2018
Warning: Selected scale factor is NOT valid for result publication
dsdgen Population Generator (Version 2.10.0)
Copyright Transaction Processing Performance Council (TPC) 2001 - 2018
Warning: Selected scale factor is NOT valid for result publication
dsdgen Population Generator (Version 2.10.0)
Copyright Transaction Processing Performance Council (TPC) 2001 - 2018
Warning: Selected scale factor is NOT valid for result publication
[ec2-user@ip-10-0-0-131 hands-on-with-redshift]$ Warning: Selected scale factor is NOT valid for result publication
dsdgen Population Generator (Version 2.10.0)
Copyright Transaction Processing Performance Council (TPC) 2001 - 2018
dsdgen Population Generator (Version 2.10.0)
Copyright Transaction Processing Performance Council (TPC) 2001 - 2018
Warning: Selected scale factor is NOT valid for result publication
Warning: Selected scale factor is NOT valid for result publication
ERROR: Failed to open output file!
File: print.c
Line: 490
ERROR: Failed to open output file!
File: print.c
Line: 490
ERROR: Failed to open output file!
File: print.c
Line: 490
ERROR: Failed to open output file!
File: print.c
Line: 490
ERROR: Failed to open output file!
File: print.c
Line: 490
ERROR: Failed to open output file!
File: print.c
Line: 490
ERROR: Failed to open output file!
File: print.c
Line: 490
ERROR: Failed to open output file!
File: print.c
Line: 490

TPC-DS database problems with foreign keys in MySQL

Hi, I am using TPC-DS and I already uploaded the data generated with dsgen to the Database. However, when I try to run tpcds_ri.sql to include the foreign keys in the system I have several violations of these constraints. Is there any recommendation to avoid this problem? Should I follow a different process to implement these foreign key constraints?
I am generating a 1 GB database.
Thanks

Select rename column should contain "AS"

When aliasing a column in SELECT query, I believe the word "AS" is the standard SQL (e.g., select sum(x) AS s). Though some system allow omitting it, others don't, such as MonetDB.

The current template files don't have AS when aliasing a column. I think it is desired to add it.

Edit: I just find "AS" is missing only in some queries, e.g., query3.

// Generate the data build/sbt "test:runMain com.databricks.spark.sql.perf.tpcds.GenTPCDSData -d <dsdgenDir> -s <scaleFactor> -l <location> -f <format>" run error

[error] at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:215)
[error] at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:197)
[error] at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
[error] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
[error] at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
[error] at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
[error] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
[error] at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
[error] at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
[error] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
[error] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
[error] at org.apache.spark.scheduler.Task.run(Task.scala:127)
[error] at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
[error] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
[error] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
[error] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[error] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[error] at java.lang.Thread.run(Thread.java:748)
[error] Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
[error] at org.apache.spark.sql.catalyst.expressions.GenericRow.get(rows.scala:174)
[error] at org.apache.spark.sql.Row.isNullAt(Row.scala:211)
[error] at org.apache.spark.sql.Row.isNullAt$(Row.scala:211)
[error] at org.apache.spark.sql.catalyst.expressions.GenericRow.isNullAt(rows.scala:166)
[error] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_0$(Unknown Source)
[error] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
[error] at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:211)
[error] ... 17 more
[error] 21/08/19 16:16:28 ERROR Executor: Exception in task 5.0 in stage 0.0 (TID 5)
[error] java.lang.RuntimeException: Error while encoding: java.lang.ArrayIndexOutOfBoundsException: 0

...

MySQL issue

Hello,
Is there any way to generate queries to use on MySQL? I've trying to use the netezza template but ERROR 1064 pops up on almost every single query. I'm just starting to use it so I'm rookie. Any help? Can't use other DBMS since it's a school project.

Best regards

dsdgen: command not found

[root@master tools]# dsdgen --help
-bash: dsdgen: command not found

and

[root@master tools]# dsdgen --DISTRIBUTIONS tpcds.idx
-bash: dsdgen: command not found

no data generated for the 2nd of March for non-leap years

There appears to be a bug in the data generation that only generates data for the 2nd of March when the year is a leap year, in this case year 2000. As a result, 1998, 1999, 2001, and 2002 have no data for the 2nd of March. This means 1823 of 1827 days have data for the given date range, 1998-01-02 to 2003-01-02.

Not a critical bug, but documenting it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.