The repository is structured as follows:
- In the
generator
folder we store the scripts to generate parquet files from the different parquet writers. - In the
job
andtpch
folders we store the scripts to run the benchmarks on each database.
- Spark-Shell
- Python packages:
duckdb
,tableauhyperapi
(userequirements.txt
to get used versions)
./tpch-data-generator.py <source-path> <destination-path>
./spark-shell --conf spark.driver.args="<source-path> <destination-path> [compressed] [onefile]"
:load tpch-data-generator.scala
The benchmark scripts assume that the parquet files are located in the current directory.
./duckdb_tpch.py
./spark-shell --driver-memory 12g -i tpch.scala
./umbra_tpch.sh