Giter Club home page Giter Club logo

msc-project-gpu-joins-evaluation's Introduction

Performance Study of GPU-based Joins

This reposistory contains implementations of four GPU-based join implementations and code to evaluate them. The implementations contain

  • SMJI - State-of-the-art sort merge join using radix sort and Merge Path. "SMJ-UM" in the paper.
  • SMJ - Improved SMJI using the GFTR technique. "SMJ-OM" in the paper.
  • PHJ - State-of-the-art partitioned hash join implementation from Sioulas et al.. "PHJ-UM" in the paper.
  • SHJ - Improved PHJ using the GFTR pattern with a redesigned partitioning strategy. "PHJ-OM" in the paper.

Dependencies

Our code is developed and tested under the following environment.

  • CUDA 12.2 or 12.3 (including nvcc, cub, thrust and so on)
  • GCC 10.2.1 or 11.4.0 (10.2.1 for the A100 machine, 11.4.0 for the RTX 3090 machine)

Usage

Configure the project

There are two places in the code base that need to be customized to your machine.

  1. In the Makefile, specify the compute capability of your GPU. We have tested the code on RTX 3090 (8.6) and A100 (8.0).
  2. In src/volcano/utils.cuh, change the mem_pool_size variable to your own GPU's memory capacity.
  3. In src/volcano/tpc_utils.hpp, change the TPC_DATA_PREFIX (absolute path) to the directory where the TPC-H and TPC-DS data are stored. See more in "Run the TPC-H/DS benchmarks".

Then in the project home directory, run

sh configure.sh

This will compile all available executables to bin/volcano/, including

  • bin/volcano/join_exp_4b4b: 4-byte keys + 4-byte non-keys for Section 5.2.1 - 5.2.6.
  • bin/volcano/join_exp_4b8b: 4-byte keys + 8-byte non-keys for Section 5.2.5.
  • bin/volcano/join_exp_8b8b: 8-byte keys + 8-byte non-keys for Section 5.2.5.
  • bin/volcano/join_pipeline: sequence of joins for Section 5.2.7.
  • bin/volcano/tpch_[7,18,19]: Joins extracted from TPC-H Q7, 18, 19 for Section 5.3.
  • bin/volcano/tpcds_[64,95]: Joins extracted from TPC-DS Q64, 95 for Section 5.3.

Because the microbenchmark will generate input data and the data generation is slow, it is more efficient to cache the generated data on disk. Suppose you want to store the input data in <path_to_input_data>, then run

mkdir -p <path_to_input_data>/int/ <path_to_input_data>/long/

Run the microbenchmarks

You can directly invoke the four executables and pass in the configurations. You can check the instructions by passing the -h flag to each executable.

Alternatively, we have prepared a script to run all the microbenchmarks in run.sh. Simply run the following command in the project home directory.

sh run.sh <repeat times> <path_to_input_data>

The results will be written to the exp_results/gpu_join/ directory, and each microbenchmark is stored as a separate CSV file.

Run the TPC-H/DS benchmarks

Running TPC-H/DS requires input data to be generated from the data generators, i.e., dbgen and dsdgen. For ease of use, we have uploaded all relevant data to the polybox. Download the vldb24_tpch_tpcds.tar.xz to your machine and then decompress and extract the data.

xz -d -v vldb24_tpch_tpcds.tar.xz
tar -xvf file.tar -C .

In the end, make sure your TPC_DATA_PREFIX directory have the following structure.

  • TPC_DATA_PREFIX
    • tpch_sf10
    • tpcds_sf100
      • q64
      • q95

Notes

  1. It is recommended to turn on the persistence mode to reduce the program launch time. See this guide. You can check if it is turned out via nvidia-smi.

msc-project-gpu-joins-evaluation's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.