Giter Club home page Giter Club logo

blitzcrank's Introduction

Blitzcrank: Fast Semantic Compression for In-memory Online Transaction Processing

Introducing Delayed Coding: A Novel Entropy Coding Algorithm

This repository contains the code for the paper titled "Blitzcrank: Fast Semantic Compression for In-memory Online Transaction Processing," accepted by VLDB'24.

Blitzcrank is a library for compressing row-store OLTP databases. It uses a novel entropy coding algorithm called Delayed Coding, which achieves near-entropy compression factors while maintaining fast decompression speeds.

Clone Instructions

To clone this project successfully, ensure you have git-lfs installed. We use it to manage large dataset files. Depending on your internet connection, the clone process may take several minutes to finish.

Project Structure

The main project structure is as follows. Two versions of Blitzcrank are provided: Delayed Coding and Arithmetic Coding. You can switch between them by modifying the top-level CMakeLists.txt.

.
├── CMakeLists.txt
├── README.md
├── LICENSE
├── rapidjson    // it is a third-party library used to parse JSON.
├── delayed_coding
│         ├── CMakeLists.txt
│         ├── include
│         │         ├── base.h
│         │         ├── blitzcrank_exception.h
│         │         ├── categorical_model.h
│         │         ├── categorical_tree_model.h
│         │         ├── compression.h
│         │         ├── data_io.h
│         │         ├── decompression.h
│         │         ├── index.h
│         │         ├── json_base.h
│         │         ├── json_compression.h
│         │         ├── json_decompression.h
│         │         ├── json_model.h
│         │         ├── json_model_learner.h
│         │         ├── model.h
│         │         ├── model_learner.h
│         │         ├── numerical_model.h
│         │         ├── simple_prob_interval_pool.h
│         │         ├── string_model.h
│         │         ├── string_squid.h
│         │         ├── string_tools.h
│         │         ├── timeseries_model.h
│         │         └── utility.h
│         └── src
│             ├── categorical_model.cpp
│             ├── categorical_tree_model.cpp
│             ├── compression.cpp
│             ├── data_io.cpp
│             ├── decompression.cpp
│             ├── json_base.cpp
│             ├── json_compression.cpp
│             ├── json_decompression.cpp
│             ├── json_model.cpp
│             ├── json_model_learner.cpp
│             ├── model.cpp
│             ├── model_learner.cpp
│             ├── numerical_model.cpp
│             ├── simple_prob_interval_pool.cpp
│             ├── string_model.cpp
│             ├── string_squid.cpp
│             ├── string_tools.cpp
│             ├── timeseries_model.cpp
│             └── utility.cpp
├── plain_ra.cpp
├── JSON.cpp
└── tabular.cpp

Build Instructions

Cmake is an open-source, cross-platform family of tools designed to build, test, and package software. A cmake config is provided. It can generate Makefiles or other scripts to create Blitzcrank binary.

We build Blitzcrank with:

cd ./build-release
cmake -DCMAKE_BUILD_TYPE=Release ..
make

Compression Instructions

./tabular_blitzcrank [mode] [dataset] [config] [if use "|" as delimiter] [if skip learning] [block size]
  • [mode]:

    • -c for compression
    • -d for decompression
    • -b for benchmarking
  • [dataset]: path to the dataset

  • [config]: path to the config file

  • [if use "|" as delimiter]:

    • 0 for comma
    • 1 for "|"
  • [if skip learning]:

    • 0 for learning
    • 1 for skipping learning
  • [block size]: block size for compression


Example: USCensus1990

Let's use the USCensus1990 dataset as an example. You can find this dataset and its configuration file in the build-release folder. Alternatively, you can download the dataset directly from this link. The configuration file is necessary for compressing or decompressing a dataset, as it contains metadata specific to the dataset. We've provided the config file for the USCensus1990 dataset.

Benchmark Mode:

./tabular_blitzcrank -b USCensus1990.dat USCensus1990.config 0 1 20000


Delimiter: ,	Skip Learning: 1	Block Size: 20000	
[Compression]	Start load data into memory...
Data loaded.
Model Size: 3.26465 KB. 
Throughput:  108.524 MiB/s	Time:  3.15137 s
[Decompression]	Block Size: 286 tuple
Throughput:  187.538 MiB/s	Time:  1.82363 s
[Compression Factor (Origin Size / CompressedSize)]: 11.5654
Compressed Size: 31030821

Compress Mode:

./tabular_blitzcrank -c USCensus1990.dat USCensus1990.com USCensus1990.config 0 1 20000


Delimiter: ,	Skip Learning: 1	Block Size: 20000	
Start load data into memory...
Data loaded.
Iteration 1 Starts
Model Size: 3.26465 KB. 
Compressed Size: 31030821

Decompress Mode

./tabular_blitzcrank -d USCensus1990.com USCensus1990.rec USCensus1990.config 0 20000

After the execution, we check the compression correctness:

diff USCensus1990.dat USCensus1990.rec

blitzcrank's People

Contributors

yimingqiao avatar

Stargazers

 avatar 家有仙旺 avatar  avatar Joseph Wyatt avatar  avatar 姚思锐 avatar 雖千萬事吾往矣 avatar  avatar  avatar  avatar Yummy Cha avatar Xiao Chen avatar  avatar  avatar 刘译蓬 avatar Мацутакэ avatar snowflowersnowflake avatar Nikhil Bhende avatar Yueyue Wang avatar  avatar Peirong Liu avatar 白马非马 avatar  avatar  avatar Mike avatar James avatar Winter Cao avatar Erichen avatar X-LEFT avatar Maynor avatar  avatar  avatar  avatar daydalek avatar Lwy avatar shinyim avatar  avatar Jiaoyi Zhang avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.