Giter Club home page Giter Club logo

selectivity-estimation-for-querying-set-valued-data's Introduction

Selectivity-estimation-for-querying-set-valued-data

We use a sample dataset from Twitter which contains 10000 tuples. The running codes mainly contains 4 steps. Preprocessing. Preprocess partitioning with graph coloring based method. ST. Convert the data and query with ST method. STH. Convert the data and query with ST-hist method. Estimation. Estimate the converted query on Postgres, Neurocard or DeepDB.

Preprocessing

To partition the elements with graph coloring based method, you could run:

bash preprocessing.sh

This command will generate indices for elements with high-frequency-first order first (generate_idx.cc). Then generate graph based on the set-valued column (generate_graph.cc), and partition the graph with graph coloring method (graph_color_partition.py). The partition result is stored in the folder ./graph_color.

ST

After preprocessing, you could run the following command to convert the set-valued data into 5 numeric subcolumns and convert the queries with ST method:

bash run_settrie.sh 5

It first greedy partition the elements based on graph coloring partition result (color_based_dis_part.cc). Then construct settrie based on the partitioning (set_map.cc). Finally convert the queries based on the settrie (settrie_trans_query.cc). The partition result, settrie and converted query are stored in the folder ./color_partition_5. You can change the parameter of this command to change the number of subcolumns.

STH

Alternatively, you could run the following command to convert the set-valued data into 5 numeric subcolumns and convert the queries with STH method by keeping 500 nodes on each settrie:

bash run_settrie_hist.sh 5 500

The partition is same as the partition in ST, so in the script we directly copy the partition. If you don't partition first, please use color_based_dis_part.cc to partition the elements first. Then construct histogram based settrie based on the partitioning (freq_set_map.cc). Finally convert the queries based on the settrie (settrie_trans_query_hist.cc). The partition result, settrie and converted query are stored in the folder ./partition_5_500. You can change the parameter of this command to change the number of subcolumns and the number of kept nodes on each settrie.

Estimation

If you want to do estimation with ST on any estimator, you should run the following command first to convert the dataset:

python3 clique_trans_table.py --partnum 5

Alternatively, if you want to do estimation with STH, please modify the line 20 in clique_trans_table.py first to change the folder, then run the following command:

python3 clique_trans_table.py --partnum 5 --keepnum 500

To estimate with Postgres, you can modify generate_sql.py and run it to populate the converted dataset into Postgres. You need the package psycopg2 to run it. Then you can modify the folder path and parameters in postgres_est_new.py and run it to estimate.

If you want to estimate with Neurocard, we have modified the source code of Neurocard to support our method. You can refer to Neurocard to insert new table and run queries. We also provide the running scripts, but you need to modify some folder path and parameters in datasets.py and run.py.

If you want to estimate with Neurocard, we have modified the source code of DeepDB to support our method. You can refer to DeepDB to insert new table and run queries. We also provide the running scripts, but you need to modify some folder path and parameters in maqp.py and schemas/geotweet/schema.py.

Incremental ST and STH

The incremental ST and STH are in the folder incremental. The running procedure are similar with the base methods.

selectivity-estimation-for-querying-set-valued-data's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.