Giter Club home page Giter Club logo

ldbc_snb_example_data's Introduction

LDBC SNB Data Converter

Scripts to convert from raw graphs produced by the SNB Datagen to graph data sets using various layouts (e.g. storing edges as merged foreign keys).

This repository uses a mix of Bash, Python, and DuckDB SQL scripts. The get.sh script installs the Python dependencies and downloads a recent DuckDB binary if it does not exist in the repository directory (the script is automatically invoked by load.sh).

If you want to use a custom-built DuckDB binary:

  • set the DUCKDB_PATH environment variable to the location of the duckdb binary (default value: .)
  • make sure the Python packages has been recompiled (see instructions)

Example data set

The example data set in this repository reflects the toy graphs used in the LDBC SNB:

The example graph is serialized using the raw serializer (composite-merged-fk layout) which contains the entire temporal graph without filtering/batching.

Generate data sets

Use the data generator in raw mode to generate the data sets. Set the $LDBC_DATA_DIRECTORY environment variable to point to the directory of Datagen's output (containing the static and dynamic directories). Currently, you also have to concatenate the CSVs using the following script.

DATAGEN_OUTPUT_DIR=TodoSetMe
LDBC_DATA_DIRECTORY=${DATAGEN_OUTPUT_DIR}/csv/raw/composite-merged-fk
./spark-concat.sh ${LDBC_DATA_DIRECTORY}

Processing data sets

To process the data sets, run the following scripts (the first one downloads DuckDB if it's not yet available):

./load.sh ${LDBC_DATA_DIRECTORY} --no-header
./transform.sh
./export.sh
# optional
./rename.sh

The duckdb directory contains Python and SQL scripts to convert data to other formats (e.g. CsvCompositeProjectedFK and CsvSingularMergedFK).

Deployed data sets

Parameter generation

Run paramgen as follows:

./load.sh ${LDBC_DATA_DIRECTORY} --no-header
./transform.sh
./factor-tables.sh
./paramgen.sh

Workflows

The workflow-* directories test the benchmark workflow, i.e. loading the initial data set, then applying the batches sequentially. Each batch consists of deletes and inserts. Currently, the scripts first apply the the deletes, then the inserts. Note however that the updates can be applied in any order, even interleaved.

Generating batches

To generate batches and test them, first load the data with a load.sh (parameterized for your data set), then run the scripts for producing/loading the data set/batches.

./load.sh
./transform.sh
./generate-batches.sh
  • The transform.sh script produces the initial snapshot of the data.
  • The generate-batches.sh script produces batches of a given timespan (e.g. one per year) in the batches/ directory.

On the example graph:

  • The data spans 4 years in the interval 2010-2013 (inclusive on both ends).
  • There is one batch per year.

ldbc_snb_example_data's People

Contributors

jackwaudby avatar szarnyasg avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

ldbc_snb_example_data's Issues

Materialize views

Views in DuckDB are not materialized.
For the parameter generation, our views are small and we use them multiple times, so it's worth creating them as a table.

Cascading deletes in SQL

Some SQL systems lack support for cascading deletes.

There are 4 deletes with cascading effects:

  1. DEL1 invokes DEL4, DEL6 and DEL7
  2. DEL4 invokes DEL6
  3. DEL6 invokes DEL7
  4. DEL7 is recursive

It's easy to see that for the first three cascading effects {DEL1, DEL4, DEL6, DEL7} form a topological ordering.

What this implies is that one can make use of the candidate tables, i.e. when deleting Persons based on the Person_Delete_candidates table, we can add the Posts of the Person to the Post_Delete_candidates etc.
Similarly, when deleting the Posts, we can add the Posts child Comments to the Comment_Delete_candidates table.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.