Giter Club home page Giter Club logo

ldbc_snb_bi's People

Contributors

agubichev avatar alexaverbuch avatar aneeshdurg avatar antaljanosbenjamin avatar arnauprat avatar benhamad avatar dependabot[bot] avatar filipecosta90 avatar hbirler avatar jackwaudby avatar jmarton avatar marcelned avatar marci543 avatar mdrz7 avatar mgree avatar mirkospasic avatar mkaufmann avatar oerling avatar petere avatar pstutz avatar szarnyasg avatar thsoft avatar trumanwangtg avatar yipingxiongtg avatar yuchenzhangtg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

ldbc_snb_bi's Issues

Q15 paramgen: person pairs without paths are returned

Q15a paramgen sometimes generates pairs who do not have a path between them, e.g.:

$ grep '|15a' output-sf1/results.csv | grep ' -1.0'
15|15a|{"person1Id": "28587302324732", "person2Id": "10995116284647", "startDate": "2012-11-02", "endDate": "2012-11-11"}|[{"weight": -1.0}]
15|15a|{"person1Id": "28587302324732", "person2Id": "10995116284647", "startDate": "2012-11-02", "endDate": "2012-11-11"}|[{"weight": -1.0}]
15|15a|{"person1Id": "28587302324732", "person2Id": "10995116284647", "startDate": "2012-11-02", "endDate": "2012-11-11"}|[{"weight": -1.0}]
15|15a|{"person1Id": "28587302324732", "person2Id": "10995116284647", "startDate": "2012-11-02", "endDate": "2012-11-11"}|[{"weight": -1.0}]
15|15a|{"person1Id": "28587302324732", "person2Id": "10995116284647", "startDate": "2012-11-02", "endDate": "2012-11-11"}|[{"weight": -1.0}]

This is due to ldbc/ldbc_snb_datagen_spark#419 and Q15a not taking this into account.

Also, the temporal personKnowsPersonDays table only has edges one way:

D select * from t where person1id = 15393162793718 and person2id = 15393162795213;
┌────────────────┬────────────────┬─────────────────────┬─────────────────────┐
│   Person1Id    │   Person2Id    │     creationDay     │     deletionDay     │
├────────────────┼────────────────┼─────────────────────┼─────────────────────┤
│ 15393162793718153931627952132012-02-02 00:00:002019-12-30 00:00:00 │
└────────────────┴────────────────┴─────────────────────┴─────────────────────┘

So they should be inserted the other way around in the paramgen – and currently they are not.

Simplify paramgen queries

Paramgen should not rely on a separate temporal/ directory. Instead, these attributes should be inlined to the factors (and used by the paramgen accordingly).

Missing features in Umbra implementation

Working on adding an Umbra implementation. Some lessons learnt:

  • Umbra cannot load from compressed (.csv.gz files):

    COPY FROM PROGRAM not implemented yet
  • The default Umbra storage engine does not support deletes yet. Use CREATE TABLE ... WITH (storage = paged) for the paged storage engine which does.

Rework environment variable scripts

  • The Neo4j/Cypher scripts should use default vars by themselves instead of the user having to source them before running the script.
  • Python scripts should extract the env var for the CSV dir from the environment themselves instead of getting it as a parameter.

Run SF1 tests in CI

We should cross-validate on SF1 in CI even if this means a +40 minute overhead in build times (as it does in Interactive).

Tigergraph benchmarks fail with "KeyError" exception

Error while running tigergraph/benchmark.py:

Traceback (most recent call last):
  File "benchmark.py", line 35, in <module>
    duration = run_batch_update(batch_date, args)
  File "/mnt/nvme0n1/pgrabusz/ldbc_snb_bi-main/tigergraph/batches.py", line 95, in run_batch_update
    result, duration = run_query(f'del_{vertex}', {'file':str(docker_path/fp.name), 'header':args.header}, args.endpoint)
  File "/mnt/nvme0n1/pgrabusz/ldbc_snb_bi-main/tigergraph/batches.py", line 35, in run_query
    return response['results'][0]['result'], duration
KeyError: 'results'

Setup:
3-node Tigergraph (bare metal/outside docker)

Steps to reproduce:

<<BASE_PATH>> is a path on the main node (like /home/root/benchmarks)
<<BASE_NODE>> is the main node's IP address

All using root:

=================
--- DATA GEN: ---
=================

Download repo
	wget https://github.com/ldbc/ldbc_snb_datagen_spark/archive/refs/heads/main.zip
		commit hash: c1438ce36d9d7baa070978512965d4e043aaa123
	cd ldbc_snb_datagen_spark-main

/tools/build.sh
	mvn version: Apache Maven 3.6.3
	java version: openjdk 11.0.15 2022-04-19
				  OpenJDK Runtime Environment (build 11.0.15+10-Ubuntu-0ubuntu0.20.04.1)
				  OpenJDK 64-Bit Server VM (build 11.0.15+10-Ubuntu-0ubuntu0.20.04.1, mixed mode, sharing)
	
Install Python tools
	python3 -m virtualenv .venv
		python version: 3.8.10
	. .venv/bin/activate
	pip install -U pip
		pip 22.1 (python 3.8)
	pip install ./tools

If not for a 1st time:
	find $TG_DATA_DIR -name _SUCCESS -delete
	find $TG_DATA_DIR -name *.crc -delete

Run data gen
	export SPARK_HOME=<<BASE_PATH>>/spark-3.1.2-bin-hadoop3.2
	export PATH="$SPARK_HOME/bin":"$PATH"
	export PLATFORM_VERSION=2.12_spark3.1
	export DATAGEN_VERSION=0.5.0-SNAPSHOT
	export SF=1

	rm -rf out-sf${SF}/
	tools/run.py \
		--cores $(nproc) \
		--memory 8G \
		./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar \
		-- \
		--format csv \
		--scale-factor ${SF} \
		--explode-edges \
		--mode bi \
		--output-dir out-sf${SF}/ \
		--generate-factors \
		# --format-options compression=gzip
		
generated data:
	generator runs for about 4 min for SF1
	<<BASE_PATH>>/ldbc_snb_datagen_spark-main/out-sf1

====================
--- BI PARAMGEN: ---
====================

repo as in bi load data
venv the same as above

install dependencies
	scripts/install-dependencies.sh does not work,
	installing manually: pip install duckdb==0.3.4 pytz
	
copy data to
	paramgen/factors and paramgen/temporal
	cp -r <<BASE_PATH>>/ldbc_snb_datagen_spark-main/out-sf1/factors/csv/raw/composite-merged-fk/* factors/
	cp -r <<BASE_PATH>>/ldbc_snb_datagen_spark-main/out-sf1/graphs/parquet/raw/composite-merged-fk/dynamic/{Person,Person_knows_Person,Person_studyAt_University,Person_workAt_Company} temporal/

run paramgen
	scripts/paramgen.sh

parameters generated to
	<<BASE_PATH>>/ldbc_snb_bi-main/parameters

=====================
--- BI LOAD DATA: ---
=====================

download repo
	wget https://github.com/ldbc/ldbc_snb_bi/archive/refs/heads/main.zip
		commit hash: 37e3a2ec30dd2e79fb9bbd9bb9a5e80c4ededf59
	cd ldbc_snb_bi-main/tigergraph
	
configure
	export TG_DATA_DIR=<<BASE_PATH>>/ldbc_snb_datagen_spark-main/out-sf1/graphs/csv/bi/composite-projected-fk/
	export TG_HEADER=true
	export SF=1
	export TG_VERSION=latest
	export TG_DDL_DIR=<<BASE_PATH>>/ldbc_snb_bi-main/tigergraph/ddl
	export TG_DML_DIR=<<BASE_PATH>>/ldbc_snb_bi-main/tigergraph/dml

	sed "s;header=\"false\";header=\"$TG_HEADER\";" $TG_DDL_DIR/load_static.gsql > $TG_DDL_DIR/load.gsql
	sed "s;header=\"false\";header=\"$TG_HEADER\";" $TG_DDL_DIR/load_dynamic.gsql >> $TG_DDL_DIR/load.gsql
	sed "s;header=\"false\";header=\"$TG_HEADER\";" $TG_DML_DIR/ins_Vertex.gsql >> $TG_DDL_DIR/load.gsql
	sed "s;header=\"false\";header=\"$TG_HEADER\";" $TG_DML_DIR/ins_Edge.gsql >> $TG_DDL_DIR/load.gsql
	sed "s;header=\"false\";header=\"$TG_HEADER\";" $TG_DML_DIR/del_Edge.gsql >> $TG_DDL_DIR/load.gsql
	
load data
	su tigergraph
	(run .venv)
	export all variables from above (repeat all export commands)
	
	ddl/setup.sh \
		<<BASE_PATH>>/ldbc_snb_datagen_spark-main/out-sf1/graphs/csv/bi/composite-projected-fk \
		<<BASE_PATH>>/ldbc_snb_bi-main/tigergraph/queries \
		<<BASE_PATH>>/ldbc_snb_bi-main/tigergraph/dml \

=========================
--- RUN BI BENCHMARK: ---
=========================

run in tigergraph path as tigergraph user:
	su tigergraph
	(run .venv)
	export all variables from above (repeat all export commands)

	export TG_PARAMETER=<<BASE_PATH>>/ldbc_snb_bi-main/parameters
	export TG_ENDPOINT=http://<<BASE_NODE>>:9000
	
	scripts/benchmark.sh
		
	error:
	
Traceback (most recent call last):
  File "benchmark.py", line 35, in <module>
    duration = run_batch_update(batch_date, args)
  File "<<BASE_PATH>>/ldbc_snb_bi-main/tigergraph/batches.py", line 95, in run_batch_update
    result, duration = run_query(f'del_{vertex}', {'file':str(docker_path/fp.name), 'header':args.header}, args.endpoint)
  File "<<BASE_PATH>>/ldbc_snb_bi-main/tigergraph/batches.py", line 35, in run_query
    return response['results'][0]['result'], duration
KeyError: 'results'

After a little investigation:

  1. scripts ignore some variables set in the environment and/or passed to the scripts/benchmark.sh (script vars.sh rewrites them)
  2. python scripts also lose the path while passing to methods
  3. run_query method (batches.py) fails - http://100.67.80.11:9000/query/ldbc_snb/del_Comment request returns {'version': {'edition': 'enterprise', 'api': 'v2', 'schema': 0}, 'error': True, 'message': "Runtime Error: File '/data/deletes/dynamic/Comment/batch_id=2012-11-29/part-00000-763d926e-20e7-4961-b04c-f510e70e9e80.c000.csv' does not exist."} so there's no results key. As the message points out - there is no such file (probably connected to the pathing issue described above).

Rework waiting scripts

Use more elegant waiting scripts, e.g.:

echo -n "Waiting for the database to start ."
until python3 scripts/test-db-connection.py > /dev/null 2>&1; do
    echo -n " ."
    sleep 1
done
echo
echo "Database started"

Cleanup Python scripts

The queries.py and batches.py scripts are redundant, their tasks can be performed by the benchmark.py scripts -- they need to be extended with a few arguments for this.

PostgreSQL BI queries

Run tuning on SF 3+

Setup script for Azure (machine type: E8ds_v4):

#!/bin/bash

export DEBIAN_FRONTEND=noninteractive

sudo apt update
sudo apt install -y tmux htop git wget curl zstd docker.io mc vim silversearcher-ag nmon openjdk-11-jdk maven python3 python3-pip 
sudo gpasswd -a ${USER} docker

curl https://downloads.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz | sudo tar -xz -C /opt/
echo "export SPARK_HOME=/opt/spark-3.1.2-bin-hadoop3.2" >> ~/.bashrc
echo "export PATH=\$SPARK_HOME/bin:$PATH" >> ~/.bashrc
echo "export PLATFORM_VERSION=2.12_spark3.1" >> ~/.bashrc
echo "export DATAGEN_VERSION=0.5.0-SNAPSHOT" >> ~/.bashrc

sudo chown -R ${USER}:${USER} /mnt
cd /mnt

git clone https://github.com/ldbc/ldbc_snb_bi/
cd ldbc_snb_bi
cypher/scripts/install-dependencies.sh
paramgen/scripts/install-dependencies.sh
cd ..

git clone https://github.com/ldbc/ldbc_snb_datagen_spark
cd ldbc_snb_datagen_spark
echo "export LDBC_SNB_DATAGEN_DIR=`pwd`" >> ~/.bashrc
tools/build.sh
cd ..

echo "export NEO4J_VERSION=4.4.2-enterprise" >> ~/.bashrc
echo "export NEO4J_ENV_VARS=--env=NEO4J_ACCEPT_LICENSE_AGREEMENT=yes" >> ~/.bashrc

Reboot for Docker to work.

sudo reboot

Generate data, factors, and load to Neo4j.

cd /mnt

cd ldbc_snb_datagen_spark
export SF=3
tools/run.py --cores 8 --memory 50G target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar -- --format csv --scale-factor ${SF} --explode-edges --mode bi --output-dir out-sf${SF}/ --generate-factors --format-options header=false,quoteAll=true
cd ..

cd ldbc_snb_bi

# paramgen
cd paramgen
cp -r ${LDBC_SNB_DATAGEN_DIR}/out-sf${SF}/factors/csv/raw/composite-merged-fk/* factors/
scripts/paramgen.sh
cd ..

# cypher
cd cypher
export NEO4J_CSV_DIR=${LDBC_SNB_DATAGEN_DIR}/out-sf${SF}/graphs/csv/bi/composite-projected-fk/
scripts/load-in-one-step.sh
cd ..

# tuning
cd tuning
# do the tuning ...

Precompute edge weights in Neo4j

Queries 19 and 20 lend themselves to precomputed edge weights. Moreover, precomputation in this benchmark is feasible due to the batch-based updates (which run separately from the queries and are relatively rare). Such precomputation would greatly speed up the query execution.

Add support for validation

The script should have create-validation-parameters and validate modes like the Interactive driver.
Or just have create-validation-parameters and UNIX diff should handle the rest.

Cypher CI test data set

The Cypher implementation should use quoted CSVs as the loader in Neo4j 4.x (r)trims CSV fields (i.e. it gets rid of trailing spaces). This may cause differences from the Postgres implementation.

Paramgen: Selecting values around a given percentile

For many paramgen queries, we need to select a range in a distribution. Example SQL queries to select a range around a given percentile:

create table t(x int);
insert into t values (10), (20), (30), (20), (40), (50), (30), (20), (60), (70), (80), (90);
SELECT t.x, abs(t.x - (select percentile_disc(0.58) within group (order by t.x) from t)) AS diff
FROM t
ORDER BY diff
LIMIT 5;
select t.x
from t, (select rowid from t where x = (select percentile_disc(0.58) within group (order by t.x) from t)) p
where p.rowid - 3 < t.rowid
  and p.rowid + 3 > t.rowid;

Use materialized views in Umbra

Umbra would produce much better query plans if we used materialized views (with PK information) instead of simple views.
However, these need to be maintained upon refresh operations.

Umbra optimizations

  • Turn off exhaustive PK indexing on edges as it consumes too much memory.
  • Check Q11's query plans to see whether it uses multi-way WCO joins.
  • Turn on creationDate indexing as it may help performance.

Revamp SQL queries

The id field should be more specific, e.g. PersonId, MessageId, etc.

The return values should follow the spec (to be updated soon) which specified names such as PersonId and Person1Id.

This also allows using NATURAL JOIN for some queries (but it's debatable whether it's good practice to use NATURAL JOIN so we won't use it now...).

Speed up Q20 paramgen query

I attempted to speed up the Q20 paramgen query in 76f41c5.
However, this resulted in excessive memory consumption.

A good trick would be to add temporal attributes to the sameUniversityKnows factor table in the datagen, i.e. its schema should be:

person1id, person2id, creationDate, deletionDate

where

creationDate = max(knows.creationDate, studyAt1.creationDate, studyAt2.creationDate)

and

deletionDate = max(knows.deletionDate, studyAt1.deletionDate, studyAt2.deletionDate)

Neo4j error for query 4

Query 4 results in the following error message using Neo4j Enterprise Edition's pipelined and/or parallel runtimes. The exact runtime is unclear as I have trouble reproducing the bug.

The query is changed as follows:

CYPHER runtime=parallel
MATCH ...
Neo.DatabaseError.Statement.ExecutionFailed
Tried overwriting already taken variable name 'topForum1' as LongSlot(2,false,Node) (was: RefSlot(1,true,Any))

To test with different versions, navigate to the cypher directory and run the following commands:

export SF=1
. scripts/use-datagen-data-set.sh
export NEO4J_VERSION=4.4.3-enterprise
scripts/load-in-one-step.sh
scripts/queries.sh

SF1 is required to trigger the error.

Using the most recent version, 4.4.11-enterprise makes the error go away.

A guide for adding new implementations?

I see the existing implementations (Cypher, Postgres) follow some folder/files structure.

Could you share the guide (or simple pointers even) on how to propose a new implementation? Which folder/file is responsible for what? What is the execution order, etc.

Having this kind of instruction could bring some more Graph DBs and make testing/comparing them easier.

Use compressed CSV files

Both Neo4j and Postgres are capable of loading from .csv.gz files (Neo4j can do this by default while Postgres supports ~PROGRAM zcat ...). There should be a flag which allows this.

Umbra container does not work on macOS

The current (cd95046) Umbra container crashes under macOS, both on Intel and M1 Macs. On the latter, it produces a segmentation fault through QEMU:

. scripts/use-sample-data-set.sh
✗ scripts/load-in-one-step.sh
Cleaning up running Umbra containers . . . Cleanup completed.
Starting container . . . Container started.
Creating database . . . Database created.
Starting the database . qemu: uncaught target signal 11 (Segmentation fault) - core dumped

Q4 Cypher implementation is slow

It would be worth optimizing it or splitting it up into multiple three subqueries (find and mark top 100 forums, run query, cleanup).

Revise gitignores

Benchmark results are currently gitignore'd which makes them susceptible to getting lost (cleaned/not committed).

Add BI Q20a/b variants

Variants:

  • (a) guaranteed that no path exists
  • (b) guaranteed that there is a path

We already support variant (b). Variant (a) is tricky. While randomly selecting endpoints is very likely to yield such a variant, it is not guaranteed. For that, we have to actually run the query and see what happens. This is of course a challenge on large-scale data sets.

Use timeout command to kill measurements

A measurement should run for ~1h 15min and then be killed by timeout. This ensures that, given a throughput batch size of 15 minutes, we have all batches until 1h.

Remove redundant entries from path queries

The PostgreSQL path queries use ARRAY[person1Id, person2Id]::bigint[] in their starting subquery. I cannot recall why -- maybe they are there to avoid duplicates (?) but they seem incorrect so these should probably be simply ARRAY[person1Id]::bigint[].

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.