ldbc / ldbc_snb_bi Goto Github PK

View Code? Open in Web Editor NEW

35.0 35.0 21.0 26.08 MB

Reference implementations for the LDBC Social Network Benchmark's Business Intelligence (BI) workload

Home Page: https://ldbcouncil.org/benchmarks/snb-bi

License: Apache License 2.0

Shell 30.82% Python 45.18% C++ 2.71% Cypher 21.29%

ldbc_snb_bi's People

Contributors

Stargazers

Watchers

Forkers

anthonyalanalex yuchenzhangtg hbirler isabella232 kalogon trumanwangtg grigoris-karvounarakis fabianmurariu larusba karol-brejna-i mgree hongtaicao yipingxiongtg mxwli sarmbruster dtenwolde

ldbc_snb_bi's Issues

Q15 paramgen: person pairs without paths are returned

Q15a paramgen sometimes generates pairs who do not have a path between them, e.g.:

$ grep '|15a' output-sf1/results.csv | grep ' -1.0'
15|15a|{"person1Id": "28587302324732", "person2Id": "10995116284647", "startDate": "2012-11-02", "endDate": "2012-11-11"}|[{"weight": -1.0}]
15|15a|{"person1Id": "28587302324732", "person2Id": "10995116284647", "startDate": "2012-11-02", "endDate": "2012-11-11"}|[{"weight": -1.0}]
15|15a|{"person1Id": "28587302324732", "person2Id": "10995116284647", "startDate": "2012-11-02", "endDate": "2012-11-11"}|[{"weight": -1.0}]
15|15a|{"person1Id": "28587302324732", "person2Id": "10995116284647", "startDate": "2012-11-02", "endDate": "2012-11-11"}|[{"weight": -1.0}]
15|15a|{"person1Id": "28587302324732", "person2Id": "10995116284647", "startDate": "2012-11-02", "endDate": "2012-11-11"}|[{"weight": -1.0}]

This is due to ldbc/ldbc_snb_datagen_spark#419 and Q15a not taking this into account.

Also, the temporal personKnowsPersonDays table only has edges one way:

D select * from t where person1id = 15393162793718 and person2id = 15393162795213;
┌────────────────┬────────────────┬─────────────────────┬─────────────────────┐
│   Person1Id    │   Person2Id    │     creationDay     │     deletionDay     │
├────────────────┼────────────────┼─────────────────────┼─────────────────────┤
│ 15393162793718 │ 15393162795213 │ 2012-02-02 00:00:00 │ 2019-12-30 00:00:00 │
└────────────────┴────────────────┴─────────────────────┴─────────────────────┘

So they should be inserted the other way around in the paramgen – and currently they are not.

Simplify paramgen queries

Paramgen should not rely on a separate temporal/ directory. Instead, these attributes should be inlined to the factors (and used by the paramgen accordingly).

Missing features in Umbra implementation

Working on adding an Umbra implementation. Some lessons learnt:

Umbra cannot load from compressed (.csv.gz files):
```
COPY FROM PROGRAM not implemented yet
```
The default Umbra storage engine does not support deletes yet. Use CREATE TABLE ... WITH (storage = paged) for the paged storage engine which does.

Use MessageThread view in SQL queries where applicable

Using the MessageThread makes the SQL queries significantly more readable. Q3, Q4 already use it, Q9, Q12 should use it. Maybe Q15 as well.

Add tool name to timings/results CSVs

The output/*.csv files should encode the tool/variant name to help visualization scripts.
Then, cut it for (num)diffing.

Rework environment variable scripts

The Neo4j/Cypher scripts should use default vars by themselves instead of the user having to source them before running the script.
Python scripts should extract the env var for the CSV dir from the environment themselves instead of getting it as a parameter.

Run SF1 tests in CI

We should cross-validate on SF1 in CI even if this means a +40 minute overhead in build times (as it does in Interactive).

Tigergraph benchmarks fail with "KeyError" exception

Error while running tigergraph/benchmark.py:

Traceback (most recent call last):
  File "benchmark.py", line 35, in <module>
    duration = run_batch_update(batch_date, args)
  File "/mnt/nvme0n1/pgrabusz/ldbc_snb_bi-main/tigergraph/batches.py", line 95, in run_batch_update
    result, duration = run_query(f'del_{vertex}', {'file':str(docker_path/fp.name), 'header':args.header}, args.endpoint)
  File "/mnt/nvme0n1/pgrabusz/ldbc_snb_bi-main/tigergraph/batches.py", line 35, in run_query
    return response['results'][0]['result'], duration
KeyError: 'results'

Setup:
3-node Tigergraph (bare metal/outside docker)

Steps to reproduce:

<<BASE_PATH>> is a path on the main node (like /home/root/benchmarks)
<<BASE_NODE>> is the main node's IP address

All using root:

=================
--- DATA GEN: ---
=================

Download repo
	wget https://github.com/ldbc/ldbc_snb_datagen_spark/archive/refs/heads/main.zip
		commit hash: c1438ce36d9d7baa070978512965d4e043aaa123
	cd ldbc_snb_datagen_spark-main

/tools/build.sh
	mvn version: Apache Maven 3.6.3
	java version: openjdk 11.0.15 2022-04-19
				  OpenJDK Runtime Environment (build 11.0.15+10-Ubuntu-0ubuntu0.20.04.1)
				  OpenJDK 64-Bit Server VM (build 11.0.15+10-Ubuntu-0ubuntu0.20.04.1, mixed mode, sharing)
	
Install Python tools
	python3 -m virtualenv .venv
		python version: 3.8.10
	. .venv/bin/activate
	pip install -U pip
		pip 22.1 (python 3.8)
	pip install ./tools

If not for a 1st time:
	find $TG_DATA_DIR -name _SUCCESS -delete
	find $TG_DATA_DIR -name *.crc -delete

Run data gen
	export SPARK_HOME=<<BASE_PATH>>/spark-3.1.2-bin-hadoop3.2
	export PATH="$SPARK_HOME/bin":"$PATH"
	export PLATFORM_VERSION=2.12_spark3.1
	export DATAGEN_VERSION=0.5.0-SNAPSHOT
	export SF=1

	rm -rf out-sf${SF}/
	tools/run.py \
		--cores $(nproc) \
		--memory 8G \
		./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar \
		-- \
		--format csv \
		--scale-factor ${SF} \
		--explode-edges \
		--mode bi \
		--output-dir out-sf${SF}/ \
		--generate-factors \
		# --format-options compression=gzip
		
generated data:
	generator runs for about 4 min for SF1
	<<BASE_PATH>>/ldbc_snb_datagen_spark-main/out-sf1

====================
--- BI PARAMGEN: ---
====================

repo as in bi load data
venv the same as above

install dependencies
	scripts/install-dependencies.sh does not work,
	installing manually: pip install duckdb==0.3.4 pytz
	
copy data to
	paramgen/factors and paramgen/temporal
	cp -r <<BASE_PATH>>/ldbc_snb_datagen_spark-main/out-sf1/factors/csv/raw/composite-merged-fk/* factors/
	cp -r <<BASE_PATH>>/ldbc_snb_datagen_spark-main/out-sf1/graphs/parquet/raw/composite-merged-fk/dynamic/{Person,Person_knows_Person,Person_studyAt_University,Person_workAt_Company} temporal/

run paramgen
	scripts/paramgen.sh

parameters generated to
	<<BASE_PATH>>/ldbc_snb_bi-main/parameters

=====================
--- BI LOAD DATA: ---
=====================

download repo
	wget https://github.com/ldbc/ldbc_snb_bi/archive/refs/heads/main.zip
		commit hash: 37e3a2ec30dd2e79fb9bbd9bb9a5e80c4ededf59
	cd ldbc_snb_bi-main/tigergraph
	
configure
	export TG_DATA_DIR=<<BASE_PATH>>/ldbc_snb_datagen_spark-main/out-sf1/graphs/csv/bi/composite-projected-fk/
	export TG_HEADER=true
	export SF=1
	export TG_VERSION=latest
	export TG_DDL_DIR=<<BASE_PATH>>/ldbc_snb_bi-main/tigergraph/ddl
	export TG_DML_DIR=<<BASE_PATH>>/ldbc_snb_bi-main/tigergraph/dml

	sed "s;header=\"false\";header=\"$TG_HEADER\";" $TG_DDL_DIR/load_static.gsql > $TG_DDL_DIR/load.gsql
	sed "s;header=\"false\";header=\"$TG_HEADER\";" $TG_DDL_DIR/load_dynamic.gsql >> $TG_DDL_DIR/load.gsql
	sed "s;header=\"false\";header=\"$TG_HEADER\";" $TG_DML_DIR/ins_Vertex.gsql >> $TG_DDL_DIR/load.gsql
	sed "s;header=\"false\";header=\"$TG_HEADER\";" $TG_DML_DIR/ins_Edge.gsql >> $TG_DDL_DIR/load.gsql
	sed "s;header=\"false\";header=\"$TG_HEADER\";" $TG_DML_DIR/del_Edge.gsql >> $TG_DDL_DIR/load.gsql
	
load data
	su tigergraph
	(run .venv)
	export all variables from above (repeat all export commands)
	
	ddl/setup.sh \
		<<BASE_PATH>>/ldbc_snb_datagen_spark-main/out-sf1/graphs/csv/bi/composite-projected-fk \
		<<BASE_PATH>>/ldbc_snb_bi-main/tigergraph/queries \
		<<BASE_PATH>>/ldbc_snb_bi-main/tigergraph/dml \

=========================
--- RUN BI BENCHMARK: ---
=========================

run in tigergraph path as tigergraph user:
	su tigergraph
	(run .venv)
	export all variables from above (repeat all export commands)

	export TG_PARAMETER=<<BASE_PATH>>/ldbc_snb_bi-main/parameters
	export TG_ENDPOINT=http://<<BASE_NODE>>:9000
	
	scripts/benchmark.sh
		
	error:
	
Traceback (most recent call last):
  File "benchmark.py", line 35, in <module>
    duration = run_batch_update(batch_date, args)
  File "<<BASE_PATH>>/ldbc_snb_bi-main/tigergraph/batches.py", line 95, in run_batch_update
    result, duration = run_query(f'del_{vertex}', {'file':str(docker_path/fp.name), 'header':args.header}, args.endpoint)
  File "<<BASE_PATH>>/ldbc_snb_bi-main/tigergraph/batches.py", line 35, in run_query
    return response['results'][0]['result'], duration
KeyError: 'results'

After a little investigation:

scripts ignore some variables set in the environment and/or passed to the scripts/benchmark.sh (script vars.sh rewrites them)
python scripts also lose the path while passing to methods
run_query method (batches.py) fails - http://100.67.80.11:9000/query/ldbc_snb/del_Comment request returns {'version': {'edition': 'enterprise', 'api': 'v2', 'schema': 0}, 'error': True, 'message': "Runtime Error: File '/data/deletes/dynamic/Comment/batch_id=2012-11-29/part-00000-763d926e-20e7-4961-b04c-f510e70e9e80.c000.csv' does not exist."} so there's no results key. As the message points out - there is no such file (probably connected to the pathing issue described above).

Rework waiting scripts

Use more elegant waiting scripts, e.g.:

echo -n "Waiting for the database to start ."
until python3 scripts/test-db-connection.py > /dev/null 2>&1; do
    echo -n " ."
    sleep 1
done
echo
echo "Database started"

Cleanup Python scripts

The queries.py and batches.py scripts are redundant, their tasks can be performed by the benchmark.py scripts -- they need to be extended with a few arguments for this.

Run tuning on SF 3+

Setup script for Azure (machine type: E8ds_v4):

#!/bin/bash

export DEBIAN_FRONTEND=noninteractive

sudo apt update
sudo apt install -y tmux htop git wget curl zstd docker.io mc vim silversearcher-ag nmon openjdk-11-jdk maven python3 python3-pip 
sudo gpasswd -a ${USER} docker

curl https://downloads.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz | sudo tar -xz -C /opt/
echo "export SPARK_HOME=/opt/spark-3.1.2-bin-hadoop3.2" >> ~/.bashrc
echo "export PATH=\$SPARK_HOME/bin:$PATH" >> ~/.bashrc
echo "export PLATFORM_VERSION=2.12_spark3.1" >> ~/.bashrc
echo "export DATAGEN_VERSION=0.5.0-SNAPSHOT" >> ~/.bashrc

sudo chown -R ${USER}:${USER} /mnt
cd /mnt

git clone https://github.com/ldbc/ldbc_snb_bi/
cd ldbc_snb_bi
cypher/scripts/install-dependencies.sh
paramgen/scripts/install-dependencies.sh
cd ..

git clone https://github.com/ldbc/ldbc_snb_datagen_spark
cd ldbc_snb_datagen_spark
echo "export LDBC_SNB_DATAGEN_DIR=`pwd`" >> ~/.bashrc
tools/build.sh
cd ..

echo "export NEO4J_VERSION=4.4.2-enterprise" >> ~/.bashrc
echo "export NEO4J_ENV_VARS=--env=NEO4J_ACCEPT_LICENSE_AGREEMENT=yes" >> ~/.bashrc

Reboot for Docker to work.

sudo reboot

Generate data, factors, and load to Neo4j.

cd /mnt

cd ldbc_snb_datagen_spark
export SF=3
tools/run.py --cores 8 --memory 50G target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar -- --format csv --scale-factor ${SF} --explode-edges --mode bi --output-dir out-sf${SF}/ --generate-factors --format-options header=false,quoteAll=true
cd ..

cd ldbc_snb_bi

# paramgen
cd paramgen
cp -r ${LDBC_SNB_DATAGEN_DIR}/out-sf${SF}/factors/csv/raw/composite-merged-fk/* factors/
scripts/paramgen.sh
cd ..

# cypher
cd cypher
export NEO4J_CSV_DIR=${LDBC_SNB_DATAGEN_DIR}/out-sf${SF}/graphs/csv/bi/composite-projected-fk/
scripts/load-in-one-step.sh
cd ..

# tuning
cd tuning
# do the tuning ...

Implement Dijkstra's algorithm in SQL for Q19 and Q20

The current algorithms do not always produce the correct results and will not scale - we should implement Dijkstra's algorithm in SQL for these queries.

Allow the configuration of Neo4j through env vars

Introduce env vars to configure the Neo4j version and other env vars

BI Q13 Cypher script fails

Due to a date/datetime conversion (?) issue.

Top-1 / arg max queries

For Q14's SQL implementation, max_by or similar could be used to select the top-1 city's name:
https://hakibenita.com/sql-group-by-first-last-value

Q6 parameter tuning produces bimodal runtime

SF1 on Neo4j

Only the first 10 parameters are used

Currently, only the 10 parameters are used: the parameters file is reopened in each batch (power batch / throughput batch).

bi.py scripts should use SF from env vars

Similarly to the other script, bi.py script should rely on env vars instead of args.

Precompute edge weights in Neo4j

Queries 19 and 20 lend themselves to precomputed edge weights. Moreover, precomputation in this benchmark is feasible due to the batch-based updates (which run separately from the queries and are relatively rare). Such precomputation would greatly speed up the query execution.

Take Person lifespans into account when generating parameters

Based on the factors in ldbc/ldbc_snb_datagen_spark#370

Add support for validation

The script should have create-validation-parameters and validate modes like the Interactive driver.
Or just have create-validation-parameters and UNIX diff should handle the rest.

Generate LaTeX tables from results

The scoring script should also generate LaTeX tables for papers and reports.

There is an initial implementation for this in:

ldbc_snb_bi/scripts/calculate-scores.py

Lines 118 to 125 in 074fb87

 con.execute(""" 

  SELECT string_agg( t, ' \n') AS power_test 

  FROM results_table_sorted; 

  """) 

 s = con.fetchone(); 

 print(f"""results (TeX code): 

 {s[0]} 

 """)

Cypher CI test data set

The Cypher implementation should use quoted CSVs as the loader in Neo4j 4.x (r)trims CSV fields (i.e. it gets rid of trailing spaces). This may cause differences from the Postgres implementation.

Paramgen: Selecting values around a given percentile

For many paramgen queries, we need to select a range in a distribution. Example SQL queries to select a range around a given percentile:

create table t(x int);
insert into t values (10), (20), (30), (20), (40), (50), (30), (20), (60), (70), (80), (90);

SELECT t.x, abs(t.x - (select percentile_disc(0.58) within group (order by t.x) from t)) AS diff
FROM t
ORDER BY diff
LIMIT 5;

select t.x
from t, (select rowid from t where x = (select percentile_disc(0.58) within group (order by t.x) from t)) p
where p.rowid - 3 < t.rowid
  and p.rowid + 3 > t.rowid;

Use materialized views in Umbra

Umbra would produce much better query plans if we used materialized views (with PK information) instead of simple views.
However, these need to be maintained upon refresh operations.

Make use of Postgres 14 features once it's out

SEARCH and NO CYCLE should be useful for a few of our queries:
https://www.depesz.com/2021/02/04/waiting-for-postgresql-14-search-and-cycle-clauses/

Postgres foreign keys should be declared not null

See ldbc/ldbc_snb_interactive_v1_impls#160

Remove unused 'test-parameters' directory

Optimize Q15 Cypher query

Commit b7b9063 made the Q15 Neo4j/Cypher implementation slow - it should be rewritten to become faster.

Umbra optimizations

Turn off exhaustive PK indexing on edges as it consumes too much memory.
Check Q11's query plans to see whether it uses multi-way WCO joins.
Turn on creationDate indexing as it may help performance.

Use ${SF} to pick parameters and save outputs

Currently, everything goes through parameters/ and output/ which is inconvenient and error prone. Experiments for different SFs should use different directories.

Revamp SQL queries

The id field should be more specific, e.g. PersonId, MessageId, etc.

The return values should follow the spec (to be updated soon) which specified names such as PersonId and Person1Id.

This also allows using NATURAL JOIN for some queries (but it's debatable whether it's good practice to use NATURAL JOIN so we won't use it now...).

Speed up Q20 paramgen query

I attempted to speed up the Q20 paramgen query in 76f41c5.
However, this resulted in excessive memory consumption.

A good trick would be to add temporal attributes to the sameUniversityKnows factor table in the datagen, i.e. its schema should be:

person1id, person2id, creationDate, deletionDate

where

creationDate = max(knows.creationDate, studyAt1.creationDate, studyAt2.creationDate)

and

deletionDate = max(knows.deletionDate, studyAt1.deletionDate, studyAt2.deletionDate)

Neo4j error for query 4

Query 4 results in the following error message using Neo4j Enterprise Edition's pipelined and/or parallel runtimes. The exact runtime is unclear as I have trouble reproducing the bug.

The query is changed as follows:

CYPHER runtime=parallel
MATCH ...

Neo.DatabaseError.Statement.ExecutionFailed
Tried overwriting already taken variable name 'topForum1' as LongSlot(2,false,Node) (was: RefSlot(1,true,Any))

To test with different versions, navigate to the cypher directory and run the following commands:

export SF=1
. scripts/use-datagen-data-set.sh
export NEO4J_VERSION=4.4.3-enterprise
scripts/load-in-one-step.sh
scripts/queries.sh

SF1 is required to trigger the error.

Using the most recent version, 4.4.11-enterprise makes the error go away.

Add SQL schema/deletion script which makes use of foreign key constraints

Cascading deletions can follow FK constraints, making the deletion queries significantly simpler.

A guide for adding new implementations?

I see the existing implementations (Cypher, Postgres) follow some folder/files structure.

Could you share the guide (or simple pointers even) on how to propose a new implementation? Which folder/file is responsible for what? What is the execution order, etc.

Having this kind of instruction could bring some more Graph DBs and make testing/comparing them easier.

Use compressed CSV files

Both Neo4j and Postgres are capable of loading from .csv.gz files (Neo4j can do this by default while Postgres supports ~PROGRAM zcat ...). There should be a flag which allows this.

Umbra container does not work on macOS

The current (cd95046) Umbra container crashes under macOS, both on Intel and M1 Macs. On the latter, it produces a segmentation fault through QEMU:

✗ . scripts/use-sample-data-set.sh
✗ scripts/load-in-one-step.sh
Cleaning up running Umbra containers . . . Cleanup completed.
Starting container . . . Container started.
Creating database . . . Database created.
Starting the database . qemu: uncaught target signal 11 (Segmentation fault) - core dumped

Q4 Cypher implementation is slow

It would be worth optimizing it or splitting it up into multiple three subqueries (find and mark top 100 forums, run query, cleanup).

Save precomputation times in Cypher/Umbra scripts

Logging does not include these at the moment.

Revise gitignores

Benchmark results are currently gitignore'd which makes them susceptible to getting lost (cleaned/not committed).

Scripts should check for the existence of the ${SF} env var

Currently, Python scripts just assign None to the scale factor, leading to all sorts of problems down the road, e.g. the visualization and analysis scripts not working.

Fix datetime issue in tuning script

Neo4j needs timezoned datetimes. Fix this in the tuning script.

Add BI Q20a/b variants

Variants:

(a) guaranteed that no path exists
(b) guaranteed that there is a path

We already support variant (b). Variant (a) is tricky. While randomly selecting endpoints is very likely to yield such a variant, it is not guaranteed. For that, we have to actually run the query and see what happens. This is of course a challenge on large-scale data sets.

Postgres start scripts should work w/o mounting a directory

The input CSV directory is only required for the first load, subsequent starts don't need it.

Use timeout command to kill measurements

A measurement should run for ~1h 15min and then be killed by timeout. This ensures that, given a throughput batch size of 15 minutes, we have all batches until 1h.

Flush results after each query

Results should be flushed after each query execution

Remove redundant entries from path queries

The PostgreSQL path queries use ARRAY[person1Id, person2Id]::bigint[] in their starting subquery. I cannot recall why -- maybe they are there to avoid duplicates (?) but they seem incorrect so these should probably be simply ARRAY[person1Id]::bigint[].

	con.execute("""
	SELECT string_agg( t, ' \n') AS power_test
	FROM results_table_sorted;
	""")
	s = con.fetchone();
	print(f"""results (TeX code):
	{s[0]}
	""")