ldbc / ldbc_snb_datagen_spark Goto Github PK

View Code? Open in Web Editor NEW

166.0 18.0 58.0 291.09 MB

Synthetic graph generator for the LDBC Social Network Benchmark, running on Spark

Home Page: https://ldbcouncil.org/benchmarks/snb

License: Apache License 2.0

Python 3.11% Java 66.93% Shell 1.20% R 2.75% Dockerfile 0.17% Scala 25.85%

snb

ldbc_snb_datagen_spark's Introduction

LDBC SNB Datagen (Spark-based)

The LDBC SNB Data Generator (Datagen) produces the datasets for the LDBC Social Network Benchmark's workloads. The generator is designed to produce directed labelled graphs that mimic the characteristics of those graphs of real data. A detailed description of the schema produced by Datagen, as well as the format of the output files, can be found in the latest version of official LDBC SNB specification document.

📜 If you wish to cite the LDBC SNB, please refer to the documentation repository.

⚠️ There are two different versions of the Datagen:

The Hadoop-based Datagen generates the Interactive workload's SF1-1000 data sets.
For the BI workload, use the Spark-based Datagen (in this repository).
For the Interactive workloads's larger data sets, see the conversion script in the driver repository.

For each commit on the main branch, the CI deploys freshly generated small data sets.

Quick start

Build the JAR

To assemble the JAR file with SBT, run:

sbt assembly

Install Python tools

Some of the build utilities are written in Python. To use them, you have to create a Python virtual environment and install the dependencies.

E.g. with pyenv and pyenv-virtualenv:

pyenv install 3.7.13
pyenv virtualenv 3.7.13 ldbc_datagen_tools
pyenv local ldbc_datagen_tools
pip install -U pip
pip install ./tools

If the environment already exists, activate it with:

pyenv activate

Running locally

The ./tools/run.py script is intended for local runs. To use it, download and extract Spark as follows.

Spark 3.2.x

Spark 3.2.x is the recommended runtime to use. The rest of the instructions are provided assuming Spark 3.2.x.

To place Spark under /opt/:

scripts/get-spark-to-opt.sh
export SPARK_HOME="/opt/spark-3.2.2-bin-hadoop3.2"
export PATH="${SPARK_HOME}/bin":"${PATH}"

To place it under ${HOME}/:

scripts/get-spark-to-home.sh
export SPARK_HOME="${HOME}/spark-3.2.2-bin-hadoop3.2"
export PATH="${SPARK_HOME}/bin":"${PATH}"

Both Java 8 and Java 11 are supported, but Java 17 is not (Spark 3.2.2 will fail, since it uses internal Java APIs and does not set the permissions appropriately).

Building the project

Run:

scripts/build.sh

Running the generator

Once you have Spark in place and built the JAR file, run the generator as follows:

export PLATFORM_VERSION=$(sbt -batch -error 'print platformVersion')
export DATAGEN_VERSION=$(sbt -batch -error 'print version')
export LDBC_SNB_DATAGEN_JAR=$(sbt -batch -error 'print assembly / assemblyOutputPath')
./tools/run.py <runtime configuration arguments> -- <generator configuration arguments>

Runtime configuration arguments

The runtime configuration arguments determine the amount of memory, number of threads, degree of parallelism. For a list of arguments, see:

./tools/run.py --help

To generate a single part-* file, reduce the parallelism (number of Spark partitions) to 1.

./tools/run.py --parallelism 1 -- --format csv --scale-factor 0.003 --mode bi

Generator configuration arguments

The generator configuration arguments allow the configuration of the output directory, output format, layout, etc.

To get a complete list of the arguments, pass --help to the JAR file:

./tools/run.py -- --help

Generating csv-composite-merged-fk files in BI mode resulting in compressed .csv.gz files:

./tools/run.py -- --format csv --scale-factor 0.003 --mode bi --format-options compression=gzip

Generating csv-composite-merged-fk files in BI mode and generating factors:

./tools/run.py -- --format csv --scale-factor 0.003 --mode bi --generate-factors

Generating CSVs in raw mode:

./tools/run.py -- --format csv --scale-factor 0.003 --mode raw --output-dir sf0.003-raw

Generating Parquet files in BI mode:

./tools/run.py -- --format parquet --scale-factor 0.003 --mode bi

Use epoch milliseconds encoded as longs for serializing date and datetime values in BI mode (this is equivalent to using the LongDateFormatter in the Hadoop Datagen):
```
./tools/run.py -- --format csv --scale-factor 0.003 --mode bi --epoch-millis
```
For the BI mode, the --format-options argument allows passing formatting options such as timestamp/date formats, the presence/abscence of headers (see the Spark formatting options for details), and whether quoting the fields in the CSV required:
```
./tools/run.py -- --format csv --scale-factor 0.003 --mode bi --format-options timestampFormat=MM/dd/y\ HH:mm:ss,dateFormat=MM/dd/y,header=false,quoteAll=true
```
The --explode-attrs argument implies one of the csv-singular-{projected-fk,merged-fk} formats, which has separate files to store multi-valued attributes (email, speaks).
```
./tools/run.py -- --format csv --scale-factor 0.003 --mode bi --explode-attrs
```
The --explode-edges argument implies one of the csv-{composite,singular}-projected-fk formats, which has separate files to store many-to-one edges (e.g. Person_isLocatedIn_City, Tag_hasType_TagClass, etc.).
```
./tools/run.py -- --format csv --scale-factor 0.003 --mode bi --explode-edges
```
The --explode-attrs and --explode-edges arguments together imply the csv-singular-projected-fk format:
```
./tools/run.py -- --format csv --scale-factor 0.003 --mode bi --explode-attrs --explode-edges
```

To change the Spark configuration directory, adjust the SPARK_CONF_DIR environment variable.

A complex example:

export SPARK_CONF_DIR=./conf
./tools/run.py --parallelism 4 --memory 8G -- --format csv --format-options timestampFormat=MM/dd/y\ HH:mm:ss,dateFormat=MM/dd/y --explode-edges --explode-attrs --mode bi --scale-factor 0.003

It is also possible to pass a parameter file:

./tools/run.py -- --format csv --param-file params.ini

Docker images

SNB Datagen images are available via Docker Hub. The image tags follow the pattern ${DATAGEN_VERSION/+/-}-${PLATFORM_VERSION}, e.g ldbc/datagen-standalone:0.5.0-2.12_spark3.2.

When building images ensure that you use BuildKit.

Standalone Docker image

The standalone image bundles Spark with the JAR and Python helpers, so you can run a workload in a container similarly to a local run, as you can see in this example:

export SF=0.003
mkdir -p out_sf${SF}_bi   # create output directory
docker run \
    --mount type=bind,source="$(pwd)"/out_sf${SF}_bi,target=/out \
    --mount type=bind,source="$(pwd)"/conf,target=/conf,readonly \
    -e SPARK_CONF_DIR=/conf \
    ldbc/datagen-standalone:${DATAGEN_VERSION/+/-}-${PLATFORM_VERSION} \
    --parallelism 1 \
    -- \
    --format csv \
    --scale-factor ${SF} \
    --mode bi \
    --generate-factors

The standalone Docker image can be built with the provided Dockerfile. To build, execute the following command from the repository directory:

export PLATFORM_VERSION=$(sbt -batch -error 'print platformVersion')
export DATAGEN_VERSION=$(sbt -batch -error 'print version')
export DOCKER_BUILDKIT=1
docker build . --target=standalone -t ldbc/datagen-standalone:${DATAGEN_VERSION/+/-}-${PLATFORM_VERSION}

JAR-only image

The ldbc/datagen-jar image contains the assembly JAR, so it can bundled in your custom container:

FROM my-spark-image
ARG VERSION
COPY --from=ldbc/datagen-jar:${VERSION} /jar /lib/ldbc-datagen.jar

The JAR-only Docker image can be built with the provided Dockerfile. To build, execute the following command from the repository directory:

docker build . --target=jar -t ldbc/datagen-jar:${DATAGEN_VERSION/+/-}-${PLATFORM_VERSION}

Pushing to Docker Hub

To release a new snapshot version on Docker Hub, run:

docker tag ldbc/datagen-jar:${DATAGEN_VERSION/+/-}-${PLATFORM_VERSION} ldbc/datagen-jar:latest
docker push ldbc/datagen-jar:${DATAGEN_VERSION/+/-}-${PLATFORM_VERSION}
docker push ldbc/datagen-jar:latest
docker tag ldbc/datagen-standalone:${DATAGEN_VERSION/+/-}-${PLATFORM_VERSION} ldbc/datagen-standalone:latest
docker push ldbc/datagen-standalone:${DATAGEN_VERSION/+/-}-${PLATFORM_VERSION}
docker push ldbc/datagen-standalone:latest

To release a new stable version, create a new Git tag (e.g. by creating a new release on GitHub), then build the Docker image and push it.

Elastic MapReduce

We provide scripts to run Datagen on AWS EMR. See the README in the ./tools/emr directory for details.

Graph schema

The graph schema is as follows:

Troubleshooting

When running the tests, they might throw a java.net.UnknownHostException: your_hostname: your_hostname: Name or service not known coming from org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal. The solution is to add an entry of your machine's hostname to the /etc/hosts file: 127.0.1.1 your_hostname.
If you are using Docker and Spark runs out of space, make sure that Docker has enough space to store its containers. To move the location of the Docker containers to a larger disk, stop Docker, edit (or create) the /etc/docker/daemon.json file and add { "data-root": "/path/to/new/docker/data/dir" }, then sync the old folder if needed, and restart Docker. (See more detailed instructions).
If you are using a local Spark installation and run out of space in /tmp (java.io.IOException: No space left on device), set the SPARK_LOCAL_DIRS to point to a directory with enough free space.
The Docker image may throw the following error when generating factors java.io.FileNotFoundException: /tmp/blockmgr-.../.../temp_shuffle_... (No file descriptors available). This error occurs on Fedora 36 host machines. Changing to an Ubuntu 22.04 host machine resolves the problem. Related issue: #420.

ldbc_snb_datagen_spark's People

Stargazers

Watchers

ldbc_snb_datagen_spark's Issues

Parameter generation is very slow

According to reports by several users, for large datesets (eg. SF1000) the generation of parameters becomes the most expensive part of the generation process (90 minutes of generating data, 12 hours for generating the parameters). We should rethink its implementation (maybe porting it to a hadoop job).

Tests locally delete generated files before terminating

@ArnauPrat I observed some very strange phenomenon.

On Travis, the CSV files are generated, and also the substitution parameters are created.

However, when I run mvn test locally, by the time the tests finish, there are no files in test_data (just an empty directory). Even more interestingly, if I run the tests from IntelliJ, they work fine. I also tried running the tests from an sh shell session (so that my .bashrc and .bash_profile settings are not loaded), but it did not make a difference.

give BI Parameters files helpful header names

at present headers are something like this, for all files:

Param0|Param1|Param2

it would be great if they described their content

cc @mkaufmann

Help with finding datagen version used for benchmarks

This issue probably doesn't belong here but I would still appreciate if you could help me figure it out.

I am trying to replicate the benchmarks mentioned in this paper. It uses the ldbc_snb_datagen to generate data for scale factors 3 and 10 and then loads them into various databases.

In particular, I am facing problems while loading data into Postgres. I am using this script to do so after loading the db_schema.sql and the stored_procedures.sql. I have noticed that the script expects certain columns (e.g. Creator to be part of Person table) and hence the script fails. You would notice that it doesn't try to read files like comment_hasCreator_person_0_0.csv, post_hasCreator_person_0_0.csv among others.

My question is that do you know which version of the ldbc_snb_datagen is being used for these benchmarks? I have opened an issue here but haven't got a reply yet. Any help would be much appreciated.

generateparams error when scaling

Hi there,

I am trying to generate the 100GB dataset, but I am getting the following error:
I do have enough memory on the YARN containers (set even to 200GB each) and still have this error happening.

Traceback (most recent call last):
  File "paramgenerator/generateparams.py", line 296, in <module>
    sys.exit(main())
  File "paramgenerator/generateparams.py", line 152, in main
    selectedPersonParams[i] = discoverparams.generate(factors)
  File "/home/[user]/ldbc/ldbc_snb_datagen/paramgenerator/discoverparams.py", line 122, in generate
    params = len(factors[0]) -1

So far I was able to generate up to the 10GB one. Any ideas of what could this be?

Update substitution parameter generator

The parameter generator should be updated in accordance with the latest specification and driver, see issues ldbc/ldbc_snb_docs#70 and ldbc/ldbc_snb_interactive_v1_driver#77.

The main task now is to update the BI parameter generator: https://github.com/ldbc/ldbc_snb_datagen/blob/38045cb60b666fab3e39a4d2860e8f7af80c427f/paramgenerator/generateparamsbi.py

I already fixed a number of inconsistencies, but there are a few left:

query 3: generate year and month
query 16: generate minPathDistance and maxPathDistance
query 18: generate lengthThreshold and languages
query 20: generate tagClasses and not just a single tagClass
query 25: add implementation

@mkaufmann I think originally you wrote most of this, so if you could jump in, that'd help speed things up.

Missing distribution.R in the tools folder

So now that I have my nice generated graph, I wanted to use the degreeAnaysis script provided in the tools folder. The problem is that it calls this distribution R script that seems to have been lost (there's just a cumulative.R). For me it's fine, I have my own R script that does that, I just need to do some awks to bring it to the desired form. I just thought it would be nice to let you guys know about this one. Maybe I'm doing something wrong?

Thanks!
Andra

walkmod.xml is not present and causes NullPointerException

After installing hadoop and executing run.sh, it fails with a NullPointerException.

ERROR [main] - /home/hadoop/ldbc_snb_datagen/walkmod.xml is invalid. Please, execute walkmod with -e to see the details

I am assuming that walkmod.xml is a configuration file which is absent. If not, maybe how to get around this can be a wiki/readme instruction

Merge keys for hasType

As a followup of ldbc/ldbc_snb_docs#121, we should merge the keys of hasType to Tag for the merge foreign keys format.

Broken link for workload generator

The link for the ldbc_socialnet_qgen is broken in the wiki. Does the workload generator still exist as a separate component?

Not a valid .JAR

Dear all,

I found a minor bug. Since, the pom.xml includes:
"0.2.7"

When running ./run.sh -> Not a valid .JAR: /path/ldbc_dnb_datagen-0.2.5-jar-with-dependencies.jar will appear.

I will fix the .sh

Regards.

LDBC BI

I read the documentation of the SNB data generator, but I didn't actually get the differnce between the workloads of interactive and BI. when to call the workload BI is it based on the scale factor ?

or in another way, how to generate a LDBC BI data set using the generator?

and where can I find the Queries of the interactive as BI workloads ?

Thank you for you help and support

The max length of the type string is not enforced

Hi,

The benchmark specification on pp.12 defines the type string as a variable length text of size 40 unicode characters. However it appears that at least using the SNB generator with SF1, several strings have longer lengths. For instance:

forum.name:
Failed to import table line 316 field title 'varchar(40)' expected in 'Group for Agatha_Christie in Newcastle_upon_Tyne'
Failed to import table line 13860 field title 'varchar(60)' expected in 'Group for Titanic:_Music_from_the_Motion_Picture in Shah_Alam'
organisation.name:
Failed to import table line 172 field name 'varchar(40)' expected in 'SAVAG__Sociedade_Anônima_Viação_Aérea_Gaúcha'
Failed to import table line 2705 field name 'varchar(80)' expected in 'École_Nationale_Supérieure_d'Électronique,_d'Électrotechnique,_d'Informatique,_d'Hydraulique_et_des_Télécommunications'
person.email:
Failed to import table line 640 field email 'varchar(40)' expected in '[email protected]'
place.name:
Failed to import table line 760 field name 'varchar(40)' expected in 'http://maps.google.com/maps?ll=25.530889,69.017306&spn=0.01,0.01&t=m&q=25.530889,69.017306'
tag.name:
Failed to import table line 919 field name 'varchar(40)' expected in 'Princess_Victoria_of_Saxe-Coburg-Saalfeld'
Failed to import table line 15618 field name 'varchar(80)' expected in 'Children_of_Nuggets:_Original_Artyfacts_from_the_Second_Psychedelic_Era,_1976–1995'

Am I misinterpreting the definition or either the specification or the datagen implementation is in defect?

Thanks,
Dean

PARAM_GENERATION seems deprecated

@ArnauPrat I was maintaining the https://github.com/ldbc/ldbc_snb_datagen/wiki/Configuration wikipage that says:

PARAM_GENERATION: indicates whether the parameters for SNB queries are generated. You should only use it with standard scaleFactor (e.g., SF 1). Always disable PARAM_GENERATION when using the data generator for non-standard input parameters (e.g., when you set numYears instead of using scaleFactor).

However, I cannot find any instances of the PARAM_GENERATION string across LDBC projects.

Release v0.2.8

We should bump the version and publish release before introducing the breaking changes required by #56.

java.lang.Exception

17/06/03 01:12:15 INFO mapreduce.Job:  map 0% reduce 0%
17/06/03 01:12:28 INFO mapreduce.Job:  map 67% reduce 0%
17/06/03 01:12:35 INFO mapreduce.Job: Task Id : attempt_1496421666783_0002_m_000000_0, Status : FAILED
Error: GC overhead limit exceeded
17/06/03 01:12:36 INFO mapreduce.Job:  map 0% reduce 0%
17/06/03 01:12:46 INFO mapreduce.Job:  map 67% reduce 0%
17/06/03 01:12:53 INFO mapreduce.Job: Task Id : attempt_1496421666783_0002_m_000000_1, Status : FAILED
Error: GC overhead limit exceeded
17/06/03 01:12:54 INFO mapreduce.Job:  map 0% reduce 0%
17/06/03 01:13:04 INFO mapreduce.Job:  map 67% reduce 0%
17/06/03 01:20:10 INFO mapreduce.Job: Task Id : attempt_1496421666783_0002_m_000000_2, Status : FAILED
Error: Java heap space
17/06/03 01:20:11 INFO mapreduce.Job:  map 0% reduce 0%
17/06/03 01:20:22 INFO mapreduce.Job:  map 67% reduce 0%
17/06/03 01:20:30 INFO mapreduce.Job:  map 100% reduce 0%
17/06/03 01:20:31 INFO mapreduce.Job:  map 100% reduce 100%
17/06/03 01:20:32 INFO mapreduce.Job: Job job_1496421666783_0002 failed with state FAILED due to: Task failed task_1496421666783_0002_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0

17/06/03 01:20:32 INFO mapreduce.Job: Counters: 8
	Job Counters 
		Failed map tasks=4
		Launched map tasks=4
		Other local map tasks=4
		Total time spent by all maps in occupied slots (ms)=486829
		Total time spent by all reduces in occupied slots (ms)=0
		Total time spent by all map tasks (ms)=486829
		Total vcore-seconds taken by all map tasks=486829
		Total megabyte-seconds taken by all map tasks=498512896
Error during execution
null
java.lang.Exception
	at ldbc.snb.datagen.hadoop.HadoopPersonGenerator.run(HadoopPersonGenerator.java:138)
	at ldbc.snb.datagen.generator.LDBCDatagen.runGenerateJob(LDBCDatagen.java:97)
	at ldbc.snb.datagen.generator.LDBCDatagen.main(LDBCDatagen.java:372)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

When I run run.sh, I got this message. What I only changed is HADOOP_HOME and LDBC_SNB_DATAGEN_HOME.

Error: java.lang.NullPointerException during "Serializing persons"

While executing the data generator with scale factor 1 on an hadoop cluster I'm getting this error during "Serializing persons" execution:

17/09/28 11:39:11 INFO client.RMProxy: Connecting to ResourceManager at siti-rack.crumb.disco.it/172.24.1.201:8032
17/09/28 11:39:11 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
17/09/28 11:39:12 INFO input.FileInputFormat: Total input paths to process : 1
17/09/28 11:39:12 INFO mapreduce.JobSubmitter: number of splits:1
17/09/28 11:39:12 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1505825335849_0082
17/09/28 11:39:12 INFO impl.YarnClientImpl: Submitted application application_1505825335849_0082
17/09/28 11:39:12 INFO mapreduce.Job: The url to track the job: http://siti-rack.crumb.disco.it:8088/proxy/application_1505825335849_0082/
17/09/28 11:39:12 INFO mapreduce.Job: Running job: job_1505825335849_0082
17/09/28 11:39:16 INFO mapreduce.Job: Job job_1505825335849_0082 running in uber mode : false
17/09/28 11:39:16 INFO mapreduce.Job:  map 0% reduce 0%
17/09/28 11:39:20 INFO mapreduce.Job:  map 100% reduce 0%
17/09/28 11:39:29 INFO mapreduce.Job: Task Id : attempt_1505825335849_0082_r_000000_0, Status : FAILED
Error: java.lang.NullPointerException
	at ldbc.snb.datagen.hadoop.HadoopPersonSortAndSerializer$HadoopPersonSerializerReducer.cleanup(HadoopPersonSortAndSerializer.java:123)
	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:179)
	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

17/09/28 11:39:36 INFO mapreduce.Job: Task Id : attempt_1505825335849_0082_r_000000_1, Status : FAILED
Error: java.lang.NullPointerException
	at ldbc.snb.datagen.hadoop.HadoopPersonSortAndSerializer$HadoopPersonSerializerReducer.cleanup(HadoopPersonSortAndSerializer.java:123)
	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:179)
	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

17/09/28 11:39:43 INFO mapreduce.Job: Task Id : attempt_1505825335849_0082_r_000000_2, Status : FAILED
Error: java.lang.NullPointerException
	at ldbc.snb.datagen.hadoop.HadoopPersonSortAndSerializer$HadoopPersonSerializerReducer.cleanup(HadoopPersonSortAndSerializer.java:123)
	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:179)
	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

17/09/28 11:39:51 INFO mapreduce.Job:  map 100% reduce 100%
17/09/28 11:39:51 INFO mapreduce.Job: Job job_1505825335849_0082 failed with state FAILED due to: Task failed task_1505825335849_0082_r_000000
Job failed as tasks failed. failedMaps:0 failedReduces:1

17/09/28 11:39:51 INFO mapreduce.Job: Counters: 37
	File System Counters
		FILE: Number of bytes read=0
		FILE: Number of bytes written=169318
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=59137
		HDFS: Number of bytes written=0
		HDFS: Number of read operations=4
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=0
	Job Counters 
		Failed reduce tasks=4
		Launched map tasks=1
		Launched reduce tasks=4
		Data-local map tasks=1
		Total time spent by all maps in occupied slots (ms)=4420
		Total time spent by all reduces in occupied slots (ms)=79484
		Total time spent by all map tasks (ms)=2210
		Total time spent by all reduce tasks (ms)=19871
		Total vcore-seconds taken by all map tasks=2210
		Total vcore-seconds taken by all reduce tasks=19871
		Total megabyte-seconds taken by all map tasks=4526080
		Total megabyte-seconds taken by all reduce tasks=81391616
	Map-Reduce Framework
		Map input records=100
		Map output records=100
		Map output bytes=59199
		Map output materialized bytes=29575
		Input split bytes=166
		Combine input records=0
		Spilled Records=100
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=7
		CPU time spent (ms)=380
		Physical memory (bytes) snapshot=519487488
		Virtual memory (bytes) snapshot=2384412672
		Total committed heap usage (bytes)=501219328
	File Input Format Counters 
		Bytes Read=58971
Error during execution
null
java.lang.Exception
	at ldbc.snb.datagen.hadoop.HadoopPersonSortAndSerializer.run(HadoopPersonSortAndSerializer.java:166)
	at ldbc.snb.datagen.generator.LDBCDatagen.runGenerateJob(LDBCDatagen.java:155)
	at ldbc.snb.datagen.generator.LDBCDatagen.main(LDBCDatagen.java:340)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Exception in thread "main" java.lang.Exception
	at ldbc.snb.datagen.hadoop.HadoopPersonSortAndSerializer.run(HadoopPersonSortAndSerializer.java:166)
	at ldbc.snb.datagen.generator.LDBCDatagen.runGenerateJob(LDBCDatagen.java:155)
	at ldbc.snb.datagen.generator.LDBCDatagen.main(LDBCDatagen.java:340)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

Someone have a possible solution? I can't figure it out by myself

Edit: I solved the problem adding the following configuration in the params.ini file:

ldbc.snb.datagen.serializer.updateStreams:false

I'm still able to run a benchmark with that configuration set to false? Maybe generating the update streams parameters after the data generation?

Why didn't I output the csv file?

Hello,when I run the./run.sh,everything is ok.But my path not include the csv file.What Happen？
Thank you for your answer,hope you can help me.

all Add Friendship events end up in updateStream partition 0

this is my config (feel free to ask for more details!):

# SIZE - FROM SCALE FACTOR
# ldbc.snb.datagen.generator.scaleFactor:1

# SIZE - FROM SPECIFIC SIZE PARAMETERS
ldbc.snb.datagen.generator.numPersons:1000
ldbc.snb.datagen.generator.numYears:1
ldbc.snb.datagen.generator.startYear:2010

# GENERAL
ldbc.snb.datagen.serializer.compressed:false
ldbc.snb.datagen.serializer.personSerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVPersonSerializer
ldbc.snb.datagen.serializer.invariantSerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVInvariantSerializer
ldbc.snb.datagen.serializer.personActivitySerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVPersonActivitySerializer
ldbc.snb.datagen.serializer.updateStreams:true
ldbc.snb.datagen.serializer.numUpdatePartitions:6
ldbc.snb.datagen.serializer.outputDir:/Users/alexaverbuch/hadoopTempDir/output/

ldbc.snb.datagen.generator.numThreads:4

# https://github.com/ldbc-dev/ldbc_snb_datagen_0.2/wiki/Compilation_Execution
# https://github.com/ldbc-dev/ldbc_snb_datagen_0.2/blob/master/src/main/resources/params.ini

Parameter generation does not work for large scale factors

For example, queries 4 and 14 for SF 10 return empty parameters.

params.ini might be overwritten

Hi,
when using the provided run.sh script and running the program on a hadoop cluster, the execution immediately aborts due to configuration errors. The following messages are reported to the screen:

Available scale factor configuration set snb.interactive.1
Available scale factor configuration set snb.interactive.3
Available scale factor configuration set snb.interactive.10
Available scale factor configuration set snb.interactive.30
Available scale factor configuration set snb.interactive.100
Available scale factor configuration set snb.interactive.300
Available scale factor configuration set snb.interactive.1000
Available scale factor configuration set graphalytics.1
Available scale factor configuration set graphalytics.3
Available scale factor configuration set graphalytics.10
Available scale factor configuration set graphalytics.30
Available scale factor configuration set graphalytics.100
Available scale factor configuration set graphalytics.300
Available scale factor configuration set graphalytics.1000
Available scale factor configuration set graphalytics.3000
Available scale factor configuration set graphalytics.10000
Available scale factor configuration set graphalytics.30000
Number of scale factors read 17
Applied configuration of scale factor snb.interactive.1
Applied configuration of scale factor snb.interactive.1
******* Configuration *******
ldbc.snb.datagen.serializer.personSerializer ldbc.snb.datagen.serializer.snb.interactive.CSVPersonSerializer
ldbc.snb.datagen.serializer.formatter.StringDateFormatter.dateTimeFormat yyyy-MM-dd'T'HH:mm:ss.SSSZ
ldbc.snb.datagen.serializer.socialNetworkDir .//social_network
ldbc.snb.datagen.generator.numThreads 1
ldbc.snb.datagen.parametergenerator.parameters true
ldbc.snb.datagen.generator.numPersons 11000
ldbc.snb.datagen.generator.startYear 2010
ldbc.snb.datagen.serializer.compressed false
ldbc.snb.datagen.generator.numYears 3
ldbc.snb.datagen.serializer.persons.sort true
ldbc.snb.datagen.serializer.endlineSeparator false
  bc.snb.datagen.serializer.personActivitySerializer ldbc.snb.datagen.serializer.snb.interactive.CSVPersonActivitySerializer
ldbc.snb.datagen.serializer.outputDir ./
ldbc.snb.datagen.generator.knowsGenerator ldbc.snb.datagen.generator.DistanceKnowsGenerator
ldbc.snb.datagen.serializer.updateStreams true
ldbc.snb.datagen.parametergenerator.python python
ldbc.snb.datagen.serializer.dateFormatter ldbc.snb.datagen.serializer.formatter.StringDateFormatter
ldbc.snb.datagen.serializer.formatter.StringDateFormatter.dateFormat yyyy-MM-dd
ldbc.snb.datagen.generator.activity true
ldbc.snb.datagen.serializer.invariantSerializer ldbc.snb.datagen.serializer.snb.interactive.CSVInvariantSerializer
ldbc.snb.datagen.serializer.numUpdatePartitions 1
ldbc.snb.datagen.generator.distribution.degreeDistribution ldbc.snb.datagen.generator.distribution.FacebookDegreeDistribution
ldbc.snb.datagen.serializer.hadoopDir .//hadoop
ldbc.snb.datagen.generator.deltaTime 10000
ldbc.snb.datagen.generator.person.similarity ldbc.snb.datagen.objects.similarity.GeoDistanceSimilarity
ldbc.snb.datagen.serializer.numPartitions 1
*********************************
Error reading scale factors
Missing ldbc.snb.datagen.generator.baseProbCorrelated parameter

The issue originates from the existence of two files named params.ini, the one in the root folder and another in src/main/resources, eventually packaged inside the jar artifact. Looking at LDBCDatagen::main:

ConfigParser.readConfig(conf, args[0]);
ConfigParser.readConfig(conf, LDBCDatagen.class.getResourceAsStream("/params.ini"));

this code has the effect of parsing the same file twice. As workaround, renaming one of the two files resolves this issue. Though it might be more reasonable to avoid completely such name clashes or at least report to the user when such conflict occurs.

out of memory error

Hi,
Trying to run SF30 on 24 threads ran into the following issue:

15/02/11 00:24:21 WARN mapred.LocalJobRunner: job_local940749219_0013
java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: java.lang.OutOfMemoryError: Java heap space
at ldbc.snb.datagen.objects.Knows.<init>(Knows.java:66)
at ldbc.snb.datagen.generator.KnowsGenerator.createKnow(KnowsGenerator.java:61)
at ldbc.snb.datagen.generator.KnowsGenerator.generateKnows(KnowsGenerator.java:35)
at ldbc.snb.datagen.hadoop.HadoopKnowsGenerator$HadoopKnowsGeneratorReducer.reduce(HadoopKnowsGenerator.java:43)
at ldbc.snb.datagen.hadoop.HadoopKnowsGenerator$HadoopKnowsGeneratorReducer.reduce(HadoopKnowsGenerator.java:25)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Configuration ideas? Should I increase the general heap space / site heap space / job heap space?
BTW, the machine itself has 378GB RAM.
Thanks
Tomer

Making java 1.7 compatible

Related to 833868f, List#sort is still Java 1.8. For some reason the error is not caught by the Java compiler, but I assure you a JVM based on 1.7 catches it strong & clear. Attached a tentative patch using Collections#sort
diff.txt

Rename packages

Our current plan is to use the Interactive Datagen for BI as well. As everything is named snb.interactive, we should clarify this in the docs.

We could also rename it, but that'd break user's param files, etc.

$  ag -Q "snb.interactive" 
src/main/resources/scale_factors.xml
3:    <scale_factor name="snb.interactive.1" >
22:    <scale_factor name="snb.interactive.3" >
41:    <scale_factor name="snb.interactive.10" >
60:    <scale_factor name="snb.interactive.30" >
79:    <scale_factor name="snb.interactive.100" >
98:    <scale_factor name="snb.interactive.300" >
117:    <scale_factor name="snb.interactive.1000" >

src/main/java/ldbc/snb/datagen/util/ConfigParser.java
70:        conf.set("ldbc.snb.datagen.serializer.personSerializer", "ldbc.snb.datagen.serializer.snb.interactive.CSVPersonSerializer");
71:        conf.set("ldbc.snb.datagen.serializer.invariantSerializer", "ldbc.snb.datagen.serializer.snb.interactive.CSVInvariantSerializer");
72:        conf.set("ldbc.snb.datagen.serializer.personActivitySerializer", "ldbc.snb.datagen.serializer.snb.interactive.CSVPersonActivitySerializer");

src/main/java/ldbc/snb/datagen/generator/LDBCDatagen.java
216:                aux = Long.parseLong(properties.getProperty("ldbc.snb.interactive.min_write_event_start_time"));
218:                aux = Long.parseLong(properties.getProperty("ldbc.snb.interactive.max_write_event_start_time"));
220:                aux = Long.parseLong(properties.getProperty("ldbc.snb.interactive.num_events"));
230:                    aux = Long.parseLong(properties.getProperty("ldbc.snb.interactive.min_write_event_start_time"));
232:                    aux = Long.parseLong(properties.getProperty("ldbc.snb.interactive.max_write_event_start_time"));
234:                    aux = Long.parseLong(properties.getProperty("ldbc.snb.interactive.num_events"));
243:            output.write(new String("ldbc.snb.interactive.gct_delta_duration:" + DatagenParams.deltaTime + "\n")
245:            output.write(new String("ldbc.snb.interactive.min_write_event_start_time:" + minDate + "\n").getBytes());
246:            output.write(new String("ldbc.snb.interactive.max_write_event_start_time:" + maxDate + "\n").getBytes());
247:            output.write(new String("ldbc.snb.interactive.update_interleave:" + (maxDate - minDate) / count + "\n")
249:            output.write(new String("ldbc.snb.interactive.num_events:" + count).getBytes());

src/main/java/ldbc/snb/datagen/serializer/UpdateEventSerializer.java
106:                                                             .getProperty("ldbc.snb.interactive.min_write_event_start_time"));
108:                                                             .getProperty("ldbc.snb.interactive.max_write_event_start_time"));
109:                    stats_.count_ = Long.parseLong(properties.getProperty("ldbc.snb.interactive.num_events"));
187:                output.write(new String("ldbc.snb.interactive.gct_delta_duration:" + DatagenParams.deltaTime + "\n")
189:                output.write(new String("ldbc.snb.interactive.min_write_event_start_time:" + stats_.minDate_ + "\n")
191:                output.write(new String("ldbc.snb.interactive.max_write_event_start_time:" + stats_.maxDate_ + "\n")
194:                    output.write(new String("ldbc.snb.interactive.update_interleave:" + (stats_.maxDate_ - stats_.minDate_) / stats_.count_ + "\n")
197:                    output.write(new String("ldbc.snb.interactive.update_interleave:" + "0" + "\n").getBytes());
199:                output.write(new String("ldbc.snb.interactive.num_events:" + stats_.count_).getBytes());

src/main/java/ldbc/snb/datagen/serializer/snb/interactive/CSVPersonActivitySerializer.java
36:package ldbc.snb.datagen.serializer.snb.interactive;

src/main/java/ldbc/snb/datagen/serializer/snb/interactive/CSVMergeForeignPersonSerializer.java
36:package ldbc.snb.datagen.serializer.snb.interactive;

src/main/java/ldbc/snb/datagen/serializer/snb/interactive/TurtleInvariantSerializer.java
36:package ldbc.snb.datagen.serializer.snb.interactive;

src/main/java/ldbc/snb/datagen/serializer/snb/interactive/TurtlePersonActivitySerializer.java
36:package ldbc.snb.datagen.serializer.snb.interactive;

src/main/java/ldbc/snb/datagen/serializer/snb/interactive/CSVPersonSerializer.java
38:package ldbc.snb.datagen.serializer.snb.interactive;

src/main/java/ldbc/snb/datagen/serializer/snb/interactive/CSVMergeForeignInvariantSerializer.java
36:package ldbc.snb.datagen.serializer.snb.interactive;

src/main/java/ldbc/snb/datagen/serializer/snb/interactive/CSVMergeForeignPersonActivitySerializer.java
36:package ldbc.snb.datagen.serializer.snb.interactive;

src/main/java/ldbc/snb/datagen/serializer/snb/interactive/TurtlePersonSerializer.java
38:package ldbc.snb.datagen.serializer.snb.interactive;

src/main/java/ldbc/snb/datagen/serializer/snb/interactive/CSVInvariantSerializer.java
36:package ldbc.snb.datagen.serializer.snb.interactive;

test_params.ini
1:ldbc.snb.datagen.generator.scaleFactor:snb.interactive.1
3:ldbc.snb.datagen.serializer.personSerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVPersonSerializer
4:ldbc.snb.datagen.serializer.invariantSerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVInvariantSerializer
5:ldbc.snb.datagen.serializer.personActivitySerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVPersonActivitySerializer

params.ini
1:ldbc.snb.datagen.generator.scaleFactor:snb.interactive.1
3:ldbc.snb.datagen.serializer.personSerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVPersonSerializer
4:ldbc.snb.datagen.serializer.invariantSerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVInvariantSerializer
5:ldbc.snb.datagen.serializer.personActivitySerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVPersonActivitySerializer

params.ini-graph.bak
1:ldbc.snb.datagen.generator.scaleFactor:snb.interactive.1
7:ldbc.snb.datagen.serializer.personSerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVPersonSerializer
8:ldbc.snb.datagen.serializer.invariantSerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVInvariantSerializer
9:ldbc.snb.datagen.serializer.personActivitySerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVPersonActivitySerializer

params-foreign-key.ini
1:ldbc.snb.datagen.generator.scaleFactor:snb.interactive.1
3:ldbc.snb.datagen.serializer.personSerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVMergeForeignPersonSerializer
4:ldbc.snb.datagen.serializer.invariantSerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVMergeForeignInvariantSerializer
5:ldbc.snb.datagen.serializer.personActivitySerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVMergeForeignPersonActivitySerializer

person_likes_* dataset files sorted oppositely from rest of edge files

It appears that while all other dataset edge files are sorted by the SOURCE vertex for directed edges, the person_likes_* files are sorted by the DEST vertex.

For example, person_hasInterest_tag lists edges in groups by the SOURCE person:

Person.id|Tag.id
933|59
933|291
933|565
933|569
933|779
...
1129|61
1129|62
1129|141
1129|205
...

However person_likes_comment files list edges in groups by the DEST comment:

Person.id|Comment.id|creationDate
32985348833579|2061584302097|2012-09-07T16:44:37.850+0000
4139|2061584302100|2012-07-21T08:37:04.052+0000
6597069777240|2061584302100|2012-07-24T21:46:12.715+0000
...
4139|1786706395175|2012-03-23T18:07:36.755+0000
6597069777240|1786706395175|2012-03-22T11:05:46.421+0000
...

While it's not technically a bug since edge ordering in the dataset files was never explicity specified, this difference in edge ordering between the files does create headaches for loader code, which in my case, takes advantage of the SOURCE vertex grouping of edges. The different ordering for person_likes_* edges means that code handling edge loading potentially needs to treat person_likes_* as a special case. In my particular case, I actually run a map-reduce job over the edge files anyway to generate the files sorted both ways (one set of files grouped by SOURCE vertex, and another group of files sorted by DEST vertex, for each directed edge type).

However the data generator decides to group edges by, either by SOURCE or DEST, ideally it should be consistent for all directed edge list files.

As a side note... this different ordering for person_likes_* edge list files resulted in incorrect outgoing "like" edge lists from person vertices to messge vertices in my database, yet I still managed to pass validation (using the Neo4j validation dataset). Looking throug the benchmark, I realized the benchmark actually never validates traversals over like edges from person vertices TO message vertices, only message vertices TO person vertices (which were correct in my case). Thus, I hadn't actually discovered my bug until just now :)

distribution of WORKS_AT.workFrom year seems odd

the distribution of years on this property looks fine between ~1980 and ~2013, like a bell curve.
however, it has a very large number of entries at 1970 & 1971.

something like this

1970 *********************************
1971 *********************************

1980 ***
******
*********
***************
*********************
******************
***************
*********
******
2013 ***

is that expected?

Wrong q2_params.txt in Bi branch

@mkaufmann
According to BI query2 specification (http://wiki.ldbcouncil.org/display/TUC/Business+Intelligence+Workload), the query requires five parameters:
* date1 - Date
* date2 - Date
* country1 - String
* country2 - String
* endDate - Date

Currently, q2_params.txt file contains something in the form of:

Param0|Param1|Param2
494|570|United_States,United_Kingdom,Germany
164|300|Croatia,Egypt,United_Kingdom,Germany

It seems that the first two columns do not contain valid dates. Second, the file should contain five columns instead of three. Third, this query does not require three countries but just two, put in separate columns with the correct separator. Something similar to this should be output:

Param0|Param1|Param2|Param3|Param4
1330819200000|1339459200000|United_States|Germany|1356998400000
1330819200000|1339459200000|Croatia|Egypt|1356998400000

Note that Param4 is the end of simulated time. By now, you can just handcode the epoch value of 2013-01-01 (as I did in this example). This is what I used to make the conversion: http://www.epochconverter.com/

Arnau

no data found

hi,I complied it with hadoop 2.7.3? I have already complied your project with hadoop2.7.3,and execute run.sh，it job seems to be worded,but I have got no data generated?where can I find it?

some of the updateStream files are empty

Hi,
Seems that when generating SF1 on 24 threads, some of the update stream files are generated empty, causing a Null pointer error in the ldbc driver. FYI @ArnauPrat and @alexaverbuch
Tomer

BI Q12 substitution parameters have a very high like threshold

400 is too high for smaller scale factors, and in any case, it should not be hardcoded in the generator.

https://github.com/ldbc/ldbc_snb_datagen/blob/b581cfb88f474c39d8d7391461c6eb1c5201bd14/paramgenerator/generateparamsbi.py#L180-L183

Docker image

There are many things to get right on a new systems for a simple data generation: installing Java and Hadoop, setting JAVA_HOME and HADOOP_HOME, while not forgetting about installing Maven and Python 2 (!).

We should really just make a Docker image out of this.

Timezone for Turtle serializer

With @antaljanosbenjamin we discovered that the Turtle generator uses GMT-2 (at least on our machines - even though Hungary is in GMT-1).

PR #65 addresses this. We are currently testing whether it resolves the issue.

Remarks on the Turtle serializer

The prefix for tags is generated, but never used, even though it could be used.

@prefix sntag: <http://www.ldbc.eu/ldbc_socialnet/1.0/tag/> .
[...]
<http://www.ldbc.eu/ldbc_socialnet/1.0/tag/Rembrandt> foaf:name "Rembrandt" .
<http://www.ldbc.eu/ldbc_socialnet/1.0/tag/Rembrandt> rdf:type _:tagclass000250 .
<http://www.ldbc.eu/ldbc_socialnet/1.0/tag/Rembrandt> snvoc:id "1451"^^xsd:int .

Some prefixes are included multiple times -- this may cause some implementations to throw an error.

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix snvoc: <http://www.ldbc.eu/ldbc_socialnet/1.0/vocabulary/> .
@prefix sntag: <http://www.ldbc.eu/ldbc_socialnet/1.0/tag/> .
@prefix sn: <http://www.ldbc.eu/ldbc_socialnet/1.0/data/> .
@prefix dbpedia: <http://dbpedia.org/resource/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix dbpedia: <http://dbpedia.org/resource/> .
@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .

(Related ticket: #7, where the Turtle serializer was introduced.)

update stream: person adds post before becoming member of forum

I created a validation dataset using

ldbc.snb.datagen.generator.numPersons:3000
ldbc.snb.datagen.generator.numYears:1
ldbc.snb.datagen.generator.startYear:2010
ldbc.snb.datagen.serializer.compressed:false
ldbc.snb.datagen.serializer.personSerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVMergeForeignPersonSerializer
ldbc.snb.datagen.serializer.invariantSerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVMergeForeignInvariantSerializer
ldbc.snb.datagen.serializer.personActivitySerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVMergeForeignPersonActivitySerializer
ldbc.snb.datagen.serializer.updateStreams:true
ldbc.snb.datagen.serializer.numUpdatePartitions:6
ldbc.snb.datagen.serializer.outputDir:/Users/alexaverbuch/hadoopTempDir/output/
ldbc.snb.datagen.generator.numThreads:8
ldbc.snb.datagen.generator.numPartitions:1

and have found that in updateStream_0_4_forum.csv this line (Line 9861 --- Add Post)
1294783648083|1293778021203|6|42949894023||1294783648083|200.93.242.168|Safari|tk|About Isabella I of Castil. About Samuel Johnson, his . About Johnny Carson, their. About Spain, The sec.|105|5497558140960|25769815355|47|1771;2038;3076;6424

occurs before this line (Line 9950 --- Add Forum Membership)
1294791536201|1293778021203|5|25769815355|5497558140960|1294791536201

i.e., Person 5497558140960 adds Post 42949894023 to Forum 25769815355 before becoming member of that Forum

is this valid behavior?

Add person stream contains university that doesn't exist

Hi,
I'm getting errors from trying to add a person to the database and then update their studyAt relationship to a university that doesn't exist. e.g.:

$ grep '805306368' *.csv
updateStream_0_0_person.csv:1388685681014|0|1|5497558139396|Eugene|Roindefo|female|383986676563|1388685681014|196.192.46.225|Chrome|1206|mg;en|[email protected];[email protected]|282;289;458;544;1164;1175;1191;1206;1532;1545;1765;1909;1953;1996;2030;2061;2777;2780;2781;2785;2786;2793;2797;2840;2984;3034;3054;3059;3060;4844;5183;6400;6945;6962;7012;7517;9097;9102;9333;11531;11695|805306368,2004|758,2010
updateStream_0_0_person.csv:1388980347766|0|1|5497558139187|Jimmy|Tsiranana|male|447319296852|1388980347766|41.188.18.41|Safari|1203|mg;en||6;1615;1616;1760;1953;2057;2082;2120;2783;2820;2845;3041;6377;6993;7380;7516;10222;11250|805306368,2005|758,2013

Please advise,
Tomer

Rewrite parameter generator in Julia

I am currently learning Julia and will use it to take a stab at issue #51, i.e. I'll rewrite the parameter generator (currently implemented in Python 2) in Julia.

I understand that Julia is still a niche language, so users might not be able to trivially install it on their systems. To mitigate this, I'll update the Docker image to support Julia.

The current Docker image is based on Alpine Linux which doesn't have a maintainer for recent Julia releases. Alpine uses musl libc, which has only Tier 3 support in Julia, which might include test failures (e.g. see the issue that has only been resolved 11 days ago). Therefore I'd just move from openjdk:8-jdk-alpine to the Debian-based openjdk:8-jdk-stretch. This also has an old Julia version, but it's possible to install with

wget -q https://julialang-s3.julialang.org/bin/linux/x64/1.1/julia-1.1.1-linux-x86_64.tar.gz
tar xf julia-1.1.1-linux-x86_64.tar.gz
export PATH=$PATH:`pwd`/julia-1.1.1/bin
julia

The work happens on the rework-paramgen branch.

The specified bucket does not exist

I'm trying to download some of the existing data sets and I used the following command:
$ aws s3 sync s3://ldbc-snb.s3.amazonaws.com/ .
But got the following error:

fatal error: An error occurred (NoSuchBucket) when calling the ListObjects operation: The specified bucket does not exist

Are the files still hosted there?

Runtime Error in stage "Sorting update streams"

When i launched the program with Hadoop, reported the following error log:

************************************************
* Sorting update streams  *
************************************************
19/05/06 12:32:59 INFO impl.TimelineClientImpl: Timeline service address: http://worker21:8188/ws/v1/timeline/
19/05/06 12:32:59 INFO client.RMProxy: Connecting to ResourceManager at /172.16.104.21:8032
19/05/06 12:32:59 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
19/05/06 12:33:00 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/yanda/.staging/job_1556954960956_0089
Error during execution
Input path does not exist: hdfs://proj99:9000/user/yanda/hadoop/temp_updateStream_person_0_0
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://proj99:9000/user/yanda/hadoop/temp_updateStream_person_0_0
	at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:321)
	at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:264)
	at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
	at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:385)
	at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:597)
	at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:614)
	at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:492)
	at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1296)
	at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1293)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
	at org.apache.hadoop.mapreduce.Job.submit(Job.java:1293)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1314)
	at ldbc.snb.datagen.hadoop.HadoopUpdateStreamSorterAndSerializer.run(HadoopUpdateStreamSorterAndSerializer.java:130)
	at ldbc.snb.datagen.generator.LDBCDatagen.runGenerateJob(LDBCDatagen.java:197)
	at ldbc.snb.datagen.generator.LDBCDatagen.main(LDBCDatagen.java:340)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://proj99:9000/user/yanda/hadoop/temp_updateStream_person_0_0
	at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:321)
	at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:264)
	at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
	at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:385)
	at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:597)
	at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:614)
	at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:492)
	at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1296)
	at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1293)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
	at org.apache.hadoop.mapreduce.Job.submit(Job.java:1293)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1314)
	at ldbc.snb.datagen.hadoop.HadoopUpdateStreamSorterAndSerializer.run(HadoopUpdateStreamSorterAndSerializer.java:130)
	at ldbc.snb.datagen.generator.LDBCDatagen.runGenerateJob(LDBCDatagen.java:197)
	at ldbc.snb.datagen.generator.LDBCDatagen.main(LDBCDatagen.java:340)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

I tracked into the code, and found that where the probelm is, starting from Line 185 in LDBCDataGen.java:

List<String> personStreamsFileNames = new ArrayList<String>(); 
            List<String> forumStreamsFileNames = new ArrayList<String>(); 
            for (int i = 0; i < DatagenParams.numThreads; ++i) { 
                int numPartitions = conf.getInt("ldbc.snb.datagen.serializer.numUpdatePartitions", 1); 
                for (int j = 0; j < numPartitions; ++j) { 
                    personStreamsFileNames.add(DatagenParams.hadoopDir +  "/temp_updateStream_person_" + i + "_" + j); 
                    if (conf.getBoolean("ldbc.snb.datagen.generator.activity", false)) { 
                        forumStreamsFileNames.add(DatagenParams.hadoopDir +  "/temp_updateStream_forum_" + i + "_" + j); 
                    } 
                } 
            } 
HadoopUpdateStreamSorterAndSerializer updateSorterAndSerializer = new HadoopUpdateStreamSorterAndSerializer(conf); 
updateSorterAndSerializer.run(personStreamsFileNames, "person");

I'm confused why it happened. why the intermediate files "/temp_updateStream_forum_" + i + "_" + j" were not generated successfully?

turtle or n3 serialiser

Hi, the previous version of SNB provides the n3 and turtle serialiser, however I can't find it in new code. Can you please help me in this matter. Cheers

Compilation errors with run.sh

Hi,

I tried to compile the datagen with the run script run.sh, but I got the compilation error as bellow (I'm using : openjdk version "1.8.0_71")

[ERROR] COMPILATION ERROR :
[INFO] -------------------------------------------------------------
[ERROR] /home/danh.lephuoc/ldbc/ldbc_snb_datagen/src/main/java/ldbc/snb/datagen/generator/distribution/utils/Bucket.java:[3,19] package javafx.util does not exist
[ERROR] /home/danh.lephuoc/ldbc/ldbc_snb_datagen/src/main/java/ldbc/snb/datagen/generator/distribution/utils/Bucket.java:[13,66] cannot find symbol
symbol: class Pair
location: class ldbc.snb.datagen.generator.distribution.utils.Bucket
[ERROR] /home/danh.lephuoc/ldbc/ldbc_snb_datagen/src/main/java/ldbc/snb/datagen/generator/distribution/EmpiricalDistribution.java:[3,19] package javafx.util does not exist
[ERROR] /home/danh.lephuoc/ldbc/ldbc_snb_datagen/src/main/java/ldbc/snb/datagen/generator/BTERKnowsGenerator.java:[3,19] package javafx.util does not exist
[ERROR] /home/danh.lephuoc/ldbc/ldbc_snb_datagen/src/main/java/ldbc/snb/datagen/generator/BTERKnowsGenerator.java:[30,39] cannot find symbol
symbol: class Pair
location: class ldbc.snb.datagen.generator.BTERKnowsGenerator
[ERROR] /home/danh.lephuoc/ldbc/ldbc_snb_datagen/src/main/java/ldbc/snb/datagen/generator/distribution/utils/Bucket.java:[19,14] cannot find symbol
symbol: class Pair
location: class ldbc.snb.datagen.generator.distribution.utils.Bucket
[ERROR] /home/danh.lephuoc/ldbc/ldbc_snb_datagen/src/main/java/ldbc/snb/datagen/generator/distribution/EmpiricalDistribution.java:[24,19] cannot find symbol
symbol: class Pair
location: class ldbc.snb.datagen.generator.distribution.EmpiricalDistribution
[ERROR] /home/danh.lephuoc/ldbc/ldbc_snb_datagen/src/main/java/ldbc/snb/datagen/generator/distribution/EmpiricalDistribution.java:[24,68] cannot find symbol
symbol: class Pair
location: class ldbc.snb.datagen.generator.distribution.EmpiricalDistribution
[ERROR] /home/danh.lephuoc/ldbc/ldbc_snb_datagen/src/main/java/ldbc/snb/datagen/generator/distribution/EmpiricalDistribution.java:[31,35] cannot find symbol
symbol: class Pair
location: class ldbc.snb.datagen.generator.distribution.EmpiricalDistribution
[ERROR] /home/danh.lephuoc/ldbc/ldbc_snb_datagen/src/main/java/ldbc/snb/datagen/generator/BTERKnowsGenerator.java:[134,19] cannot find symbol
symbol: class Pair
location: class ldbc.snb.datagen.generator.BTERKnowsGenerator
[ERROR] /home/danh.lephuoc/ldbc/ldbc_snb_datagen/src/main/java/ldbc/snb/datagen/generator/BTERKnowsGenerator.java:[134,69] cannot find symbol
symbol: class Pair
location: class ldbc.snb.datagen.generator.BTERKnowsGenerator
[ERROR] /home/danh.lephuoc/ldbc/ldbc_snb_datagen/src/main/java/ldbc/snb/datagen/generator/BTERKnowsGenerator.java:[141,40] cannot find symbol
symbol: class Pair
location: class ldbc.snb.datagen.generator.BTERKnowsGenerator
[INFO] 12 errors
[INFO] -------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 4.680s
[INFO] Finished at: Mon Jul 25 17:00:25 CEST 2016
[INFO] Final Memory: 27M/332M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.5.1:compile (default-compile) on project ldbc_snb_datagen: Compilation failure: Compilation failure:
[ERROR] /home/danh.lephuoc/ldbc/ldbc_snb_datagen/src/main/java/ldbc/snb/datagen/generator/distribution/utils/Bucket.java:[3,19] package javafx.util does not exist
[ERROR] /home/danh.lephuoc/ldbc/ldbc_snb_datagen/src/main/java/ldbc/snb/datagen/generator/distribution/utils/Bucket.java:[13,66] cannot find symbol
[ERROR] symbol: class Pair
[ERROR] location: class ldbc.snb.datagen.generator.distribution.utils.Bucket
[ERROR] /home/danh.lephuoc/ldbc/ldbc_snb_datagen/src/main/java/ldbc/snb/datagen/generator/distribution/EmpiricalDistribution.java:[3,19] package javafx.util does not exist
[ERROR] /home/danh.lephuoc/ldbc/ldbc_snb_datagen/src/main/java/ldbc/snb/datagen/generator/BTERKnowsGenerator.java:[3,19] package javafx.util does not exist
[ERROR] /home/danh.lephuoc/ldbc/ldbc_snb_datagen/src/main/java/ldbc/snb/datagen/generator/BTERKnowsGenerator.java:[30,39] cannot find symbol
[ERROR] symbol: class Pair
[ERROR] location: class ldbc.snb.datagen.generator.BTERKnowsGenerator
[ERROR] /home/danh.lephuoc/ldbc/ldbc_snb_datagen/src/main/java/ldbc/snb/datagen/generator/distribution/utils/Bucket.java:[19,14] cannot find symbol
[ERROR] symbol: class Pair
[ERROR] location: class ldbc.snb.datagen.generator.distribution.utils.Bucket
[ERROR] /home/danh.lephuoc/ldbc/ldbc_snb_datagen/src/main/java/ldbc/snb/datagen/generator/distribution/EmpiricalDistribution.java:[24,19] cannot find symbol
[ERROR] symbol: class Pair
[ERROR] location: class ldbc.snb.datagen.generator.distribution.EmpiricalDistribution
[ERROR] /home/danh.lephuoc/ldbc/ldbc_snb_datagen/src/main/java/ldbc/snb/datagen/generator/distribution/EmpiricalDistribution.java:[24,68] cannot find symbol
[ERROR] symbol: class Pair
[ERROR] location: class ldbc.snb.datagen.generator.distribution.EmpiricalDistribution
[ERROR] /home/danh.lephuoc/ldbc/ldbc_snb_datagen/src/main/java/ldbc/snb/datagen/generator/distribution/EmpiricalDistribution.java:[31,35] cannot find symbol
[ERROR] symbol: class Pair
[ERROR] location: class ldbc.snb.datagen.generator.distribution.EmpiricalDistribution
[ERROR] /home/danh.lephuoc/ldbc/ldbc_snb_datagen/src/main/java/ldbc/snb/datagen/generator/BTERKnowsGenerator.java:[134,19] cannot find symbol
[ERROR] symbol: class Pair
[ERROR] location: class ldbc.snb.datagen.generator.BTERKnowsGenerator
[ERROR] /home/danh.lephuoc/ldbc/ldbc_snb_datagen/src/main/java/ldbc/snb/datagen/generator/BTERKnowsGenerator.java:[134,69] cannot find symbol
[ERROR] symbol: class Pair
[ERROR] location: class ldbc.snb.datagen.generator.BTERKnowsGenerator
[ERROR] /home/danh.lephuoc/ldbc/ldbc_snb_datagen/src/main/java/ldbc/snb/datagen/generator/BTERKnowsGenerator.java:[141,40] cannot find symbol
[ERROR] symbol: class Pair
[ERROR] location: class ldbc.snb.datagen.generator.BTERKnowsGenerator

Is hadoop a dependency of this project?

I ran the program with the example .ini file and received the following error: run.sh: line 21: /home/user/hadoop-2.6.0/bin/hadoop: No such file or directory

No data was generated

Syntax error in generateparams.py

Hi,

I am new to LDBC. I'd like to play with it to generate a nice 50 billion vertex graph :).
For now, all I did was to snoop around the guidelines and to place the contents of ldbc_snb_datagen on my cluster nodes. I saw that you guys have a script that runs everything so I tried sh run.sh; beforehand I modified the first two lines to contain the correct paths...

Still I get this error:

Serializing invariant schema *
146 total seconds
Person generation time: 5
University correlated edge generation time: 4
Interest correlated edge generation time: 6
Random correlated edge generation time: 7
Person serialization time: 8
Person activity generation and serialization time: 67
Sorting update streams time: 48
Invariant schema serialization time: 0
Total Execution time: 146
File "paramgenerator/generateparams.py", line 92
def handlePairCountryParam((Country1, Country2)):
^
SyntaxError: invalid syntax

I am sure I am doing it wrong, but could someone please explain what? Needless to say I did not touch that python script.

Thanks!
Andra

Implement generator for BI query 3

https://github.com/ldbc/ldbc_snb_datagen/blob/master/paramgenerator/generateparamsbi.py#L117

Merge keys for isSubclassOf

As a followup of ldbc/ldbc_snb_docs#114, we should merge the keys of isSubclassOf to TagClass for the merge foreign keys format.

How to generate update streams with dates as Strings

I set the dateFormatter to the StringDateFormatter in the params.ini file.

ldbc.snb.datagen.serializer.dateFormatter:ldbc.snb.datagen.serializer.formatter.StringDateFormatter

The node/property/relationships are fine, but the updateStream_* files contain epoch times. Is there a way to work around this?

generating messages between user

Hi there,
i'm interested in the message frequency between users and i am wondering if there's any plan to implement this feature

person update stream columns are in the wrong order

should be like this:

personId
firstName
lastName
gender
birthday
creationDate
locationIp
browserUsed
cityId
languages
emails
tagIds
studyAts
workAts

New installation of datagen hangs

Hi,
I'm trying to run SF10 on a new machine.
I just now pulled the latest version and I think I configured hadoop ok this time...

Two issues:

It seems to ignore my SF param.
Datagen hangs on the first map-reduce.

This is my params.ini:

scaleFactor:10
compressed:false
serializer:csv
updateStreams:true
outputDir:./res

numThreads:8

This is where it hangs:

Reading scale factors..
Number of scale factors read 7
Executing with scale factor 1
 ... Num Persons 11000
 ... Start Year 2010
 ... Num Years 3
Done ... 49558 surnames were extracted
Done ... 42970 given names were extracted
************************************************
* Starting: Person generation *
************************************************
15/03/03 04:00:46 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/03/03 04:00:46 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
15/03/03 04:00:47 INFO input.FileInputFormat: Total input paths to process : 1
15/03/03 04:00:47 INFO mapreduce.JobSubmitter: number of splits:1
15/03/03 04:00:47 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1425372778972_0002
15/03/03 04:00:48 INFO impl.YarnClientImpl: Submitted application application_1425372778972_0002
15/03/03 04:00:48 INFO mapreduce.Job: The url to track the job: http://SarahN.labs.hpl.hp.com:8088/proxy/application_1425372778972_0002/
15/03/03 04:00:48 INFO mapreduce.Job: Running job: job_1425372778972_0002
15/03/03 04:00:55 INFO mapreduce.Job: Job job_1425372778972_0002 running in uber mode : false
15/03/03 04:00:55 INFO mapreduce.Job:  map 0% reduce 0%
15/03/03 04:01:05 INFO mapreduce.Job:  map 67% reduce 0%
15/03/03 04:01:06 INFO mapreduce.Job:  map 100% reduce 0%

Ideas?

Serialize generated data to multiple output formats

For a complex query where we are not sure if the results are correct, we would like to be able to verify it against another system. If the system under test can only understand one output format of the data generator (e.g. RDF) and the verification system only supports another output format (e.g. CSV), we would have to convert the generated data. Howerver, it should be easier to simply stream the generated data through multiple serializers in order to produce multiple output formats.

Some persons do not have an email address

Hi,

At least for the generated dataset with SF1, some persons do not have an email address. For instance, from the output of the CSVMergeForeign serializer:

# awk 'BEGIN { FS= "|" }; $1 == 8698' person_0_0.csv
8698|Chen|Liu|female|1982-05-29|2010-02-21T09:44:41.479+0000|14.103.81.196|Firefox|432
# awk 'BEGIN { FS= "|" }; $1 == 8698' person_email_emailaddress_0_0.csv 
... no result ...

The specification describes on pp. 16 that the attribute email of a Person consists of 1 or more strings.

Is this a discrepancy between the specification and the actual behaviour of the datagen tool?