smartdataanalytics / rdfprocessingtoolkit Goto Github PK

Command line interface based RDF processing toolkit to run sequences of SPARQL statements ad-hoc on RDF datasets, streams of bindings and streams of named graphs with support for processing JSON, CSV and XML using function extensions

Home Page: https://smartdataanalytics.github.io/RdfProcessingToolkit/

License: Other

Java 96.50% Shell 2.52% Makefile 0.99%

jena sparql-functions sparql-extensions named-graphs rdf debian-package docker-image rpm-package sparql

rdfprocessingtoolkit's Introduction

RDF Processing Toolkit

News

2023-05-19 New quality of life features: cpcat command and the canned queries tree.rq and gtree.rq.
2023-04-04 Release v1.9.5! RPT now ships with sansa (Apache Spark based tooling) and rmltk (RML Toolkit) features. A proper GitHub release will follow once Apache Jena 4.8.0 is out as some code depends on its latest SNAPSHOT changes.
2023-03-28 Started updating documentation to latest changes (ongoing)

Previous entries

Overview

RDF/SPARQL Workflows on the Command Line made easy. The RDF Processing Toolkit (RPT) integrates several of our tools into a single CLI frontend: It features commands for running SPARQL-statements on triple and quad based data both streaming and static. SPARQL extensions for working with CSV, JSON and XML are included. So is an RML toolkit that allows one to convert RML to SPARQL (or TARQL). Ships with Jena's ARQ and TDB SPARQL engines as well as one based on Apache Spark.

RPT is Java tool which comes with debian and rpm packaging. It is invoked using rpt <command> where the following commands are supported:

integrate: This command is the most relevant one for day-to-day RDF processing. It features ad-hoc querying, transformation and updating of RDF datasets with support for SPARQL-extensions for ingesting CSV, XML and JSON. Also supports jq-compatible JSON output that allows for building bash pipes in a breeze.
ngs: Processor for named graph streams (ngs) which enables processing for collections of named graphs in streaming fashion. Process huge datasets without running into memory issues.
sbs: Processor for SPARQL binding streams (sbs) which enables processing of SPARQL result sets in streaming fashion. Most prominently for use in aggregating the output of a ngs map operation.
rmltk: These are the (sub-)commands of our (R2)RML toolkit. The full documentation is available here.
sansa: These are the (sub-)commands of our Semantic Analysis Stack (Stack) - a Big Data RDF Processing Framework. Features parallel execution of RML/SPARQL and TARQL (if the involved sources support it).

Check this documentation for the supported SPARQL extensions with many examples

Example Usage

integrate allows one to load multiple RDF files and run multiple queries on them in a single invocation. Further prefixes from a snapshot of prefix.cc are predefined and we made the SELECT keyword of SPARQL optional in order to make scripting less verbose. The --jq flag enables JSON output for interoperability with the conventional jq tool

rpt integrate data.nt update.ru more-data.ttl query.rq

rpt integrate --jq file.ttl '?s { ?s a foaf:Person }' | jq '.[].s'

ngs is your well known bash tooling such as head, tail, wc adapted to named graphs instead of lines of text

# Group RDF into graph based on consecutive subjects and for each named graph count the number of triples
cat file.ttl | ngs subjects | ngs map --sparql 'CONSTRUCT { ?s eg:triples ?c} { SELECT ?s COUNT(*) { ?s ?p ?o } GROUP ?s }

# Count number of named graphs
rpt ngs wc file.trig

# Output the first 3 graphs produced by another command
./produce-graphs.sh | ngs head -n 3

Canned Queries

RPT ships with several useful queries on its classpath. Classpath resources can be printed out using cpcat. The following snippet shows examples of invocations and their output:

Overview

$ rpt cpcat spo.rq
CONSTRUCT WHERE { ?s ?p ?o }

$ rpt cpcat gspo.rq
CONSTRUCT WHERE { GRAPH ?g { ?s ?p ?o } }

Any resource (query or data) on the classpath can be used as an argument to the integrate command:

rpt integrate yourdata.nt spo.rq
# When spo.rq is executed then the data is queried and printed out on STDOUT

Reference

The exact definitions can be viewed with rpt cpcat resource.rq.

spo.rq: Output triples from the default graph
gspo.rq: Output quads from the named graphs
tree.rq: Deterministically replaces all intermediate nodes with blank nodes. Intermediate nodes are those that appear both as subject and as objects. Useful in conjunction with --out-format turtle/pretty for formatting e.g. RML.
gtree.rq: Named graph version of tree.rq
rename.rq: Replaces all occurrences of an IRI in subject and object positions with a different one. Usage (using environment variables): FROM='urn:from' TO='urn:to' rpt integrate data.nt rename.rq
count.rq: Return the sum of the counts of triples in the default graph and quads in the named graphs.
s.rq: List the distinct subjects in the default graph

Example Use Cases

Lodservatory implements SPARQL endpoint monitoring uses these tools in this script called from this git action.
Linked Sparql Queries provides tools to RDFize SPARQL query logs and run benchmark on the resulting RDF. The triples related to a query represent an instance of a sophisticated domain model and are grouped in a named graph. Depending on the input size one can end up with millions of named graphs describing queries amounting to billions of triples. With ngs one can easily extract complete samples of the queries' models without a related triple being left behind.

Building

The build requires maven.

For convenience, this Makefile which defines essential goals for common tasks. To build a "jar-with-dependencies" use the distjar goal. The path to the created jar bundle is shown when the build finishes. In order to build and and install a deb or rpm package use the deb-rere or rpm-rere goals, respectively.

$ make

make help                # Show these help instructions
make distjar             # Create only the standalone jar-with-dependencies of rpt
make rpm-rebuild         # Rebuild the rpm package (minimal build of only required modules)
make rpm-reinstall       # Reinstall rpm (requires prior build)
make rpm-rere            # Rebuild and reinstall rpm package
make deb-rebuild         # Rebuild the deb package (minimal build of only required modules)
make deb-reinstall       # Reinstall deb (requires prior build)
make deb-rere            # Rebuild and reinstall deb package
make docker              # Build Docker image
make release-bundle      # Create files for Github upload

A docker image is available at https://registry.hub.docker.com/r/aksw/rpt

License

The source code of this repo is published under the Apache License Version 2.0. Dependencies may be licensed under different terms. When in doubt please refer to the licenses of the dependencies declared in the pom.xml files. The dependency tree can be viewed with Maven using mvn dependency:tree.

Acknowledgements

This project is developed with funding from the QROWD H2020 project. Visit the QROWD GitHub Organization for more Open Source tools!

History

(no entry yet)

rdfprocessingtoolkit's People

Contributors

Stargazers

Watchers

Forkers

manasys georghildebrand lpmeyer

rdfprocessingtoolkit's Issues

Relative IRIs in SERVICE clauses may become rewritten

http://www.scholarlydata.org/sparql/ has a graph <conference-ontology>. This query yields no results when run from sparql integrate:

CONSTRUCT { ?s ?p ?o } { SERVICE <http://www.scholarlydata.org/sparql> { graph <conference-ontology> { ?s ?p ?o } } }

As this is graph name is a relative IRI, it is likely that sparql-integrate always expands it, making it impossible to query for it. Needs investigation of a workaround.

Debug queries

Something like echo for debugging queries would be nice.

Make use of Jena's extension rsparql extension mechanism

Jena already provides SPARQL command line tooling with extension support.
As sparql-integrate does a bit more than just registering extensions (e.g. it supports processing sequences of queries from multiple files), it does not seem possible to integrate completely into (r)sparql (i.e. having sparql-integrate as only a command line wrapper that invokes (r)sparql with the appropriate extensions).

Yet, there are 2 things we can do:

Make it possible to use sparql-integrate extensions via jena; which means documenting the namespaces for the extensions and testing it with jena's native command line tooling
Extend sparql-integrate to support the same extension points as (r)sparql

SPARQL IRI function does not work

Running the jena-4.6.0 branch rpt, the IRI function does not return something.

IRI("http://example/") does not create an IRI.

Example:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

construct {
  <http://example.com/12>
    rdfs:seeAlso ?a .
}
where {
  bind(IRI("http://example/111") as ?a)
}

There is no error.

Only output used prefixes

Output of all formats that support prefixes cause all of prefix.cc prefixes to be written out. This makes output so ugly its not funny.

By default, output should be deferred until a certain amount of data has been seen on which basis the set of used prefixes is determined. This is likely to work for 80-99% of the use cases. The amount of data to consider should be configurable.
To capture the remaining corner cases, there should be an option to provide the output prefixes directly

completed deferred output support on:

sparql integrate
named graph streams (ngs)
[n/a] sparql binding streams (sbs)

--version option is missing

rpt --version should give a useful out

Install instruction for develop are out off date

After mvn clean install there are no debian packages built to install.

FROM maven:3-jdk-11

ENV SHA=b50353b3eb0424197194732b9faac5ce7330159f

RUN apt-get update && apt-get install -y \
  default-jre-headless \
  jq

WORKDIR /app/
RUN git init
RUN git remote add origin https://github.com/SmartDataAnalytics/RdfProcessingToolkit.git
RUN echo $SHA
RUN git fetch --depth 1 origin $SHA
RUN git checkout FETCH_HEAD
RUN mvn clean install -DskipTests=true
RUN dpkg -i $(find . -name "rdf-processing-toolkit*.deb")

ENTRYPOINT [ "" ]
CMD [ "" ]

spark jakarta servlet compatibility

The sansa commands currently fail due to spark still expecting javax servlets rather than jakarta ones.

Possible solutions:

[1] https://stackoverflow.com/questions/76618374/spark-3-4-1-jakarta-servlet-6-0-0-compatibility-issue

Namespaces are note recognized inside of insert graph patterns

LOAD <dcat.ttl>

PREFIX dct: <http://purl.org/dc/terms/>
PREFIX dcat: <http://www.w3.org/ns/dcat#>
INSERT {
?distribution
  eg:workload ?sourceString ;
  dcat:accessURL ?distribution ;
  eg:resultGraph ?distribution .
}
WHERE {
  ?distribution dct:isFormatOf ?sourceDistribution .
  ?sourceDistribution dcat:downloadURL ?sourceFile .
  ?sourceFile url:text ?source .
  BIND(STRDT(?source, xsd:json) AS ?sourceString)
}

# Create stops
INSERT {
  GRAPH ?x { ?s a dcat:Stop }  <--- not recognized
  GRAPH ?x { ?s eg:stopId ?i }
}

Smarter Auto-Update of Spatial Index

Spatial Index is rebuilt before every query which makes rpt integrate --geoindex useless for larger datasets.

Base URL is ignored for files specfied as CLI arguments

Dumping data loaded using

rpt integrate file.ttl

where file.ttl has the content

@prefix rr: <http://www.w3.org/ns/r2rml#> .
@base <http://example.com/base/> .

<TriplesMap1>
    a rr:TriplesMap .

Returns <TriplesMap1> without being prefixed by the base IRI.
The bug appears to be somewhere inRDFDataMgrRx.createIteratorTriples.
Also, by now Jena has the improved AsyncParser so it may be better to switch to it.

rpt-cli: Pass files with JavaScript functions on to ARQ

Jena supports SPARQL extensions by means of java script functions. Currently our cli tooling lacks the feature to pass such files on to the ARQ machinery.

json:js should log errors (especially syntax errors)

Right now it just silently returns without value.

Conditional Triple Pattern Reordering Deactivation when property functions and SERVICE clauses are used

Triple pattern reordering w.r.t. property functions and SERVICE causes unexpected results.
In general if a property function computes bindings from input bindings the reordering will cause lots of issues
For example, if (?x pfn:retrieve "search string") requires the object to be bound in order to yield bindings with matching resources then { <foo> rdfs:label ?l . ?x pfn:?l } is vastly different from { ?x pfn:retrieve ?l . <foo> rdfs:label ?l .}.

The underlying issue is query planning over restricted sources (or that's the name I recall from a db lecture) - so pfn:retrieve can be seen as a relation whole 'object' column is non-enumerable. AFAIK there is no way in jena to express that.

Hence, there should be a feature to auto disable triple pattern reordering in the presence of property functions (or make the reordering aware of the pfn capabilities).

W.r.t SERVICE: I think there was a similar issues that problems can occurr when
{ {BEFORE} SERVICE {...} { AFTER } } is transformed into { {BEFORE AFTER} SERVICE {...} } - but right now I don't recall the issue anymore...

NoClassDefFoundError: org/apache/hadoop/shaded/org/apache/commons/configuration2/Configuration

> java -Dspark.kryoserializer.buffer.max="2047" -jar ./rpt-1.9.7-rc9.jar sansa query mapping.rq > result.hs12.tiny.raw.ttl 

15:35:57 [INFO] [n.s.s.c.i.CmdSansaTarqlImpl:66] - 'spark.master' not set - defaulting to: local[*]
15:35:57 [WARN] [o.a.s.u.Utils:73] - Your hostname, coypuserver.coypu.org resolves to a loopback address: 127.0.1.1; using 159.69.72.186 instead (on interface enp7s0)
15:35:57 [WARN] [o.a.s.u.Utils:73] - Set SPARK_LOCAL_IP if you need to bind to another address
15:35:58 [INFO] [o.a.s.SparkContext:61] - Running Spark version 3.3.2
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/shaded/org/apache/commons/configuration2/Configuration
	at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.<init>(DefaultMetricsSystem.java:43)
	at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.<clinit>(DefaultMetricsSystem.java:41)
	at org.apache.hadoop.security.UserGroupInformation$UgiMetrics.create(UserGroupInformation.java:149)
	at org.apache.hadoop.security.UserGroupInformation.<clinit>(UserGroupInformation.java:265)
	at org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2561)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2561)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:316)
	at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2714)
	at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947)
	at net.sansa_stack.spark.cli.impl.CmdSansaQueryImpl.run(CmdSansaQueryImpl.java:50)
	at net.sansa_stack.spark.cli.cmd.CmdSansaQuery.call(CmdSansaQuery.java:45)
	at net.sansa_stack.spark.cli.cmd.CmdSansaQuery.call(CmdSansaQuery.java:13)
	at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
	at picocli.CommandLine.access$1300(CommandLine.java:145)
	at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2358)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2352)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2314)
	at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
	at picocli.CommandLine$RunLast.execute(CommandLine.java:2316)
	at picocli.CommandLine.execute(CommandLine.java:2078)
	at org.aksw.rdf_processing_toolkit.cli.cmd.CmdUtils.callCmd(CmdUtils.java:77)
	at org.aksw.rdf_processing_toolkit.cli.cmd.CmdUtils.callCmd(CmdUtils.java:40)
	at org.aksw.rdf_processing_toolkit.cli.cmd.CmdUtils.execCmd(CmdUtils.java:21)
	at org.aksw.rdf_processing_toolkit.cli.main.MainCliRdfProcessingToolkit.main(MainCliRdfProcessingToolkit.java:9)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.shaded.org.apache.commons.configuration2.Configuration
	at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
	at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
	... 27 more

Feature: Add flag to show arq execution log (e.g. how services clauses are executed)

jena's arq command has a flag to easily enable this kind of debugging - we should provide the same

Build error

maven clean install fails with:

[ERROR] Failed to execute goal org.vafer:jdeb:1.5:jdeb (default) on project sparql-integrate-debian-cli: Failed to create debian package /home/beavis/Cloud/work/SparqlIntegrate/sparql-integrate-debian-cli/target/sparql-integrate-cli_1.0.0~20180119135008_all.deb: Could not create deb package: Data source not found : /home/beavis/Cloud/work/SparqlIntegrate/sparql-integrate-debian-cli/src/deb/resources/usr/bin -> [Help 1]

Hangs during build.

Dependency http://download.java.net/maven/2/junit/junit/4.12/junit-4.12.jar does a 404

Create a more readable function reference

Also, json:entries and geof:parsePolyline is missing in the docs

there should be a brew recipe

I will do this myself ...

Conditional OpExecutor selection

Some property functions do not work as expected when reordered, because they need input passed to them.
Currently reordering is globally disabled which sometimes gives unacceptable long runtimes. Resolution of this issue should involve:

A heuristic to automatically choose reordering (e.g. default is enabled, but use of property functions in a black list disable it)
An explicit option to set the reordering mode (on/off/auto)

For example, fs:find requires ?path to be bound in order to yield results.

foo a ?path .
?path fs:find ?file

Yasgui does not support RDFStar

Integrate an appropriately patches Yasgui version.

SPARQL insert results in a java.lang.NullPointerException

I did start RPT with the following parameters:
integrate -X --out-format txt edited2.ttl '* {?s ?p ?o }' --server

In the browser I did enter the following query:

PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX sh: <http://www.w3.org/ns/shacl#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX ns0: <http://rdfunit.aksw.org/ns/core#>

insert { ?testresult prov:wasAssociatedWith ?metric . }
where {
  	?testresult a ns0:TestCaseResult ;
		ns0:testCase ?bn .
	?bn a ?test .
  	?test rdfs:seeAlso ?metric .
}

edited2.ttl:

@prefix : <http://stream-ontology.com/maturitymodel/> .
@prefix rlog: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/rlog#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix dqv: <http://www.w3.org/ns/dqv#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix ns0: <http://rdfunit.aksw.org/ns/core#> .
@prefix ownshaclcorrectness: <http://stream-ontology.com/maturitymodel/shapes/Correctness/> .
@prefix msv: <http://stream-ontology.com/metrics-severity/> .
@prefix qb: <http://purl.org/linked-data/cube#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix prov: <http://www.w3.org/ns/prov#> .

ownshaclcorrectness:2
        rdfs:seeAlso  :M2 .

ownshaclcorrectness:1
        rdfs:seeAlso  :M1 .

<urn:uuid:85a78c5a-233a-439f-918f-514568c85f58>
        sh:conforms                false ;
        ns0:testsTimeout           "0"^^xsd:nonNegativeInteger ;
        prov:wasStartedBy          <http://localhost/> ;
        prov:startedAtTime         "2022-06-20T14:41:08.088Z"^^xsd:dateTime ;
        ns0:testsSuceedded         "1"^^xsd:nonNegativeInteger ;
        ns0:testsError             "0"^^xsd:nonNegativeInteger ;
        prov:endedAtTime           "2022-06-20T14:41:08.167Z"^^xsd:dateTime ;
        rdf:type                   ns0:TestExecution ;
        ns0:testsRun               "2"^^xsd:nonNegativeInteger ;
        rdf:type                   prov:Activity ;
        ns0:testsFailed            "1"^^xsd:nonNegativeInteger ;
        sh:result                  <urn:uuid:85a78c5a-233a-439f-918f-514568c85f58/bebe5da3-32f6-4234-bf0e-1948cd770f69> ;
        ns0:executionType          "shaclTestCaseResult" ;
        ns0:source                 <http://example.de/dataset1> ;
        rdf:type                   sh:ValidationReport ;
        prov:wasAssociatedWith     <file:///app/maturityModel/shapes.ttl> ;
        ns0:totalIndividualErrors  "1"^^xsd:nonNegativeInteger .


<urn:uuid:85a78c5a-233a-439f-918f-514568c85f58/bebe5da3-32f6-4234-bf0e-1948cd770f69>
        dcterms:date         "2022-06-20T14:41:08.149Z"^^xsd:dateTime ;
        ns0:testCase         _:b4 ;
        rdf:type             ns0:TestCaseResult ;
        rdf:type             sh:ValidationResult ;
        prov:wasGeneratedBy  <urn:uuid:85a78c5a-233a-439f-918f-514568c85f58> ;
        sh:focusNode         <http://stream-ontology.com/matvoc-core/> ;
        sh:message           "The ontology should provide some basic metadata, like rdfs:comment, dct:creator, rdfs:label, owl:versionInfo, dct:modified and owl:priorVersion." ;
        sh:severity          rlog:WARN .


_:b4    rdf:type  ownshaclcorrectness:1 .

Error:

java.lang.NullPointerException
	at org.apache.jena.update.UpdateExecutionDatasetBuilder.dataset(UpdateExecutionDatasetBuilder.java:70)
	at org.apache.jena.update.UpdateExecution.dataset(UpdateExecution.java:33)
	at org.apache.jena.update.UpdateExecutionFactory.make(UpdateExecutionFactory.java:236)
	at org.apache.jena.update.UpdateExecutionFactory.create(UpdateExecutionFactory.java:123)
	at org.aksw.jenax.arq.connection.core.RDFConnectionUtils$1.update(RDFConnectionUtils.java:171)
	at org.aksw.jenax.arq.connection.fix.RDFLinkAdapterFix.update(RDFLinkAdapterFix.java:43)
	at org.apache.jena.rdflink.RDFLinkModular.update(RDFLinkModular.java:116)
	at org.apache.jena.rdflink.RDFConnectionAdapter.update(RDFConnectionAdapter.java:131)
	at org.aksw.jenax.web.servlet.SparqlEndpointBase.processUpdateAsync(SparqlEndpointBase.java:526)
	at org.aksw.jenax.web.servlet.SparqlEndpointBase.processStmtAsync(SparqlEndpointBase.java:269)
	at org.aksw.jenax.web.servlet.SparqlEndpointBase.executeWildcardPost(SparqlEndpointBase.java:78)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52)
	at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:124)
	at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:167)
	at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$VoidOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:159)
	at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:79)
	at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:469)
	at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:391)
	at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:80)
	at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:253)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
	at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265)
	at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:232)
	at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:680)
	at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:394)
	at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:366)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:319)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)
	at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:791)
	at org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1626)
	at org.aksw.jenax.web.filter.SparqlStmtTypeAcceptHeaderFilter.doFilter(SparqlStmtTypeAcceptHeaderFilter.java:142)
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
	at org.aksw.jenax.web.filter.FilterPost.doFilter(FilterPost.java:40)
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
	at org.aksw.jenax.web.filter.CorsFilter.doFilter(CorsFilter.java:48)
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
	at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:201)
	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
	at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:602)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
	at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1435)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501)
	at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1350)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
	at org.eclipse.jetty.server.Server.handle(Server.java:516)
	at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388)
	at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:633)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:380)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
	at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:383)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:882)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1036)
	at java.base/java.lang.Thread.run(Thread.java:829)

Environment substitution not working with the new SparqlScriptProcessor.

Environment sustitution for env: IRIs such as in

$> FOO=bar sparql-integrate file.sparql

file.sparql:
SELECT * { BIND(<env:FOO> AS ?bar) }

stopped working after consolidating lots of logic into the SparqlScriptProcessor.

Although the SparqlScriptProcessor is configured with a parser that is wrapped with a post processor to perform environment substitution, the SparqlScriptProcessor invokes SparqlStmtUtils.processFile(...) which internally creates a new parser instance.

Hence,processFileneeds to be changed to accept a parser as an argument.

Regression in 1.0.3: Parser error in file argument may get swallowed

sparql-integrate file.sparql
sparql-integrate sometimes fails to raise a parse error if the query contained in file.sparql is invalid.
It seems that a SparqlStmt with isParsed=false is created, where actually an exception should be thrown.

Sansa - KryoException: Buffer overflow

I started with:

sansa query mapping.rq

The mapping file is simple but the linked csv is simple and long (440MB).
The first minutes I had jobs using all my cores, then up to two jobs just used my cores for a while and then after 10 minutes or so the error was thrown.
Here is the end of the log:

3:12:48 [INFO] [o.a.s.s.DAGScheduler:61] - Missing parents: List()
13:12:48 [INFO] [o.a.s.s.DAGScheduler:61] - Submitting ResultStage 1522 (MapPartitionsRDD[243] at mapPartitions at JavaRddOfBindingsOps.java:144), which has no missing parents
13:12:48 [INFO] [o.a.s.s.m.MemoryStore:61] - Block broadcast_172 stored as values in memory (estimated size 40.0 KiB, free 9.1 GiB)
13:12:48 [INFO] [o.a.s.s.m.MemoryStore:61] - Block broadcast_172_piece0 stored as bytes in memory (estimated size 9.5 KiB, free 9.1 GiB)
13:12:48 [INFO] [o.a.s.s.BlockManagerInfo:61] - Added broadcast_172_piece0 in memory on Pulsar-5047.lan:41025 (size: 9.5 KiB, free: 9.2 GiB)
13:12:48 [INFO] [o.a.s.SparkContext:61] - Created broadcast 172 from broadcast at DAGScheduler.scala:1513
13:12:48 [INFO] [o.a.s.s.DAGScheduler:61] - Submitting 1 missing tasks from ResultStage 1522 (MapPartitionsRDD[243] at mapPartitions at JavaRddOfBindingsOps.java:144) (first 15 tasks are for partitions Vector(75))
13:12:48 [INFO] [o.a.s.s.TaskSchedulerImpl:61] - Adding task set 1522.0 with 1 tasks resource profile 0
13:12:48 [INFO] [o.a.s.s.TaskSetManager:61] - Starting task 0.0 in stage 1522.0 (TID 279) (Pulsar-5047.lan, executor driver, partition 75, NODE_LOCAL, 4380 bytes) taskResourceAssignments Map()
13:12:48 [INFO] [o.a.s.e.Executor:61] - Running task 0.0 in stage 1522.0 (TID 279)
13:12:48 [INFO] [o.a.s.s.ShuffleBlockFetcherIterator:61] - Getting 14 (11.9 MiB) non-empty blocks including 14 (11.9 MiB) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks
13:12:48 [INFO] [o.a.s.s.ShuffleBlockFetcherIterator:61] - Started 0 remote fetches in 0 ms
13:12:52 [ERROR] [o.a.s.e.Executor:98] - Exception in task 0.0 in stage 1522.0 (TID 279)
org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 1. To avoid this, increase spark.kryoserializer.buffer.max value.
	at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:392) ~[rpt.jar:?]
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:593) [rpt.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 1
	at com.esotericsoftware.kryo.io.Output.require(Output.java:167) ~[rpt.jar:?]
	at com.esotericsoftware.kryo.io.Output.writeVarInt(Output.java:284) ~[rpt.jar:?]
	at com.esotericsoftware.kryo.Kryo.writeReferenceOrNull(Kryo.java:682) ~[rpt.jar:?]
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:646) ~[rpt.jar:?]
	at org.aksw.jenax.io.kryo.jena.QuadSerializer.write(QuadSerializer.java:20) ~[rpt.jar:?]
	at org.aksw.jenax.io.kryo.jena.QuadSerializer.write(QuadSerializer.java:16) ~[rpt.jar:?]
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651) ~[rpt.jar:?]
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:361) ~[rpt.jar:?]
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:302) ~[rpt.jar:?]
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651) ~[rpt.jar:?]
	at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:388) ~[rpt.jar:?]
	... 4 more
13:12:52 [WARN] [o.a.s.s.TaskSetManager:73] - Lost task 0.0 in stage 1522.0 (TID 279) (Pulsar-5047.lan executor driver): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 1. To avoid this, increase spark.kryoserializer.buffer.max value.
	at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:392)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:593)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 1
	at com.esotericsoftware.kryo.io.Output.require(Output.java:167)
	at com.esotericsoftware.kryo.io.Output.writeVarInt(Output.java:284)
	at com.esotericsoftware.kryo.Kryo.writeReferenceOrNull(Kryo.java:682)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:646)
	at org.aksw.jenax.io.kryo.jena.QuadSerializer.write(QuadSerializer.java:20)
	at org.aksw.jenax.io.kryo.jena.QuadSerializer.write(QuadSerializer.java:16)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:361)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:302)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
	at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:388)
	... 4 more

13:12:52 [ERROR] [o.a.s.s.TaskSetManager:77] - Task 0 in stage 1522.0 failed 1 times; aborting job
13:12:52 [INFO] [o.a.s.s.TaskSchedulerImpl:61] - Removed TaskSet 1522.0, whose tasks have all completed, from pool 
13:12:52 [INFO] [o.a.s.s.TaskSchedulerImpl:61] - Cancelling stage 1522
13:12:52 [INFO] [o.a.s.s.TaskSchedulerImpl:61] - Killing all running tasks in stage 1522: Stage cancelled
13:12:52 [INFO] [o.a.s.s.DAGScheduler:61] - ResultStage 1522 (hasNext at Iterator.java:132) failed in 4.145 s due to Job aborted due to stage failure: Task 0 in stage 1522.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1522.0 (TID 279) (Pulsar-5047.lan executor driver): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 1. To avoid this, increase spark.kryoserializer.buffer.max value.
	at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:392)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:593)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 1
	at com.esotericsoftware.kryo.io.Output.require(Output.java:167)
	at com.esotericsoftware.kryo.io.Output.writeVarInt(Output.java:284)
	at com.esotericsoftware.kryo.Kryo.writeReferenceOrNull(Kryo.java:682)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:646)
	at org.aksw.jenax.io.kryo.jena.QuadSerializer.write(QuadSerializer.java:20)
	at org.aksw.jenax.io.kryo.jena.QuadSerializer.write(QuadSerializer.java:16)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:361)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:302)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
	at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:388)
	... 4 more

Driver stacktrace:
13:12:52 [INFO] [o.a.s.s.DAGScheduler:61] - Job 78 failed: hasNext at Iterator.java:132, took 4.147336 s
java.lang.RuntimeException: java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1522.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1522.0 (TID 279) (Pulsar-5047.lan executor driver): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 1. To avoid this, increase spark.kryoserializer.buffer.max value.
	at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:392)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:593)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 1
	at com.esotericsoftware.kryo.io.Output.require(Output.java:167)
	at com.esotericsoftware.kryo.io.Output.writeVarInt(Output.java:284)
	at com.esotericsoftware.kryo.Kryo.writeReferenceOrNull(Kryo.java:682)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:646)
	at org.aksw.jenax.io.kryo.jena.QuadSerializer.write(QuadSerializer.java:20)
	at org.aksw.jenax.io.kryo.jena.QuadSerializer.write(QuadSerializer.java:16)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:361)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:302)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
	at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:388)
	... 4 more

Driver stacktrace:
	at org.aksw.commons.util.exception.ExceptionUtilsAksw.rethrowUnless(ExceptionUtilsAksw.java:40)
	at org.aksw.commons.util.exception.ExceptionUtilsAksw.rethrowIfNotBrokenPipe(ExceptionUtilsAksw.java:88)
	at org.aksw.rdf_processing_toolkit.cli.cmd.CmdUtils.lambda$callCmd$0(CmdUtils.java:70)
	at picocli.CommandLine.execute(CommandLine.java:2088)
	at org.aksw.rdf_processing_toolkit.cli.cmd.CmdUtils.callCmd(CmdUtils.java:77)
	at org.aksw.rdf_processing_toolkit.cli.cmd.CmdUtils.callCmd(CmdUtils.java:40)
	at org.aksw.rdf_processing_toolkit.cli.cmd.CmdUtils.execCmd(CmdUtils.java:21)
	at org.aksw.rdf_processing_toolkit.cli.main.MainCliRdfProcessingToolkit.main(MainCliRdfProcessingToolkit.java:9)
Caused by: java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1522.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1522.0 (TID 279) (Pulsar-5047.lan executor driver): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 1. To avoid this, increase spark.kryoserializer.buffer.max value.
	at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:392)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:593)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 1
	at com.esotericsoftware.kryo.io.Output.require(Output.java:167)
	at com.esotericsoftware.kryo.io.Output.writeVarInt(Output.java:284)
	at com.esotericsoftware.kryo.Kryo.writeReferenceOrNull(Kryo.java:682)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:646)
	at org.aksw.jenax.io.kryo.jena.QuadSerializer.write(QuadSerializer.java:20)
	at org.aksw.jenax.io.kryo.jena.QuadSerializer.write(QuadSerializer.java:16)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:361)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:302)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
	at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:388)
	... 4 more

Driver stacktrace:
	at net.sansa_stack.spark.io.rdf.output.RddRdfWriter.runUnchecked(RddRdfWriter.java:130)
	at net.sansa_stack.spark.cli.impl.CmdSansaMapImpl.writeOutRdfSources(CmdSansaMapImpl.java:96)
	at net.sansa_stack.spark.cli.impl.CmdSansaQueryImpl.run(CmdSansaQueryImpl.java:126)
	at net.sansa_stack.spark.cli.cmd.CmdSansaQuery.call(CmdSansaQuery.java:45)
	at net.sansa_stack.spark.cli.cmd.CmdSansaQuery.call(CmdSansaQuery.java:13)
	at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
	at picocli.CommandLine.access$1300(CommandLine.java:145)
	at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2358)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2352)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2314)
	at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
	at picocli.CommandLine$RunLast.execute(CommandLine.java:2316)
	at picocli.CommandLine.execute(CommandLine.java:2078)
	... 4 more
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1522.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1522.0 (TID 279) (Pulsar-5047.lan executor driver): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 1. To avoid this, increase spark.kryoserializer.buffer.max value.
	at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:392)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:593)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 1
	at com.esotericsoftware.kryo.io.Output.require(Output.java:167)
	at com.esotericsoftware.kryo.io.Output.writeVarInt(Output.java:284)
	at com.esotericsoftware.kryo.Kryo.writeReferenceOrNull(Kryo.java:682)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:646)
	at org.aksw.jenax.io.kryo.jena.QuadSerializer.write(QuadSerializer.java:20)
	at org.aksw.jenax.io.kryo.jena.QuadSerializer.write(QuadSerializer.java:16)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:361)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:302)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
	at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:388)
	... 4 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2238)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2259)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2278)
	at org.apache.spark.rdd.RDD.collectPartition$1(RDD.scala:1036)
	at org.apache.spark.rdd.RDD.$anonfun$toLocalIterator$3(RDD.scala:1038)
	at org.apache.spark.rdd.RDD.$anonfun$toLocalIterator$3$adapted(RDD.scala:1038)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
	at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:32)
	at java.base/java.util.Iterator.forEachRemaining(Iterator.java:132)
	at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31)
	at net.sansa_stack.spark.io.rdf.output.RddRdfWriter.runOutputToConsole(RddRdfWriter.java:206)
	at net.sansa_stack.spark.io.rdf.output.RddRdfWriter.run(RddRdfWriter.java:137)
	at net.sansa_stack.spark.io.rdf.output.RddRdfWriter.runUnchecked(RddRdfWriter.java:128)
	... 16 more
Caused by: org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 1. To avoid this, increase spark.kryoserializer.buffer.max value.
	at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:392)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:593)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 1
	at com.esotericsoftware.kryo.io.Output.require(Output.java:167)
	at com.esotericsoftware.kryo.io.Output.writeVarInt(Output.java:284)
	at com.esotericsoftware.kryo.Kryo.writeReferenceOrNull(Kryo.java:682)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:646)
	at org.aksw.jenax.io.kryo.jena.QuadSerializer.write(QuadSerializer.java:20)
	at org.aksw.jenax.io.kryo.jena.QuadSerializer.write(QuadSerializer.java:16)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:361)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:302)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
	at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:388)
	... 4 more
13:12:52 [INFO] [o.a.s.SparkContext:61] - Invoking stop() from shutdown hook
13:12:52 [INFO] [o.s.j.s.AbstractConnector:383] - Stopped Spark@396e5758{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
13:12:52 [INFO] [o.a.s.u.SparkUI:61] - Stopped Spark web UI at http://Pulsar-5047.lan:4040
13:12:52 [INFO] [o.a.s.MapOutputTrackerMasterEndpoint:61] - MapOutputTrackerMasterEndpoint stopped!
13:12:52 [INFO] [o.a.s.s.m.MemoryStore:61] - MemoryStore cleared
13:12:52 [INFO] [o.a.s.s.BlockManager:61] - BlockManager stopped
13:12:52 [INFO] [o.a.s.s.BlockManagerMaster:61] - BlockManagerMaster stopped
13:12:52 [INFO] [o.a.s.s.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:61] - OutputCommitCoordinator stopped!
13:12:52 [INFO] [o.a.s.SparkContext:61] - Successfully stopped SparkContext
13:12:52 [INFO] [o.a.s.u.ShutdownHookManager:61] - Shutdown hook called
13:12:52 [INFO] [o.a.s.u.ShutdownHookManager:61] - Deleting directory /tmp/spark-c4c495f7-e3c7-4dbb-ba74-b222d203e83e

I have 64GB RAM Which were at 60% usage and 16 logical cores.

RDF output gets cut off near the end

There is a missing call to StreamRDF.finish()

Non-deterministic Output Formatting

It seems that the output itself is fine, but there seems to be a race condition when configuring the output writer.
An alternative reason could be that writer configuration still uses a reflection hack for configuration which might break.
Needs investigation.

Running rpt integrate /tmp/data.jsonld spo.rq --out-format turtle gives output with different formatting such as:

Variant A:

@prefix geo:  <http://www.opengis.net/ont/geosparql#> .
@prefix owl:  <http://www.w3.org/2002/07/owl#> .
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .

<https://data.coypu.org/country/DEU>
        rdf:type                        <https://schema.coypu.org/global#Country>;
        rdfs:label                      "Germany"@en;

Variant B:

@prefix geo: <http://www.opengis.net/ont/geosparql#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
<https://data.coypu.org/country/DEU> rdfs:label "Germany"@en .
<https://data.coypu.org/country/DEU> rdf:type <https://schema.coypu.org/global#Country> .

allow for java -jar usage

currently java -jar runs are not possible:

java -jar /usr/src/app/sparql-integrate-debian-cli/target/sparql-integrate-debian-cli-1.0.0.jar
no main manifest attribute, in /usr/src/app/sparql-integrate-debian-cli/target/sparql-integrate-debian-cli-1.0.0.jar

Please extend in order to allow more easy usage patterns ...

Add a 'dump graphs' sub-command for easily saving (named) graphs into separate files

Executing construct queries in streaming and sorted ways

In the code there are transformations for 'client-side' construct query execution as well as construct-to-lateral transformers.
The lateral transformation effectively pushes the construct template into the WHERE part which effectively allows for expressing sort conditions of the produced triples in SPARQL.

RPT/integrate needs to expose the features in the CLI:

Add option for streaming construct (construct-to-lateral with optional distinct/reduced operation). E.g. --construct-streaming[=n] where the value is the window size for reducing duplicated triples; with absent=distinct, 0=no duplicate removal.
Add convenience option to implicitly sort all construct queries (construct-to-lateral with ORDER BY ?s ?p ?o ?g for nquads)

rpt integrate has no --help and fails when running without arguments

ergo: I am not able to see whats is possible.

rpt integrate --help
Usage: rpt integrate [-h]
Sparql Integrate
  -h, --help

∴ rpt integrate
[ERROR] Application startup failed
java.lang.IllegalArgumentException: Args must not be null
	at org.springframework.util.Assert.notNull(Assert.java:134)
	at org.springframework.boot.DefaultApplicationArguments.<init>(DefaultApplicationArguments.java:41)
	at org.springframework.boot.SpringApplication.run(SpringApplication.java:294)
	at org.springframework.boot.builder.SpringApplicationBuilder.run(SpringApplicationBuilder.java:134)
	at org.aksw.sparql_integrate.cli.MainCliSparqlIntegrate.main(MainCliSparqlIntegrate.java:557)
	at org.aksw.rdf_processing_toolkit.cli.cmd.CmdSparqlIntegrateMain.run(CmdSparqlIntegrateMain.java:20)
	at picocli.CommandLine.executeUserObject(CommandLine.java:1919)
	at picocli.CommandLine.access$1100(CommandLine.java:145)
	at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2332)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2326)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2291)
	at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2159)
	at picocli.CommandLine.execute(CommandLine.java:2058)
	at org.aksw.rdf_processing_toolkit.cli.main.MainCliRdfProcessingToolkit.main(MainCliRdfProcessingToolkit.java:16)
java.lang.RuntimeException: java.lang.IllegalArgumentException: Args must not be null
	at org.aksw.commons.util.exception.ExceptionUtils.rethrowUnless(ExceptionUtils.java:35)
	at org.aksw.commons.util.exception.ExceptionUtils.rethrowIfNotBrokenPipe(ExceptionUtils.java:14)
	at org.aksw.rdf_processing_toolkit.cli.main.MainCliRdfProcessingToolkit.lambda$main$0(MainCliRdfProcessingToolkit.java:13)
	at picocli.CommandLine.execute(CommandLine.java:2068)
	at org.aksw.rdf_processing_toolkit.cli.main.MainCliRdfProcessingToolkit.main(MainCliRdfProcessingToolkit.java:16)
Caused by: java.lang.IllegalArgumentException: Args must not be null
	at org.springframework.util.Assert.notNull(Assert.java:134)
	at org.springframework.boot.DefaultApplicationArguments.<init>(DefaultApplicationArguments.java:41)
	at org.springframework.boot.SpringApplication.run(SpringApplication.java:294)
	at org.springframework.boot.builder.SpringApplicationBuilder.run(SpringApplicationBuilder.java:134)
	at org.aksw.sparql_integrate.cli.MainCliSparqlIntegrate.main(MainCliSparqlIntegrate.java:557)
	at org.aksw.rdf_processing_toolkit.cli.cmd.CmdSparqlIntegrateMain.run(CmdSparqlIntegrateMain.java:20)
	at picocli.CommandLine.executeUserObject(CommandLine.java:1919)
	at picocli.CommandLine.access$1100(CommandLine.java:145)
	at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2332)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2326)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2291)
	at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2159)
	at picocli.CommandLine.execute(CommandLine.java:2058)
	... 1 more

Handle mixed construct queries

A construct query could return triples and/or quads. In the "and/or" case a mixed output is returned. I think it would be better to return only quads if a quad statement exist. So the output could have a "filetype" n-quads.

java.lang.OutOfMemoryError with integrate

I was loading a ttl file (29GB) via a tdb2 and tried to output it into a ttl file.
I noticed that the output file was empty at least before 10 minutes before the error.

> java  -Xmx6g -jar ~/Documents/rpt.jar integrate --loc /home/kjunghanns/Documents/Coypu/data-sources/baci/db.db --db-engine=tdb2 output/coytradegraph.all.rpt.ttl spo.rq -o=output/coytradegraph.all.rpt.small.ttl

09:16:31 [INFO] [o.a.s.c.m.SparqlIntegrateCmdImpls:196] - Inferred output format from output/coytradegraph.all.rpt.small.ttl: Turtle/pretty
09:16:31 [INFO] [o.a.j.r.s.SparqlScriptProcessor:270] - Interpreting argument #1: 'output/coytradegraph.all.rpt.ttl'
09:16:31 [INFO] [o.a.j.r.s.SparqlScriptProcessor:440] - Detected data format: text/turtle
09:16:31 [INFO] [o.a.j.r.s.SparqlScriptProcessor:467] - A total of 84 prefixes known after processing output/coytradegraph.all.rpt.ttl
09:16:31 [INFO] [o.a.j.r.s.SparqlScriptProcessor:270] - Interpreting argument #2: 'spo.rq'
09:16:31 [INFO] [o.a.j.r.s.SparqlScriptProcessor:438] - Argument does not appear to be (RDF) data because content type probing yeld no result
09:16:31 [INFO] [o.a.j.r.s.SparqlScriptProcessor:372] - Preparing SPARQL statement at line 1, column 1
09:16:31 [INFO] [o.a.c.d.RdfDataEngineFactoryTdb2:75] - Created new directory (its content will deleted when done): /home/kjunghanns/Documents/Coypu/data-sources/baci/db.db
09:16:31 [INFO] [o.a.c.d.RdfDataEngineFactoryTdb2:109] - Connecting to TDB2 database in folder /home/kjunghanns/Documents/Coypu/data-sources/baci/db.db
09:16:31 [INFO] [o.a.s.c.m.SparqlIntegrateCmdImpls:622] - Processing output/coytradegraph.all.rpt.ttl
10:40:21 [INFO] [o.a.s.c.m.SparqlIntegrateCmdImpls:622] - Processing spo.rq:1:1

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "HttpClient-1-SelectorManager"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "HttpClient-2-SelectorManager"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "pool-3-thread-1"
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
10:45:58 [INFO] [o.a.c.d.RdfDataEngineFactoryTdb2:77] - Deleting created directory: /home/kjunghanns/Documents/Coypu/data-sources/baci/db.db

Service Enhancer breaks literals in service clause

This issue actually needs to be fixed at the service enhancer jena plugin - but for now I document it here:

The following example incorrectly uses <env:S> (string substitution) rather than env://S (IRI substitution).

S=https://query.wikidata.org/sparql rpt integrate 'SELECT * { SERVICE <env:S> { <http://www.wikidata.org/entity/Q54837> ?p ?o } }'

This causes a NPE in my service enhancer plugin. The plugin should not interfere and just pass on the execution request to the engine.

Caused by: java.lang.NullPointerException: Cannot invoke "java.util.List.size()" because "opts" is null
	at org.apache.jena.sparql.service.enhancer.impl.ChainingServiceExecutorBulkServiceEnhancer.createExecution(ChainingServiceExecutorBulkServiceEnhancer.java:53) ~[jena-serviceenhancer-4.8.0-SNAPSHOT.jar:4.8.0-SNAPSHOT]

No sparql response for GET query

reported by @splattater

curl -vvv http://localhost:8642/sparql --url-query query='select * { ?s ?p ?o }'

< HTTP/1.1 200 OK
< Date: Thu, 15 Jun 2023 08:51:29 GMT
< Content-Type: text/html;charset=utf-8
< Content-Length: 1169
< Server: Jetty(9.4.40.v20210413)
< 
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
...

integrate shows strange behavior when JSON-LD is used

Command:

java -jar $PATH_RPT integrate -o=local_licenses.jsonld --out-format=jsonld datasources.n3 get_licenses.rq

Version: 1.9.1

datasources.n3:

<http://dalicc.net/dependencygraph/dg_default/>
        <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>  <https://vocab.eccenca.com/di/Dataset> .

<https://coypu.org/datasource/COY%20Ontology%20%2F%20CoyPu%20Schema> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/dcat#Dataset>.
<https://coypu.org/datasource/COY%20Ontology%20%2F%20CoyPu%20Schema> <http://purl.org/dc/terms/title> "COY Ontology / CoyPu Schema".
<https://coypu.org/datasource/COY%20Ontology%20%2F%20CoyPu%20Schema> <http://purl.org/dc/terms/language> "http://id.loc.gov/vocabulary/iso639-1/en".
<https://coypu.org/datasource/COY%20Ontology%20%2F%20CoyPu%20Schema> <http://purl.org/dc/terms/license> <http://dalicc.net/licenselibrary/CC-BY_v4>.

get_licenses.rq:

PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX odrl: <http://www.w3.org/ns/odrl/2/>
construct { ?license a odrl:Set . }
 WHERE {
   SELECT distinct ?license WHERE {
     ?ds a <http://www.w3.org/ns/dcat#Dataset> ;
     dcterms:license ?license .
   }

}

It shows an RDF result in the terminal, but writes [] into the jsonld file. With other formats it works.

Listen on multiple ports with different read-only/write privileges.

A simple setup for workflows would have one public port with read-only data access, and another write-enabled sparql enpdoint that only listens on localhost.

Something like

rpt integrate --server \
  --port 4567 --listen '*' \
  --port 5678 --allow-update --listen 'localhost'

DBMS Probing log output should be hidden

Currently every startup logs warnings similar to those below:

16:07:56 [INFO] [o.a.j.r.s.SparqlScriptProcessor:406] - Preparing SPARQL statement at line 1, column 1
16:07:56 [WARN] [o.a.j.a.exec:80] - URI <bif:sys_stat> has no registered function factory
16:07:56 [WARN] [o.a.j.a.exec:80] - URI <bif:sys_stat> has no registered function factory
16:07:56 [WARN] [o.a.j.s.u.MappedLoader:54] - Loading function or property function with old style 'jena.hpl.hp.com' used - preferred style is to use 'jena.apache.org': http://jena.hpl.hp.com/ARQ/property#versionARQ => http://jena.apache.org/ARQ/property#versionARQ
16:07:56 [WARN] [o.a.j.s.u.ClsLoader:54] - Class not found: org.apache.jena.sparql.pfunction.library.versionARQ
16:07:56 [WARN] [o.a.j.a.exec:80] - URI <tag:stardog:api:functions:toRadians> has no registered function factory
16:07:56 [WARN] [o.a.j.a.exec:80] - URI <http://www.ontotext.com/sparql/functions/toRadians> has no registered function factory
16:07:56 [INFO] [o.a.s.c.m.SparqlIntegrateCmdImpls:511] - Detected DBMS: jena

Dataset in trento-bike-racks fail without error

This dataset gives no error, but also no output.

SPARQL Endpoint throws exception for SPARQL Update (POST, x-www-form-urlencoded)

Running a SPARQL endpoint with the following setting
rpt integrate --server any_rdf.ttl

I get an exception after executing a POST request with a valid SPARQL Update using https://www.w3.org/TR/sparql11-protocol/#update-via-post-urlencoded

10:43:25 [WARN] [o.g.j.s.WebComponent:613] - A servlet request to the URI http://localhost:8642/sparql contains form parameters in the request body but the request body has been consumed by the servlet or a servlet filter accessing the request parameters. Only resource methods using @FormParam will work as expected. Resource methods consuming the request body by other means will not work as expected.
10:43:25 [WARN] [o.a.j.w.p.UncaughtExceptionProvider:25] - java.lang.RuntimeException: Both 'query' and 'update' statement strings provided in a single request; query=INSERT DATA {
Using POST request with content-type x-www-form-urlencoded ...

It seems, that a 'query' parameter was added somewhere after my request (using the 'update' parameter) came in.

Chromium/Chrome does not support yasgui to localhost

These browser don't allow access to localhost for security reasons (prevent leaking information from localhost into the web).
However, for the yasgui use case this is actually desired.
This needs to be documented in a trouble shooting guide.

Improve Documentation

There are so many features currently either badly documented or just undocumented:

There should be dedicated sections for how to use rpt in different architecture setups; e.g. as a caching proxy.

Thanks to @SimonBin the maven build produces ascii documentation files for the picocli commands, but they need to be added to the ...

documentation web page.
deb / rpm packaging

smartdataanalytics / rdfprocessingtoolkit Goto Github PK

rdfprocessingtoolkit's Introduction

RDF Processing Toolkit

News

Overview

Example Usage

Canned Queries

Overview

Reference

Example Use Cases

Building

License

Acknowledgements

History

rdfprocessingtoolkit's People

Contributors

Stargazers

Watchers

Forkers

rdfprocessingtoolkit's Issues

Recommend Projects

Recommend Topics

Recommend Org