balhoff / blazegraph-runner Goto Github PK

Simple CLI for Blazegraph

License: BSD 3-Clause "New" or "Revised" License

Scala 100.00%

blazegraph-runner's Introduction

blazegraph-runner

blazegraph-runner provides a simple command-line wrapper for the Blazegraph open source RDF database. It provides operations on an "offline" database, so that you can easily load data or execute queries against a Blazegraph journal file, without needing to run it as an HTTP SPARQL server.

Usage

Usage

 blazegraph-runner [options] command [command options]

Options

   --informat   : Input format
   --journal    : Blazegraph journal file
   --outformat  : Output format
   --properties : Blazegraph properties file

Commands

   construct <query> <output> : SPARQL construct

   dump [command options] <output> : Dump Blazegraph database to an RDF file
      --graph : Named graph to load triples into

   load [command options] <data-files> ... : Load triples
      --base=STRING
      --graph              : Named graph to load triples into
      --use-ontology-graph

   reason [command options] : Materialize inferences
      --append-graph-name=STRING : If a target-graph is not provided, append this text to the end of source graph name to use as target graph for inferred statements.
      --merge-sources            : Merge all selected source graphs into one set of statements before reasoning. Inferred statements will be stored in provided `target-graph`, or else in the default graph. If `merge-sources` is false (default), source graphs will be reasoned separately and in parallel.
      --ontology                 : Ontology to use as rule source. If the passed value is a valid filename, the ontology will be read from the file. Otherwise, if the value is an ontology IRI, it will be loaded from the database if such a graph exists, or else, from the web.
      --parallelism=NUM          : Maximum graphs to simultaneously either read from database or run reasoning on.
      --reasoner=STRING          : Reasoner choice: 'arachne' (default) or 'whelk'
      --rules-file               : Reasoning rules in Jena syntax.
      --source-graphs            : Space-separated graph IRIs on which to perform reasoning (must be passed as one shell argument).
      --source-graphs-query      : File name or query text of SPARQL select used to obtain graph names on which to perform reasoning. The query must return a column named `source_graph`.
      --target-graph             : Named graph to store inferred statements.

   select <query> <output> : SPARQL select

   update <update> : SPARQL update

Commands

There are a number of general options that apply to all commands:

journal: Blazegraph journal file
properties: Blazegraph properties file. If this is not set, a default properties file is used that includes named graphs and text indexing.
informat: Input format. Valid values for this option depend on the command.
outformat: Output format. Valid values for this option depend on the command.

Load

Load RDF data from files into a Blazegraph journal. A list of files or folders can be passed to the command; folders will be recursively searched for data files.

blazegraph-runner load --journal=blazegraph.jnl --graph="http://example.org/mydata" --informat=rdfxml mydata1.rdf mydata2.rdf

When loading multiple files (or a folder with files), by default each file is loaded under its own graph, currently named using the file's path. (This can be exploited in a SPARQL query, for example to distinguish between triples coming from different files.)

If your data files are OWL ontologies, blazegraph-runner can efficiently search within each file to find the ontology IRI if you want to use it as the target named graph:

blazegraph-runner load --journal=blazegraph.jnl --use-ontology-graph=true --informat=rdfxml go.owl

If you set --use-ontology-graph=true and also provide a value for --graph, the --graph will be used as a fallback value in the case that an ontology IRI is not found.

Dump

Export RDF data from a Blazegraph journal to a file. If a value for --graph is provided, only data from that graph is exported. If --graph is not provided, data from the default graph will be exported. In the future this command should be extended to dump all graphs to separate file or dump all data to a quad format.

blazegraph-runner dump --journal=blazegraph.jnl --graph="http://example.org/mydata" --outformat=turtle mydata.ttl

Select

Query a Blazegraph journal using SPARQL SELECT. Results can be output as TSV, XML, or JSON.

blazegraph-runner select --journal=blazegraph.jnl --outformat=tsv myquery.rq mydata.tsv

Construct

Query a Blazegraph journal using SPARQL CONSTRUCT. Results can be output as Turtle, RDFXML, or N-triples.

blazegraph-runner construct --journal=blazegraph.jnl --outformat=turtle myquery.rq mydata.ttl

Update

Apply a SPARQL UPDATE to modify data in a Blazegraph journal.

blazegraph-runner update --journal=blazegraph.jnl myupdate.rq

Reason

Materialize inferences derived from data in a Blazegraph journal, and store the inferred triples back to the journal. Reasoning rules are applied in-memory using the Arachne reasoner. This command has a number of different options:

rules-file: a file of reasoning rules in Jena rules format (not all Jena rule constructs are supported by Arachne).
ontology: an OWL ontology to convert to reasoning rules. If the passed value is a valid filename, the ontology will be read from the file. Otherwise, if the value is an IRI, it will be loaded from the Blazegraph journal if such a graph exists, or else, blazegraph-runner will attempt to download it from the web.
target-graph: the graph IRI in which to store inferred triples
append-graph-name: if target-graph is not provided, text provided with this option will be appended to the graph name of a given source graph to create a target graph IRI in which to store inferred triples.
source-graphs: space-separated list of graph IRIs on which to perform reasoning (must be passed as one shell argument).
source-graphs-query: file name or query text of SPARQL SELECT query used to obtain graph IRIs on which to perform reasoning. The query must return a column named source_graph.
merge-sources: whether to merge all selected source graphs into one set of statements before reasoning. Inferred statements will be stored in provided target-graph, or else in the default graph. If merge-sources is false (default), source graphs will be reasoned separately and in parallel, with results stored either together in target-graph or separately using append-graph-name.
parallelism: set the number of concurrent workers to use for reasoning on a set of graphs. Arachne is single-threaded, but if reasoning is applied independently to a set of graphs, this can occur in parallel.

This command line will select all named graphs from the database, materialize inferences for each one separately (up to 8 simultaneously), using rules derived from the RO ontology, and store the inferred triples in separate graphs corresponding to each source graph:

blazegraph-runner reason --journal=blazegraph.jnl --ontology="http://purl.obolibrary.org/obo/ro.owl" --source-graphs-query=graphs.rq --append-graph-name="#inferred" --merge-sources=false --parallelism=8

graphs.rq could look like this:

SELECT DISTINCT ?source_graph
WHERE {
  GRAPH ?source_graph { ?s ?p ?o . }
}

Building

If you clone the blazegraph-runner repository and want to build locally, you will need to have SBT installed.

Package a local version to run from the repo:

sbt stage
./target/universal/stage/bin/blazegraph-runner <options>

Zip up a distribution

sbt universal:packageZipTarball

blazegraph-runner's People

Contributors

Stargazers

Watchers

Forkers

dougli1sqrd lpalbou cmungall scala-steward hlapp

blazegraph-runner's Issues

Add Whelk as reasoner choice

Add relationship between asserted graph and inferred graph

Could be something like <inferred> <inferredFrom> <asserted>.

cc @dougli1sqrd

multiple loads of the same ontology file into the same journal grow the journal each time

I would guess this is an issue with how blazegraph makes/handles bnodes used in ontology files. I'm not sure its an issue with your code or not, but it surprised me so thought I'd let you know.

Example:

blazegraph-runner load --journal=mondo_reasoner.jnl --use-ontology-graph=true --informat=rdfxml mondo_reasoned.owl

I get a file with size 418906112
running the same command again, the file increases to 817823744
and produces slightly different results when running SPARQL queries over logical definitions and the subClassOf* graph. (mondo_reasoned.owl is the output of robot reason on mondo.owl from a few months ago)

Enable, or document how to obtain CSV output

The default output format from select seems to be TSV. It's not clear whether this can be changed to CSV; specifically,--outformat=csv and --outformat=text/csv both result in an error ("java.lang.IllegalArgumentException: Invalid SPARQL select output format").

It seems that Blazegraph itself is capable of producing CSV; for example, the instance at https://treatment.ld.plazi.org/sparql returns it when send an Accept: text/csv header in HTTP GET request (with a query parameter, of course). (Interestingly, this also drops the angle brackets when reporting entity URIs; the TSV produced by blazegraph-runner requires using STR() to achieve this.)

Add ability to load a set of owl files from a registry

note: this functionality could easily be done in a python wrapper, not sure what makes most sense

The purpose is to be able to load a set of owl or turtle files from a registry, rather than explicitly on the command line

For example, the OBO registry http://purl.obolibrary.org/meta/ontologies.ttl

has triples like

<http://purl.obolibrary.org/obo/go>
        <http://obofoundry.github.io/vocabulary/activity_status>  "active" ;
...
        <http://www.w3.org/ns/dcat#distribution>  <http://purl.obolibrary.org/obo/go.owl> ;
        <http://www.w3.org/ns/dcat#distribution>  <http://purl.obolibrary.org/obo/go.obo> ;
        <http://www.w3.org/ns/dcat#distribution>  <http://purl.obolibrary.org/obo/go.json> ;
        <http://www.w3.org/ns/dcat#distribution>  <http://purl.obolibrary.org/obo/go/extensions/go-plus.owl> ;
        <http://www.w3.org/ns/dcat#distribution>  <http://purl.obolibrary.org/obo/go/go-base.owl> ;
        <http://www.w3.org/ns/dcat#distribution>  <http://purl.obolibrary.org/obo/go/extensions/go-plus.json> ;
        <http://www.w3.org/ns/dcat#distribution>  <http://purl.obolibrary.org/obo/go/go-basic.obo> ;
        <http://www.w3.org/ns/dcat#distribution>  <http://purl.obolibrary.org/obo/go/go-basic.json> ;
        <http://www.w3.org/ns/dcat#distribution>  <http://purl.obolibrary.org/obo/go/extensions/go-taxon-groupings.owl> ;
        <http://www.w3.org/ns/dcat#distribution>  <http://purl.obolibrary.org/obo/go/snapshot/go.owl> ;
        <http://www.w3.org/ns/dcat#distribution>  <http://purl.obolibrary.org/obo/go/snapshot/go.obo> ;

...

<http://purl.obolibrary.org/obo/go/extensions/go-plus.owl>
        <http://purl.org/dc/elements/1.1/description>  "The main ontology plus axioms connecting to select external ontologies, with subsets of those ontologies" ;
        <http://purl.org/dc/elements/1.1/title>  "GO-Plus" ;
        <http://www.w3.org/ns/dcat#accessURL>  "http://purl.obolibrary.org/obo/go/extensions/go-plus.owl" ;
        <http://xmlns.com/foaf/0.1/page>  <http://geneontology.org/page/download-ontology> .

<http://purl.obolibrary.org/obo/go/go-base.owl>
        <http://purl.org/dc/elements/1.1/description>  "The main ontology plus axioms connecting to select external ontologies, excluding the external ontologies themselves" ;
        <http://purl.org/dc/elements/1.1/title>  "GO Base Module" ;
        <http://www.w3.org/ns/dcat#accessURL>  "http://purl.obolibrary.org/obo/go/go-base.owl" ;
        <http://xmlns.com/foaf/0.1/page>  <http://geneontology.org/page/download-ontology> .

(we can modify this if need be. It looks like we are missing typing the distributions by format

I'd like to load this ttl file, and then have bgr then load all targets of dcat:distribution that are rdf/owl files. note: the use case here is to load all owl distributions for the same ontology. Each ontology can go in its own NG.

The longer term vision is to have a framework that is more metadata driven. Each ontology can declare reasoning strategy/expectations. Not sure if this framework 'sits above' bgr? see https://docs.google.com/document/d/1ld73pVz_BIH22jRBZuV0RVDeSiuGyQpD1u_F9Yv9gg0/edit#

when target graph is not provided, load fails

I mistakenly thought that passing null would load into the default graph. Discovered by @wdduncan.

new load options: --replace-graph=true --use-version-iri=true

--replace-graph: clear graph before inserting

--use-version-iri: insert content into graphs for both ontology IRI and the version IRI

Update command: Should take multiple sparql queries as arguments or files

The update command should be able to look in a directory for all sparql queries and take more than one file as arguments. All updates would be run from each sparql query.

Support rdf*

@prefix : <http://example.org/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@base <http://example.org/> .

    :bob :knows :alice .
<< :bob :knows :alice >> :source :fred .

riot from jena is fine with this:

$ riot --version
Jena:       VERSION: 3.16.0
Jena:       BUILD_DATE: 2020-07-09T16:13:45+0000
$ riot tests/rdfstar.ttl 
<http://example.org/bob> <http://example.org/knows> <http://example.org/alice> .
<< <http://example.org/bob> <http://example.org/knows> <http://example.org/alice> >> <http://example.org/source> <http://example.org/fred> .

but when I try bg-load I get:

Caused by: org.openrdf.rio.RDFParseException: IRI included an unencoded space: '32' [line 10]
        at org.openrdf.rio.helpers.RDFParserHelper.reportFatalError(RDFParserHelper.java:440)
        at org.openrdf.rio.helpers.RDFParserBase.reportFatalError(RDFParserBase.java:685)
        at org.openrdf.rio.turtle.TurtleParser.reportFatalError(TurtleParser.java:1366)
        at org.openrdf.rio.turtle.TurtleParser.parseURI(TurtleParser.java:948)
        at org.openrdf.rio.turtle.TurtleParser.parseValue(TurtleParser.java:613)
        at org.openrdf.rio.turtle.TurtleParser.parseSubject(TurtleParser.java:448)
        at org.openrdf.rio.turtle.TurtleParser.parseTriples(TurtleParser.java:382)
        at org.openrdf.rio.turtle.TurtleParser.parseStatement(TurtleParser.java:260)
        at org.openrdf.rio.turtle.TurtleParser.parse(TurtleParser.java:215)
        at com.bigdata.rdf.rio.BasicRioLoader.loadRdf2(BasicRioLoader.java:236)
        at com.bigdata.rdf.rio.BasicRioLoader.loadRdf(BasicRioLoader.java:176)
        at com.bigdata.rdf.store.DataLoader.loadData4_ParserErrors_Not_Trapped(DataLoader.java:1595)
        at com.bigdata.rdf.store.DataLoader.loadFiles(DataLoader.java:1359)

note: would be nice to also have an option to set rdr in bg without specifying a whole properties file

Importing process hangs for some big data

Hello, I am experiencing a weird error when trying to importing data using the blazegraph-runner.

I tried to import a number of different ttl data from this repository (https://data.monarchinitiative.org/ttl/) using the load command provided by this program. Importing small datasets was perfectly fine but sometimes importing big datasets is not working, i.e. the program hanged without any other error messages. This issue "sometimes" happens, e.g., around 5 or 6 times of 10.
I ran this program using sbt shell, e.g., "> run --properties=src/main/resources/org/renci/blazegraph/blazegraph.properties load go.ttl"
I initially doubted that the data itself might be wrong but importing using load command provided by Blazegraph's admin webpage (https://wiki.blazegraph.com/wiki/index.php/Quick_Start#Load_Data) worked without any errors, e.g., "load file:///home/yy20716/blazegraph-runner/go.ttl". I also tested data with other triplestores such as Stardog community editions, so the data itself seems okay.

Could you please let me know what I should try to solve this issue? Is there any options or configurations I can adjust to see what happens behind? Thank you.

I'm attaching the hacky makefile I used for testing. You can run with make -f test-blazegraph-runner.txt run-tests.

Although you may need to run multiple times.

Are there any tweaks to blazegraph-runner to improve performance?

test-blazegraph-runner.txt