Giter Club home page Giter Club logo

scio-sparql's Introduction

Scio + SPARQL = Distributed offline graph database

The scio-sparql project is a Scio extension that lets you use SPARQL on top of a collection of RDF triples.

Features

  • Mostly SPARQL 1.1 compliant
  • rdf4j based
  • RDF/Turtle/Trig/etc. parsers for statements

Quick Start

import com.spotify.scio.values.SCollection
import es.jolivar.scio.sparql.Interpreter.SCollectionStatements
import org.eclipse.rdf4j.model.Statement
import org.eclipse.rdf4j.query.BindingSet

val statements: SCollection[Statement] = ???
val bindingSets: SCollection[BindingSet] = statements.executeSparql(
  """
    |PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    |SELECT ?name 
    |       ?email
    |WHERE
    |  {
    |    ?person  a          foaf:Person .
    |    ?person  foaf:name  ?name .
    |    ?person  foaf:mbox  ?email .
    |  }""".stripMargin
)
bindingSets.map { bindingSet =>
  val name = bindingSet.getValue("name").stringValue()
  // Do something with the bindings
}

SPARQL Limitations

Since Apache Beam doesn't have a concept of ordering there's no support for ORDER BY. In fact, this is a no-op in the implementation. What is supported though it's using ORDER BY in a sliced context. That is to say, that the following query will return the 3 largest elements after skiping the first 5 largest ones:

SELECT ?a ?b ?c
WHERE {
  ?a ?b ?c
}
ORDER BY ?c
OFFSET 5
LIMIT 3 

A caveat about it though, that since Apache Beam offers no ordering guarantees, the results will not appear in order but are guaranteed to be the elements of the correct result set.

MINUS and inner FILTERs are not currently supported due to the fact of having to process statement by statement individually.

Property paths are supported as long as they are finite. In particular, the following quantity modifiers aren't supported:

  • iri* ZeroOrMorePath
  • iri+ OneOrMorePath

An additional limitation due to the nature of the Apache Beam model is that if there's a duplicate statement in the original SCollection it will surface through to the end if the query matches it. To avoid this it would mean making a SCollection[Statement].distinct shuffle at the beginning, which is potentially very expensive.

Currently, there's also no support for SERVICE within the SPARQL query as that logic is an entire different can of worms.

How it works

At dataflow building time the program parses the SPARQL query and converts the query into its equivalent algebra operations. It then proceeds to build the operations based on interpreting the algebra into its equivalence in Scio operations. This in turn statically compiles the dataflow for execution, meaning the overhead is equivalent to writing out the operations by hand.

One very important caveat to notice is that SPARQL is very, very prone to using JOINS. This will translate to very large amounts of shuffling, but that's the price to pay for running this process in a distributed fashion.

In economic terms this means that some queries might have a larger shuffle cost than a compute cost. In particular this is true of Cloud Dataflow using its Dataflow Shuffle service.

scio-sparql's People

Contributors

jordiolivares avatar scala-steward avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Forkers

scala-steward

scio-sparql's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.