w3c / sparql-dev Goto Github PK

View Code? Open in Web Editor NEW

112.0 65.0 23.0 71 KB

SPARQL 1.2 Community Group

Home Page: https://w3c.github.io/sparql-12/

License: Other

sparql-dev's Introduction

SPARQL 1.2 Community Group

Welcome to the SPARQL 1.2 Community Group github repository.

Mailing list archive.

There are a lot of issues being worked on and also have a look at the wiki

sparql-dev's People

Contributors

Stargazers

Watchers

Forkers

vladimiralexiev seralf rubensworks ajs6f gitter-badger jaw111 kasei jpcs metaphacts jervenbolleman tpt isabella232 leyawang gkellogg afs lisp domel jitsedesmet ebremer

sparql-dev's Issues

Pre existing feature wish lists for an evolved SPARQL

Just comment with a link to a list when you have one.

Working with named graphs gets complicated fast

"SPARQL queries getting more complicated than necessary.

Most notably: The lack of persistent blank node identifiers for followup
queries. I know some triples stores can do this but it should really be
supported by all of them.

working with named graphs gets complicated quick - like adding a
{ GRAPH ?graph { } } block for every bit of info that may come from a different
graph."

https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0275.html

Sequence numbers

Ability to create a numerical sequence for groups within the result..
It should be possible to create sequences for rows of a result, something like
http://docs.openlinksw.com/virtuoso/fn_sequence_set/

this will allow for treating the result in a map like fashion and making it easier to access specific rows.

Formal definition of the SERVICE + VAR pattern semantics

Why?

The SERVICE + VAR pattern allows a variable in the URL position of a SERVICE clause, e.g.

SERVICE ?endpoint {
   SELECT ... WHERE { ... }
}
BIND (... as ?endpoint)

This pattern is only informative in SPARQL 1.1.

As a result, some SPARQL endpoints support this feature, some do not (e.g. Virtuoso does not).

There would be a need to formalize the semantics of this feature.

Considerations for backward compatibility

This would "just" move an informative feature to a normative feature. I don't think this would cause any backward compatibility issue.

Ability to use Turtle-like syntax for language filtering

Why ?

Currently if I want to filter a literal based on its language, I need to write:

?x skos:prefLabel ?pref .
FILTER(langMatches(lang(?pref), "fr"))

or, if I am lazy (which happens often) :

?x skos:prefLabel ?pref .
FILTER(lang(?pref) = "fr")

This is tedious for a common use-case.

How could it look like ?

I'd like to be able to use Turtle-like syntax for language filtering:

?x skos:prefLabel ?pref@fr .

Considerations for backward-compatibility

This would make the @ character a special character in variable names.

FROM in subqueries

According to current grammar FROM is not allowed in subqueries..
The distributed nature of RDF screams for that..

SPARQL 1.2 CG Charter

Draft charter : https://w3c.github.io/sparql-12/charter.html

This issue is to agree on the initial charter for the CG.

easier langtag and datatype agnostic matching of literals

Why?

For exploration or candidate generation i'm often doing things like the following in order to get results independent of language tag or datatype of the literal:

?s ?p ?l .
FILTER(STR(?l) = "XYZ") .

The problem with this is that many SPARQL endpoints don't seem to optimize for such queries (i.e., VALUES ?l {"XYZ"@en "XYZ"@de "XYZ"@fr ...10more... "XYZ" } ?s ?p ?l . tends to be a lot faster than the above FILTER clause!). While this could be seen as a common query plan optimization problem of SPARQL engines (failing to identify the perfect string lookup and using their existing indices to quickly answer the query), i'd like to honor this "frequent special case" with its own little syntax add-on, maybe also making it a lot easier to identify and optimize for.

For a syntax extension some things come to mind, such as "XYZ"@*, "XYZ"^^* and/or "XYZ"*, but not decided at all.

Previous work

Could be related to #13 and #17, but they seem to focus on other aspects.

Considerations for backward compatibility

How do current endpoints deal with "XYZ"@* ?

Arithmetic operators for durations, dates, and times

Why?

Querying temporal data often requires arithmetic operations on durations, dates, and times. For example, we might want to retrieve books that were published during the past year. Such query currently either needs to hard-code the publication date of the oldest included book, or makes use of a non-standard behaviour or an extension function of a specific SPARQL engine.

SPARQL 1.1 reuses arithmetic operators for numerics from XPath. The behaviour of arithmetic operators, such as addition or subtraction, for durations, dates, and times is undefined.

What could it look like?

Several SPARQL engines already allow some arithmetic operations on durations, dates, and times, such as Apache Jena, where the following query returns the date a year ago from now:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT (now() - "P1Y"^^xsd:duration AS ?year_ago)
WHERE {}

Similarly to arithmetic operators for numerics in SPARQL 1.1, SPARQL 1.2 can reuse the arithmetic operators for durations, dates, and times from XPath by defining extending the mapping of operators to XPath functions.

Previous work

Some SPARQL engines offer extension functions for arithmetic on durations, dates, and times, such as bif:datediff() in OpenLink's Virtuoso.

Considerations for backward compatibility

This change defines a previously undefined behaviour, so it is backwards-compatible.

Ability to do "LOAD ... WHERE { }" queries

Why ?

Ability to do "LOAD ... WHERE { }" queries would enable "crawling" linked data using SPARQL queries, by following references to URIs and loading their RDF representations in the triplestore.

Currently we can do LOAD <http://...> queries; if I want to crawl and aggregate linked data I need to do:

SELECT ?geoNamesUri WHERE { ?x :livesInCountry ?geoNamesUri }
Iterate on ?geoNamesUri in the application code
For each ?geoNamesUri, issue a LOAD <geoNamesUri> query

How could it look like ?

Instead of the steps above, I'd like to be able to write :

LOAD ?geoNamesUri WHERE { ?x :livesInCountry ?geoNamesUri }

and even

LOAD ?geoNamesUri INTO ?geoNamesUri WHERE { ?x :livesInCountry ?geoNamesUri }

Previous work

None I am aware of.

Considerations for backward-compatibility

I think this is an extension of current LOAD queries, so backward-compatibility is preserved.

SPARQL Function

Extension function definition based on SPARQL filter language

Previous work

Linked Data Script Language:
http://ns.inria.fr/sparql-extension

Proposed solution

select *
where {
?x ?p ?y
filter us:test(?x)
}

function us:test(?x) {
strstarts(str(?x), rdf:) || strstarts(str(?x), rdfs:)
}

A minimal protocol for all services

A minimal protocol MUST be defined for SPARQL services to facilitate SPARQL clients development [2]:

For reading and writing, the SPARQL endpoint is identical at the end with: sparql (for example: http://example.org/dataset/sparql)
In reading, the HTTP GET and POST (for very long query) methods are supported and they will use the same parameter named "query" to transmit a query.
In writing, the POST method is supported and it will use the parameter named update to send a request.
The format of default response (ie without ACCEPT parameter about format) must be the same in function of the query type, ie. JSON for SPARQL SELECT.
The responses formats are JSON (and XML ?) for SPARQL SELECT.
To define the format of the service response, the ACCEPT parameter must be needed in the HTTP request.

For CONSTRUCT, DESCRIBE, INSERT, UPDATE, CLEAR, the default format need to be define also.

To simplify the development of tests, I propose to add 2 things:
7. Support deletion of all data with a CLEAR ALL request (may be disable or enable in the configuration of SPARQL service).
8. Support loading data (Turtle) with a LOAD INTO GRAPH query.

[2] (chapter 5) Le Linked Data à l'université: la plateforme LinkedWiki (French) K Rafes - 2019

Edit : The default format is different in function of the query type (CONSTUCT, CLEAR, etc).
Edit2 : Clarification of 4 and 5

SPARQL Execution Sequence

SPARQL execution sequence is mostly decided by the execution engine and query optimizations that vary from engine to engine..

in general there are only a couple of features that allow for influencing the execution sequence, like VALUES, SERVICE and a Subquery.

this request splits into the following:

clear sequence what must be executed first: SERVICE, Subquery..
execution hints, with which the author of the query can override the engine's execution order or optimization..

[META] GitHub issue management

In order to streamline the issue management of this repo,
I would suggest making use of issue labels and issue templates.

I volunteer to follow up on this once there is an agreement.

Issue labels

GitHub issues make grouping and filtering of issues by topic possible.
I propose to label issues by their corresponding W3C recommendation.

management: Issues related to the management of the community group and this GitHub repository.
spec:query: SPARQL query language
spec:update: SPARQL update language
spec:service: SPARQL service description
spec:protocol: SPARQL HTTP protocol
spec:results: SPARQL result formats
spec:entailment: SPARQL entailment regimes
spec:graphstore: SPARQL graph store HTTP protocol

Issue templates

In order to make issues have a consistent format,
GitHub's issue templates could help out.

For spec-related issues, I would propose to extend @JervenBolleman's template.
It contains the following required blocks:

Why?
Previous work
Proposed solution
Considerations for backward compatibility

For other types of issues, I would leave the format open.

DESCRIBE using Shapes

OData has ways to describe a Business Object (entity).
GraphQL allows you to specify exactly the pieces of one or more objects that you want to get.
RDF has just triples; how to delineate (or "circumscribe") a business object is non-trivial.
You could use a Named Graph for this, but most people use named graphs to capture a Unit of Work (transaction), so you can use SPARQL graph protocol's PUT to overwrite it; and attach provenance to it. There are many cases when a unit of work does not coincide with a business object, eg:
- an object is too large to update at once and you want to update only a few props
- a data process updates a slice (aspect) of data that goes across many objects
SPARQL leaves the semantics of DESCRIBE vague on purpose; a repo is supposed to return "useful info" about the resources being described, but what exactly is left to its discretion.
Many repos return the CBD or SCBD but that uses blank nodes to delineate "owned sub-objects", and imho blank nodes are a bad practice that should never be used.
Jena has DESCRIBE handlers but you have to program these in Java
If you have a complex SPARQL query to find some objects and then you need a complex query to fetch those objects, it's highly non-trivial how to do this because of the bottom-up execution semantics of SPARQL.
Sometimes you need different object "profiles" for different purposes, eg full data for a detailed object view, but brief data for a list view.

I feel it should be possible to describe business objects using RDF Shapes (SHACL, ShEx)

Possible interaction with JSONLD Frames and GraphQL is also interesting to explore.
It should be possible to attach shapes to objects using existing SHACL and ShEx mechanisms, but also as part of the DESCRIBE query/headers (full vs brief view)

cc @afs @jeenbroekstra @HolgerKnublauch @ericprud @jimkont @labra @azaroth42 @gkellogg @msporny

Keyword for functions that can produce multiple bindings

Currently, BIND assignments can produce at most one binding for a single variable.

Many SPARQL implementations have a notion of "special" or "magic" predicates that can be used to bind multiple variables at once, and to produce multiple rows at once. Examples include Jena's built-in magic properties/property functions (e.g. https://jena.apache.org/documentation/query/extension.html#property-functions) and various free-text search implementations that take a string as input and produce all subjects that have values. Further, some implementations even use the SERVICE keyword as a hack, e.g. SERVICE wikibase:label. From our own modest experience there are tons of use cases for such multi-binding functions.

Even if SPARQL 1.2 doesn't include any specific built-in multi-binding functions, it would be great to have a proper syntax and an extension point so that these special functions can be consistently identified, supported by syntax-directed editors etc.

I struggle to find a proper keyword, so as a placeholder I am using ITERATE in this proposal:

SELECT * WHERE { ... ITERATE (ex:myFunction(?inputA, ?inputB, ...) AS ?outputA, ?outputB, ...) . }

The above would produce an iterator of bindings evaluating whatever is implemented by ex:myFunction. The right hand side would have unbound variables that get bound in each iteration. It's similar to BIND otherwise.

As related work, SHACL-AF includes a mechanism to declare new SPARQL functions that wrap ASK or SELECT queries. Obviously this could be extended to turn SELECT queries into reusable "stored procedures" that can then be reused in other queries using the new keyword.

Asynchronous SPARQL

It should be possible to have a standard way to submit and respond to a standing query. One possible approach is below:

POST a query, with parameter async=true

Response is 201 CREATED, with a header of Location: <x> and possibly an authorization token for modification.

Performing a GET on at any point responds with any results accumulated so far.
If the auth token is available, user can perform a DELETE to remove/deactivate the query.
GETs can include limit and offset parameters for paging.

This would allow expensive and/or standing queries to be computed and stored, rather than recomputed fresh. A delete from the graph would need to reset or "un-match" the results, but streaming in new data would be equivalent to a streaming engine. Non-streaming engines can use this to accumulate and save, even if their results aren't incremental.

JSON+LD/RDF serialization of SPARQL queries

Upto now SPARQL has been defined as a string of characters. However, there is a wish for an easier to manipulate format of queries.

Why?

It is currently difficult to give back feedback to users regarding how their query was executed and can be improved. Systems that do this need to patch query strings, which even for simple cases like automatically adding OFFSET and LIMIT clauses in the right place of a query string is not straight forward.

This feature would be basic building block for sharing query execution plans as an RDF/JSON+LD query object can be extended to as in the strawman example below that there is an error and the kind of join type planned by the DB.

SELECT ?somethingBad ?badClass WHERE { ?somethingBad a ?badClass , "bad" }

{"base": "<http://example/base/>",
 "project" : 
    {"variables" : [ "?somethingBad", "?badClass" ],
    "source" : 
        { "bgp" : [
            { "@type" : "triple" , 
            "subject" : "?somethingBad" ,
            "predicate" :  "rdf:type",
            "object" :  "?badClass" },
            { "@type" : "triple" , 
            "subject" : "?somethingBad" ,
            "predicate" :  "rdf:type",
            "object" :  "bad" ,
            "_error_" : "Error: can't have a literal for a type in our well designed system" },
        "_join_" : "join:hash"
        }
    }
}

The basics can be build from the SPARQL algebra model.

The desire for a JSON+LD model is so that it can be embedded in a application/sparql-results+json(+ld) result set.

Previous work

The SPIN notation has an notation for SPARQL in RDF.

Considerations for backward compatibility

SPARQL 1.0 and 1.1 only understand a query string, and do not expect JSON or RDF in a query field.
Suggestion is to add a new standard form parameter e.g. rdfquery for the http protocol.

Extending the formats of application/sparql-results with new fields can break many clients in the field.

GROUP_CONCAT sorting

There should be a way to sort group rows before aggregation, particularly for GROUP_CONCAT.

Why

The use of GROUP_CONCAT is limited by the output not being deterministic with respect to the ordering of values in the group. For example, a GROUP_CONCAT(?name) aggregate might produce "Alice Bob Eve" in one implementation, and "Bob Alice Eve" in another (or even within the same implementation from different query evaluations). Allowing users to specify an implicit/explicit ordering for rows in an aggregation group would improve interoperability.

Previous work

This was raised as ISSUE-66 and subsequently postponed during work on SPARQL 1.1.

Jindřich Mynarz discusses this as a potential SPARQL 1.2 feature.

Considerations for backward compatibility

At the user-level, this is purely additive as SPARQL 1.1 aggregate groups do not have any explicit ordering. However, it would require updates to the existing definition of SPARQL Algebra Set Functions which are defined in terms of multisets, not ordered sequences.

SPARQL Protocol support for structured error responses

Errors returned by a SPARQL Protocol implementaiton should be machine-readable. Common types of errors should be identifiable by IRI, and have relevant associated data included. For example:

tokenization errors and their location (byte/character/line offset)
parsing errors and the location of the incorrect tokens (offset ranges)
evaluation errors due to resource limits

Why

Representing endpoint errors in a machine-readable format would allow client tools to intelligently communicate the error to end-users, suggest ways to fix the error, or provide alternative approaches to answering the query.

Currently the SPARQL 1.1 Protocol does not give any guidance for how errors encountered during evaluation of a query/update should be represented. The text that does discuss error responses says only that:

The response body of a failed query request is implementation defined. Implementations may use HTTP content negotiation to provide human-readable or machine-processable (or both) information about the failed query request.

Previous work

RFC-7807 specifies a JSON format for expressing "machine-readable details of errors in a HTTP response". The use of RFC-7807 with JSON-LD might allow an RDF-friendly representation of error details.

Considerations for backward compatibility

This is purely additive as SPARQL 1.1 Protocol does not specify specific formats for encoding errors.

Generalize SERVICE for non-SPARQL endpoints

Currently, the SERVICE clause can only be used for querying SPARQL endpoints,
while the ability to query over non-SPARQL endpoints would also be useful.

Why?

From the SPARQL 1.1 federated query spec:

SERVICE calls depend on the SPARQL Protocol [SPROT] which transfers serialized RDF documents making blank nodes unique between service calls.

This allows queries like the following:

SELECT *
WHERE {
  SERVICE <http://example.org/sparql> {
     ?s ?p ?o .
  } 
}

However, since RDF is being published on the Web as Linked Data
using various RDF serializations (Turtle, TriG, JSON-LD, ...),
it would be useful for querying these with the SERVICE keyword as well.

This would enable queries like the following:

SELECT *
WHERE {
  SERVICE <http://example.org/me.rdf> {
     ?s ?p ?o .
  } 
}

This would require a SPARQL engine to:

Fetch the remote document (with conneg for supported RDF serializations)
Parse the document based on the document's content type to in-memory RDF.
Query over the in-memory RDF.

Previous work

The Comunica framework automatically detects the type of source in SERVICE clauses,
and queries the source appropriately.

Considerations for backward compatibility

This would be fully backwards-compatible, as existing SERVICE clauses for SPARQL endpoints would still work without changes to the query string.

Ability to query remote named graphs

Why?

Clauses referring to named graphs assume these are stored in the local triple store. E.g.

SELECT FROM <mygraph> WHERE {   }
SELECT ... WHERE { GRAPH <mygraph> { ... } }

But this does not allow to query a named graph from a remote endpoint, in short a "remote named graph".

How could it look like ?

An endpoint could evaluate a "FROM graph" or "GRAPH graph { }" by first looking for the named graph in the local triple store, then, if it cannot be found, try to dereference its URI and fetch its content locally.

Previous work

None that I'm aware of. But this may be somehow related to issue #15 about a "LOAD WHERE" clause.

Considerations for backward-compatibility

Possibly conflicting: with SPARQL 1.1, if a named graph does not exist in the local triple store, then what happens: is it created or just considered empty?

Whole-query-call scope for BNODE(...) with parameter

Current status

In SPARQL 1.1, the function BNODE(...), when used with an argument (a simple literal or a xsd:string), creates/reuses a blank node associated to that literal in the scope of a single solution mapping.
Given the choice for the scope, the version with argument does not add expressiveness to the language, given that this behavior can be replicated by binding the expression BNODE() to a variable and then reusing it.

Missing expressiveness

In CONSTRUCT queries, there is often the need to generate new resources that are referenced across multiple solution mappings.
This can be currently done by generating appropriated URIs and using the function IRI(...).
There is no way to generate blank nodes having the same role in the output graph.

Proposal

I propose to:

extend BNODE(...) expected argument to be any RDF term;
consider the whole query call as scope for the association with the given RDF term (i.e., every invocation of BNODE(...) with the same argument inside the same query call will return the same blank node).

Implementation cost

For the implementations I know of, the semantics in SPARQL 1.1 require more work than the ones proposed here: to check if an existing blank node has to be reused, for each query call and each solution binding a different blank node map has to be maintained; in this proposal a single map for each query call is enough.

Backward compatibility

This proposal, as described so far, would not be backwards compatible (it changes the semantics of an existing function), but:

it is quite possible that this would not be a problem in practice, if (as I guess) the version of BNODE(...) with argument is not currently much used;
to avoid the problem altogether, the function with the new semantics could be given a new name (e.g., BNODE_UNIQUE(...)) while the function BNODE(...) could keep its previous semantics.

Extended the protocol for errors in federated queries

Developers have difficulty debugging SPARQL queries because the error messages are not uniform enough and SPARQL editors are not able to catch correctly these messages.
And most importantly, error messages disappear between SPARQL services in federated queries. The SPARQL services MUST forward the error messages of sub-queries.

Dynamic function invocation

Why?

Currently SPARQL has a fairly robust extension function mechanism that allows for arbitrary functions to be referred to by URIs. This allows for vendors to implement useful extensions and for those to be even interoperable if vendors publish definitions of the semantics of their functions, since then other vendors can also add implementations associated with the same URI.

However there is no way currently to do dynamic function invocation i.e. cases where you want to run different functions based on some conditions. The best you can do currently is to use IF to select different functions, which are known in advance, to evaluate based on some conditions e.g.

BIND(IF(?x > 0, ex:a(?x), ex:b(?x)) AS ?z)

This doesn't cover the use case of data driven function selection e.g. where the data itself encodes the desired function. So the following is currently illegal:

BIND(?x(?y) AS ?z)

Existing Solutions

Apache Jena currently supports dynamic function invocation by introducing a new built-in function called CALL() in it's ARQ syntax:

BIND(CALL(?x, ?y) AS ?z)

This is a n-ary function i.e. it can take any number of arguments. It's semantics are defined as follows:

If there are zero arguments error
Evaluate the first argument (which may itself be an arbitrary expression)
If the result is not a URI then error
If the result is a URI treat as an extension function and attempt to evaluate the identified function passing in the remaining arguments (which again may be arbitrary expressions) and return its result (or error)

parallel string split

apf:strSplit is one of the most useful examples of a function that returns multiple bindings (see #6, which has extra info about it).
It was first implemented in Jena and is ported to rdf4j.

Some enhancements can be useful:

Define its behavior on empty string, rather than an exception: tarql/tarql#54
Handle parallel arrays https://issues.apache.org/jira/browse/JENA-1505 (also mentioned as
tarql/tarql#70): apf:strIndexSplit or apf:strSplitParallel

Named solution sets

"another gap in SPARQL that I have felt, and that Bryan Thompson
aptly suggested a few years ago, is that SPARQL does not provide any
mechanism for naming or saving solution sets, even though they are a
fundamental concept in SPARQL. On a number of occasions I have wished
that I could save an intermediate result set and then refer to it later,
in producing final results."
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0300.html

standardized communication of partial results

Why?

Some SPARQL implementations (e.g., Virtuoso) have an anytime query answer feature. This means that in case a timeout (explicitly given by the user or implicit by server-side settings) is exceeded, the query returns partial results.

An example for this behavior is the following query which is going to return partial results (the counts you will see are incorrect, easily verified by just looking at the number of different ?p being reported or by fixing any of them and counting again. There is a 30 second implicit timeout on the dbpedia sparql endpoint, after which counts of triples encountered till timeout are reported.):

SELECT ?p (COUNT(*) AS ?c) {
 ?s ?p ?o
}

While the feature is a nice (if not necessary) feature to provide public SPARQL endpoints, the current situation is sub-optimal (if not dangerous) in several ways:

it's on by default (while most developers would expect some kind of error-handling in case of a timeout), it cannot be switched off
a 200 is used to communicate the partial result
given the SPARQL result (doc) it doesn't contain information about being complete or incomplete (there is some HTTP response header, but as a 200 is used, developers currently are unlikely to handle it)
there is no clear indication for end-users telling them that "these counts are incomplete" in the web-interface

Previous work

Virtuoso's anytime query feature, this issue being discussed before.

Maybe related: #7

Proposed solution

Standardize this in some way, talk about expected behavior, make it more explicit, reduce the risk for misinterpretation of results.

Ideas for this include:

define a way to explicitly ask for partial and/or complete results
- if i ask for a complete result, do the above query, run out of time, then i'd expect a timeout error
- if i ask for a partial result, do the above query, run out of time, then i'd expect a partial result
define recommendations about the default and how completeness info should be presented to developers and users
define a HTTP response code for partial result documents (maybe depending on which completeness mode the user asked for and whether this was explicit or not)
extend the SPARQL result syntax to include completeness information

Considerations for backwards compatibility

Depends on how this is implemented. Parts might be possible as extension, changing the status code for example might not.

Test suite

The sparql 1.1 test suite is useful, but

there are bugs that can't be fixed because the process is over
There are some harnesses, but no harness as a service that any dev can easily use to test his implementation, continuously.
The Implementation Reports are generated from EARL rdf test results, which is great. But afaik these are submitted by devs and taken at face value.

@BorderCloud (Karima Rafes) has been running http://sparqlscore.com/ valiantly for 4 years (see documentation), added some tests and fixed some; and given up on others because of ambiguities in the spec.

She proposed and I support that whatever 1.2 features are standardized by this group, should have tests. I also put forward that this group should try to fix 1.1 test suite problems, and help w3c host a continuous testing harness.

The biggest improvements needed on this testing site are

more flexible result comparison by the test runner. Eg using jsonld c14n to make comparison easier
logistical issues eg what do you use as counterparty server for Federated queries

Karima please add more from recent emails

Protocol to access the logs of SPARQL queries

To develop new autocompletions for SPARQL queries [1], it is necessary to allow the collection of users queries. Via the protocol, the user should request that his query be public or private and the SPARQL service is able to share these queries public without transformation (ie. with variables names original and comments).

[1] Designing scientific SPARQL queries using autocompletion by snippets
K Rafes, S Abiteboul, S Cohen-Boulakia, B Rance - 2018 IEEE 14th International Conference on e-Science, 2018

SPARQL-friendly lists

It is very hard[7] to query RDF
lists, using standard SPARQL, while returning item ordering.
This inability to conveniently handle such a basic data
construct seems brain-dead to developers who have grown to
take lists for granted.

"On my wish list are . . . generic structures like nested lists as first class citizens"
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0170.html

IDEA: Jena's list:index property

Apache Jena offers one potential (though non-standard)
way to ease this pain, by defining a list:index property:
https://jena.apache.org/documentation/query/rdf_lists.html

IDEA: Add lists as a fundamental concept in RDF

As proposed by David Wood and James Leigh
prior to the RDF 1.1 work.[8]
https://www.w3.org/2009/12/rdf-ws/papers/ws14

The unnamed/default graph should have a standard name

At present the unnamed/default graph has no standard name. This means that, when writing code that manipulates graphs, one must special-case the unnamed/default graph. It also violates one of the Axioms of Web Architecture: "Any resource of significance should be given a URI."

I think the unnamed/default graph should have a standard name, such as http://www.w3.org/1999/02/22-rdf-syntax-ns#defaultGraph ( rdf:defaultGraph ). Implied references to the unnamed/default graph in SPARQL, TriG, etc., should be understood as short-hand for this graph name.

Support window functions

SPARQL should add support for window functions. This would increase expressivity and address some existing use cases such as "limit per resource".

Why

Window functions would allow computing values that are unavailable in SPARQL 1.1 queries:

row numbering and ranking
quantiles
moving averages
running totals

These can be used to address use cases such as limiting the result set to a specific number of results for each resource ("limit per resource"). For example, consider a query to retrieve information about web posts:

SELECT ?post ?title ?author ?date WHERE {
	?post a sioc:Post ;
		dc:title ?title ;
		sioc:has_creator ?author
}

Given that a post can have any number of titles and authors, we might wish to restrict our query to only providing information about at most 3 authors for any individual post. This isn't easily done using standard SPARQL, but can be addressed using window functions.

Previous work

I've implemented window functions (with the strawman syntax shown below) in Kineo.
Window functions in SQLServer
Window functions in SQLite
Window functions in PostgreSQL

Proposed solution

Using a RANK window function, we can filter the result set of the example query above with a HAVING clause:

PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX sioc: <http://rdfs.org/sioc/ns#>
SELECT ?post ?title ?author ?date WHERE {
	?post a sioc:Post ;
		dc:title ?title ;
		sioc:has_creator ?author
}
HAVING (RANK() OVER (PARTITION BY ?post ORDER BY ?author) <= 2)

This will take the the result set from matching the basic graph pattern, and partition it into groups based on the value of ?post. Within each partition, rows will be sorted by ?author, and then assigned an increasing integer rank. Finally, these rows will be filtered to keep only those with a rank less than or equal to 2. The final result set will be the concatenation of rows in each partition.

Beyond this use case, existing aggregates (e.g. AVG and SUM) can be used with windows to support things like moving averages and running totals.

Considerations for backward compatibility

None.

Simplify writing non-exhaustive queries

Abstract proposal

Why?

Currently, it is easy to write exhaustive queries in SPARQL, but hard to write non-exhaustive queries. For example, it is much easier to write “give me all first names, all last names, and all birthdates of a person” than it is to say “give me any first name, last name, birthdate”. Unfortunately, for any query processor, it is easier to answer the first—and this especially holds for processors that are not databases, such as link-traversers.

(Proposed by @RubenVerborgh)

Abstract proposal

We need syntax to make simple queries simple. Even though it might (partly) just be syntactic sugar for a combination of MIN, OPTIONAL, etc., such syntactic sugar matters, because queries that are easier to write will be written more often. Currently, many queries have a much wider semantics than what is intended (e.g., you don’t really need all first names of a person; in most cases, just one will do).

Previous work

Unknown

Considerations for backward compatibility

This should not change the meaning of existing clauses, only the introduction of new ones.

Concrete proposals

Hereafter, a couple of concrete example proposals are listed, but alternatives are welcomed.

`OPTIONAL`

Querying optional foaf properties of a person can get quite verbose:

SELECT *
WHERE {
    OPTIONAL {
        <http://example.org/person1> foaf:name ?name.
    }
    OPTIONAL {
        <http://example.org/person1> foaf:mbox ?mail.
    }
    OPTIONAL {
        <http://example.org/person1> foaf:img ?img.
    }
}

Syntactic sugar for OPTIONAL clauses could be added by for example making optional variables start with two question marks:

SELECT *
WHERE {
    <http://example.org/person1> foaf:name ??name, ??mail, ??img.
}

Database Cursors / keyset pagination

Why?

Paginating large result sets through LIMIT/OFFSET is horrendously innefficient (N!) as each new page requires the database to first materialise all earlier pages. This means the further you drill into results, the longer each page takes to return, until you're timing out.

It would be useful if SPARQL 1.2 and its protocol could support a more sophisticated form of pagination with cursors or keysets, that would allow clients to databases to effectively remember where they were in returning a large solution, and provide an id for the next page that allows the database to restore its state/location within the results across queries.

It would likely need both changes to the SPARQL protocol as well as the query language.

Previous work

https://www.postgresql.org/docs/9.2/plpgsql-cursors.html
https://www.eversql.com/faster-pagination-in-mysql-why-order-by-with-limit-and-offset-is-slow/

GROUPS: access specific solutions

this corresponds to an extension of the SAMPLE function by FIRST, LAST or row-n

Extended the protocol in order to building IRIs' autocompletion, via keywords/text research

This autocompletion of IRIs does not presuppose any prior knowledge of the SPARQL service and the ontologies it contains. The user writes keywords, in the language of his choice, to obtain a list of suggestions for relative IRIs. It is then enough to choose one so that the SPARQL editor can insert it in the current request. This type of feature is requested by all kinds of users.
For the moment, I use the Wikidata API to build the demonstrator of this autocompletion but ideally, all SPARQL services should offer this research of its IRIs when it's possible [1].

[1] Designing scientific SPARQL queries using autocompletion by snippets
K Rafes, S Abiteboul, S Cohen-Boulakia, B Rance - 2018 IEEE 14th International Conference on e-Science, 2018

Miscellaneous features wanted in SPARQL

From Adrian Gschwend in #44:

the place where I see most need right now is discussing the future of SPARQL. I guess we could have SPARQL 1.2 defined relatively fast with a bunch of stuff which is missing in the current spec but would be very useful in the real world. Some of them are implemented by various stores but not standardized so it's store-proprietary. For bigger stuff like the PATH concept implemented in Stardog it would make sense to think about SPARQL 2.0. There we would be allowed to break things IMHO. Again, a lot of stuff would either be syntactic sugar or modest extensions of the spec.

allow CONSTRUCT subqueries in FROM clauses

Why?

This adds a lot of flexibility for example to do ad-hoc partitioning or (partial) alignments of graphs using different vocabularies.

Example:

g1:
g1:a skos:prefLabel "a"@en .
g1:a skos:prefLabel "A"@de .
g1:b rdfs:label "b" .

g2:
g2:A rdfs:label "A2" .

Query:
SELECT ?s ?al
FROM {
 CONSTRUCT {
  ?u :anyLabel ?l .
 }
 FROM <g1>
 FROM <g2> 
 WHERE {
  ?x ?p ?rl .
  VALUES ?p { skos:prefLabel rdfs:label }
  BIND (STR(?rl) AS ?l)
  BIND (IRI(REPLACE(STR(?x), "g2", "g1")) AS ?u)
 }
}
WHERE {
 ?s :anyLabel ?al .
}

Result:
?s    ?al
g1:a  "a"
g1:a  "A"
g1:a  "A2"
g1:b  "b"

Previous work

None that i'm aware of.

Considerations for backward compatibility

Extension only.

Add support for OWL friendly syntax

I'm specifically referring to Terp or some Terp like syntax http://ceur-ws.org/Vol-614/owled2010_submission_14.pdf

standardization of communicating partial results

CONSTRUCT FRAMED

Why?

I don't really understand the ideas raised in #39, but a perhaps smaller but seemingly related problem I've encountered many times is that handling raw RDF triples can at times be awkward. Often you want them framed into objects so you can process resource objects one at a time, and know you have all the requested properties for each object. Often you don't really want to have to process the entire stream of results to group all the triples yourself. If you have a library like RDF4j or JENA you can load your triples into a Model, or memory store; but you may not always have such tools available.

Whilst databases are often already under load, they are frequently better placed to consume results for framing than their clients.

Proposed solution

It would be nice to be able to pass this burden onto the database in some circumstances, i.e. a query of something like:

CONSTRUCT FRAMED 
{ ?s ?p ?o } 
WHERE 
{ ?s ?p ?o }

would return all matches framed into resource objects... e.g. a JSONLD result stream of results grouped into resource objects: [{,,,}, {,,,}, {,,,}, {,,,}].

Such a proposal would require CONSTRUCT FRAMED queries to use a response format that can handle the framing, i.e. they would require a frame oriented format (something like JSON(LD)/XML). Technically fully beautified turtle could also fulfill the requirement, however turtle is typically read in a triple oriented manner not a resource oriented one; and the point is to guarantee consumers can process each subject one at a time.

Previous work

JSON/LD Framing.

Considerations for backward compatibility

It requires an additive change to syntax.

Standardize free text search of RDF data

Several RDF stores support free text search, but there's no standard way to do it.
Proposed by Kjetil Kjernsmo in W3C Graph workshop lightning talk: https://www.w3.org/Data/events/data-ws-2019/assets/lightning/KjetilKjernsmo.pptx

Need views of RDF data

Analogous to relational views. Proposed by Barry Zane, Cambridge Semantics, at W3C Graph Workshop lightning talk: https://www.w3.org/Data/events/data-ws-2019/assets/lightning/BarryZane.pptx

CONSTRUCT GRAPH

Why?

Named graphs are increasingly used for data management and data modelling. SPARQL 1.1 Query offers no way to produce RDF quads in named graphs. The CONSTRUCT clause is limited to producing RDF triples. This is possible only indirectly via SPARQL 1.1 Update by running an INSERT update operation, followed by dumping the created named graphs. This work-around requires write access to a SPARQL endpoint and post-processing the exported data, since there's no standards-based way to request data in a quad-based RDF format.

What could it look like?

CONSTRUCT query form can be extended to produce named graphs.

Previous work

Apache Jena already allows to CONSTRUCT named graphs (https://jena.apache.org/documentation/query/construct-quad.html).

Considerations for backward compatibility

This is a change that grows SPARQL, so that its implementation would not break any SPARQL 1.1 applications. It is backwards-compatible, however, it is not forwards-compatible, since non-SPARQL 1.2 implementations would break due to the required changes in query syntax.

Define Aggregate

Why?

Define new aggregate using function definition

Previous work

Linked Data Script Language
http://ns.inria.fr/sparql-extension/

Proposed solution

aggregate() operator returns the list of values as a new list datatype.
define functions, e.g. median, to compute new aggregates

select (aggregate(?n) as ?list) (us:median(?list) as ?m)
where {
?x rdf:value ?n
}

function us:median(?list) {
xt:get(xt:sort(?list), xsd:integer(xt:size(?list) / 2))
}

Considerations for backward compatibility

SPARQL 1.2 Extended Aggregate Functions

I'd like to see a few more aggregate functions in the spec. In particular for statistical evaluation .

For example I have noticed that simple functions like median are missing from SPARQL 1.1

currently the following aggregates are mentioned in the SPARQL 1.1. spec

'COUNT' '(' 'DISTINCT'?( '*' | Expression ) ')'
'SUM' '(' 'DISTINCT'? Expression ')'
'MIN' '(' 'DISTINCT'? Expression ')'
'MAX' '(' 'DISTINCT'? Expression ')'
'AVG' '(' 'DISTINCT'? Expression ')'
'SAMPLE' '(' 'DISTINCT'? Expression ')'
'GROUP_CONCAT' '(' 'DISTINCT'? Expression ( ';' 'SEPARATOR' '=' String )? ')'

In extension it might be attractive to add basic cluster analysis features here as well.

It might be a good idea to take a look at current implementations and compare features, complexity and performance for inclusion.

Under "Decision Process" I find "It is expected that participants can earn Committer status through a history of valuable contributions as is common in open source projects", but I don't see a process described by which folks are invited or confirmed to be committers, or who actually does the invitation or confirmation. Did you have thoughts about that?

ajs6f